The human ear is very good at differentiating sounds, so without noticeably altering the vocal tracks you will have a hard time doing this.
One of the recognizable differences in the voice is the Timbre, being made up of the frequency spectrum and the sound envelope that lets us identify who we are listening to.
You could possibly use frequency spectrum analysis to identify the differential between the two voices and create a frequency filter that you could could apply to each voice to make them sound more alike. An EQ filter may also need to be applied.
You would then either apply the filtering to both vocal tracks allowing the cross-fade to be less distinct, or apply the filtering as an effect before or during the transition to make the voices indistinct while you switch them.
I'm not sure what software you would use, but I'm sure the capability exists.