Those are some pretty good observations that I've noticed myself and I've been playing around with faceswap for over a year. I can add on to what you are saying with some of my own observations.
Assuming your data on both ends are videos and there moderately high quality like 720p or above, open the videos in a video editing program like shotcut, premier, vegas and add some sharpness to the video before you extract it. It helps simulate higher quality data when you do that. Make sure you leave some normal and blurry data though so it know hows to deal with blurs.
Also if you want your data to train even faster, then make it so that your B data doesn't teach the model anything it doesn't need to know. For example, If A video's model never opens there mouth, You can speed up the training process by not bothering allowing the model to learn how to open B data's mouth. Every single second the model spent learning how to open B's mouth when it wasn't needed could have been a second it was learning more important things instead. Of course this is a double edge sword, the second you introduce an A video where the mouth opens, the model will be significantly lacking in the understand on how to do that. Also its easy to make mistake and not realize you didn't provide sufficient enough data for it to learn.