We'd need to see some samples to know for sure, but a few points:
1: Original is a 64x64 model, that's pretty low res for any large videos. You may need to downscale your video or use a larger model.
2: Data curation is a bit of an art. It takes time to get down. Almost everyone's first video is a failure.
3: Using a single video for Face A might lead to overtraining the A side (especially if that's the video you're swapping as well)