I think you almost have it.
If I personally was in your scenario as I understand it:
Have good sources for A, mainly from the video you wish to use. I may try squeezing as many out of it as possible. You can also use images of A from other sources but if you don't that available, it may be fine.
The B face is more important for as many sources as possible, whatever good data (images) you can find.
Landmarks are used to create the mask. "Reasonable" landmarks will lead to reasonable mask, so no big worries for training. It's AI , it will learn. I have personally sat down and made the alignments perfect for 4K images on either side and it didn't make all that much difference for training. So I learned that the hard way.
Then let all that train for a while.
Then for the "to-convert-from-from-A-to-B" video I do a separate extraction.
Every single frame. Get the alignments pretty clean. If you have obstructions you may wish to additionally clean up the generated masks. See how it goes.
Then use the model you trained, with the "to-convert-from-from-A-to-B" video and that nice and clean Alignments file.
I usually make a short video(extract, alignments, mask too)...about 20 sec long, and convert that to see how its looking before I do the full one.