I know that you are extremely busy, but if you are working on the codebase I have a suggestion/ request that I think would make workflows easier.
I notice that when debug landmarks are output by extraction that it shows the actual 3d orientation of the face/ pose. It would be great if the manual bounding box tool also had this 3d alignment - faces that are pointed even 60 degrees off centre in any direction often don't align properly, and most especially face down poses, eg, looking at a phone/ reading a book do not align or swap well, and it is difficult to get the manual aligner tool to get these types of faces aligned properly. Fan tends to compress the facial features and so in practice when a video starts to move towards a face looking downwards, the distance between eyes and mouth start to close; in the swap, the mouth stays in frame longer than it "should".
Perhaps there is some way to integrate facial identity into the extraction and alignment process, given that in general the 3d object that is the persons head has the same features regardless of position? Further to that, directional / pose information doesn't necessarily seem to feed into the training model.
One thing I have noticed is that off centred faces, eg, a low angle shot of a person looking away, especially as one eye starts to be obscured, or profile shots in general, that the training model seems to instead assume that the face itself is obscured rather than off centre; and so swaps for half faces tend not to work very well (as noted many times in the forum)
Similarly, while I understand that generally speaking the training process does not necessarily rely on the landmarks (esp if WTL is disabled), I have noted that certain faces when misaligned obviously swap internally in a way that suggests landmark misalignment, not the mask itself (ie, faces that are misaligned to the point that the 68 points are a jumble with the entire face being rotated some degrees off centred). So it occurs to me that adding better alignments in the first place may improve the entire process.
Alternatively, I have been playing with the 3ddfa_v3 aligner , as coupled with the retinaface detector, and it does seem from testing that it has better detection accuracy and alignments than the S3FD/ Fan combo in these off centre poses. It is much better at aligning profile faces, even in a low angle shot. It's design takes into consideration that the face is part of a 3d object (hence my above suggestion). This aligner does have quite a significant data base that goes with it (annoying for end users, huge download) although with that said from my limited understanding of python I don't think it would be exceptionally difficult to integrate it or at least try it out to see how it compares in any given dataset.
I figure what I am saying is descriptive enough but if you want me to upload some pictures to help illustrate the point let me know and I will upload some.