There are datasets with over a million faces that have been used to train the newest model architectures to encode faces. They've been very good at starting training, but have never been able to beat a 1:1 trained encoder.
Is it theoretically possible for a generic encoder? Almost definitely. Has it been done? Not yet.
We have ALWAYS been able to get a good bump in quality by allowing the encoder to train on the individual faces we're looking at. It's just better able to encode the details of those two faces if we give it the room to ignore other possible faces.
The big advantage the pretrained encoders give is shortcutting the early training time. Once the decoder gets trained to catch up with the encoder though, we always recommend unfreezing the encoder so it can get better and the result can get the best it can. Is it possible that you might decide "this is good enough" and just keep the encoder frozen? Definitely. That's why we leave the setting of freezing the encoder up to the user. They're free to decide if the results meet their standards and can decide to stop training or alter settings at any time.