This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.
Please mark any answers that fixed your problems so others can find the solutions.
I'm sorry if this is a newbie question but I can't understand why one model with the same settings fails while another model is exceptional. I started to wonder if I had my previous model on Central Storage distribution and later switched over, because when I started training my new model it kept crashing until I switched over to central storage. After several thousand "its" of training can I switch back to default distribution?
No, but different iterations of Tensorflow may handle things differently though in terms of VRAM allocation.
I do suspect that there is some kind of leak going on in Tensorflow recently. Not big, but enough that a model may OOM after several thousand iterations (this should not be able to happen, as TF gathers the all the VRAM it requires at the start).