Lately I've been having an issue with training performance that I'm struggling to isolate.
Training (DNY256 batch-26) starts out strong @ 100 eg/s, but after ~ 2 minutes, the speed starts gradually decreasing to about 55-60 eg/s, where it will stay permanently. Along with the performance drop, GPU power usage gradually decreases, starting around 170W and stabilizing around 110W. Temps reach ~ 70C, but stabilize at ~ 65C. Restarting the training session gets performance back up again 2 minutes at a time.
At first I thought the GPU was throttling, but the slowdown doesn't happen with an older model I used before (DFL-SAE standalone batch-16). With DFL-SAE, performance is consistent, as are GPU power draw and temps, which stabilize at higher levels @ 180W and 75C.
Am I hitting some kind of bottleneck when I training with DNY-256? I feel like this is a relatively recent occurrence.
GPU is a 3060ti, and data is stored on an NVME drive.
Has anyone encountered anything like this?