Intel big-little headache

Ryzen1988 · Post by **Ryzen1988** » Sun Aug 13, 2023 4:03 pm

From the 12th gen aka Alderlake Intel has introduced a new architecture with a sort of big-little core architecture.
The 12th gen was max 8 performance cores with HT, and max 8 efficiency cores depending on the sku of course.
13th gen Raptor lake is max 8 perf cores and 16 efficiency cores.
One of my system happened to be a 12700K version contrary to most of my systems because been a real ryzen/threadripper fan.

From what i understand the Thread director is a hardware solution that directs work and tasks to either the big core or a little core.
My god this sucks!
For example with faceswap, when i start training it always redirects the training job to the little cores. They choke and its/min drop significantly.
Have every powerplan to ultimate/max super duper, but the tread director just does his own thing.
I can solve the issue by
1 - Keeping the separate training preview window on top makes the work go to the big cores but when having the the normal program window on top work goes to the small cores
1 - Every time manually assign adjust the core affinity of the work process to bypass the small cores.

Damn that sucks,
In the end just disabled the small cores in bios, but a warning for al who think that 12th and 13th gen is a good way to get lots of multithreaded multicore performance.
The thread director is a real *

Disclaimer is that i do run W10, it supposedly been somewhat better in W11

Post by **torzdf** » Sun Aug 13, 2023 10:37 pm

Interesting. I've not seen an issue with bottlenecking in i7-12700k, but tbf, I have not investigated at all, and am on Linux

Ryzen1988 · Post by **Ryzen1988** » Wed Aug 16, 2023 11:57 am

The behavior is not limited to faceswap.
The distribution of work for seems to be heavily dependent on the task being in the foreground, otherwise it often gets assigned to the little cores.
Also the fact that my sku only has 4 little cores, if like the raptor lake i9 you have 16 little cores that maybe would be a different scenario.

During launch it was touted that you needed W11 for optimal performance and with initial benchmarks showed there was sometimes strange performance discrepancy vs W10.
Guess they never bothered to fix that, but i still see W11 as complete dumpster fire and would rather skip it. Probably go back to a Ryzen and Threadripper in a year anyway.
Tried to adopt linux several times, but the process always is panful and ultimately i go back to my windows LTSC. Born and raised a windows user sort a say

Post by **torzdf** » Thu Aug 17, 2023 8:29 am

Yeah, updating to 12th Gen intel was what made me upgrade to W11.

Upgrading to W11 was what made me make the full time switch to Linux.

HoloByteus · Post by **HoloByteus** » Mon Aug 21, 2023 11:26 am

I've never noticed a bottleneck due to the P/E core design with W11 and a 12700K/3060. It's pretty obvious the bottleneck is the 3060 in this case. Even E-cores aren't hit that hard with FS tasks although recently I started using NVencC H265 encoding and there I see E cores maxed like I've rarely seen, a solid box of 100%, but still the GPU is also at 100% utilization so I wouldn't call that a processor bottleneck. There is also P core utilization but the CPU is usually less than 20% utilized if that. That task is just moving data after all and that's an E-core primary job, total system power draw is 250w.
See Max E-core usage with Topaz Video AI too, GPU 100% and iGPU about 8% since it's my display and Task manager is running, total system draw now is 380w.

Ryzen1988 · Post by **Ryzen1988** » Mon Aug 21, 2023 11:57 pm

With a 3060 as you mentioned it does not surprise me.
NVNEC is almost all offloaded to the ASIC part, so cpu and gpu utilisation is meant to be very low

I am running a system with a 4090 and rtx a6000, i have them somewhat constrained power wise because if both load up on heavy work at the same time my 1300W power supply just shuts off. really need a bigger psu.
Anyway in the above scenario having the 4 little cores do all the heavy cpu work really visibly slows down the it/s.

Also i use Topaz video as well, and the facts that these jobs get directed to the E cores is just crazy stupid.
E cores were meant for running OS background tasks.
With topaz the E cores slow down the 4090 massively.
When using P cores the 4090 gets the cores to 70-80% load it really speeds up. Think compared to the E cores it speeds up the process around 40%

HoloByteus · Post by **HoloByteus** » Tue Aug 22, 2023 1:21 am

I can see 4 E-cores not being enough for a 4090 and it is disappointing P-cores aren't automatically engaged when they're clearly maxed. I've been looking to upgrade to a 4090ti or titan and suspected I'd need a processor upgrade as well. This could help solve my puzzle as to which would be better a 14700 or 900K. I'd still rather benefit from the power savings on a GPU primary task. Will need to see which 12 or 16 E-cores are enough for 100% GPU utilization or I'd agree the scheduling design needs some tweaking.

In the meantime, sounds like a job for Processor Lasso. There's no need to fear, Processor Lasso is here!

Ryzen1988 · Post by **Ryzen1988** » Wed Aug 23, 2023 12:43 am

Haha i did thought about lasso, but having more stuff run continuously in the background is something i always try to avoid.
So the choice of just disable in the bios was the final one.
Besides when training on 4090 power savings is out of the window, also it warms up the room to 25 degrees with windows open
A very poor time to live in europe atm

Im contemplating of maybe going for the new threadripper if it ever comes out, could really use the extra pcie lanes

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 24, 2023 8:33 am

: Totaal.jpg (220.01 KiB) Viewed 32807 times

This is what it looks like with only faceswap training.
I can get gpu utilisation somewhat up to 90% by lowering the batchsize.
So the importance of not running the process on the little cores is pretty important

Post by **torzdf** » Thu Aug 24, 2023 9:09 am

This isn't the correct way to measure GPU utilization. You should use nvidia-smi. Also see here: app.php/faqpage#f0r3

Ryzen1988 · Post by **Ryzen1988** » Fri Aug 25, 2023 4:29 pm

I know, i should be ashamed because i have all the utility's, was more to show how much the cpu is utilised even with all p cores.
I took the least effort way to get a pic of the system

HoloByteus · Post by **HoloByteus** » Wed Aug 30, 2023 1:55 pm

Ryzen1988 wrote: ↑Wed Aug 23, 2023 12:43 am
Haha i did thought about lasso, but having more stuff run continuously in the background is something i always try to avoid.

That's a surprising amount of CPU use for training, I practically idle during training.

I just noticed recently during a BIOS update MSI now offers E-core disable via Scroll lock Key toggle which is pretty nice. I'm a huge fan of automation however and no fan of bloatware. Lasso's two processes have a total .8s boot time impact on my system and the feature set and automation make lasso one of my favorite apps considering it actively monitors system response and can automatically manage that and most importantly it's configurable. Use to need it for Flight Simulation although these days I just use it to control power plan & prevent cores parking for selected apps (ffmpeg).

In your case I'd use it to set a CPU Set (a Native Windows feature which is more preference than specific affinity assignment) for Python & FFMpeg only allowing other processes to still use E-cores and even these two processes if needed, just preferring P-cores now instead of E-cores.

Ryzen1988 · Post by **Ryzen1988** » Wed Aug 30, 2023 2:42 pm

oke, you guys have inspired me to reevaluate my decision about lasso.

Stuff like core parking i always prevent in bios, combined with ultimate power plan in windows.
I have accepted my fate: paying a significant amount on electricity every month so i can work, relax and exist with my computers
But having the small cores for real background tasks i of course even more practical.

HoloByteus · Post by **HoloByteus** » Wed Aug 30, 2023 4:16 pm

The last two 4090 users that visited the discord where I spend most of my time only showed 170w power draw during training however that was just after release and we weren't sure if that was normal or just current level of support. Recently a user with both 3090 and 4090 commented that the 4090 is quiet and produced less heat than a 3090. I asked about power draw but have yet to see a reply. I'd be curious as to what you're seeing with nvidia-smi. 170w is 3060 level draw and with 4090 performance that would be very nice.

Regarding Window's Task manager, you won't find cuda broken out unless you disable resizable bar. Microsoft updated Task manager to include cuda in the 3D performance metrics with RB enabled. It works fine for the last two GPU's I've used and shows the same levels I see in other tools and unlike nvidia-smi has a graph. Only issue I have with it is that if I use the iGPU for display, which I usually do while training since it's a background task, FS will not show any GPU utilization, just VRAM used. Using other cuda apps however like Vegas, NVencC and Topaz it does still show GPU utilization. I imagine it's in the difference between how cuda's engaged via python vs Windows API's. Other tools however will still show GPU utilization.

Ryzen1988 · Post by **Ryzen1988** » Wed Aug 30, 2023 5:18 pm

Oke, oke i will run some tools for ya guys. I can say the 4090 really outperforms the a6000 (basicly a 3090) with a royal 40-50%
Unfortunately the mem size is not up to par with the a6000

Timestamp : Wed Aug 30 19:11:38 2023
Driver Version : 536.99
CUDA Version : 12.2

Attached GPUs : 2
GPU 00000000:01:00.0
FB Memory Usage
Total : 24564 MiB
Reserved : 408 MiB
Used : 23178 MiB
Free : 977 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 1 MiB
Free : 32767 MiB
Compute Mode : Default
Utilization
Gpu : 98 %
Memory : 79 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
GPU Utilization Samples
Duration : 14.16 sec
Number of Samples : 71
Max : 99 %
Min : 20 %
Avg : 74 %
Memory Utilization Samples
Duration : 14.16 sec
Number of Samples : 71
Max : 83 %
Min : 6 %
Avg : 48 %
GPU Power Readings
Power Draw : 248.73 W
Current Power Limit : 450.00 W
Requested Power Limit : 450.00 W
Default Power Limit : 450.00 W
Min Power Limit : 10.00 W
Max Power Limit : 450.00 W
Power Samples
Duration : 2.61 sec
Number of Samples : 119
Max : 349.28 W
Min : 72.25 W
Avg : 239.76 W
Clocks
Graphics : 2610 MHz
SM : 2610 MHz
Memory : 10251 MHz
Video : 1980 MHz
Max Clocks
Graphics : 3135 MHz
SM : 3135 MHz
Memory : 10501 MHz
Video : 2415 MHz

HoloByteus · Post by **HoloByteus** » Thu Aug 31, 2023 6:07 pm

Thanks, what model config? Assuming Phase-A, what output resolution, encoder used, Batch Size and EGs/s for this session?

248w is still short of 450w and 74% utilization would suggest it's not quite fully utilized although that's point in time utilization. I didn't expect training would need a flagship CPU but it's beginning to look like that may be the case. It's a shame we aren't able to leverage RTX I/O for this.

Thanks for running the test!

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 31, 2023 7:37 pm

It was a relatively light and small model, 256px in -> 384px out
batch size 16 if i am not mistaken, worked really well, so now i am making a 300px in -> 512px out with a way fatter FC.
I will post what the performance/utilization with that when running as well

"output_size": 384,
"split_decoders": true,
"enc_architecture": "efficientnet_v2_b3",
"enc_scaling": 90,
"enc_load_weights": true,
"bottleneck_type": "max_pooling",
"bottleneck_norm": "rms",
"bottleneck_size": 512,
"bottleneck_in_encoder": true,
"fc_depth": 1,
"fc_min_filters": 512,
"fc_max_filters": 512,
"fc_dimensions": 6,
"fc_filter_slope": -0.5,
"fc_dropout": 0.0,
"fc_upsampler": "upscale_hybrid",
"fc_upsamples": 1,
"fc_upsample_filters": 448,
"dec_upscale_method": "upscale_hybrid",
"dec_upscales_in_fc": 0,
"dec_norm": "rms",
"dec_min_filters": 64,
"dec_max_filters": 384,
"dec_slope_mode": "full",
"dec_filter_slope": -0.33,
"dec_res_blocks": 1,
"dec_output_kernel": 3,
"dec_gaussian": false,
"dec_skip_last_residual": true,

HoloByteus · Post by **HoloByteus** » Sat Sep 02, 2023 6:39 pm

I'm thinking putting a bit more work on the GPU with a move to EffV2-S @ 80% scaling might give you better workload distribution and the improved accuracy couldn't hurt.

Ryzen1988 · Post by **Ryzen1988** » Mon Sep 04, 2023 7:42 pm

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.13 Driver Version: 537.13 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 On | Off |
| 79% 64C P2 308W / 450W | 23605MiB / 24564MiB | 96% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 WDDM | 00000000:05:00.0 Off | Off |
| 65% 36C P0 62W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------==============NVSMI LOG==============
Attached GPUs : 2
GPU 00000000:01:00.0
FB Memory Usage
Total : 24564 MiB
Reserved : 408 MiB
Used : 23614 MiB
Free : 541 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 1 MiB
Free : 32767 MiB
Utilization
Gpu : 98 %
Memory : 38 %
GPU Utilization Samples
Duration : 14.09 sec
Number of Samples : 71
Max : 99 %
Min : 37 %
Avg : 83 %
Memory Utilization Samples
Duration : 14.09 sec
Number of Samples : 71
Max : 49 %
Min : 4 %
Avg : 31 %
GPU Power Readings
Power Draw : 292.67 W
Current Power Limit : 450.00 W
Requested Power Limit : 450.00 W
Default Power Limit : 450.00 W
Min Power Limit : 10.00 W
Max Power Limit : 450.00 W
Power Samples
Duration : 2.92 sec
Number of Samples : 119
Max : 484.74 W
Min : 74.64 W
Avg : 291.33 W
Clocks
Graphics : 2310 MHz
SM : 2250 MHz
Memory : 10251 MHz
Video : 1845 MHz
Max Clocks
Graphics : 3135 MHz
SM : 3135 MHz
Memory : 10501 MHz
Video : 2415 MHz

Decent amount of more usage with a somewhat fatter model of the other one posted but same res.

Faceswap Forum

Intel big-little headache

Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache

Re: Intel big-little headache