Hard crash and reboot - wit's end.

Talk about Hardware used for Deep Learning


Locked
User avatar
Akkr_
Posts: 2
Joined: Sat Jul 11, 2020 3:07 am
Has thanked: 1 time

Hard crash and reboot - wit's end.

Post by Akkr_ »

Quite a complicated set of issues,

let's start with the specs,
i7-5850hq (igpu disabled)
RX 580 8gb
28gb 1866mhz ddr3
It's a weird setup, and while I know the sketchy idea of having a laptop chip frankenstein by some unknown internet merchant is frowned upon, it works, for me, great, until it doesn't. The low power and 128mb of L4 cache makes it all worth it.

I used to flash my cards myself with a hex editor, so, regedit, DDU, etc. all these things I've tried. However this card was never flashed and rarely used for any intensive workloads.

When I first installed faceswap, whenever I would attempt to extract it would hard crash and reboot my computer. OK, I've seen this before a while back while playing a game, specifically when loading assets, going from lobby to in-game or vice versa (each and every time), and still don't know for certain why it happened. but eventually it worked itself out after rolling back every driver and smashing my head on my keyboard.

So, I figured I can get this thing to run, and after loading up DDU "Eureka!" I had intelHD drivers still there, must have been that, so I uninstalled it along with my current Radeon driver and re-installed 19.2.2, (known to be stable) and I was able to load up faceswap no issues, ran my extractions, ran the training all night, and ran a conversion, everything worked fine. As well, I extracted using cpu only when my PC was in this failing state and it worked fine, so whatever it was, it's got to be on the GPU side of things.

Worse still, when I gave up for the day and figured I'd play a round or two to clear my head and test for driver/hardware issues, the crashes returned. So, I ran DDU, reinstalled my latest drivers, ran CCleaner, and all of a sudden my game was playable again but the faceswap still wasn't. Even when setting the trainer to lightweight, batchsize 4, 10k iterations, power limit 50% to soften the possibility of some power spike issue, it still crashes my computer and cycles a reboot.

And, this only occurs when the GPU enters any sort of "significant" load, never happens when browsing or watching videos.

Power spike on a dying power supply? I've RMA'ed it, 1200W 80+Gold, less than a year old never used for mining, checked all my connections. And I've tested and validated the CPU with various benchmarks and stress tests, its super sketchy nature has never been an issue. I would say this whole crashing stuff only really started around the new year, give or take a month or two. I think it's windows 20-04? or the last major update before it possibly?

Where do I go from here?

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: Hard crash and reboot - wit's end.

Post by torzdf »

Honestly, at this point I won't be able to offer any further advice. Persistent hard crashes are nearly always hardware related.

I would say power issues, but you PSU is serious overkill, so it's unlikely to be that. I would say, make sure that any factory overclocks on your card are disabled.

My word is final

User avatar
Akkr_
Posts: 2
Joined: Sat Jul 11, 2020 3:07 am
Has thanked: 1 time

Re: Hard crash and reboot - wit's end.

Post by Akkr_ »

Thanks, I've definitely played with the card's clock settings vanilla, OC, UC, UV & Power Limit. Temps were never really an issue I think they topped out at 80c previously when I was able to run training, but i set my fans to keep it under 70 which they did quite comfortably.

I think my card might be running at x8, So i'll have a look in my BIOS, pull out some PCIE cards if i can but hopefully that's not the issue.

*UPDATE# 1*

So, I finally gave up, did a windows refresh for the first time in my life, i realized my registry was more of a mess than I thought and well sometimes you just have to burn her down. Didn't work.

It turns out that yes, even in my BIOS my card was showing x8, which is strange, so i pulled a card out of my second CPU-bound pcie slot, should go to x16, work fine, right? wrong. I load up GPU-Z and it shows indeed, x8 but at pcie gen 1.1 speeds or 2.5GT/s. Ok, apparently this is somewhat normal as a power saving feature and should go to x16 3.0 when at load. wrong. stayed like it was, worse, the issue began to present itself again in the benchmark tool. when I re-entered the bios even my board navigator was saying the link was at 2.5GT/s, something it didn't display before. it's a dual-bios card, i tried both. something must be going off, some component must be dying or fried. Wrong again.

I ran my motherboard manufacturer's update wizard, noticed no changes, uninstalled the bloat it came with, and ran DDU yet again, this time, reinstalling my 19.2.2 driver, fire up GPU-Z and? x16 @ 1.1, progress!?! I run the benchmark, quickly jumps to pcie gen 3, everything is running great.

I did notice in DDU that I had two GPUs listed this time, GPU0 and GPU1 were identical, which is odd, afterburner didn't show anything of the sort, device manager either, etc. which reminded me that when loading up faceswap, I do get the message "Using GPU: ['opencl_amd_ellesmere.0', 'opencl_amd_ellesmere.0']" and I am running only one card, when i tested this on another machine (with poor results for other reasons) I saw it had only enumerate the intel HD graphics device name once.

**** Update #2 ****
it still crashes when doing anything, always when first loading up, never in the middle of anything. My guess is power delivery, which would be really weird, given the PSU is nearly mint, hopefully it's just a bad 8 pin. I rolled up my sleeves and tested with a spare RX 480 and had identical results, so the obvious next test is to dig out another 8 pin cable from my box in storage and if it fails, swap the PSU (elbow grease :x ) if it too fails validation, then, hey it's the motherboard or the chip, luckily i have a spare celeron and spotty H97 mining board.

User avatar
cyanspark
Posts: 5
Joined: Fri Mar 18, 2022 3:37 am

Re: Hard crash and reboot - wit's end.

Post by cyanspark »

Haha! Watt an adventure you've had with the PCI slots and the 8 pin ribbons. I think my one comes from Gygabike ultradura bullet never die edition MB, so I think it's the PSU, called a no-name viper super venom bite or something....

Fortunately I have a high current gamer Antec on the side, new, will see if that fixes things.

I took the car 2 hours to geneva and 1300 euros and got a Sooooper Dooooper AMD 2950 GTX 2080 8gb etc etc computer to last me ten years until 2031 :D however FaceSwap crashed out my new PC!!! TWICE!!! after 2 minutes of "Train"... Especially when num concurrent images was set to 128 and then 32, third time round I left it at 16 and it didn't crash for 15 minutes, until I gave up with only 6000 iterations.

difficult times faceswap, will write an update here if I have a weird adventure on my PC, else If i find the crash is related to the concurrent images in training.

User avatar
cyanspark
Posts: 5
Joined: Fri Mar 18, 2022 3:37 am

Re: Hard crash and reboot - wit's end.

Post by cyanspark »

Ok, so I upgraded from a 630w economy PSU to a 750w Antec pro gold and it fixed the issue, then I read on the guide "at startup faceswap consumes higher energy than games do and can crash a PC through low voltages" ...

That's kinda weird, I wonder which voltage rail was causing the PC to crash, why from a ram/cpu/gpu technical front? perhaps there could be an checkbox option to stagger processing when it's's near 100 percent of total system resources.

Locked