Training freeze at random iteration - HRESULT failed with 0x887a0005

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
lighting
Posts: 11
Joined: Thu Mar 02, 2023 6:54 pm
Has thanked: 1 time
Been thanked: 1 time

Training freeze at random iteration - HRESULT failed with 0x887a0005

Post by lighting »

Can't find similar topic by errorcode, so start a new one.
Model Original, all settings by default. After some iteration (it may be 36 or 506 but rarely greater than 1000) process freeze. It can be stopped by "Stop" button (with no response on stop signal) or stopped by itself after some timeout. In second case in console I see error message:

Code: Select all

03/04/2023 17:49:39 INFO     Loading data, this may take a while...
03/04/2023 17:49:39 INFO     Loading Model from Original plugin...
03/04/2023 17:49:39 INFO     Using configuration saved in state file
03/04/2023 17:49:40 INFO     Loaded model from disk: 'C:\Users\user\Documents\faceswap\model\original.h5'
03/04/2023 17:49:41 INFO     Loading Trainer from Original plugin...

03/04/2023 17:49:48 INFO     [Saved models] - Average loss since last save: face_a: 0.05167, face_b: 0.04069

03/04/2023 17:49:49 INFO     [Preview Updated]

03/04/2023 17:50:40 INFO     [Saved models] - Average loss since last save: face_a: 0.05577, face_b: 0.04558

03/04/2023 17:50:40 INFO     [Preview Updated]

2023-03-04 18:10:39.223119: F tensorflow/c/logging(dot)cc:43] HRESULT failed with 0x887a0005: readback_heap->Map(0, nullptr, &readback_heap_data)
Process exited.

I try to reinstall faceswap with no visible effect. Google by errorcode give few topics on stackoverflow with no useful results for me.
Graphic card AMD RX580 8Gb. Any idea why this happined? With stop in every few hundred of iteration it almost impossible reach some training result.

by torzdf » Mon Mar 06, 2023 10:01 am

Unfortunately this is a timeout within DirectML and comes directly from the Tensorflow-DirectML plugin. There are some mitigation steps on the Tensorflow-DirectML github:

https://github.com/microsoft/tensorflow ... imeouts.md

Go to full post
Last edited by torzdf on Mon Mar 06, 2023 10:02 am, edited 1 time in total.
User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: Training freeze at random iteration - HRESULT failed with 0x887a0005

Post by torzdf »

Unfortunately this is a timeout within DirectML and comes directly from the Tensorflow-DirectML plugin. There are some mitigation steps on the Tensorflow-DirectML github:

https://github.com/microsoft/tensorflow ... imeouts.md

Last edited by torzdf on Mon Mar 06, 2023 10:02 am, edited 1 time in total.

My word is final

User avatar
lighting
Posts: 11
Joined: Thu Mar 02, 2023 6:54 pm
Has thanked: 1 time
Been thanked: 1 time

Re: Training freeze at random iteration - HRESULT failed with 0x887a0005

Post by lighting »

Thank you for link. This how to not actually solve the situation, but help to make freeze less friquent by redusing batch size to 4.

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 215 times
Contact:

Re: Training freeze at random iteration - HRESULT failed with 0x887a0005

Post by bryanlyon »

One of the suggestions at the link that Torzdf posted is to disable the timeout. The timeout is to prevent freezing your GPU. The suggested setting it to 10 seems like a good idea as it should still prevent freezing for longer than that, but still allow kernels that take longer than 2 seconds to run.

User avatar
lighting
Posts: 11
Joined: Thu Mar 02, 2023 6:54 pm
Has thanked: 1 time
Been thanked: 1 time

Re: Training freeze at random iteration - HRESULT failed with 0x887a0005

Post by lighting »

Yes, i read that and now my delay set to 60 seconds. I did this before create this thread. As i say early - in my case really helps changing batch size.

Locked