Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #643

MaksimDanilov · 2024-05-30T21:06:05Z

Describe the bug

I made program that waits to input image to be segmentated though U2NET model. After some inputs (can't determine frequency as it is always random), I start to get DEVICE_NOT_AVAILABLE_ERROR when I post tensor to xpu. Can I somehow get debug logs to post here?

* I noticed that with the same program I can't reproduce this error on cpu)
** It happens only if I have transferred data to model. Simple forward data to xpu works file.

Versions

Collecting environment information...
PyTorch version: 2.1.0.post2+cxx11.abi
PyTorch CXX11 ABI: No
IPEX version: 2.1.30+xpu
IPEX commit: 474a6b3
Build type: Release

OS: Microsoft Windows 11 Home
GCC version: N/A
Clang version: N/A
IGC version: 2024.1.0 (2024.1.0.20240308)
CMake version: version 3.28.0-msvc1
Libc version: N/A

Python version: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Iris(R) Xe Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.29283', has_fp64=0, total_memory=7167MB, max_compute_units=96, gpu_eu_count=96)
Intel OpenCL ICD version: N/A
Level Zero version: N/A

CPU:

Revision=

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.30+xpu
[pip3] numpy==1.26.1
[pip3] torch==2.1.0.post2+cxx11.abi
[pip3] torchaudio==2.1.0.post2+cxx11.abi
[pip3] torchvision==0.16.0.post2+cxx11.abi
[conda] intel-extension-for-pytorch 2.1.30+xpu pypi_0 pypi
[conda] numpy 1.26.1 pypi_0 pypi
[conda] torch 2.1.0.post2+cxx11.abi pypi_0 pypi
[conda] torchaudio 2.1.0.post2+cxx11.abi pypi_0 pypi
[conda] torchvision 0.16.0.post2+cxx11.abi pypi_0 pypi

Code

import torch
import intel_extension_for_pytorch as ipex

import logging


class TinyModel(torch.nn.Module):
    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(750, 1000)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(1000, 2)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x


device = 'xpu'
dtype = torch.float32

model = TinyModel()
model.eval() # Set the model to evaluation mode for inference, as required by ipex.optimize() function.
data = torch.zeros(750, dtype=dtype)[None]

model = model.to(device)
model = ipex.optimize(model, weights_prepack=False)

if __name__ == "__main__":
    while True:
        try:
            input('Press Any Key To Continue...')

            logging.info(f'Available: {ipex.xpu.is_available()}.')
            logging.info('Started.')

            with torch.no_grad():
                model(data.to(device))

            logging.info('Forwarded.')

        except KeyboardInterrupt:
            break

        except EOFError:
            break

        except Exception as ex:
            logging.exception(ex)

The text was updated successfully, but these errors were encountered:

nazneenn · 2024-05-31T05:50:01Z

Hi @MaksimDanilov, thanks for reporting this issue. Could you please help provide the Traceback error from your terminal using
python script.py &> ./log.txt

Please also verify the driver installation version from here:
https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu

MaksimDanilov · 2024-05-31T08:01:24Z

log.txt

MaksimDanilov · 2024-05-31T08:16:01Z

Update issue description 'cause it's only reproduce if I post data to model.

nazneenn · 2024-05-31T09:15:44Z

Is it possible to provide us with a minimum code reproducer for this issue as well as the test image? Thanks.

MaksimDanilov · 2024-05-31T16:51:04Z

Okey. I'll try to make minimal example on weekends.

MaksimDanilov · 2024-06-01T17:26:35Z

Added code to description (it's not production code, however error occurs in it). Also, pasted log with time measures when the exception was thrown.
log.txt

nazneenn · 2024-06-03T15:00:20Z

Thank you for providing the sample code. I could not reproduce the same error at my end.

Could we test if the Intel® oneAPI Base Toolkit 2024.1.0 is installed and sourced correctly? In your collected environment information, I noticed that the DPCPP and MKL versions are listed as N/A.
Please run the sanity check test to double confirm if the correct version is installed. Please verify according to this documentation for windows.
https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu&os=windows&package=pip

call {DPCPPROOT}\env\vars.bat
call {MKLROOT}\env\vars.bat
python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

MaksimDanilov · 2024-06-03T15:50:46Z

Okey. I didn't invoke these two bat's like that. I use setvars bat in root directory of OneApi.

MaksimDanilov · 2024-06-03T16:20:15Z

collect env gives the same result as versions n/a.

MaksimDanilov · 2024-06-03T16:29:11Z

~~Maybe I should set root vars explicitly. I try it out now~~
I think collect_env.py is not adapted to fetch that data on Windows.

MaksimDanilov · 2024-06-04T14:11:56Z

@nazneenn Maybe I can search for debug logs if exists to help you understand this problem a bit more?

ziyanxzy · 2024-06-07T02:09:03Z

i have the same problem. when i choose dataset "piqa" limit 5 ,it can work. but if i use limit10/20/50 and more cases, it can not.

nazneenn · 2024-06-07T06:19:11Z

@nazneenn Maybe I can search for debug logs if exists to help you understand this problem a bit more?

@MaksimDanilov, you may try any debugger like gdb or python debugger to get more details on the issue and report back what you see when the core dump happens:
python -m pdb your_script.py

ziyanxzy · 2024-06-07T06:30:56Z

@nazneenn Maybe I can search for debug logs if exists to help you understand this problem a bit more?

@MaksimDanilov, you may try any debugger like gdb or python debugger to get more details on the issue and report back what you see when the core dump happens: python -m pdb your_script.py

MaksimDanilov · 2024-06-10T10:46:25Z

@nazneenn, I can always reproduce this problem now. step by step guide:

Prepare loop-forever script.
Init model and send it to xpu with optimization.
Send tensor to gpu.
Wait ~6-7 minutes.
Repeat 2 and 3.
Get this error. In system logs when I tried to push data to tensor I see "Display driver igfx stopped responding and has successfully recovered". (When I tried to change TdrDelay to 60 secs my device freezes forever)

MaksimDanilov · 2024-06-17T06:50:42Z

@nazneenn any help here?

nazneenn · 2024-06-27T04:53:48Z

Hi @MaksimDanilov ,
Has this issue been resolved? #659 where you have mentioned that you resolved it by compiling an extension with AOT='dg1.'
Thanks

MaksimDanilov · 2024-06-27T09:45:13Z

Hi @MaksimDanilov , Has this issue been resolved? #659 where you have mentioned that you resolved it by compiling an extension with AOT='dg1.' Thanks

Hi :-)
No, after I compiled dll, I can't reduce AOT so related task is about that. Don't know what to do now, 'cause I need to have two separated process to have loaded model to GPU and working like command terminal to do tasks for time of main program executed.

Still have problem after pushing tensor to GPU after 5-6 minutes (first try pass successfully). When I launch second process and my gpu memory is running low, catch similar error. Maybe problem in swapping or something like that, because the loading freezes after some time.
All the time it fails with ifqnx driver was restarted by Windows in system events.

yinghu5 · 2024-07-19T01:15:44Z

Hi @MaksimDanilov ,
I tried the issue on one Intel Xe(r) Iris Graphcis + i5 G1135G7 with 16G, it seems everything is going fine with latest IPEX 2.1.30 post verison.. I posted my step here, could you please try again and see if it can resolve the problem?
(Please also attach your Windows performance windows and show the GPU and Memory, Just guess you may not need to build IPEX from source as the latest IPEX 2.1.30 should be ok for the Intel Xe(r) Iris Graphics )

1. set https_proxy=http://xxx.intel.com:xxx

2. conda create -n ipex_311 python=3.11

3.   If creating this leads to errors because proxy 
then  edit.condarc to add proxy 
because defaults profiles points to anaconda please follow below steps:
   conda config --get
   conda config --add channels conda-forge
   conda config --remove channels defaults 
   Then retry creating conda env
   
4. conda activate ipex_311

5. conda install pkg-config libuv

6. python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30.post0 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

7. pip install dpcpp-cpp-rt==2024.1.2  mkl-dpcpp==2024.1

8. python -m pip install setuptools==69.5.1
python -m pip install numpy==1.26.4

9. Verify Env is correct : python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

Run your workload : python test.py"

import torch
import intel_extension_for_pytorch as ipex

import logging

class TinyModel(torch.nn.Module):
    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(750, 1000)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(1000, 2)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

device = 'xpu'
dtype = torch.float32

model = TinyModel()
model.eval() # Set the model to evaluation mode for inference, as required by ipex.optimize() function.
data = torch.zeros(750, dtype=dtype)[None]

model = model.to(device)
model = ipex.optimize(model, weights_prepack=False)

if __name__ == "__main__":
    while True:
        try:
            input('Press Any Key To Continue...')

            logging.info(f'Available: {ipex.xpu.is_available()}.')
            logging.info('Started.')

            with torch.no_grad():
                model(data.to(device))

            logging.info('Forwarded.')

        except KeyboardInterrupt:
            break

        except EOFError:
            break

        except Exception as ex:
            logging.exception(ex)

screenshot:

MaksimDanilov · 2024-07-19T10:58:22Z

@yinghu5 Hi, thanks for reply.
I tried to do as you say, but got the same problem. Can you wait after first pass a bit longer (~10 minutes).

yinghu5 · 2024-08-16T03:38:56Z

@MaksimDanilov , Thank you for raising the issue to Driver team. I will check with them about the problem internally.
I can reproduce the problem click after first pass a bit longer (~10 minutes). It seems the driver dis-connect the CPU and XPU if some break.

Press Any Key To Continue...
2024-08-16 08:49:19,646 - root - INFO - Available: True.
2024-08-16 08:49:19,646 - root - INFO - Started.
2024-08-16 08:49:19,646 - root - INFO - Forwarded.
Press Any Key To Continue...
2024-08-16 09:01:54,091 - root - INFO - Available: True.
2024-08-16 09:01:54,091 - root - INFO - Started.
2024-08-16 09:01:54,615 - root - ERROR - Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
Traceback (most recent call last):
File "C:\Users\yhu5\test.py", line 44, in
model(data.to(device))
^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
Press Any Key To Continue...

nazneenn self-assigned this May 31, 2024

nazneenn added the XPU/GPU XPU/GPU specific issues label May 31, 2024

nazneenn added the Windows label Jun 5, 2024

MaksimDanilov mentioned this issue Jun 17, 2024

Compilation Intel Extension For PyTorch with specific AOT device enabled didn't reduce AOT for first pass tensor to model. #659

Closed

yinghu5 self-assigned this Jul 18, 2024

MaksimDanilov mentioned this issue Aug 5, 2024

When interferencing nn model after some time got following: Display driver igfx stopped responding and has successfully recovered. IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT#831

Open

10 tasks

jingxu10 unassigned nazneenn Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #643

Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #643

MaksimDanilov commented May 30, 2024 •

edited

Loading

nazneenn commented May 31, 2024

MaksimDanilov commented May 31, 2024 •

edited

Loading

MaksimDanilov commented May 31, 2024

nazneenn commented May 31, 2024

MaksimDanilov commented May 31, 2024 •

edited

Loading

MaksimDanilov commented Jun 1, 2024 •

edited

Loading

nazneenn commented Jun 3, 2024

MaksimDanilov commented Jun 3, 2024

MaksimDanilov commented Jun 3, 2024 •

edited

Loading

MaksimDanilov commented Jun 3, 2024 •

edited

Loading

MaksimDanilov commented Jun 4, 2024

ziyanxzy commented Jun 7, 2024

nazneenn commented Jun 7, 2024

ziyanxzy commented Jun 7, 2024

MaksimDanilov commented Jun 10, 2024 •

edited

Loading

MaksimDanilov commented Jun 17, 2024

nazneenn commented Jun 27, 2024

MaksimDanilov commented Jun 27, 2024

yinghu5 commented Jul 19, 2024 •

edited

Loading

MaksimDanilov commented Jul 19, 2024 •

edited

Loading

yinghu5 commented Aug 16, 2024

Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #643

Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #643

Comments

MaksimDanilov commented May 30, 2024 • edited Loading

Describe the bug

Versions

Code

nazneenn commented May 31, 2024

MaksimDanilov commented May 31, 2024 • edited Loading

MaksimDanilov commented May 31, 2024

nazneenn commented May 31, 2024

MaksimDanilov commented May 31, 2024 • edited Loading

MaksimDanilov commented Jun 1, 2024 • edited Loading

nazneenn commented Jun 3, 2024

MaksimDanilov commented Jun 3, 2024

MaksimDanilov commented Jun 3, 2024 • edited Loading

MaksimDanilov commented Jun 3, 2024 • edited Loading

MaksimDanilov commented Jun 4, 2024

ziyanxzy commented Jun 7, 2024

nazneenn commented Jun 7, 2024

ziyanxzy commented Jun 7, 2024

MaksimDanilov commented Jun 10, 2024 • edited Loading

MaksimDanilov commented Jun 17, 2024

nazneenn commented Jun 27, 2024

MaksimDanilov commented Jun 27, 2024

yinghu5 commented Jul 19, 2024 • edited Loading

MaksimDanilov commented Jul 19, 2024 • edited Loading

yinghu5 commented Aug 16, 2024

MaksimDanilov commented May 30, 2024 •

edited

Loading

MaksimDanilov commented May 31, 2024 •

edited

Loading

MaksimDanilov commented May 31, 2024 •

edited

Loading

MaksimDanilov commented Jun 1, 2024 •

edited

Loading

MaksimDanilov commented Jun 3, 2024 •

edited

Loading

MaksimDanilov commented Jun 3, 2024 •

edited

Loading

MaksimDanilov commented Jun 10, 2024 •

edited

Loading

yinghu5 commented Jul 19, 2024 •

edited

Loading

MaksimDanilov commented Jul 19, 2024 •

edited

Loading