Different outputs when run on CPU vs GPU (CUDA) #21859
Labels
ep:CUDA
issues related to the CUDA execution provider
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
I am attempting to export a model from HuggingFace from PyTorch to Onnx. After exporting the model, I am trying to confirm the outputs are still correct however it appears that when executing the model on the GPU using the CUDAExecutionProvider the outputs of the model are not close enough to the target embeddings produced by the model before exporting. When executing the model on the CPU however, the model does pass the test.
Seems similar to issue #4488 but maybe a new CUDA version or something re-triggered it?
To reproduce
`import torch
import onnxruntime
import torch.nn.functional as F
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_state, attention_mask):
'''Apply a mean pooling operation to the last hidden state output by the model'''
def main():
if name == "main":
main()`
Urgency
Somewhat urgent, attempting to optimize a model to use Onnx so I can use it in Nvidia Triton.
Platform
Linux
OS Version
22.04.4
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
CUDA 12.6
The text was updated successfully, but these errors were encountered: