Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different outputs when run on CPU vs GPU (CUDA) #21859

Open
lucian-cap opened this issue Aug 26, 2024 · 2 comments
Open

Different outputs when run on CPU vs GPU (CUDA) #21859

lucian-cap opened this issue Aug 26, 2024 · 2 comments
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot

Comments

@lucian-cap
Copy link

Describe the issue

I am attempting to export a model from HuggingFace from PyTorch to Onnx. After exporting the model, I am trying to confirm the outputs are still correct however it appears that when executing the model on the GPU using the CUDAExecutionProvider the outputs of the model are not close enough to the target embeddings produced by the model before exporting. When executing the model on the CPU however, the model does pass the test.

Seems similar to issue #4488 but maybe a new CUDA version or something re-triggered it?

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.4 LTS running on WSL2 via Windows 11 64bit
  • CPU: 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
  • GPU: Nvidia GeForce RTX 4090 24GB
  • ONNX Runtime installed from (source or binary): binary (by pip install onnxruntime-gpu)
  • ONNX Runtime version: 1.19.0
  • Python version: 3.8.19

To reproduce

`import torch
import onnxruntime

import torch.nn.functional as F
import numpy as np

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_state, attention_mask):
'''Apply a mean pooling operation to the last hidden state output by the model'''

input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
return torch.sum(last_hidden_state * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def main():

#Create a large input to make sure we hit the 384 max window before exporting the model
sentences = ['All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy. \
              All work and no play makes Jack a dull boy.']

#Create gold embeddings using the SentenceTransformer module
sent_tran_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2').to('cuda')
gold_embed = torch.tensor(sent_tran_model.encode(sentences))

#Load the tokenizer and model from HuggingFace
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')

#Tokenize the input before passing it into the model
encoded_input = tokenizer(sentences,
                          padding = True,
                          truncation = True,
                          return_tensors = 'pt',
                          max_length = 384).to('cuda')

#Export the model to Onnx
torch.onnx.export(model,
                  (encoded_input['input_ids'], encoded_input['attention_mask']),
                  'all-mpnet-base-v2.onnx',
                  export_params = True,
                  opset_version = 13,
                  do_constant_folding = True,
                  input_names = ['input_ids',
                                 'attention_mask'],
                  output_names = ['last_hidden_state',
                                  'pooler_output'],
                  dynamic_axes = {'input_ids': {0: 'batch_size'},
                                  'attention_mask': {0: 'batch_size'},
                                  'last_hidden_state': {0: 'batch_size'},
                                  'pooler_output': {0: 'batch_size'}})

####################################################################################################
# Run the Onnx model on the CPU and show the embeddings are close to the target embeddings
####################################################################################################

#Run the Onnx model on the CPU
ort_session_cpu = onnxruntime.InferenceSession('all-mpnet-base-v2.onnx', providers = ['CPUExecutionProvider'])
ort_outs_cpu = ort_session_cpu.run(('last_hidden_state', 'pooler_output'), 
                                   {k: v.numpy(force = True) for k, v in encoded_input.items()})
ort_outs_cpu = [torch.tensor(i) for i in ort_outs_cpu]

#Apply the mean pooling operation and normalization operation as described on the HuggingFace page for the model
pool_output_cpu = mean_pooling(ort_outs_cpu[0], encoded_input['attention_mask'].to('cpu'))
sent_embed_cpu = F.normalize(pool_output_cpu, p = 2, dim = 1)

#Assert the CPU embeddings are close to the targets
torch.testing.assert_close(gold_embed, sent_embed_cpu)

####################################################################################################
# Run the Onnx model on the GPU and show the embeddings are NOT close to the target embeddings
####################################################################################################

#Run the Onnx model on the GPU
ort_session_cpu = onnxruntime.InferenceSession('all-mpnet-base-v2.onnx', providers = ['CUDAExecutionProvider'])
ort_outs_cpu = ort_session_cpu.run(('last_hidden_state', 'pooler_output'), 
                                   {k: v.numpy(force = True) for k, v in encoded_input.items()})
ort_outs_cpu = [torch.tensor(i).to('cuda') for i in ort_outs_cpu]

#Apply the mean pooling operation and normalization operation as described on the HuggingFace page for the model
pool_output_cpu = mean_pooling(ort_outs_cpu[0], encoded_input['attention_mask'].to('cuda'))
sent_embed_cpu = F.normalize(pool_output_cpu, p = 2, dim = 1)

#Assert the GPU embeddings are close to the targets
torch.testing.assert_close(gold_embed.to('cuda'), sent_embed_cpu)

if name == "main":
main()`

Urgency

Somewhat urgent, attempting to optimize a model to use Onnx so I can use it in Nvidia Triton.

Platform

Linux

OS Version

22.04.4

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

CUDA 12.6

@github-actions github-actions bot added model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. ep:CUDA issues related to the CUDA execution provider labels Aug 26, 2024
@microsoft microsoft deleted a comment Aug 27, 2024
@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 27, 2024

I saw the absolute difference is not large:

Greatest absolute difference: 0.00011079013347625732 at index (0, 573) (up to 1e-05 allowed)

I suggest to use end-to-end metric (like precision and recall etc) to measure. Sometime, such small difference does not matter on real metric. For example, when you convert SQuAD model from float32 to float16, abs difference could be larger than 0.001, but end-to-end F1 metric is not impacted at all.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

9 participants
@tianleiwu @lucian-cap and others