[Bug]: Batch embedding inference is inconsistent with hf

Below is the minimal reproduction script, you may firstly setup an embedding server of 'intfloat/multilingual-e5-large-instruct' on 8000 port.
[batch_embedding.txt](https://github.com/user-attachments/files/19429471/batch_embedding.txt)

### Your current environment

PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.9
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40
GPU 3: NVIDIA A40
GPU 4: NVIDIA A40
GPU 5: NVIDIA A40
GPU 6: NVIDIA A40
GPU 7: NVIDIA A40

Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          86
On-line CPU(s) list:             0-85
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              43
Socket(s):                       1
Stepping:                        6
BogoMIPS:                        5187.80
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.7 MiB (86 instances)
L1i cache:                       2.7 MiB (86 instances)
L2 cache:                        172 MiB (43 instances)
L3 cache:                        16 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-85
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-dali-cuda120==1.32.0
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] nvidia-pyindex==1.0.9
[pip3] onnx==1.15.0rc2
[pip3] optree==0.10.0
[pip3] pynvml==11.4.1
[pip3] pytorch-quantization==2.1.2
[pip3] pyzmq==25.1.2
[pip3] sentence-transformers==3.2.1
[pip3] torch==2.5.1
[pip3] torch-tensorrt==2.2.0a0
[pip3] torchaudio==2.5.1
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.17.0a0
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev240+g7c4f9883.d20250321
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     PHB     PHB     PHB     PHB     0-85    0               N/A
GPU1    PHB      X      PHB     PHB     PHB     PHB     PHB     PHB     0-85    0               N/A
GPU2    PHB     PHB      X      PHB     PHB     PHB     PHB     PHB     0-85    0               N/A
GPU3    PHB     PHB     PHB      X      PHB     PHB     PHB     PHB     0-85    0               N/A
GPU4    PHB     PHB     PHB     PHB      X      PHB     PHB     PHB     0-85    0               N/A
GPU5    PHB     PHB     PHB     PHB     PHB      X      PHB     PHB     0-85    0               N/A
GPU6    PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     0-85    0               N/A
GPU7    PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      0-85    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.3.4.1
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.19.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.3.2.001
PYTORCH_VERSION=2.2.0a0+81ea7a4
PYTORCH_BUILD_NUMBER=0
CUDNN_VERSION=8.9.7.29+cuda12.2
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=76438008
CUDA_DRIVER_VERSION=545.23.08
PYTORCH_BUILD_VERSION=2.2.0a0+81ea7a4
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=23.12
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1


### 🐛 Describe the bug

when i use vllm to create embeddings, it turns out weird in the behavior between batching and send requests one by one.
My model is "intfloat/e5-mistral-7b-instruct", my test data is a list with 100 strings.
When i set the max-num-seqs=1， i can pass the test in https://github.com/vllm-project/vllm/commits/main/tests/models/embedding/language/test_embedding.py .

But when i use batch inference, the result is inconsistent with huggingface or sentence-transformers, only the first 20 of embeddings can stay consistent with hf, others are inconsistent with cosine_similarity of 0.98 or lower, do you have any ideas to solve this batch inference problem? Thanks

![Image](https://github.com/user-attachments/assets/c1818c24-dcd4-45e3-a750-516ec7d061eb)

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Batch embedding inference is inconsistent with hf #15393

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Batch embedding inference is inconsistent with hf #15393

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions