-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Description:
Environment
OS: macOS (Apple Silicon M-series)
Python: 3.9.6
PyTorch: 2.8.0
Transformers: 4.57.3
Device: MPS (Metal Performance Shaders)
Model: tencent/WeDLM-8B-Instruct
Issue
The model loads successfully but fails during inference with model.generate() on macOS systems. The error occurs because the WeDLM custom model code is incompatible with the standard transformers generation interface when running without CUDA/GPU dependencies.
Steps to Reproduce
python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "tencent/WeDLM-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16 if device == "mps" else torch.float32,
device_map="auto",
low_cpu_mem_usage=True
)
Model loads successfully
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
This fails:
outputs = model.generate(**inputs, max_new_tokens=100)
Error Traceback
TypeError: wrapped_fn() got an unexpected keyword argument 'input_ids'
The error originates from the custom modeling_wedlm.py file's forward method wrapper.
Root Cause
The WeDLM package requires triton>=3.0.0 which is not available on macOS
The custom model code expects GPU-specific optimizations (flash-attn, triton kernels)
The install.sh script only supports CUDA installations
The model's forward pass wrapper doesn't handle the standard transformers generate() API correctly on non-CUDA devices
Expected Behavior
The model should work with CPU/MPS inference on macOS, similar to other HuggingFace models, even if at reduced performance.
Suggested Solutions
Add macOS/CPU fallback: Detect when triton/flash-attn are unavailable and use standard attention mechanisms
Update dependencies: Make triton/flash-attn optional dependencies with graceful fallbacks
Fix model wrapper: Ensure the custom model code properly handles standard transformers API calls
Documentation: Add clear notes about macOS limitations and workarounds
Additional Context
Model downloads and loads successfully (15GB)
MPS device is detected and used
The issue affects all generation attempts, not just specific prompts
Other similar models (Qwen2.5, Llama) work fine on the same system
Workaround Attempted
Tried passing input_ids directly instead of unpacking **inputs, but the underlying model architecture still fails due to missing GPU dependencies.
Would appreciate guidance on either:
Making WeDLM work on macOS without GPU dependencies
Clear documentation stating macOS is unsupported
Alternative inference methods for CPU/MPS devices