Closed
Description
🚀 The feature, motivation and pitch
Currently, user need to manually download hugging face safetensors, convert to llama_transformer format, and load the checkpoint and config for the export and inference.
It would be great to directly download and cache (don't have to load it again) the converted checkpoints, and do the inference. Similar to what mlx does:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/dolphin3.0-llama3.2-3B-4Bit")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
Alternatives
No response
Additional context
No response
RFC (Optional)
No response