Try it here.
The model may not run properly on your devices with insufficient RAM!
A simple demonstration modified from HuggingFace's React-translator example with TypeScript support.
The demo utilizes Transformers.js
to load and run a smaller large language model (LLM) - or small language model (SLM) in the web browser. The app uses Vite
's Worker
to run the model in the background, hence this would have to be a React or Svelte app.
You can define the models, tasks, device and other model parameters in /src/model/Config.json
:
Notes:
- The model loading time and memory usage mostly depends on model size. It is recommended to have at least 4-8 GB free memory on your device.
- System roles may not work well on certain models.
- Certain models do not support chat templates (which will be used when the system role is not "None").
- Certain devices and dtype options may not work for certain models. "Audo" in dtype does not work for all models either.
- For example,
Gemma-3-1B-It
does not work on WebGPU, andSmolLM2-1.7B-Instruct
only works when dtype is set toint8
,uint8
,bnb4
orq4f16
.
- For example,
- After loading a model, you must refresh the page to load a different one. There is no way to release the old model from the memory, and trying to load more than two models proved to be problematic.
{
"models": {
"SmolLM2-135M-Instruct": "HuggingFaceTB/SmolLM2-135M-Instruct",
"SmolLM2-360M-Instruct": "HuggingFaceTB/SmolLM2-360M-Instruct",
"codegen-350M-mono": "Xenova/codegen-350M-mono",
"Qwen2.5-0.5B-Instruct": "Mozilla/Qwen2.5-0.5B-Instruct",
"Qwen3-0.6B": "onnx-community/Qwen3-0.6B-ONNX",
"Gemma-3-1B-It": "onnx-community/gemma-3-1b-it-ONNX",
"Falcon3-1B-Instruct": "onnx-community/Falcon3-1B-Instruct",
"AMD-OLMo-1B-SFT-DPO": "onnx-community/AMD-OLMo-1B-SFT-DPO",
"ZR1-1.5B": "onnx-community/ZR1-1.5B-ONNX",
"SmolLM2-1.7B-Instruct": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"Phi-3-mini-4k-Instruct (3.8B)": "Xenova/Phi-3-mini-4k-instruct"
},
"system_roles": {
"None (no chat template)": "",
"Helpful assistant": "You are a helpful, concise, and accurate assistant.",
"Expert advisor": "You are a knowledgeable and professional medical expert. Provide clear, evidence-based answers.",
"Socratic guide": "You are a Socratic tutor. Ask thoughtful questions to guide the user to their own conclusions.",
"Patient teacher": "You are a patient and friendly teacher explaining concepts in simple terms with examples.",
"Quiz generator": "You are a quiz master. Generate multiple-choice questions to test knowledge of a topic.",
"Code assistant": "You are a skilled software engineer. Help the user write clean, efficient code.",
"Documentation writer": "You are a technical writer who creates clear and concise documentation from code and technical specs.",
"Data analyst": "You are a data analyst. Interpret data clearly, with charts or summaries if needed.",
"Storyteller": "You are a creative and engaging storyteller. Write vivid and original fiction.",
"Poet": "You are a poetic wordsmith. Craft expressive and emotionally resonant poetry.",
"Motivational coach": "You are a motivational coach. Offer practical advice and encouragement.",
"Friendly companion": "You are a kind and empathetic companion. Listen and respond warmly."
},
"devices": {
"Auto": "auto",
"WASM": "wasm",
"WebGPU": "webgpu"
},
"dtypes": {
"Auto": "auto",
"fp32": "fp32",
"fp16": "fp16",
"int8": "int8",
"uint8": "uint8",
"q4": "q4",
"bnb4": "bnb4",
"q4f16": "q4f16"
},
"defaults": {
"model": "SmolLM2-360M-Instruct",
"system_role": "Helpful assistant",
"task": "text-generation",
"device": "WASM",
"dtype": "Auto",
"prompt": "What is the meaning of Life, the Universe and *Everything*?",
"config": {
"max_new_tokens": 1024,
"temperature": 0.2,
"top_p": 0.95,
"top_k": 30,
"repetition_penalty": 1.05,
"do_sample": true
}
}
}
Install dependencies.
Start the dev server.
Build a production at ./dist
.
Serve and view the built production.
Commit changes.