yllama.rs

The idea was to work on a non-trivial implementation to learn a bit of Rust and get back into coding after years of engineering management. Project was timeboxed to a few days. Inspired by llama.cpp, the goal was to deliver a Llama 3 8b inference implementation that could run on a modern laptop and could also be deployed to the Internet Computer (ICP).

Goals

Functional goals

Llama 3 8b inference on laptop and ICP with maximum code reuse between the two targets - that also means that the code needed to be modular to be able to be deployed on ICP canisters
Solidfy knowledge around transformers
Support GGUF files
Support several strategies for weights (file-mapped, copy to heap,...)
Support some form of model quantization
Ability to deploy the same code locally and on the ICP

Non-functional goals

Pure Rust as it is well supported to build on ICP
Explore how Rust handles mutability and in particular the interior mutability pattern
Built from scratch to maximize learning, so I didn't use any of Candle
No dynamic dispatch or checks during model execution - model statically built including for value initialization (I regretted that choice!)
Naive implementation, leaving optimization as a later act

Usage

F32 and F16-quantized tensors are supported. A GGUF file can be downloaded from Hugging Face.
Hugging Face tokenizers is currently used but will be replaced by a custom implementation. For now a tokenized file needs to be provided. For instance this file for LLama 3.

To start:

cargo run --release -- -f ../Meta-Llama-3-8B-Instruct/ggml-model-f32.gguf -t ../llama-3-tokenizer/tokenizer.json -p "Fourth of July jokes ?"

Generation speed is around 1 token / second depending on memory
For the deployment on ICP, please refer to this repo

Known Issues

Bug: the Mmap is not freed after all the data have been copied to the heap

Learnings

Rust is a pretty neat language with great library and superior tooling and I felt productive quickly (which doesn't mean I was)
The #beginners channel on The Rust Programming Language Discord was an amazing resoource
Typing in Rust is limited, cumbersome and verbose compared to Haskell and that slowed my down considerably at some point. A lot of typing decisions I took were probably wrong (llama.rs is an eyesore!)
The inner matmul loops for both arm64 and wasm are relatively well optimized out of the box in release mode (no SIMD though) - Rust optimizer seems adequate
Gpt and Claude were not really able to help much

Reference

Meta Llama 3 8b model
llama.cpp and llama3.c

Contact

gip.github@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
yllama		yllama
yloader		yloader
ymain		ymain
ymath		ymath
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yllama.rs

Goals

Usage

Known Issues

Learnings

Reference

Contact

About

Releases

Packages

Languages

License

gip/yllama.rs

Folders and files

Latest commit

History

Repository files navigation

yllama.rs

Goals

Usage

Known Issues

Learnings

Reference

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages