The idea was to work on a non-trivial implementation to learn a bit of Rust and get back into coding after years of engineering management. Project was timeboxed to a few days. Inspired by llama.cpp, the goal was to deliver a Llama 3 8b inference implementation that could run on a modern laptop and could also be deployed to the Internet Computer (ICP).
Functional goals
- Llama 3 8b inference on laptop and ICP with maximum code reuse between the two targets - that also means that the code needed to be modular to be able to be deployed on ICP canisters
- Solidfy knowledge around transformers
- Support GGUF files
- Support several strategies for weights (file-mapped, copy to heap,...)
- Support some form of model quantization
- Ability to deploy the same code locally and on the ICP
Non-functional goals
- Pure Rust as it is well supported to build on ICP
- Explore how Rust handles mutability and in particular the interior mutability pattern
- Built from scratch to maximize learning, so I didn't use any of Candle
- No dynamic dispatch or checks during model execution - model statically built including for value initialization (I regretted that choice!)
- Naive implementation, leaving optimization as a later act
- F32 and F16-quantized tensors are supported. A GGUF file can be downloaded from Hugging Face.
- Hugging Face
tokenizers
is currently used but will be replaced by a custom implementation. For now a tokenized file needs to be provided. For instance this file for LLama 3.
To start:
cargo run --release -- -f ../Meta-Llama-3-8B-Instruct/ggml-model-f32.gguf -t ../llama-3-tokenizer/tokenizer.json -p "Fourth of July jokes ?"
- Generation speed is around 1 token / second depending on memory
- For the deployment on ICP, please refer to this repo
- Bug: the Mmap is not freed after all the data have been copied to the heap
- Rust is a pretty neat language with great library and superior tooling and I felt productive quickly (which doesn't mean I was)
- The #beginners channel on The Rust Programming Language Discord was an amazing resoource
- Typing in Rust is limited, cumbersome and verbose compared to Haskell and that slowed my down considerably at some point. A lot of typing decisions I took were probably wrong (llama.rs is an eyesore!)
- The inner matmul loops for both arm64 and wasm are relatively well optimized out of the box in release mode (no SIMD though) - Rust optimizer seems adequate
- Gpt and Claude were not really able to help much
- Meta Llama 3 8b model
- llama.cpp and llama3.c