llama2-rs

Inference Llama 2 in Rust. It ports the llama2.c to Rust. I wrote this purely for self educational purpose. While the code is pretty much replicated from llama2.c, it divides the code into several modules.

Build and Test

To run the unit tests:

% cargo test

To make the release build:

% cargo build --release

Run

It runs with the TinyStories models. I didn't test it with any Llama2 model though.

Running with the stories260K model:

% ./target/release/run stories260K/stories260K.bin -z stories260K/tok512.bin -t 0.9 -s 12345

It gives:

acheived tok/s 371.5529753265602

Once upon a time, there was a little girl named Lily. She loved to play with her toys in her room. One day, she asked her mom what happened. Her mom said, "Don't worry, Lily. We can go to the park to make it clean and shoot a new ones."
Lily was so happy and started to play all day. She asked her mom to help the party. Her mom said, "Let's find my bed." Lily and her mom sat down to the park, but it was not broken.
Lily's mom asked her what was wrong and they had a lot of fun. Lily smiled and said, "Okay, Lily. Let's see your mom."
Lily went to the park and got it in her bed. She showed it to her

With the same seed, it generates identical text as llama2.c.

You can download other TinyStories models, run:

% ./target/release/run ~/Downloads/stories110M.bin

Performance

Though better performance is not the goal, I did a little bit comparison with llama2.c just to validate the implementation.

The benchmark ran on a Mac Mini with the M2 chip. For llama2-rs, it uses the release version. For llama2.c, it uses the build from make run (not make runfast which runs much faster), without OpenMP. They have similar performance.

This repo is the multi-thread version of llama2-rs. I also used a single thread version during the bechmark.

model	llama2.c	llama2-rs (single thread)	llama2-rs (multi-thread)
stories15M.bin	123 tok/s	125 tok/s	190 tok/s
stories42M.bin	35 tok/s	36 tok/s	89 tok/s
stories110M.bin	12 tok/s	12 tok/s	38 tok/s

Quantization

The int8 quantization is implemented in src/model/quantize. To run it, you need to export the quantized model first, using the export.py from llama2.c. For example:

% python3 export.py ~/Downloads/stories110M_q80.bin --version 2 --checkpoint ~/Downloads/stories110M.pt

Then run:

% ./target/release/runq ~/Downloads/stories110M_q80.bin

Unsorted TODOs

Chat

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
src		src
stories260K		stories260K
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
tokenizer.bin		tokenizer.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama2-rs

Build and Test

Run

Performance

Quantization

Unsorted TODOs

License

About

Uh oh!

Releases

Packages

Languages

License

magic003/llama2-rs

Folders and files

Latest commit

History

Repository files navigation

llama2-rs

Build and Test

Run

Performance

Quantization

Unsorted TODOs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages