Exollama provides a collection of local inference tools, designed to help bootstrap AGI on a budget ($1K).
- llama2.cu allows you to target an NVidia GPU using a plain C project (the genesis of this project)
- ollama allows you to target either CPU or GPU resources
- shell scripts: see
epochs/epoch-1/system-install.sh
A summary of each script deployed during setup is given below.
- agent-name.sh: Displays the name of a given agent identified by ID
- agent-step.sh: Runs the next agent step for a given agent ID
- browse.sh: Browses for avahi services
- coordinate.sh: Fetches an agent's next communication coordinate
- deploy.sh: Deploys shell scripts to /usr/local/exollama
- deps.sh: Installs SQ
- embed.sh: Obtains embeddings for the given model from ollama
- exo-step.sh: Runs the next pod step
- fetch-message.sh: Fetches a message from an sq rest api endpoint
- finalize-entry.sh: Collects output from an agent step
- generate-agent.sh: Uses the configured LLM to generate an agent biography
- get-exocortexum.sh: Snags a message from exocortexia.phext for an agent step
- get-message.sh: Pulls the current agent's scroll from SQ into msg.txt
- init-agent.sh: Assists agents with choosing their initial pod from aurora, chronos, elysium, helios, nyx, or tyche.
- install.sh: Configures the /etc/exollama directory for production use
- launch-pod.sh: Loads the database for a pod using SQ
- list-agents.sh: Checks the integrity of the agent store, listing agents 1-100
- list-spots.sh: Lists how many open spots remain in each pod (seems to be broken)
- maybe-think.sh: Checks to see if an agent needs to complete a thought step for this round
- pod-name.sh: Displays the configured pod name, as defined in /etc/exollama/pod.id
- pods.sh: launches an instance of SQ on ports 11000 through 110100, each serving as a distinct agent interface
- pod-status.sh: Displays runtime status for the pod and launches an SQ host if needed
- post-message.sh: Consumes msg.txt, pushing it back into the current phext via SQ
- pricing.sh: Displays LLM pricing information
- process.sh: (placeholder for high-level agent processing)
- rag.sh: provides a quick way to interact with rag.py from exocortical
- render.sh: pretty printing for console text with style
- reset.sh: attempt to nudge LLMs away from groupthink
- roster.sh: displays the current pod's roster (just 1.1.1/1.1.1/1.1.1)
- round-status.sh: displays status information about the current thought round
- run.sh: Executes llama2_q4, which leverages GPU compute for zero-dependency inference
- send-message.sh: Routes a message to a specific agent via SQ (REST)
- setup.sh: Single-step setup for building llama2_q4
- status.sh: Exollama node status
- sync.sh: Automates workflow steps (git stash + rebase)
- system-install.sh: Creates a deployed instance in /usr/local/exollama
- talk.sh: Constructs the next interaction coordinate for the given agent
- teach.sh: Uses RAG to constrain agent output to a training set based upon phext
- test-cuda.sh: Verifies that you have a viable CUDA environment for nvcc
- think.sh: Runs an agent thought step using deepseek-r1
- tok.sh: API credit calculations
- update-pod-manifest.sh: archives the current pod manifest report
- verify-agent.sh: Tests whether or not a given agent exists in the agent store
Simple and fast Pure Cuda inference for 4-bit AWQ quantized models.
Based on llama2.c Forked from llama2_q4.cu
The instructions below still apply, but you can execute all of them in one shot with setup.sh
on Ubuntu 24.04 LTS with an NVidia GPU. AMD support via rocm is a work-in-progress.
After you've completed setup, use run.sh to run your prompts through your new LLM!
./setup.sh
./run.sh
- prompt.phext: Your custom prompt
- output.phext: The content generated by the selected LLM
git clone https://github.com/wbic16/exollama
cd exollama
mdkir build
cd build
cmake ..
cmake --build . --config Release
cd ..
You can quickly run inference requests via ollama using the scripts below.
- dq.sh: deepseek-r1:1.5b
- ds.sh: deepseek-r1:8b
- tl.sh: tinyllama
- ml.sh: mistral
- q2.sh: qwen2:7b
- gg.sh: gemma:7b
- l2.sh: llama3.2
- oc.sh: opencoder
The simpler way is to download a pre-converted model from Huggingface, but you can also prepare your own weights.
You can use one of these models:
Here are the commands for the 7B model:
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq/resolve/main/pytorch_model.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq/resolve/main/config.json
pip install numpy torch
python3 convert_awq_to_bin.py pytorch_model.bin output
./weight_packer config.json output llama2-7b-awq-q4.bin 1
./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
And here are the commands for the 13B model:
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/config.json
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00003-of-00003.bin
pip install numpy torch
python3 convert_awq_to_bin.py pytorch_model-00001-of-00003.bin output
python3 convert_awq_to_bin.py pytorch_model-00002-of-00003.bin output
python3 convert_awq_to_bin.py pytorch_model-00003-of-00003.bin output
cd build
cp ../weight_packer weight_packer
./weight_packer config.json output llama2-13b-awq-q4.bin 1
./llama2_q4 llama2-13b-awq-q4.bin -n 256 -i "write an essay about GPUs"
cd ..
Note: the last argument of weight_packer is used to indicate whether the awq weights are using old packing format (that need repacking). If you use latest AWQ repo from github, it will generate weights in new packing format. The weights at https://huggingface.co/abhinavkulkarni/ are still using old format so we are setting the param to 1 above.
- First generate AWQ int-4 quantized weights following steps in llm-awq
- Convert AWQ weights into individual weight binary files using convert_awq_to_bin.py
- Convert/repack the weight binary files using the weight_repacker.cpp utility.
- Run the inference (llama2_q4.cu) pointing to the final weight file.
Note: AWQ scripts doesn't run on Windows. Use Linux or WSL.
Example:
python -m awq.entry --model_path /path-to-model/Llama-2-7b-chat-hf --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/llama2-7b-chat-metadata.pt
python -m awq.entry --model_path /path-to-model/Llama-2-7b-chat-hf --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama2-7b-chat-metadata.pt --q_backend real --dump_quant awq_weights/llama2-7b-awq.pt
pip install numpy torch
python3 convert_awq_to_bin.py awq_weights/llama2-7b-awq.pt output
./weight_packer config.json output llama2-7b-awq-q4.bin 0
We get ~200 tokens per second with RTX 4090 for 7b paramater models:
llama2_q4.exe llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"
Model params:-
dim: 4096
hidden_dim: 11008
n_heads: 32
n_kv_heads: 32
n_layers: 32
seq_len: 2048
vocab_size: 32000
loaded weights
<s>
write an essay about GPUs
Introduction:
GPU (Graphics Processing Unit) is a specialized electronic circuit designed to accelerate the manipulation of graphical data. It is a key component of a computer's hardware that is used to improve the performance of graphics-intensive applications such as video games, computer-aided design (CAD) software, and scientific simulations. In this essay, we will explore the history of GPUs, their architecture, and their impact on the computer industry.
History of GPUs:
The concept of a GPU can be traced back to the 1960s when computer graphics were still in their infancy. At that time, computer graphics were primarily used for scientific visualization and were not yet a major component of mainstream computing. However, as computer graphics became more popular in the 1980s and 1990s, the need for specialized hardware to handle the increasingly complex graphics tasks became apparent. In the early 1990s, the first GPUs were developed, which were designed to offload the computationally intensive graphics tasks from the CPU (Central Processing Unit) to the GPU.
Architecture
achieved tok/s: 200.787402. Tokens: 255, seconds: 1.27
MIT