Replies: 1 comment
-
GGUFs pack everything into one file, so you only need one file. A lot of providers offers ready ggufs nowdays. Use search for Once you have the repo id, you can run the download using the built in from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", #specify HF repo id where the model files are hosted
filename="*q8_0.gguf", #will download file that matches this pattern name (so one ending with q8_0.gguf, so an 8bit quant)
local_dir="./ai/llm_models/", #optionally specific dir to save the model file (other wise it will cache it in the cache dir)
#verbose=True,
# n_gpu_layers=-1, #uncomment this if you have llama-cpp-python with gpu support installed, otherwise it uses CPU
chat_format="llama", #specify chat template, not sure if this is obligatory or detected automatically
)
output = llm.create_chat_completion(
messages = [
{"role": "system", "content": ""},
{
"role": "user",
"content": "Hi, I'm just testing if it works.",
}
],
max_tokens=256, #max number of tokens, you can check for each model what's the max it can support online
)
print(output) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I want to run 8bit files using the LlamaCpp-Python library. Should I upload 2 files at the same time? Can you share sample code?
Beta Was this translation helpful? Give feedback.
All reactions