CPU-only inferencing using a locally-deployed LLM

In this project, Cloudera uses llama.cpp to employ a mistral-7b model for inferencing. Our approach caters nicely to resource constrained environments that may only have CPUs or smaller GPUs with low VRAM and limited threads. In the examples herein, we are deliberately only using CPUs for inferencing. llama.cpp facilitates this by supporting 4-bit integer quantization and mixed floating point precision. If GPUs are used, llama.cpp offers the flexibility of supporting NVIDIA and CUDA, as well as Apple Silicon and Metal.This support extends also to AMD's ROCm library.

Please consider the detail below.

Quantization

Quantization is the process of converting model weights from higher precision to lower precision floating point values.

GGML and GGUF

In this project, Cloudera uses llama.cpp's newer GGUF format versus the older GGML.These formats enable you to fit an entire model within RAM or VRAM.

GGML stands for Georgi Gerganov’s Machine Learning. The GGML format provided a way to encapsulate an entire model into a single file.

Facebook released the Georgi Gerganov’s Unified Format (GGUF) in August of 2023 to overcome some of the limitations of the GGML format. For example,new features can now be added to the format without breaking previous models.Also, tokenization now has better support for special characters. Reportedly,there is also enhanced performance.

New versions of tokenizers > 0.13.3 should be used to support the GGUF format.

Mistral

Cloudera remains agnostic to the model provider and can support a variety of locally-deployed and remotely-served models.For this project, please note Mistral. Its performance against LLaMA2 is compelling. Mistral's 7 billion parameter model benchmarks better against other larger models using 13 billion parameters. As a transformer model for text generation, it uses sliding window attention and offers a large context length of 8,000 tokens.

Most importantly, it demands low memory while offering decent throughput performance, low latency,and acceptable accuracy.

Please note that Cloudera does not provide benchmarking results to support these claims but enables you to derive your own experience via CML.

Cloudera Machine Learning Documentation

For detailed instructions on how to run these scripts, see the documentation.

Example use cases

1. Python code generation example

2. Describing what Cloudera ECS is by using RAG context

3. Text-to-Voice generation from an LLM response

Installing outside of CML (e.g. on your laptop or within an EC2)

Manually clone the git repo

git clone https://github.com/kentontroy/cloudera_cml_llm_rag

Untar the pre-populated FAISS vector stores for RAG examples

tar xvzf vectorstore.tar.gz

Download the model files

In the top level project directory:
mkdir models

cd jobs

Change the path in download_models.py:
For example, "print(subprocess.run(["./download_models.sh"], shell=True))" instead of "print(subprocess.run(["/home/cdsw/jobs/download_models.sh"], shell=True))"

Change the path in download_models.sh to point the directory where you want the models to be stored:
wget -P ../models https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
wget -P ../models https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

cd ..

Finally, run: 
python download_models.py

Manually set the CDSW_APP_PORT where Gradio runs

Add an environment variable in .env that references a port of your choosing for CDSW_APP_PORT.
Do not do this if you are running the Gradio application in CML as CML will natively expose the same environment variable.

Run the Gradio app

Wherever the model files are stored, change the LLM_MODEL_PATH entry in .env to point to the correct directory.
Then, in the top level project directory, run:

python demo.py

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
data		data
downloads		downloads
grammar		grammar
images		images
jobs		jobs
logs/tracking		logs/tracking
models		models
.blog		.blog
.env		.env
.project-metadata.yaml		.project-metadata.yaml
.query		.query
.sentiment		.sentiment
.troubleshooting		.troubleshooting
LICENSE		LICENSE
README.md		README.md
cdsw-build.sh		cdsw-build.sh
chain.py		chain.py
console.py		console.py
demo.py		demo.py
handler.py		handler.py
model.py		model.py
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt
vectorStore.py		vectorStore.py
vectorstore.tar.gz		vectorstore.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPU-only inferencing using a locally-deployed LLM

Quantization

GGML and GGUF

Mistral

Cloudera Machine Learning Documentation

Example use cases

1. Python code generation example

2. Describing what Cloudera ECS is by using RAG context

3. Text-to-Voice generation from an LLM response

Installing outside of CML (e.g. on your laptop or within an EC2)

Manually clone the git repo

Untar the pre-populated FAISS vector stores for RAG examples

Download the model files

Manually set the CDSW_APP_PORT where Gradio runs

Run the Gradio app

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

kentontroy/cloudera_cml_llm_rag

Folders and files

Latest commit

History

Repository files navigation

CPU-only inferencing using a locally-deployed LLM

Quantization

GGML and GGUF

Mistral

Cloudera Machine Learning Documentation

Example use cases

1. Python code generation example

2. Describing what Cloudera ECS is by using RAG context

3. Text-to-Voice generation from an LLM response

Installing outside of CML (e.g. on your laptop or within an EC2)

Manually clone the git repo

Untar the pre-populated FAISS vector stores for RAG examples

Download the model files

Manually set the CDSW_APP_PORT where Gradio runs

Run the Gradio app

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages