marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library. #499

irthomasthomas · 2024-02-02T10:29:27Z

marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

CTransformers

![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs

Supported Models

Model	Model Type	CUDA	Metal
GPT-2	gpt2
GPT-J, GPT4All-J	gptj
GPT-NeoX, StableLM	gpt_neox
Falcon	falcon	✅
LLaMA, LLaMA 2	llamai	✅	✅
MPT	mpt	✅
StarCoder, StarChat	gpt_bigcode	✅
Dolly V2	dolly-v2
Replit	replit

Installation

To install via pip, simply run:

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling	`40`
`top_p`	`float`	The top-p value to use for sampling	`0.95`
`temperature`	`float`	The temperature to use for sampling	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty	`64`
`seed`	`int`	The seed value to use for sampling tokens	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate	`256`
`stop`	`List`	A list of sequences to stop generation when encountered	`None`
`stream`	`bool`	Whether to stream the generated text	`False`
`reset`	`bool`	Whether to reset the model state before generating text	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens in a single prompt	`8`
`threads`	`int`	The number of threads to use for evaluating tokens	`-1`
`context_length`	`int`	The maximum context length to use	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU	`0`

Find the URL for the model card for GPTQ here.

Made with ❤️ by marella

Suggested labels

null

The text was updated successfully, but these errors were encountered:

This was referenced Mar 6, 2024

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering #704

Open

privateGPT/README.md at main · imartinez/privateGPT #707

Open

ShellLM mentioned this issue Apr 22, 2024

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

Open

1 task

This was referenced Aug 1, 2024

Codestral Mamba | Mistral AI | Frontier AI in your hands #852

Open

namuan/chat-circuit: Branch Out Your Conversations #858

Open

This was referenced Aug 11, 2024

evidently - An open-source framework to evaluate, test and monitor ML and LLM-powered systems. #877

Open

Xgboost 2.0.0 · dmlc/xgboost #878

Open

vidore/colpali · Hugging Face #891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library. #499

marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library. #499

irthomasthomas commented Feb 2, 2024

marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library. #499

marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library. #499

Comments

irthomasthomas commented Feb 2, 2024

CTransformers

Supported Models

Installation

Usage

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

Suggested labels

null