-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Home
📚 You can view our Documentation here! 📚
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
If you saved a LoRA adapter through Unsloth, you can also continue training using your LoRA weights. The optimizer state will be reset as well. To load even optimizer states to continue finetuning, see the next section.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "LORA_MODEL_NAME",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
trainer = Trainer(...)
trainer.train()
Add lm_head
and embed_tokens
. For Colab, sometimes you will go out of memory for Llama-3 8b. If so, just add lm_head
.
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"lm_head", "embed_tokens",],
lora_alpha = 16,
)
Then use 2 different learning rates - a 2-10x smaller one for the lm_head
or embed_tokens
like so:
from unsloth import UnslothTrainer, UnslothTrainingArguments
trainer = UnslothTrainer(
....
args = UnslothTrainingArguments(
....
learning_rate = 5e-5,
embedding_learning_rate = 5e-6, # 2-10x smaller than learning_rate
),
)
We now support it! Try the following:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
...
args = TrainingArguments(
...
),
)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(trainer)
You must edit the Trainer
first to add save_strategy
and save_steps
. Below saves a checkpoint every 50 steps to the folder outputs
.
trainer = SFTTrainer(
....
args = TrainingArguments(
....
output_dir = "outputs",
save_strategy = "steps",
save_steps = 50,
),
)
Then in the trainer do:
trainer_stats = trainer.train(resume_from_checkpoint = True)
Which will start from the latest checkpoint and continue training.
To save to 16bit for VLLM, use:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
To merge to 4bit to load on HuggingFace, first call merged_4bit
. Then use merged_4bit_forced
if you are certain you want to merge to 4bit. I highly discourage you, unless you know what you are going to do with the 4bit model (ie for DPO training for eg or for HuggingFace's online inference engine)
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
To save just the LoRA adapters, either use:
model.save_pretrained(...) AND tokenizer.save_pretrained(...)
Or just use our builtin function to do that:
model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
We save to .bin
in Colab so it's like 4x faster, but set safe_serialization = None
to force saving to .safetensors
. So model.save_pretrained(..., safe_serialization = None)
or model.push_to_hub(..., safe_serialization = None)
To save to GGUF, use the below to save locally:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
For to push to hub:
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")
All supported quantization options for quantization_method
are listed below:
# https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
ALLOWED_QUANTS = \
{
"not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized" : "Recommended. Slow conversion. Fast inference, small files.",
"f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0" : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s" : "Uses Q3_K for all tensors",
"q4_0" : "Original quant method, 4-bit.",
"q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s" : "Uses Q4_K for all tensors",
"q4_k" : "alias for q4_k_m",
"q5_k" : "alias for q5_k_m",
"q5_0" : "Higher accuracy, higher resource usage and slower inference.",
"q5_1" : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s" : "Uses Q5_K for all tensors",
"q6_k" : "Uses Q8_K for all tensors",
"iq2_xxs" : "2.06 bpw quantization",
"iq2_xs" : "2.31 bpw quantization",
"iq3_xxs" : "3.06 bpw quantization",
"q3_k_xs" : "3-bit extra small quantization",
}
First save your model to 16bit:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
Then use the terminal and do:
git clone --recursive https://github.com/ggerganov/llama.cpp
make clean -C llama.cpp
make all -j -C llama.cpp
pip install gguf protobuf
python llama.cpp/convert_hf_to_gguf.py FOLDER --outfile OUTPUT --outtype f16
Or follow the steps at https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model using the model name "merged_model" to merge to GGUF.
You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage
.
The default is model.save_pretrained(..., maximum_memory_usage = 0.75)
. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
First split your training dataset into a train and test split. Set the trainer settings for evaluation to:
new_dataset = dataset.train_test_split(test_size = 0.01)
SFTTrainer(
args = TrainingArguments(
fp16_full_eval = True,
per_device_eval_batch_size = 2,
eval_accumulation_steps = 4,
eval_strategy = "steps",
eval_steps = 1,
),
train_dataset = new_dataset["train"],
eval_dataset = new_dataset["test"],
This will cause no OOMs and make it somewhat faster with no upcasting to float32.
Assuming your dataset is a list of list of dictionaries like the below:
[
[{'from': 'human', 'value': 'Hi there!'},
{'from': 'gpt', 'value': 'Hi how can I help?'},
{'from': 'human', 'value': 'What is 2+2?'}],
[{'from': 'human', 'value': 'What's your name?'},
{'from': 'gpt', 'value': 'I'm Daniel!'},
{'from': 'human', 'value': 'Ok! Nice!'},
{'from': 'gpt', 'value': 'What can I do for you?'},
{'from': 'human', 'value': 'Oh nothing :)'},],
]
You can use our get_chat_template
to format it. Select chat_template
to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
, and use mapping
to map the dictionary values from
, value
etc. map_eos_token
allows you to map <|im_end|>
to EOS without any training.
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple
of (custom_template, eos_token)
where the eos_token
must be used inside the template.
unsloth_template = \
"{{ bos_token }}"\
"{{ 'You are a helpful assistant to the user\n' }}"\
"{% endif %}"\
"{% for message in messages %}"\
"{% if message['role'] == 'user' %}"\
"{{ '>>> User: ' + message['content'] + '\n' }}"\
"{% elif message['role'] == 'assistant' %}"\
"{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
"{% endif %}"\
"{% endfor %}"\
"{% if add_generation_prompt %}"\
"{{ '>>> Assistant: ' }}"\
"{% endif %}"
unsloth_eos_token = "eos_token"
tokenizer = get_chat_template(
tokenizer,
chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
Unsloth has a function called add_new_tokens
which allows you to add new tokens to your finetune. For example if you want to add <CHARACTER_1>
, <THINKING>
and <SCRATCH_PAD>
we can do the following:
model, tokenizer = FastLanguageModel.from_pretrained(...)
from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<CHARACTER_1>", "<THINKING>", "<SCRATCH_PAD>"])
model = FastLanguageModel.get_peft_model(...)
Note - you MUST always call add_new_tokens
before FastLanguageModel.get_peft_model
!
Unsloth supports natively 2x faster inference. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
See https://github.com/googlecolab/colabtools/issues/3409
In a new cell, run the below:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
This section was authored by sebdg. It provides explanations for how parameters affect the finetuning process
Adjusting the LoraConfig
parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:
r
- Description: Rank of the low-rank decomposition for factorizing weight matrices.
-
Impact:
- Higher: Retains more information, increases computational load.
- Lower: Fewer parameters, more efficient training, potential performance drop if too small.
lora_alpha
- Description: Scaling factor for the low-rank matrices' contribution.
-
Impact:
- Higher: Increases influence, speeds up convergence, risks instability or overfitting.
- Lower: Subtler effect, may require more training steps.
lora_dropout
- Description: Probability of zeroing out elements in low-rank matrices for regularization.
-
Impact:
- Higher: More regularization, prevents overfitting, may slow training and degrade performance.
- Lower: Less regularization, may speed up training, risks overfitting.
loftq_config
- Description: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
-
Impact:
-
Not None: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting
init_lora_weights='loftq'
. - None: LoftQ quantization is not applied.
- Note: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.
-
Not None: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting
use_rslora
- Description: Enables Rank-Stabilized LoRA (RSLora).
-
Impact:
-
True: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to
lora_alpha/math.sqrt(r)
, which has been proven to work better as per the Rank-Stabilized LoRA paper. -
False: Uses the original default scaling factor
lora_alpha/r
.
-
True: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to
gradient_accumulation_steps
- Default: 1
- Description: The number of steps to accumulate gradients before performing a backpropagation update.
-
Impact:
- Higher: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
- Lower: Faster updates but may require more memory per step and can be less stable.
weight_decay
- Default: 0.01
- Description: Regularization technique that applies a small penalty to the weights during training.
-
Impact:
- Non-zero Value (e.g., 0.01): Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
- Zero: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.
learning_rate
- Default: 2e-4
- Description: The rate at which the model updates its parameters during training.
-
Impact:
- Higher: Faster convergence but risks overshooting optimal parameters and causing instability in training.
- Lower: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.
q_proj (query projection)
- Description: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
- Impact: Transforms the input into query vectors that are used to compute attention scores.
k_proj (key projection)
- Description: Projects the input into the key space in the attention mechanism.
- Impact: Produces key vectors that are compared with query vectors to determine attention weights.
v_proj (value projection)
- Description: Projects the input into the value space in the attention mechanism.
- Impact: Produces value vectors that are weighted by the attention scores and combined to form the output.
o_proj (output projection)
- Description: Projects the output of the attention mechanism back into the original space.
- Impact: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.
gate_proj (gate projection)
- Description: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
- Impact: Controls the flow of information through the gate, allowing selective information passage based on learned weights.
up_proj (up projection)
- Description: Used for up-projection, typically increasing the dimensionality of the input.
- Impact: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.
down_proj (down projection)
- Description: Used for down-projection, typically reducing the dimensionality of the input.
- Impact: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.
Read this 3 step guide, which details how to use LLama.Cpp to convert Unsloth Lora Adapter to GGML(.bin) and use it in Ollama: https://medium.com/p/edadb6d9e0f0
This article was written by Sarin Suriyakoon.
This guide provides information on how to set the fine-tuned model we trained using unsloth from a Google Colab training notebook and call the model locally via the Ollama cli.
This Ollama guide was written by Jed Tiotuico
To successfully run the fine-tuned model, we need:
- Hugging Face account
- A Base unsloth model - for this guide, we have chosen
unsloth/tinyllama
as the base model - A basic understanding of the unsloth FastLanguageModel. In particular, fine-tuning unsloth/tinyllama. We recommend their Google Colab training notebooks on huggingface for more information on the training data
- The Lora adapters that were saved online via the huggingface hub
- A working local ollama installation: as of writing, we used 0.1.32, but it should work from later versions.
ollama --version
ollama version is 0.1.32
To recall, we provided some training code using unsloth FastLanguageModel. Please note that we can log in on huggingface on Google Colab by setting our API token as a secret token labeled “HF_TOKEN”
import os
from google.colab import userdata
hf_token = userdata.get("HF_TOKEN")
os.environ['HF_TOKEN'] = hf_token
We then run the cli command below to login
!huggingface-cli login --token $HF_TOKEN
To check our token is working, run
!huggingface-cli whoami
Below is a sample training code from the Unsloth notebook
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/tinyllama", # "unsloth/tinyllama" for 16bit loading
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
Moreover, we used the training code below. We provided dataset
and eval_dataset
for our training data, which had only one text
column.
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
logging.set_verbosity_info()
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = eval_dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = True, # Packs short sequences together to save time!
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_ratio = 0.1,
num_train_epochs = 2,
learning_rate = 2e-5,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.1,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
trainer_stats = trainer.train()
Then, we should be able to run our inference, as shown below.
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
"""
<s>
Q:
What is the capital of France?
A:
"""
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 1000, use_cache = True)
print(tokenizer.batch_decode(outputs))
Lastly, below, we demonstrate how to save the model online via huggingface
model.push_to_hub_merged(“myhfusername/my-model", tokenizer, save_method = "lora")
When we wrote part of this guide we merely took from the page below https://rentry.org/llama-cpp-conversions#setup
Clone the llama.cpp repository using
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
llama.cpp
has Python scripts that we need to run, so we need to pip install
its dependencies
pip install -r requirements.txt
Now, let us build our local llama.cpp
make clean && make all -j
For anyone with nvidia GPUs
make clean && LLAMA_CUDA=1 make all -j
2. Clone our huggingface base model and the Lora adapters from huggingface hub we uploaded earlier, where we used the push_to_hub_merged()
function
From the llama.cpp folder let us clone our base model.
git clone https://huggingface.co/unsloth/tinyllama
Next, we clone our Lora model
git clone https://huggingface.co/myhfusername/my-model
We now need to convert both the base model and the Lora adapters.
python convert.py tinyllama --outtype f16 --outfile tinyllama.f16.gguf
python convert-lora-to-ggml.py my-model
If the conversion succeeds, the last lines from our output should be
Converted my-model/adapter_config.json and my-model/adapter_model.safetensors to my-model/ggml-adapter-model.bin
--model-base - is the gguf model --model-out - is the new gguf model --lora is the adapter model
export-lora --model-base tinyllama.f16.gguf --model-out tinyllama-my-model.gguf --lora my-model/ggml-adapter-model.bin
Lastly we quantize the merged model
quantize tinyllama-my-model.gguf tinyllama-my-model.Q8_0.gguf Q8_0
FROM tinyllama-my-model.gguf
### Set the system message
SYSTEM """
You are a super helpful helper.
"""
PARAMETER stop <s>
PARAMETER stop </s>
ollama create my-model -f Modelfile
ollama run my-model "<s>\nQ: \nWhat is the capital of France?\nA:\n"
Support for NVIDIA Pascal family of cards, specifically the P40 and P100.
-
Create three files (
Dockerfile
,unsloth_env_file.yml
, anddocker-compose.yml
) with the contents provided below. -
Ensure Docker and Docker Compose are installed on your system.
-
Install the NVIDIA Container Toolkit for GPU support if not already done.
-
Place all three files in the same directory.
-
Open a terminal and navigate to the directory containing these files.
-
Run the following command to build and start the container:
docker-compose up --build
-
Once the container is running, access Jupyter Lab by opening a web browser and navigating to
http://localhost:8888
.
# Stage 1: Base image with system dependencies
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 as base
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
git \
vim \
curl \
wget \
&& rm -rf /var/lib/apt/lists/*
# Install Miniconda only if it's not already installed
RUN if [ ! -d "/opt/conda" ]; then \
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh && \
bash miniconda.sh -b -p /opt/conda && \
rm miniconda.sh; \
fi
# Set path to conda
ENV PATH /opt/conda/bin:$PATH
# Set path to conda
ENV PATH /opt/conda/bin:$PATH
# Stage 2: Python environment setup
FROM base as python-env
COPY unsloth_env_file.yml unsloth_env_file.yml
RUN conda env create -f unsloth_env_file.yml
SHELL ["conda", "run", "-n", "unsloth_env", "/bin/bash", "-c"]
# Stage 3: Final image
FROM python-env as final
# Install Unsloth (This step is separate because it's likely to change more frequently)
RUN pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
ENV PATH /usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Set the working directory
WORKDIR /workspace
# Set the default command to run Jupyter Lab
CMD ["conda", "run", "--no-capture-output", "-n", "unsloth_env", "jupyter", "lab", "--ip=0.0.0.0", "--no-browser", "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]
name: unsloth_env
channels:
- xformers
- pytorch
- nvidia
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- aiohttp=3.9.5=py310h5eee18b_0
- aiosignal=1.2.0=pyhd3eb1b0_0
- anyio=4.2.0=py310h06a4308_0
- argon2-cffi=21.3.0=pyhd3eb1b0_0
- argon2-cffi-bindings=21.2.0=py310h7f8727e_0
- arrow-cpp=16.1.0=hc1eb8f0_0
- async-lru=2.0.4=pyhd8ed1ab_0
- async-timeout=4.0.3=py310h06a4308_0
- attrs=23.1.0=py310h06a4308_0
- aws-c-auth=0.6.19=h5eee18b_0
- aws-c-cal=0.5.20=hdbd6064_0
- aws-c-common=0.8.5=h5eee18b_0
- aws-c-compression=0.2.16=h5eee18b_0
- aws-c-event-stream=0.2.15=h6a678d5_0
- aws-c-http=0.6.25=h5eee18b_0
- aws-c-io=0.13.10=h5eee18b_0
- aws-c-mqtt=0.7.13=h5eee18b_0
- aws-c-s3=0.1.51=hdbd6064_0
- aws-c-sdkutils=0.1.6=h5eee18b_0
- aws-checksums=0.1.13=h5eee18b_0
- aws-crt-cpp=0.18.16=h6a678d5_0
- aws-sdk-cpp=1.10.55=h721c034_0
- babel=2.14.0=pyhd8ed1ab_0
- beautifulsoup4=4.12.3=py310h06a4308_0
- blas=1.0=mkl
- bleach=4.1.0=pyhd3eb1b0_0
- boost-cpp=1.82.0=hdb19cb5_2
- bottleneck=1.3.7=py310ha9d4c09_0
- brotli-python=1.0.9=py310h6a678d5_8
- bzip2=1.0.8=h5eee18b_6
- c-ares=1.19.1=h5eee18b_0
- ca-certificates=2024.7.4=hbcca054_0
- certifi=2024.7.4=pyhd8ed1ab_0
- cffi=1.16.0=py310h5eee18b_1
- charset-normalizer=3.3.2=pyhd3eb1b0_0
- cuda-cudart=11.8.89=0
- cuda-cupti=11.8.87=0
- cuda-libraries=11.8.0=0
- cuda-nvrtc=11.8.89=0
- cuda-nvtx=11.8.86=0
- cuda-runtime=11.8.0=0
- cuda-version=11.8=hcce14f8_3
- cudatoolkit=11.8.0=h6a678d5_0
- datasets=2.19.1=py310h06a4308_0
- debugpy=1.6.7=py310h6a678d5_0
- decorator=5.1.1=pyhd3eb1b0_0
- defusedxml=0.7.1=pyhd3eb1b0_0
- dill=0.3.8=py310h06a4308_0
- entrypoints=0.4=py310h06a4308_0
- ffmpeg=4.3=hf484d3e_0
- filelock=3.13.1=py310h06a4308_0
- freetype=2.12.1=h4a9f257_0
- frozenlist=1.4.0=py310h5eee18b_0
- fsspec=2024.3.1=py310h06a4308_0
- gflags=2.2.2=h6a678d5_1
- glog=0.5.0=h6a678d5_1
- gmp=6.2.1=h295c915_3
- gmpy2=2.1.2=py310heeb90bb_0
- gnutls=3.6.15=he1e5248_0
- h11=0.14.0=pyhd8ed1ab_0
- h2=4.1.0=pyhd8ed1ab_0
- hpack=4.0.0=pyh9f0ad1d_0
- httpcore=1.0.5=pyhd8ed1ab_0
- httpx=0.27.0=pyhd8ed1ab_0
- hyperframe=6.0.1=pyhd8ed1ab_0
- icu=73.1=h6a678d5_0
- idna=3.7=py310h06a4308_0
- importlib-metadata=7.0.1=py310h06a4308_0
- importlib_metadata=7.0.1=hd8ed1ab_0
- importlib_resources=6.4.0=pyhd8ed1ab_0
- intel-openmp=2023.1.0=hdb19cb5_46306
- ipykernel=6.28.0=py310h06a4308_0
- ipython_genutils=0.2.0=pyhd3eb1b0_1
- jedi=0.19.1=py310h06a4308_0
- jinja2=3.1.4=py310h06a4308_0
- jpeg=9e=h5eee18b_2
- json5=0.9.25=pyhd8ed1ab_0
- jsonschema=4.19.2=py310h06a4308_0
- jsonschema-specifications=2023.7.1=py310h06a4308_0
- jupyter-lsp=2.2.5=pyhd8ed1ab_0
- jupyter_client=7.4.9=py310h06a4308_0
- jupyter_core=5.7.2=py310h06a4308_0
- jupyter_events=0.10.0=py310h06a4308_0
- jupyter_server=2.14.1=py310h06a4308_0
- jupyter_server_terminals=0.4.4=py310h06a4308_1
- jupyterlab=4.2.4=pyhd8ed1ab_0
- jupyterlab_pygments=0.3.0=pyhd8ed1ab_1
- jupyterlab_server=2.27.3=pyhd8ed1ab_0
- krb5=1.20.1=h143b758_1
- lame=3.100=h7b6447c_0
- lcms2=2.12=h3be6417_0
- ld_impl_linux-64=2.38=h1181459_1
- lerc=3.0=h295c915_0
- libabseil=20240116.2=cxx17_h6a678d5_0
- libboost=1.82.0=h109eef0_2
- libbrotlicommon=1.0.9=h5eee18b_8
- libbrotlidec=1.0.9=h5eee18b_8
- libbrotlienc=1.0.9=h5eee18b_8
- libcublas=11.11.3.6=0
- libcufft=10.9.0.58=0
- libcufile=1.9.1.3=0
- libcurand=10.3.5.147=0
- libcurl=8.7.1=h251f7ec_0
- libcusolver=11.4.1.48=0
- libcusparse=11.7.5.86=0
- libdeflate=1.17=h5eee18b_1
- libedit=3.1.20230828=h5eee18b_0
- libev=4.33=h7f8727e_1
- libevent=2.1.12=hdbd6064_1
- libffi=3.4.4=h6a678d5_1
- libgcc-ng=14.1.0=h77fa898_0
- libgomp=14.1.0=h77fa898_0
- libgrpc=1.62.2=h2d74bed_0
- libiconv=1.16=h5eee18b_3
- libidn2=2.3.4=h5eee18b_0
- libjpeg-turbo=2.0.0=h9bf148f_0
- libnghttp2=1.57.0=h2d74bed_0
- libnpp=11.8.0.86=0
- libnvjpeg=11.9.0.86=0
- libpng=1.6.39=h5eee18b_0
- libprotobuf=4.25.3=he621ea3_0
- libsodium=1.0.18=h7b6447c_0
- libssh2=1.11.0=h251f7ec_0
- libstdcxx-ng=11.2.0=h1234567_1
- libtasn1=4.19.0=h5eee18b_0
- libthrift=0.15.0=h1795dd8_2
- libtiff=4.5.1=h6a678d5_0
- libunistring=0.9.10=h27cfd23_0
- libuuid=1.41.5=h5eee18b_0
- libwebp-base=1.3.2=h5eee18b_0
- llvm-openmp=14.0.6=h9e868ea_0
- lz4-c=1.9.4=h6a678d5_1
- markupsafe=2.1.3=py310h5eee18b_0
- mistune=2.0.4=py310h06a4308_0
- mkl=2023.1.0=h213fc3f_46344
- mkl-service=2.4.0=py310h5eee18b_1
- mkl_fft=1.3.8=py310h5eee18b_0
- mkl_random=1.2.4=py310hdb19cb5_0
- mpc=1.1.0=h10f8cd9_1
- mpfr=4.0.2=hb69a4c5_1
- mpmath=1.3.0=py310h06a4308_0
- multidict=6.0.4=py310h5eee18b_0
- multiprocess=0.70.15=py310h06a4308_0
- nb_conda_kernels=2.3.1=py310h06a4308_0
- nbclassic=1.1.0=py310h06a4308_0
- nbclient=0.8.0=py310h06a4308_0
- nbconvert=7.10.0=py310h06a4308_0
- nbformat=5.9.2=py310h06a4308_0
- ncurses=6.4=h6a678d5_0
- nest-asyncio=1.6.0=py310h06a4308_0
- nettle=3.7.3=hbbd107a_1
- networkx=3.3=py310h06a4308_0
- notebook=6.5.7=py310h06a4308_0
- notebook-shim=0.2.3=py310h06a4308_0
- numexpr=2.8.7=py310h85018f9_0
- numpy=1.26.4=py310h5f9d8c6_0
- numpy-base=1.26.4=py310hb5e798b_0
- openh264=2.1.1=h4ff587b_0
- openjpeg=2.4.0=h9ca470c_2
- openssl=3.3.1=h4bc722e_2
- orc=2.0.1=h2d29ad5_0
- overrides=7.4.0=py310h06a4308_0
- packaging=24.1=py310h06a4308_0
- pandas=2.2.2=py310h6a678d5_0
- pandocfilters=1.5.0=pyhd3eb1b0_0
- pillow=10.4.0=py310h5eee18b_0
- pip=24.0=py310h06a4308_0
- platformdirs=3.10.0=py310h06a4308_0
- prometheus_client=0.14.1=py310h06a4308_0
- prompt_toolkit=3.0.43=hd3eb1b0_0
- psutil=5.9.0=py310h5eee18b_0
- ptyprocess=0.7.0=pyhd3eb1b0_2
- pure_eval=0.2.2=pyhd3eb1b0_0
- pyarrow=16.1.0=py310h1128e8f_0
- pycparser=2.21=pyhd3eb1b0_0
- pysocks=1.7.1=py310h06a4308_0
- python=3.10.14=h955ad1f_1
- python-dateutil=2.9.0post0=py310h06a4308_2
- python-fastjsonschema=2.16.2=py310h06a4308_0
- python-json-logger=2.0.7=py310h06a4308_0
- python-tzdata=2023.3=pyhd3eb1b0_0
- python-xxhash=2.0.2=py310h5eee18b_1
- pytorch=2.1.0=py3.10_cuda11.8_cudnn8.7.0_0
- pytorch-cuda=11.8=h7e8668a_5
- pytorch-mutex=1.0=cuda
- pytz=2024.1=py310h06a4308_0
- pyyaml=6.0.1=py310h5eee18b_0
- pyzmq=24.0.1=py310h5eee18b_0
- re2=2022.04.01=h295c915_0
- readline=8.2=h5eee18b_0
- referencing=0.30.2=py310h06a4308_0
- regex=2023.10.3=py310h5eee18b_0
- requests=2.32.3=py310h06a4308_0
- rfc3339-validator=0.1.4=py310h06a4308_0
- rfc3986-validator=0.1.1=py310h06a4308_0
- rpds-py=0.10.6=py310hb02cf49_0
- s2n=1.3.27=hdbd6064_0
- safetensors=0.4.2=py310ha89cbab_1
- send2trash=1.8.2=py310h06a4308_0
- setuptools=69.5.1=py310h06a4308_0
- six=1.16.0=pyhd3eb1b0_1
- snappy=1.1.10=h6a678d5_1
- sniffio=1.3.0=py310h06a4308_0
- soupsieve=2.5=py310h06a4308_0
- sqlite=3.45.3=h5eee18b_0
- stack_data=0.2.0=pyhd3eb1b0_0
- sympy=1.12=py310h06a4308_0
- tbb=2021.8.0=hdb19cb5_0
- terminado=0.17.1=py310h06a4308_0
- tinycss2=1.2.1=py310h06a4308_0
- tk=8.6.14=h39e8969_0
- tokenizers=0.19.1=py310hff361bb_0
- tomli=2.0.1=pyhd8ed1ab_0
- torchaudio=2.1.0=py310_cu118
- torchtriton=2.1.0=py310
- torchvision=0.16.0=py310_cu118
- tornado=6.4.1=py310h5eee18b_0
- tqdm=4.66.4=py310h2f386ee_0
- traitlets=5.14.3=py310h06a4308_0
- typing-extensions=4.11.0=py310h06a4308_0
- typing_extensions=4.11.0=py310h06a4308_0
- tzdata=2024a=h04d1e81_0
- urllib3=2.2.2=py310h06a4308_0
- utf8proc=2.6.1=h5eee18b_1
- webencodings=0.5.1=py310h06a4308_1
- websocket-client=1.8.0=py310h06a4308_0
- wheel=0.43.0=py310h06a4308_0
- xformers=0.0.22.post7=py310_cu11.8.0_pyt2.1.0
- xxhash=0.8.0=h7f8727e_3
- xz=5.4.6=h5eee18b_1
- yaml=0.2.5=h7b6447c_0
- yarl=1.9.3=py310h5eee18b_0
- zeromq=4.3.5=h6a678d5_0
- zipp=3.17.0=py310h06a4308_0
- zlib=1.2.13=h5eee18b_1
- zstd=1.5.5=hc292b87_2
- pip:
- accelerate==0.33.0
- asttokens==2.4.1
- bitsandbytes==0.43.2
- comm==0.2.2
- docstring-parser==0.16
- exceptiongroup==1.2.2
- executing==2.0.1
- gguf==0.9.1
- hf-transfer==0.1.8
- huggingface-hub==0.24.2
- iprogress==0.4
- ipython==8.26.0
- ipywidgets==8.1.3
- jupyterlab-widgets==3.0.11
- markdown-it-py==3.0.0
- matplotlib-inline==0.1.7
- mdurl==0.1.2
- parso==0.8.4
- peft==0.12.0
- pexpect==4.9.0
- prompt-toolkit==3.0.47
- protobuf==3.20.3
- pure-eval==0.2.3
- pygments==2.18.0
- rich==13.7.1
- sentencepiece==0.2.0
- shtab==1.7.1
- stack-data==0.6.3
- transformers==4.43.3
- trl==0.8.6
- tyro==0.8.5
- wcwidth==0.2.13
- widgetsnbextension==4.0.11
version: '3.8'
services:
unsloth-env:
environment:
- NVIDIA_VISIBLE_DEVICES=all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ./cache:/root/.cache
- ./workspace:/workspace
working_dir: /workspace
ports:
- "8888:8888" # For Jupyter Lab
tty: true
stdin_open: true
build:
context: .
dockerfile: Dockerfile
[Updated 10th November 2024] Want to work on cool Triton kernels, optimizations and maths algorithms to make LLMs and AI more accessible? Come join us! We currently have over 2.5 million monthly Hugging Face model downloads and collaborate with Meta, Google, Hugging Face on open models. We fixed dozens of bugs in Gemma, Llama & Phi, Mistral, helped fix a gradient accumulation bug, showed how gradient checkpointing can be improved to reduce VRAM and more!
We value engineers who are proactive, independent and who ship features and ideas quickly - if stuff breaks, that's fine with us! Internships are 3 monthly renewable roles ($100K - $120K USD pa) SF focused or remote. Full time roles ($150K - $250K USD pa with equity) SF only. We're Y Combinator S24 alumni & backed by Github!
!! On our criteria for being considered for an internship / full time role !!
- Debug, solve urgent issues / bugs and make 3 merged PRs for interns / 6 merged PRs for a full time engineer
- OR Create a high quality accepted PR on 1 item below. 📚=Software Eng 🔢=Kernels 🛠️=Infra:
- 📚Optimized QLoRA finetuning for FLUX / stable diffusion models. Diffusers just added 4bit QLoRA support - make this faster. Provide a Colab notebook tutorial on how to use it. Do NOT copy paste from Diffusers. Show what optimizations you did.
- 📚Unoptimized Apple Silicon / Metal support LoRA - MLX, Core ML Tools etc support.
- 📚Utilities to export Unsloth finetunes to vLLM, SGLang & Ollama - LoRA adapters only. Provide a standalone serving interface to vLLM, SGLang
- 🔢Add float8 + QLoRA finetuning support via Torch AO into Unsloth.
- 🔢Bitsandbytes 4bit QLoRA dequant Triton kernel - must be faster than CUDA version.
- 📚Add TPU (maybe JAX?) & AMD support into Unsloth. TPU - Colab & Kaggle notebooks. AMD - Runpod equivalent.
- 🔢Use torch AO and add MXFP4 support in preparation for Blackwell. Show experiments on loss curves matching.
- 🔢Add fully optimized Deepseek finetuning support - investigate Scatter MoE - confirm loss curves match.
- 🔢Add full finetuning / pretraining support in Unsloth - Triton kernels for all.
- 🔢FSDPv2 + QLoRA (maybe via Torch AO) +
torch.compile
. Investigate PyTorch native Pipeline, Sequence & Tensor parallelism. - 🛠️Using spot instances to train models with a checkpoint recovery mechanism (like SkyPilot)
- 🛠️Modal but using spot instances
- 🔢Use binary tensor cores for fast Hamming distances. Show this works in approximate nearest neighbors
- 🔢Make
torch.compile
work on gradient checkpointing with compiled autograd and removingtorch._dynamo.disable
- 🔢Port Flex Attention to Unsloth for all models (Llama, Mistral, Gemma etc). Must be torch compilable and faster than naive SDPA.
For for details, email me, or ask me questions on Discord! For more information about us, see our:
- CUDA / GPU Mode lecture Talk. Youtube link
- Low Level Technicals of LLMs. Youtube link
- Fixing bugs in Llama, Mistral, Gemma. Youtube link
- PyTorch Conference Mini Talk. Youtube link
- PyTorch Engineers Meeting Talk. Youtube link
- Hugging Face Collab Blog. Blog link