π€π Welcome! This repository contains minimal recipes to get started quickly with the Gemma family of models.
Note
Gemma 3n Conversational Fine tuning 2B on a Free Colab Notebook:
Gemma 3n Conversational Fine tuning 4B on a Free Colab Notebook:
Gemma 3n Multimodal Finetuning 2B/4B on a Free Colab Notebook:
To quickly run a Gemma π model on your machine, install the latest version of timm
(for the vision encoder) and π€ transformers
to run inference, or if you want to fine tune it.
$ pip install -U -q transformers timm
The easiest way to start using Gemma 3n is by using the pipeline abstraction in transformers:
import torch
from transformers import pipeline
pipe = pipeline(
"image-text-to-text",
model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"
device="cuda",
torch_dtype=torch.bfloat16
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
{"type": "text", "text": "Describe this image"}
]
}
]
output = pipe(text=messages, max_new_tokens=32)
print(output[0]["generated_text"][-1]["content"])
Initialize the model and the processor from the Hub, and write the model_generation
function that takes care of processing the prompts and running the inference on the model.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "google/gemma-3n-e4b-it" # google/gemma-3n-e2b-it
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)
def model_generation(model, messages):
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
input_len = inputs["input_ids"].shape[-1]
inputs = inputs.to(model.device, dtype=model.dtype)
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
generation = generation[:, input_len:]
decoded = processor.batch_decode(generation, skip_special_tokens=True)
print(decoded[0])
And then using calling it with our specific modality:
# Text Only
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital of France?"}
]
}
]
model_generation(model, messages)
# Interleaved with Audio
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in English:"},
{"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
]
}
]
model_generation(model, messages)
# Interleaved with Image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]
model_generation(model, messages)
We include a series of notebook+scripts for fine tuning the models.
- Gemma 3n Conversational Fine tuning 2B on free Colab T4
- Gemma 3n Conversational Fine tuning 4B with Unsloth on free Colab T4
- Gemma 3n Multimodal Fine tuning 2B/4B with Unsloth on free Colab T4
- Fine tuning Gemma 3n on audio
- Fine tuning Gemma 3n on GUI Grounding
- Fine tuning Gemma3n on video+audio using FineVideo (all modalities)
- Fine tuning Gemma 3n on images using TRL
- Fine tuning Gemma 3n on images (script)
- Fine tuning Gemma 3n on audio (script)
- Fine tuning Gemma3n on video+audio using FineVideo (all modalities)
- Reinforement Learning (GRPO) on Gemma 3 with Unsloth and TRL
- Vision fine tuning Gemma 3 4B with Unsloth
- Conversational fine tuning Gemma 3 4B with Unsloth
Before fine-tuning the model, ensure all dependencies are installed:
$ pip install -U -q -r requirements.txt
β¨ Bonus: We've also experimented with adding object detection π capabilities to Gemma 3. You can explore that work in this dedicated repo.