BiDoRA is a Python package implementing true BiDoRA (Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation) for efficient fine-tuning of Large Language Models. Specifically optimized for:
- 3D Code Generation (Rust, Blender, CAD)
- Spatial Intelligence Tasks
- Small Datasets (<10k samples)
- Automatic Hardware Adaptation (Laptop to A100)
BiDoRA uses bi-level optimization to separately optimize magnitude and direction components of weight updates:
W' = m β (Wβ + BA) / ||Wβ + BA||
β β
magnitude direction
(upper) (lower)
Training Process:
- Lower Level: Optimize direction (A, B matrices) on training set
- Upper Level: Optimize magnitude (m) on validation set via hypergradients
- Final Phase: Direction fine-tuning on combined data with fixed magnitude
Benefits:
- β Reduces overfitting on small datasets (<10k samples)
- β Better alignment with full fine-tuning (correlation: -8.042 vs -1.784 for DoRA)
- β Statistically significant improvements on GLUE (p < 0.001)
Important Notes:
β οΈ Training Time: 3-4x slower than standard LoRA due to bi-level optimizationβ οΈ No Quantization: BiDoRA requires full precision (bfloat16) - quantization disabled automaticallyβ οΈ Memory: Uses 8-bit AdamW optimizer (75% memory reduction) to compensate- β Best For: Small specialized datasets where quality > speed
- β BiDoRA Bi-Level Optimization: True magnitude-direction decomposition
- β Auto Hardware Detection: Automatically adapts config to available hardware
- β Full Precision Training: Optimized for bfloat16 (no quantization needed for BiDoRA)
- β Flexible Data Formats: JSONL, HuggingFace Datasets
- β Type-Safe Config: Pydantic-validated configuration
- β CLI Interface: Simple command-line interface with Typer
pip install bidora# With uv (recommended)
uv add bidora
# With pip
pip install bidoragit clone https://github.com/bjoernbethge/bidora.git
cd bidora
uv sync --devbidora infoShows available hardware and recommended configuration.
bidora list-modelsImportant: BiDoRA requires separate train and validation files for bi-level optimization.
bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-4B \
--output ./output \
--rank 8 \
--epochs 3bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-4B \
--lr 2e-4 \
--upper-lr-mult 2.0 \
--rank 8bidora train \
--dataset "code_search_net" \
--model Qwen/Qwen3-8B \
--output ./output \
--rank 8{"instruction": "Generate a Rust function to create a 3D cube mesh", "output": "fn create_cube() -> Mesh { ... }"}
{"instruction": "Write Blender Python code to add a sphere", "input": "radius: 2.0", "output": "import bpy\nbpy.ops.mesh.primitive_uv_sphere_add(radius=2.0)"}{"prompt": "// Generate 3D mesh\nfn create_mesh()", "completion": " -> Mesh {\n let vertices = vec![...];\n Mesh::new(vertices)\n}"}{"code": "use bevy::prelude::*;\n\nfn setup_3d_scene(mut commands: Commands) { ... }"}bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-4B \
--rank 4 \
--batch-size 1 \
--auto-hardware # Automatic adaptationConfig automatically adjusted:
- Precision: bfloat16 (full precision - BiDoRA requirement)
- Batch Size: 1-2
- Gradient Accumulation: 8-16
- Max Seq Length: 1024-2048
bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-8B \
--rank 16 \
--batch-size 2 \
--auto-hardwareAuto-Config:
- Precision: bfloat16 (full precision - BiDoRA requirement)
- Batch Size: 2-4
- Gradient Accumulation: 4-8
- Max Seq Length: 2048
bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-32B \
--rank 16 \
--batch-size 8 \
--auto-hardwareAuto-Config:
- Precision: bfloat16 (full precision - BiDoRA requirement)
- Batch Size: 4-8
- Gradient Accumulation: 2-4
- Max Seq Length: 4096
bidora train --helpMost Important Parameters:
| Parameter | Description | Default |
|---|---|---|
--model, -m |
Model name or path | Qwen/Qwen3-4B |
--train-file, -t |
Training JSONL | Required |
--val-file, -v |
Validation JSONL | Required for BiDoRA |
--dataset, -d |
HuggingFace Dataset | - |
--output, -o |
Output directory | ./output |
--rank, -r |
LoRA Rank | 8 |
--epochs, -e |
Training Epochs | 3 |
--batch-size, -b |
Batch Size | 4 |
--lr |
Learning Rate (lower level) | 2e-4 |
--upper-lr-mult |
Upper level LR multiplier | 2.0 |
--max-samples |
Max Training Samples | All |
--auto-hardware |
Auto-adjustment | True |
bidora train \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--model Qwen/Qwen3-8B \
--rank 16 \
--batch-size 8 \
--lr 3e-4 \
--epochs 5 \
--no-auto-hardware # Manual config| Model | Parameter | VRAM (bf16) | Training VRAM | Recommended For |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | ~2GB | ~6GB | Laptop GPU (6-8GB) |
| Qwen3-1.7B | 1.7B | ~4GB | ~10GB | Laptop GPU (8GB+) |
| Qwen3-4B | 4B | ~8GB | ~16GB | Desktop GPU (12-16GB) |
| Qwen3-8B | 8B | ~16GB | ~24GB | Desktop GPU (24GB+) / A100 |
| Qwen3-14B | 14B | ~28GB | ~40GB | A100 (40GB) |
| Qwen3-32B | 32B | ~64GB | ~80GB | A100 (80GB) |
π‘ Memory Optimization: Uses 8-bit AdamW optimizer (75% memory reduction) to compensate for full precision requirement.
| Base Model | LoRA Params | Reduction |
|---|---|---|
| 7B | ~2M | 3500Γ |
| 14B | ~4M | 3500Γ |
| 32B | ~8M | 4000Γ |
# data/rust_3d_train.jsonl
{"instruction": "Create a three-rs mesh for a cube", "output": "use three::*;\n\nfn create_cube(size: f32) -> Mesh {\n let geometry = Geometry::cuboid(size, size, size);\n Mesh::new(geometry, Material::default())\n}"}
{"instruction": "Generate Bevy 3D scene setup", "output": "use bevy::prelude::*;\n\nfn setup(mut commands: Commands) {\n commands.spawn(Camera3dBundle::default());\n commands.spawn(PbrBundle {\n mesh: meshes.add(Mesh::from(shape::Cube { size: 1.0 })),\n ..default()\n });\n}"}bidora train \
--train-file data/rust_3d_train.jsonl \
--val-file data/rust_3d_val.jsonl \
--model Qwen/Qwen3-4B \
--output ./rust_3d_model \
--rank 8 \
--epochs 3 \
--batch-size 2from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model with BiDoRA adapters
model = AutoModelForCausalLM.from_pretrained(
"./rust_3d_model/final_model",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")
# Generate
prompt = "### Instruction:\nCreate a three-rs function to render a sphere\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))from bidora import (
FullConfig, ModelConfig, BiDoRAConfig, TrainingConfig, DataConfig,
load_model_and_tokenizer, prepare_bidora_model,
load_and_prepare_dataset, prepare_dataset_for_training,
train_bidora
)
from pathlib import Path
# Create config
config = FullConfig(
model=ModelConfig(
model_name="Qwen/Qwen3-4B",
quantization="none" # BiDoRA requires full precision (bfloat16)
),
bidora=BiDoRAConfig(
rank=8,
use_bidora=True, # Enable BiDoRA bi-level optimization
upper_lr_multiplier=2.0
),
training=TrainingConfig(
batch_size=2,
learning_rate=2e-4,
num_epochs=3
),
data=DataConfig(
train_file=Path("data/train.jsonl"),
val_file=Path("data/val.jsonl") # Required for BiDoRA
),
output_dir=Path("./output")
)
# Auto-adjust for hardware (will keep full precision for BiDoRA)
config.auto_adjust_for_hardware()
# Load model with BiDoRA layers
model, tokenizer = load_model_and_tokenizer(config.model)
model = prepare_bidora_model(model, config.bidora, quantized=False)
# Load data
dataset = load_and_prepare_dataset(config.data)
tokenized_dataset = prepare_dataset_for_training(
dataset, tokenizer, config.training.max_seq_length
)
# Train with bi-level optimization
trainer = train_bidora(model, tokenizer, tokenized_dataset, config)# Reduce batch size
bidora train --batch-size 1 ...
# Or use smaller model
bidora train --model Qwen/Qwen3-1.7B ...
# Note: BiDoRA cannot use quantization (requires full precision)If Flash Attention 2 is not available:
- Automatically disabled
- Or manually: Set
use_flash_attention=Falsein ModelConfig
# Reinstall dependencies
uv pip install --force-reinstall transformers accelerate peft bitsandbytes- BiDoRA Paper - Original bi-level optimization paper
- LoRA Paper - Low-Rank Adaptation
- DoRA Paper - Weight-Decomposed LoRA
- Qwen3 Models - HuggingFace model collection
If you use BiDoRA in your research, please cite:
@article{liu2024bidora,
title={BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation},
author={Liu, Peiran and Wang, Luning and Sun, Yanchao and Tang, Zhongwei and Xu, Dawei and Li, Jiaxi and Xu, Zhili},
journal={arXiv preprint arXiv:2410.09758},
year={2024}
}MIT License - see LICENSE file.