Authors: Sepehr Heydarian, Rongze(Archer) Liu, Elshaday Yoseph, Tien Nguyen
This project implements an intelligent, semi-automated data pipeline for improving a wildfire object detection model. The system is designed to continuously ingest unlabelled images, generate initial annotations using AI models, refine them through human-in-the-loop review, and retrain the base model. The pipeline also includes model compression steps (e.g. distillation and quantization) to prepare models for deployment on edge devices. For more technical details, refer to the full documentation.
Manual labeling of wildfire imagery is time-consuming and error-prone. In addition, models degrade over time as environmental conditions and data distributions shift. Our system aims to continuously learn from new data using a scalable, semi-supervised approach. It automates as much of the machine learning workflow as possible and involves human review only when necessary.
- Automated pre-labeling using YOLOv8 and Grounding DINO
- Model matching and validation using IoU and confidence thresholds
- Human-in-the-loop review for mismatches via Label Studio
- Image augmentation to improve generalization
- End-to-end training, distillation, and quantization
- CI/CD/CT-compatible design for regular updates and retraining
This guide will walk you through setting up and running the AutoML CI/CD/CT: Continuous Training and Deployment Pipeline project.
git clone https://github.com/Capstone-AutoML/AutoML_Capstone.git
cd AutoML_CapstoneControl which pipeline steps to run via pipeline_config.json:
// Set to true to skip a step
"process_options": {
"skip_human_review": false,
"skip_training": false,
"skip_distillation": false,
"skip_quantization": false
}
Important: Docker cannot handle interactive Label Studio sessions for human review. Before running with Docker, you must disable human review in automl_workspace/config/pipeline_config.json:
"process_options": {
"skip_human_review": true
}You can simply run:
docker compose upThis command will:
- Download necessary datasets and models on first run (unless
automl_workspace/data_pipeline/,automl_workspace/data_pipeline/distillation/, orautoml_workspace/model_registry/model/are removed). - Automatically use your GPU if the following key is updated in both
automl_workspace/config/train_config.jsonandautoml_workspace/config/pipeline_config.json:
"torch_device": "cuda"Default is
"cpu", which will force CPU-only execution.
If you want to run the auto-labeling part of the pipeline separately, do:
docker compose run auto_labelingThis step should always come first.
Then, to run the augmentation, training, and compression steps, use:
docker compose run train_compressBefore running, replace your docker-compose.yaml file with:
services:
capstone:
image: celt313/automl_capstone:v0.0.3
ipc: host
platform: linux/x86_64
container_name: automl_capstone
ipc: host
working_dir: /app
entrypoint: bash
command: -c "source activate capstone_env && ./fetch_dataset.sh && python src/main.py"
volumes:
- .:/app
generate_box:
image: celt313/automl_capstone:v0.0.3
ipc: host
platform: linux/x86_64
profiles: ["optional"]
entrypoint: bash
command: -c "source activate capstone_env && python src/generate_boxed_images.py"
volumes:
- .:/app
auto_labeling:
image: celt313/automl_capstone:v0.0.3
ipc: host
platform: linux/x86_64
profiles: ["optional"]
entrypoint: bash
command: -c "source activate capstone_env && ./fetch_dataset.sh && python src/label_main.py"
volumes:
- .:/app
train_compress:
image: celt313/automl_capstone:v0.0.3
ipc: host
platform: linux/x86_64
profiles: ["optional"]
entrypoint: bash
command: -c "source activate capstone_env && python src/train_compress.py"
volumes:
- .:/app
test:
image: celt313/automl_capstone:v0.0.3
ipc: host
platform: linux/x86_64
profiles: ["optional"]
entrypoint: bash
command: -c "source activate capstone_env && pytest tests/"
volumes:
- .:/appThen run:
docker compose upto run the entire pipeline.
If you want to run the auto-labeling part of the pipeline separately, do:
docker compose run auto_labelingThis step should always come first.
Then, to run the augmentation, training, and compression steps, use:
docker compose run train_compressTo verify the setup and run unit tests:
docker compose run testTo run the script that overlays bounding boxes on sample and labeled images using predictions from YOLO, DINO, and mismatched sources:
docker compose run generate_boxThis will:
-
Sample and draw 10 images each from YOLO, DINO, and mismatched directories.
-
Draw bounding boxes on all images from the labeled directory.
- Save the visualized outputs under
automl_workspace/data_pipeline/boxed_images/
For human-in-the-loop validation using Label Studio, refer to the Human Intervention documentation.
To continue development based on the current project setup, follow the steps below using the provided conda environment.
Before getting started, ensure you have the following installed:
- Conda or Miniconda - For environment management
1. Clone the repository:
git clone https://github.com/Capstone-AutoML/AutoML_Capstone.git
cd AutoML_Capstone2. Set up environments:
For Full Pipeline (includes pre-labeling, training, distillation, and quantization):
conda env create -f environment.yml
conda activate capstone_env
# Install GroundingDINO (required for full pipeline)
# To keep your workspace clean, it's recommended to clone the repository outside the main project directory
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install .For Human Review Only:
conda env create -f human_review_env.yml
conda activate human_review_envNote: Both environments may be needed depending on your workflow. The human review is integrated in the main pipeline.
3. GPU Support (Optional):
# Activate the full pipeline environment
conda activate capstone_env
# Check CUDA version
nvcc -V
# Install GPU PyTorch (example for CUDA 12.4)
pip uninstall torch torchvision
pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 --index-url https://download.pytorch.org/whl/cu124
⚠️ Compatibility Note: PyTorch 2.6.0 has known compatibility issues with GroundingDINO and Ultralytics. PyTorch 2.5.1 is recommended as shown above. If you need to use PyTorch 2.6.0 or higher, please refer to the GroundingDINO issue.
Before running the pipeline, you can customize the behavior by modifying the configuration files in the automl_workspace/config/ directory:
pipeline_config.json- Main pipeline settings (thresholds, augmentation, distillation parameters)augmentation_config.json- Data augmentation settings (seed, number of augmentations, etc.)train_config.json- Model training configuration (epochs, batch size, learning rate, etc.)distillation_config.yaml- Distillation settings (model paths, epochs, patience, etc.)quantize_config.json- Model quantization settings (labeled images paths, quantization method, etc.)
⚠️ Compatibility Note: Due to ongoing compatibility issues between required packages (such asimx500-converter,uni-pytorch, andmodel-compression-toolkit), we are currently unable to support IMX quantization in this pipeline. The default option inquantize_config.jsonisFP16. If you require IMX quantization, you may need to experiment with manual package pinning or use a separate, isolated environment. Refer to Sony IMX500 Export for Ultralytics YOLO11 and Raspberry Pi AI Camera IMX500 Converter User Manual for future development.
If you want to use your own dataset as input to the pipeline, create a folder structured as automl_workspace/data_pipeline/input/ and place your images inside it.
The distillation dataset is a subset of labeled images, which is used to train the student model. It is a folder that contains the train images/labels and validation images/labels. The folder should have the following name & structure:
distillation_dataset/
train/
images/
labels/
val/
images/
labels/It is currently assumed to be located in the automl_workspace/data_pipeline/distillation directory. When a new custom distillation dataset is provided, the user can overwrite the distillation_dataset attribute in distillation_config.yaml with the either relative or absolute path to the directory of the new custom distillation dataset.
python src/main.pypython src/pipeline/human_intervention.pypython src/generate_boxed_images.pyEncountering issues? Need assistance? For any questions regarding this pipeline, please open an issue in the GitHub repository.