Training for RF-DETR

This directory contains the necessary files to train the RF-DETR model on AWS SageMaker.

Example processed video with RF-DETR detections with tracking.

File Structure

aws/
├── src/                          # Python source files
│   ├── train_rfdetr_aws.py      # SageMaker training script
│   ├── train_rfdetr_local.py    # Local training/data prep script
│   └── submit_sagemaker_job.py  # Helper to submit training jobs
├── scripts/                      # Shell scripts
│   ├── build_docker.sh          # Build Docker image
│   ├── example_usage.sh         # Main entry point
│   ├── setup_sagemaker_role.sh  # Create IAM role
│   ├── fast_s3_upload.sh        # Fast S3 upload
│   ├── test_docker_local.sh     # Local GPU testing
│   ├── process_video.sh         # Process video with model
├── Dockerfile                    # Container definition
├── requirements.txt              # Python dependencies
├── README.md                     # This file
└── QUICKSTART.md                 # Quick start guide

Prerequisites

AWS Account with SageMaker access
AWS CLI configured with appropriate credentials
Docker installed locally (for building images)
Python packages:
```
pip install boto3 sagemaker
```

Setup

1. Create SageMaker Execution Role

The SageMaker execution role needs permissions to:

Access S3 buckets (for data and model artifacts)
Access ECR (for Docker images)
Create and manage SageMaker training jobs

Create a role with the AmazonSageMakerFullAccess managed policy, or create a custom policy with the required permissions.

# Get your role ARN
aws iam list-roles | grep SageMaker

2. Prepare Training Data

Your training data should be in the COCO format with the following structure:

dataset/
├── train/
│   ├── _annotations.coco.json
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── valid/
│   ├── _annotations.coco.json
│   ├── image1.jpg
│   └── ...
└── test/ (optional)
    ├── _annotations.coco.json
    └── ...

Usage

Option 1: Quick Start (Automated)

Use the example_usage.sh script to handle everything:

# First, configure the script with your AWS details
cp scripts/example_usage.sh.template scripts/example_usage.sh
nano scripts/example_usage.sh  # Update with your AWS account info

# Then run it
./scripts/example_usage.sh

Or call the Python script directly:

python src/submit_sagemaker_job.py \
  --role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_SAGEMAKER_ROLE \
  --s3-bucket your-sagemaker-bucket \
  --local-data-path test_data/sample \
  --epochs 50 \
  --batch-size 8 \
  --model-size large \
  --instance-type ml.p3.2xlarge \
  --volume-size 100

Note: A sample dataset is provided in test_data/sample/ for testing. For production, use your own dataset.

This will:

Build the Docker image
Push it to ECR
Upload your dataset to S3
Submit the SageMaker training job

Option 2: Manual Steps

Step 1: Build and Push Docker Image

# Get your AWS account ID and region
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
IMAGE_NAME=rfdetr-sagemaker-training
IMAGE_TAG=latest

# Build the image
docker build -t ${IMAGE_NAME}:${IMAGE_TAG} -f Dockerfile .

# Create ECR repository (if it doesn't exist)
aws ecr create-repository --repository-name ${IMAGE_NAME} --region ${REGION} || true

# Login to ECR
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# Tag and push
docker tag ${IMAGE_NAME}:${IMAGE_TAG} ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_TAG}
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_TAG}

Step 2: Upload Data to S3

aws s3 sync /path/to/your/dataset s3://your-bucket/rfdetr/training-data/

Step 3: Submit Training Job Using Python

from sagemaker.estimator import Estimator

# Configuration
role_arn = 'arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_SAGEMAKER_ROLE'
image_uri = 'YOUR_ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/rfdetr-sagemaker-training:latest'
s3_data_path = 's3://your-bucket/rfdetr/training-data/'
s3_output_path = 's3://your-bucket/rfdetr/output/'

# Hyperparameters
hyperparameters = {
    'epochs': 50,
    'batch-size': 8,
    'grad-accum-steps': 2,
    'model-size': 'large'
}

# Create estimator
estimator = Estimator(
    image_uri=image_uri,
    role=role_arn,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    output_path=s3_output_path,
    hyperparameters=hyperparameters,
    max_run=86400,  # 24 hours
    volume_size=100  # GB
)

# Start training
estimator.fit({'training': s3_data_path})

SageMaker Instance Types

Recommended instance types for training:

ml.p3.2xlarge - 1x V100 GPU, 8 vCPUs, 61 GB RAM (~$3.83/hr)
ml.p3.8xlarge - 4x V100 GPUs, 32 vCPUs, 244 GB RAM (~$14.69/hr)
ml.p3.16xlarge - 8x V100 GPUs, 64 vCPUs, 488 GB RAM (~$28.15/hr)
ml.g4dn.xlarge - 1x T4 GPU, 4 vCPUs, 16 GB RAM (~$0.71/hr) - for testing
ml.g5.xlarge - 1x A10G GPU, 4 vCPUs, 16 GB RAM (~$1.41/hr)

For cost savings, consider using Managed Spot Training which can reduce costs by up to 90%.

Hyperparameters

Available hyperparameters:

Parameter	Type	Default	Description
`epochs`	int	50	Number of training epochs
`batch-size`	int	8	Batch size per device
`grad-accum-steps`	int	2	Gradient accumulation steps
`model-size`	str	'large'	Model size: 'large' or 'medium'
`learning-rate`	float	None	Learning rate (uses model default if not set)

Configuration Parameters

Additional configuration options for submit_sagemaker_job.py:

Parameter	Type	Default	Description
`--volume-size`	int	100	EBS volume size in GB for the training instance
`--instance-type`	str	'ml.p3.2xlarge'	SageMaker instance type
`--instance-count`	int	1	Number of instances for distributed training
`--max-run`	int	86400	Maximum training time in seconds (24 hours)
`--job-name`	str	auto	Training job name (auto-generated if not specified)

Monitoring

View Training Progress

AWS Console:
- Navigate to SageMaker → Training jobs
- Click on your job name
- View metrics and logs

AWS CLI:

aws sagemaker describe-training-job --training-job-name YOUR_JOB_NAME

CloudWatch Logs:

aws logs tail /aws/sagemaker/TrainingJobs --follow --filter-pattern YOUR_JOB_NAME

Download Model Artifacts

After training completes:

aws s3 cp s3://your-bucket/rfdetr/output/YOUR_JOB_NAME/output/model.tar.gz .
tar -xzf model.tar.gz

Local Testing

Before deploying to SageMaker (and incurring costs), test your Docker container locally:

GPU Training Test (GPU required)

To test training locally with a GPU:

cd aws/
./scripts/test_docker_local.sh rfdetr-sagemaker-training latest test_data/sample

This will:

Run training for 2 epochs with small batch size
Use your local GPU
Save outputs to /tmp/sagemaker-local/
Simulate SageMaker's directory structure

Requirements:

NVIDIA Docker runtime installed
GPU available on host

Manual Docker Testing

You can also run the container manually:

# Build the image
./scripts/build_docker.sh

# Run interactively to debug (using sample data from repo)
docker run --rm -it \
  -v $(pwd)/test_data/sample:/opt/ml/input/data/training \
  -v /tmp/output:/opt/ml/output \
  -v /tmp/model:/opt/ml/model \
  --gpus all \
  rfdetr-sagemaker-training:latest \
  bash

# Inside the container, you can:
# - Check Python version: python --version
# - Verify imports: python -c "import torch, rfdetr, transformers"
# - Inspect data: ls /opt/ml/input/data/training
# - Run training: python /opt/ml/code/train.py --epochs 1 --batch-size 2

Test Without GPU

If you don't have a GPU locally, you can still test the container build and imports:

docker run --rm rfdetr-sagemaker-training:latest \
  python -c "import torch, rfdetr; print('Success!')"

Cost Optimization

Use Spot Instances: In submit_sagemaker_job.py, modify the estimator:

use_spot_instances=True,
max_wait=90000  # Maximum time to wait for spot instance

Use Smaller Instances for Testing: Start with ml.g4dn.xlarge for debugging
Checkpoint Regularly: Modify the training script to save checkpoints to S3
Set Maximum Runtime: Use max_run parameter to avoid runaway costs

Troubleshooting

Container Fails to Start

Check CloudWatch logs:

aws logs tail /aws/sagemaker/TrainingJobs --follow

Common issues:

Missing dependencies in Dockerfile
Incorrect Python path
Memory issues

Training Fails

Check the error in CloudWatch logs
Verify data format matches expected COCO format
Ensure sufficient instance memory for batch size
Check S3 permissions for the execution role

Out of Memory (OOM)

Reduce batch-size
Increase grad-accum-steps
Use a larger instance type
Use model-size: 'medium' instead of 'large'

Support

For issues specific to:

SageMaker: Check AWS SageMaker Documentation
RF-DETR: Check the rfdetr package documentation
This Implementation: Open an issue in the project repository

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
test_data		test_data
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
video_processed.gif		video_processed.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training for RF-DETR

File Structure

Prerequisites

Setup

1. Create SageMaker Execution Role

2. Prepare Training Data

Usage

Option 1: Quick Start (Automated)

Option 2: Manual Steps

Step 1: Build and Push Docker Image

Step 2: Upload Data to S3

Step 3: Submit Training Job Using Python

SageMaker Instance Types

Hyperparameters

Configuration Parameters

Monitoring

View Training Progress

Download Model Artifacts

Local Testing

GPU Training Test (GPU required)

Manual Docker Testing

Test Without GPU

Cost Optimization

Troubleshooting

Container Fails to Start

Training Fails

Out of Memory (OOM)

Support

References

About

Uh oh!

Releases 13

Packages

Languages

License

mbari-org/rf-detrtrain

Folders and files

Latest commit

History

Repository files navigation

Training for RF-DETR

File Structure

Prerequisites

Setup

1. Create SageMaker Execution Role

2. Prepare Training Data

Usage

Option 1: Quick Start (Automated)

Option 2: Manual Steps

Step 1: Build and Push Docker Image

Step 2: Upload Data to S3

Step 3: Submit Training Job Using Python

SageMaker Instance Types

Hyperparameters

Configuration Parameters

Monitoring

View Training Progress

Download Model Artifacts

Local Testing

GPU Training Test (GPU required)

Manual Docker Testing

Test Without GPU

Cost Optimization

Troubleshooting

Container Fails to Start

Training Fails

Out of Memory (OOM)

Support

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Languages

Packages