This directory contains the necessary files to train the RF-DETR model on AWS SageMaker.
Example processed video with RF-DETR detections with tracking.
aws/
├── src/ # Python source files
│ ├── train_rfdetr_aws.py # SageMaker training script
│ ├── train_rfdetr_local.py # Local training/data prep script
│ └── submit_sagemaker_job.py # Helper to submit training jobs
├── scripts/ # Shell scripts
│ ├── build_docker.sh # Build Docker image
│ ├── example_usage.sh # Main entry point
│ ├── setup_sagemaker_role.sh # Create IAM role
│ ├── fast_s3_upload.sh # Fast S3 upload
│ ├── test_docker_local.sh # Local GPU testing
│ ├── process_video.sh # Process video with model
├── Dockerfile # Container definition
├── requirements.txt # Python dependencies
├── README.md # This file
└── QUICKSTART.md # Quick start guide
- AWS Account with SageMaker access
- AWS CLI configured with appropriate credentials
- Docker installed locally (for building images)
- Python packages:
pip install boto3 sagemaker
The SageMaker execution role needs permissions to:
- Access S3 buckets (for data and model artifacts)
- Access ECR (for Docker images)
- Create and manage SageMaker training jobs
Create a role with the AmazonSageMakerFullAccess managed policy, or create a custom policy with the required permissions.
# Get your role ARN
aws iam list-roles | grep SageMakerYour training data should be in the COCO format with the following structure:
dataset/
├── train/
│ ├── _annotations.coco.json
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── valid/
│ ├── _annotations.coco.json
│ ├── image1.jpg
│ └── ...
└── test/ (optional)
├── _annotations.coco.json
└── ...
Use the example_usage.sh script to handle everything:
# First, configure the script with your AWS details
cp scripts/example_usage.sh.template scripts/example_usage.sh
nano scripts/example_usage.sh # Update with your AWS account info
# Then run it
./scripts/example_usage.shOr call the Python script directly:
python src/submit_sagemaker_job.py \
--role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_SAGEMAKER_ROLE \
--s3-bucket your-sagemaker-bucket \
--local-data-path test_data/sample \
--epochs 50 \
--batch-size 8 \
--model-size large \
--instance-type ml.p3.2xlarge \
--volume-size 100Note: A sample dataset is provided in test_data/sample/ for testing. For production, use your own dataset.
This will:
- Build the Docker image
- Push it to ECR
- Upload your dataset to S3
- Submit the SageMaker training job
# Get your AWS account ID and region
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
IMAGE_NAME=rfdetr-sagemaker-training
IMAGE_TAG=latest
# Build the image
docker build -t ${IMAGE_NAME}:${IMAGE_TAG} -f Dockerfile .
# Create ECR repository (if it doesn't exist)
aws ecr create-repository --repository-name ${IMAGE_NAME} --region ${REGION} || true
# Login to ECR
aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com
# Tag and push
docker tag ${IMAGE_NAME}:${IMAGE_TAG} ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_TAG}
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_TAG}aws s3 sync /path/to/your/dataset s3://your-bucket/rfdetr/training-data/from sagemaker.estimator import Estimator
# Configuration
role_arn = 'arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_SAGEMAKER_ROLE'
image_uri = 'YOUR_ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/rfdetr-sagemaker-training:latest'
s3_data_path = 's3://your-bucket/rfdetr/training-data/'
s3_output_path = 's3://your-bucket/rfdetr/output/'
# Hyperparameters
hyperparameters = {
'epochs': 50,
'batch-size': 8,
'grad-accum-steps': 2,
'model-size': 'large'
}
# Create estimator
estimator = Estimator(
image_uri=image_uri,
role=role_arn,
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path=s3_output_path,
hyperparameters=hyperparameters,
max_run=86400, # 24 hours
volume_size=100 # GB
)
# Start training
estimator.fit({'training': s3_data_path})Recommended instance types for training:
- ml.p3.2xlarge - 1x V100 GPU, 8 vCPUs, 61 GB RAM (~$3.83/hr)
- ml.p3.8xlarge - 4x V100 GPUs, 32 vCPUs, 244 GB RAM (~$14.69/hr)
- ml.p3.16xlarge - 8x V100 GPUs, 64 vCPUs, 488 GB RAM (~$28.15/hr)
- ml.g4dn.xlarge - 1x T4 GPU, 4 vCPUs, 16 GB RAM (~$0.71/hr) - for testing
- ml.g5.xlarge - 1x A10G GPU, 4 vCPUs, 16 GB RAM (~$1.41/hr)
For cost savings, consider using Managed Spot Training which can reduce costs by up to 90%.
Available hyperparameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
epochs |
int | 50 | Number of training epochs |
batch-size |
int | 8 | Batch size per device |
grad-accum-steps |
int | 2 | Gradient accumulation steps |
model-size |
str | 'large' | Model size: 'large' or 'medium' |
learning-rate |
float | None | Learning rate (uses model default if not set) |
Additional configuration options for submit_sagemaker_job.py:
| Parameter | Type | Default | Description |
|---|---|---|---|
--volume-size |
int | 100 | EBS volume size in GB for the training instance |
--instance-type |
str | 'ml.p3.2xlarge' | SageMaker instance type |
--instance-count |
int | 1 | Number of instances for distributed training |
--max-run |
int | 86400 | Maximum training time in seconds (24 hours) |
--job-name |
str | auto | Training job name (auto-generated if not specified) |
-
AWS Console:
- Navigate to SageMaker → Training jobs
- Click on your job name
- View metrics and logs
-
AWS CLI:
aws sagemaker describe-training-job --training-job-name YOUR_JOB_NAME
-
CloudWatch Logs:
aws logs tail /aws/sagemaker/TrainingJobs --follow --filter-pattern YOUR_JOB_NAME
After training completes:
aws s3 cp s3://your-bucket/rfdetr/output/YOUR_JOB_NAME/output/model.tar.gz .
tar -xzf model.tar.gzBefore deploying to SageMaker (and incurring costs), test your Docker container locally:
To test training locally with a GPU:
cd aws/
./scripts/test_docker_local.sh rfdetr-sagemaker-training latest test_data/sampleThis will:
- Run training for 2 epochs with small batch size
- Use your local GPU
- Save outputs to
/tmp/sagemaker-local/ - Simulate SageMaker's directory structure
Requirements:
- NVIDIA Docker runtime installed
- GPU available on host
You can also run the container manually:
# Build the image
./scripts/build_docker.sh
# Run interactively to debug (using sample data from repo)
docker run --rm -it \
-v $(pwd)/test_data/sample:/opt/ml/input/data/training \
-v /tmp/output:/opt/ml/output \
-v /tmp/model:/opt/ml/model \
--gpus all \
rfdetr-sagemaker-training:latest \
bash
# Inside the container, you can:
# - Check Python version: python --version
# - Verify imports: python -c "import torch, rfdetr, transformers"
# - Inspect data: ls /opt/ml/input/data/training
# - Run training: python /opt/ml/code/train.py --epochs 1 --batch-size 2If you don't have a GPU locally, you can still test the container build and imports:
docker run --rm rfdetr-sagemaker-training:latest \
python -c "import torch, rfdetr; print('Success!')"-
Use Spot Instances: In
submit_sagemaker_job.py, modify the estimator:use_spot_instances=True, max_wait=90000 # Maximum time to wait for spot instance
-
Use Smaller Instances for Testing: Start with
ml.g4dn.xlargefor debugging -
Checkpoint Regularly: Modify the training script to save checkpoints to S3
-
Set Maximum Runtime: Use
max_runparameter to avoid runaway costs
Check CloudWatch logs:
aws logs tail /aws/sagemaker/TrainingJobs --followCommon issues:
- Missing dependencies in Dockerfile
- Incorrect Python path
- Memory issues
- Check the error in CloudWatch logs
- Verify data format matches expected COCO format
- Ensure sufficient instance memory for batch size
- Check S3 permissions for the execution role
- Reduce
batch-size - Increase
grad-accum-steps - Use a larger instance type
- Use
model-size: 'medium'instead of 'large'
For issues specific to:
- SageMaker: Check AWS SageMaker Documentation
- RF-DETR: Check the rfdetr package documentation
- This Implementation: Open an issue in the project repository
