Amazon ML Challenge 2025 - Product Price Prediction

🏆 Project Overview

This project is a comprehensive machine learning solution for the Amazon ML Challenge 2025, focused on predicting product prices from catalog text descriptions. The solution employs advanced feature engineering, ensemble learning, and multiple gradient boosting models to achieve competitive SMAPE scores.

Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Team Members: Putta Sujith, KAMANNAGARI BHAVITHAGNA, E. Yashas Kumar

🎯 Final Competition Result

Final SMAPE Score: 56.2% (Amazon ML Challenge 2025 Leaderboard)

📊 Problem Statement

Objective: Predict product prices from catalog content (text descriptions)

Dataset:

Training: 75,000 samples with prices
Test: 75,000 samples for prediction
Price range: $0.13 - $2,796.00

Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)

Lower is better (0% = perfect, 200% = worst)
Formula: SMAPE = (1/n) * Σ |pred - actual| / ((|actual| + |pred|)/2) * 100%

Constraints:

No external price lookup allowed
Models must use MIT/Apache 2.0 licenses
Maximum 8 billion parameters

🎯 Solution Architecture

Core Approach

The solution uses a multi-stage ensemble pipeline combining:

Advanced Feature Engineering (230+ features)
- Text-based features (IPQ, measurements, categories)
- Statistical features (50+ derived metrics)
- TF-IDF embeddings (800 → 150 via SVD)
- Clustering features (200 clusters)
- Target encoding (IPQ and category-based)
Ensemble Learning (3-5 models)
- LightGBM with different objectives (MAE, Huber, RMSE)
- XGBoost with different objectives (Absolute Error, Squared Error)
- Weighted averaging based on performance
Stratified Cross-Validation
- 3-7 folds depending on script
- Price-based stratification (20 quantile bins)
- Out-of-fold predictions for ensemble weighting

📁 Project Structure

Amazon_ML_Challenge_2025/
│
├── 📄 Core Training Scripts
│   ├── train_ultra_fast.py          # ⚡ FASTEST: 3 folds, 3 LGB models (~1.5 hours)
│   ├── train_best_fast.py           # 🚀 BALANCED: 3 folds, 5 models (~2 hours)
│   └── train_ultra_optimized.py     # 🎯 COMPREHENSIVE: 3 folds, 5 models (~2.5 hours)
│
├── 📂 Scripts (Modular Pipeline)
│   ├── scripts/features.py          # Feature extraction functions
│   ├── scripts/train.py             # Training pipeline skeleton
│   └── scripts/predict.py           # Prediction pipeline
│
├── 📂 Features (Generated)
│   ├── features/production/         # Production features (218 features)
│   ├── features/advanced/           # Advanced features (246 features)
│   └── features/minimal/            # Image features (106 features)
│
├── 📂 Models (Trained)
│   ├── models/*.pkl                 # Trained model files
│   └── models/*.csv                 # Submission files
│
├── 📂 Student Resource (Dataset)
│   └── student_resource/dataset/
│       ├── train.csv                # Training data (75k samples)
│       ├── test.csv                 # Test data (75k samples)
│       ├── sample_test.csv          # Sample test
│       └── sample_test_out.csv      # Sample output format
│
├── 📄 Documentation
│   ├── README.md                    # This file
│   ├── README_SUBMISSION.md         # Submission documentation
│   ├── ML_Approach_Documentation.txt # Detailed ML approach
│   ├── FINAL_SUBMISSION_GUIDE.txt   # Submission guide
│   ├── COLAB_INSTRUCTIONS.txt       # Google Colab setup
│   └── FILES_TO_SUBMIT.txt          # Submission checklist
│
├── 📄 Notebooks
│   └── Amazon_ML_Challenge_2025_Organized.ipynb
│
├── 📄 Output Files
│   ├── test_out.csv                 # Main submission
│   ├── test_out1.csv                # Run 1 predictions
│   ├── test_out2.csv                # Run 2 predictions
│   └── test_out3.csv                # Run 3 predictions
│
├── 📄 Configuration
│   ├── requirements.txt             # Python dependencies
│   └── LICENSE.md                   # MIT License
│
└── 📂 Virtual Environment
    └── venv/                        # Python virtual environment

🔧 Feature Engineering Pipeline

1. Advanced Text Features (26 features)

Item Pack Quantity (IPQ) Extraction:

15+ regex patterns to identify bulk quantities
Patterns: "pack of X", "X count", "X pieces", "X ct", etc.
Critical feature: Products sold in bulk have different unit pricing

Measurement Extraction:

Weight: oz, lb, kg, g (normalized to oz)
Volume: ml, l, fl oz, gal (normalized to ml)
Dimensions: length × width × height (inches/cm)

Product Categories (20 binary flags):

Food, health, beverage, beauty, home, baby, pet
Electronics, toys, office, outdoor, clothing, books
Tools, garden, automotive, medical
Premium, bulk, brand presence

2. Text Statistics (50+ features)

Basic Counts:

Text length, word count, number count
Punctuation counts (comma, dot, dash)
Character type counts (upper, lower, digit, special)

Ratios & Transformations:

Upper/digit/special character ratios
Average word length, words per comma
Log transforms: log(IPQ), log(text_len), log(weight), log(volume)
Polynomial: IPQ², IPQ³, √IPQ

Interaction Features:

IPQ × word_count, IPQ × weight, IPQ × volume
Weight × volume, 3D volume (L×W×H)
Per-unit features: weight/IPQ, volume/IPQ, words/IPQ

3. TF-IDF Text Embeddings (150 features)

Configuration:

Max features: 800
N-gram range: (1, 3) - unigrams, bigrams, trigrams
Min document frequency: 3
Max document frequency: 0.8
Sublinear TF scaling: True

Dimensionality Reduction:

TruncatedSVD: 800 → 150 components
Preserves ~85-90% of variance
Reduces computational cost and overfitting

4. Clustering Features (8 features)

Algorithm: MiniBatchKMeans

200 clusters on SVD embeddings
Batch size: 2000, iterations: 100

Statistics per cluster:

Median, mean, std, min, max, count
25th percentile (Q1), 75th percentile (Q3)

Rationale: Similar products cluster together with similar prices

5. Target Encoding (14 features)

IPQ-based Encoding:

Median, mean, std, count of prices per IPQ value
Captures pricing patterns for different pack sizes

Category-based Encoding:

Median and mean prices for 5 key categories
Categories: food, health, beverage, beauty, premium
Handles unseen values with global statistics

6. Feature Scaling

QuantileTransformer with normal output distribution
Robust to outliers
Ensures all features on comparable scale

Total Features: ~230 (80 numerical + 150 SVD)

🤖 Model Architecture

Ensemble Strategy

The solution uses 5 diverse gradient boosting models with different objectives and hyperparameters to maximize ensemble diversity:

Model 1: LightGBM - MAE Optimized

Objective: Regression (MAE loss)
Learning rate: 0.02-0.025
Num leaves: 150-180
Max depth: 14-15
Feature fraction: 0.65-0.7
Regularization: L1=1.0-2.0, L2=1.0-2.0

Model 2: LightGBM - Huber Loss

Objective: Huber (robust to outliers)
Alpha: 0.85-0.9
Learning rate: 0.025-0.03
Num leaves: 120-150
Max depth: 12-13
Feature fraction: 0.7-0.75

Model 3: LightGBM - RMSE

Objective: Regression (RMSE loss)
Learning rate: 0.025-0.035
Num leaves: 100-130
Max depth: 11-12
Feature fraction: 0.75-0.8

Model 4: XGBoost - Absolute Error

Objective: reg:absoluteerror
Learning rate: 0.02-0.025
Max depth: 8-10
Subsample: 0.7
Tree method: hist

Model 5: XGBoost - Squared Error

Objective: reg:squarederror
Learning rate: 0.025-0.03
Max depth: 7-9
Subsample: 0.75
Tree method: hist

Why This Ensemble Works

Different Loss Functions: MAE, Huber, RMSE, Absolute Error, Squared Error
Different Algorithms: LightGBM vs XGBoost
Different Hyperparameters: Varying depth, learning rate, regularization
Weighted Averaging: Performance-based weights (1 / SMAPE)

🚀 Training Scripts Comparison

Script	Folds	Models	Training Time	Expected SMAPE	Best For
train_ultra_fast.py	3	3 LGB	~1.5 hours	40-50%	Quick testing
train_best_fast.py	3	5 (3 LGB + 2 XGB)	~2 hours	38-45%	Balanced speed/accuracy
train_ultra_optimized.py	3	5 (3 LGB + 2 XGB)	~2.5 hours	38-45%	Best accuracy

Recommendations

For quick testing: Use train_ultra_fast.py
For submission: Use train_best_fast.py or train_ultra_optimized.py
For best results: Run multiple times and ensemble the outputs

💻 Installation & Usage

Prerequisites

Python 3.8+
Windows/Linux/Mac
4GB+ RAM

Install Dependencies

pip install -r requirements.txt

Required packages:

pandas, numpy
scikit-learn
lightgbm
xgboost (optional, for full ensemble)
scipy
tqdm

Quick Start (Local)

Prepare Data:

Ensure train.csv and test.csv are in student_resource/dataset/

Run Training:

# Fast version (~1.5 hours)
python train_ultra_fast.py

# Balanced version (~2 hours)
python train_best_fast.py

# Full version (~2.5 hours)
python train_ultra_optimized.py

Output:

test_out.csv - Final predictions (75,000 rows)

Google Colab Setup

Upload Files:
- Amazon_ML_Challenge_2025_Organized.ipynb
- train.csv and test.csv

Modify Paths in Notebook:

# Change:
'student_resource/dataset/train.csv' → 'train.csv'
'student_resource/dataset/test.csv' → 'test.csv'

Run All Cells
Download: test_out.csv

📈 Performance Metrics

Competition Results

🏆 Final Leaderboard SMAPE: 56.2%

This score was achieved on the complete 75K test set in the Amazon ML Challenge 2025, demonstrating the model's ability to generalize to unseen data.

Cross-Validation Results

Metric	train_ultra_fast	train_best_fast	train_ultra_optimized
CV SMAPE	40-50%	38-45%	38-45%
Test SMAPE	56.2%	56.2%	56.2%
Training Time	~1.5 hours	~2 hours	~2.5 hours
Models	3 LightGBM	5 (3 LGB + 2 XGB)	5 (3 LGB + 2 XGB)
Folds	3	3	3

Note: The gap between CV SMAPE (40-50%) and test SMAPE (56.2%) indicates some overfitting to the training distribution. This is common in competitions with diverse product catalogs.

Feature Importance (Top 10)

IPQ (Item Pack Quantity) - Most important
IPQ median price (target encoding)
Cluster median price
Text length
Word count
Weight
Volume
IPQ × words interaction
Log(IPQ)
TF-IDF components

Model Contributions

LightGBM MAE: ~25-35% weight
LightGBM Huber: ~20-30% weight
LightGBM RMSE: ~15-25% weight
XGBoost MAE: ~10-20% weight
XGBoost Squared: ~10-15% weight

🔬 Methodology

1. Data Loading & Exploration

Load 75,000 training samples
Analyze price distribution ($0.13 - $2,796.00)
Identify key patterns in catalog text

2. Feature Engineering

Extract IPQ using 15+ regex patterns
Parse measurements (weight, volume, dimensions)
Identify product categories (20 categories)
Compute text statistics (50+ features)
Create interaction features

3. Text Embeddings

TF-IDF vectorization (800 features)
SVD dimensionality reduction (150 components)
Preserve ~85-90% variance

4. Clustering

MiniBatchKMeans (200 clusters)
Compute cluster statistics
Merge with original data

5. Target Encoding

IPQ-based price statistics
Category-based price statistics
Handle unseen values

6. Model Training

Stratified K-Fold (3 folds)
Train multiple models with different objectives
Early stopping to prevent overfitting

7. Ensemble

Weighted averaging based on performance
Clip predictions to valid range
Generate final predictions

🛠️ Troubleshooting

Memory Error

# Reduce in training script:
max_features = 800 → 500
n_components = 150 → 100
n_clusters = 200 → 100

Training Too Slow

# Reduce in training script:
n_splits = 3 → 2
rounds = 1500 → 1000
# Use train_ultra_fast.py instead

SMAPE Too High

# Increase in training script:
n_splits = 3 → 5
rounds = 1500 → 2000
# Add more models to ensemble
# Run multiple times and average predictions

Import Errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

# Or install individually
pip install pandas numpy scikit-learn lightgbm xgboost scipy tqdm

📜 License & Compliance

Project License

This project is licensed under the MIT License - see LICENSE.md

Third-Party Licenses

LightGBM: MIT License ✅
XGBoost: Apache License 2.0 ✅
scikit-learn: BSD 3-Clause License ✅
pandas, numpy, scipy: BSD 3-Clause License ✅

All models and libraries are open-source and competition-compliant.

Model Parameters: All models are well within the 8 Billion parameter limit.

📊 Key Insights

Competition Performance Analysis

Final SMAPE: 56.2%

The model achieved a competitive score on the Amazon ML Challenge 2025 leaderboard. The gap between cross-validation (40-50%) and test performance (56.2%) reveals important insights:

Performance Gap Analysis:

CV SMAPE: 40-50%
Test SMAPE: 56.2%
Gap: ~6-16 percentage points

Reasons for the Gap:

Distribution Shift: Test set may have different product categories or price ranges
Overfitting: Model learned training-specific patterns that don't generalize
Target Encoding Leakage: IPQ and category encodings may be too specific to training data
Limited Folds: 3 folds may not capture full data variability

What Works Well

IPQ (Item Pack Quantity) is the most important feature
- Products sold in bulk have different unit pricing
- 15+ regex patterns capture various formats
TF-IDF Embeddings capture product-specific information
- N-grams (1-3) capture phrases like "pack of", "premium quality"
- SVD reduction preserves most information while reducing noise
Clustering helps identify product categories
- Similar products cluster together with similar prices
- Cluster statistics provide strong price signals
Target Encoding provides direct price signals
- IPQ-based encoding captures pack size pricing patterns
- Category-based encoding captures category-specific pricing
Ensemble Diversity is crucial for robustness
- Different loss functions handle different error types
- Weighted averaging reduces overfitting

Challenges & Solutions

Challenge 1: High variance in prices ($0.13 - $2,796.00)

Solution: Log transforms, quantile scaling, stratified CV

Challenge 2: Text data is noisy and unstructured

Solution: Multiple regex patterns, TF-IDF with n-grams, SVD reduction

Challenge 3: Overfitting to training data (CV: 40-50% → Test: 56.2%)

Solution: Stratified K-fold CV, early stopping, ensemble diversity
Lesson Learned: Need more aggressive regularization and validation strategies

Challenge 4: Long training times

Solution: MiniBatchKMeans, reduced folds, optimized hyperparameters

🎓 Learning Outcomes

This project demonstrates:

Advanced Feature Engineering
- Text parsing with regex
- Statistical feature creation
- Interaction features
- Target encoding
Ensemble Learning
- Multiple model types
- Different objectives
- Weighted averaging
- Out-of-fold predictions
Production ML Pipeline
- Modular code structure
- Progress tracking
- Error handling
- Reproducibility
Optimization Techniques
- Dimensionality reduction (SVD)
- Batch processing (MiniBatchKMeans)
- Early stopping
- Hyperparameter tuning

🚀 Future Improvements

Based on the final competition result (SMAPE: 56.2%), here are key improvements to reduce the CV-test gap:

High Priority (Address Overfitting)

More Robust Cross-Validation
- Increase folds: 3 → 7 or 10 for better generalization
- Use GroupKFold or TimeSeriesSplit if applicable
- Implement adversarial validation to detect distribution shift
Regularization Improvements
- Stronger L1/L2 penalties
- Lower learning rates with more rounds
- Higher min_child_samples to prevent overfitting
- Add dropout for neural network components
Target Encoding Refinement
- Add smoothing/noise to target encodings
- Use leave-one-out encoding instead of global
- Reduce reliance on training-specific statistics
Feature Selection
- Remove highly correlated features
- Use feature importance to drop low-value features
- Implement recursive feature elimination

Medium Priority (Enhance Features)

Image Features: Use image_link to extract visual features
- ResNet/EfficientNet embeddings
- Color histograms, texture features
- Product category classification from images
Advanced Text Features
- Transformer-based embeddings (BERT, RoBERTa)
- Named entity recognition for brands
- Sentiment analysis for product descriptions
Hyperparameter Tuning
- Optuna or Bayesian optimization
- Grid search on key parameters
- Ensemble different hyperparameter sets

Low Priority (Advanced Techniques)

Stacking: Meta-model on top of base models
Pseudo-labeling: Use test predictions iteratively
External Data: Product category databases (if allowed)
Outlier Handling: Better strategies for extreme prices

Lessons Learned

What to do differently:

Start with more conservative models (higher regularization)
Use more folds (7-10) even if it takes longer
Validate on holdout set that mimics test distribution
Monitor CV-test gap during development
Use simpler target encoding with smoothing

📞 Contact & Support

Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Competition: Amazon ML Challenge 2025
Date: January 2025

Team Members:

Putta Sujith
KAMANNAGARI BHAVITHAGNA
E. Yashas Kumar

🏆 Competition Results & Submission

Final Results

Amazon ML Challenge 2025

Final SMAPE Score: 56.2%
Leaderboard: Complete 75K test set evaluation
Team: Data Miners
Institution: Dayananda Sagar University, Bangalore

Submission Checklist

test_out.csv - Final predictions (75,000 rows)
ML_Approach_Documentation.txt - Detailed documentation
Source code - Well-commented training scripts
README.md - Project documentation
requirements.txt - Dependencies
LICENSE.md - License information
Final SMAPE: 56.2% achieved

Verification

Predictions file has correct format (sample_id, price)
No negative or zero prices
75,000 predictions
Documentation explains approach
Source code is well-commented
All licenses are compliant
Team information is correct
Competition submission completed

🎉 Acknowledgments

Amazon ML Challenge 2025 for organizing the competition
Dayananda Sagar University for support and resources
LightGBM Team (Microsoft) for the excellent gradient boosting library
XGBoost Team (DMLC) for the robust ML framework
scikit-learn Community for comprehensive ML tools
Open Source Community for all the amazing libraries

📚 References

Libraries & Frameworks

Techniques

TF-IDF: Term Frequency-Inverse Document Frequency
SVD: Singular Value Decomposition
K-Means Clustering
Stratified K-Fold Cross-Validation
Ensemble Learning
Target Encoding

Built with ❤️ by Team Data Miners

Dayananda Sagar University, Bangalore

Last Updated: January 2025

📝 Quick Reference

File Naming Convention

train_*.py - Training scripts
test_out*.csv - Prediction outputs
*_Documentation.txt - Documentation files
*_GUIDE.txt - Guide files

Important Paths

Dataset: student_resource/dataset/
Features: features/
Models: models/
Scripts: scripts/

Key Commands

# Install dependencies
pip install -r requirements.txt

# Run training (choose one)
python train_ultra_fast.py        # Fastest
python train_best_fast.py         # Balanced
python train_ultra_optimized.py   # Best

# Check output
head test_out.csv
wc -l test_out.csv  # Should be 75001 (including header)

Expected Output Format

sample_id,price
TEST_000000,12.99
TEST_000001,24.50
TEST_000002,8.75
...

End of README

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
features		features
models		models
scripts		scripts
student_resource		student_resource
.gitignore		.gitignore
Amazon_ML_Challenge_2025_Organized.ipynb		Amazon_ML_Challenge_2025_Organized.ipynb
COLAB_INSTRUCTIONS.txt		COLAB_INSTRUCTIONS.txt
Dataminers Documentation.pdf		Dataminers Documentation.pdf
FILES_TO_SUBMIT.txt		FILES_TO_SUBMIT.txt
FINAL_SUBMISSION_GUIDE.txt		FINAL_SUBMISSION_GUIDE.txt
LICENSE.md		LICENSE.md
ML_Approach_Documentation.txt		ML_Approach_Documentation.txt
README.md		README.md
README_SUBMISSION.md		README_SUBMISSION.md
SUBMISSION_CHECKLIST.md		SUBMISSION_CHECKLIST.md
requirements.txt		requirements.txt
test_out.csv		test_out.csv
test_out1.csv		test_out1.csv
test_out2.csv		test_out2.csv
test_out3.csv		test_out3.csv
train_best_fast.py		train_best_fast.py
train_ultra_fast.py		train_ultra_fast.py
train_ultra_optimized.py		train_ultra_optimized.py

License

sujithputta02/Team-Dataminers

Folders and files

Latest commit

History

Repository files navigation