This project is a comprehensive machine learning solution for the Amazon ML Challenge 2025, focused on predicting product prices from catalog text descriptions. The solution employs advanced feature engineering, ensemble learning, and multiple gradient boosting models to achieve competitive SMAPE scores.
Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Team Members: Putta Sujith, KAMANNAGARI BHAVITHAGNA, E. Yashas Kumar
Final SMAPE Score: 56.2% (Amazon ML Challenge 2025 Leaderboard)
Objective: Predict product prices from catalog content (text descriptions)
Dataset:
- Training: 75,000 samples with prices
- Test: 75,000 samples for prediction
- Price range: $0.13 - $2,796.00
Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)
- Lower is better (0% = perfect, 200% = worst)
- Formula:
SMAPE = (1/n) * Σ |pred - actual| / ((|actual| + |pred|)/2) * 100%
Constraints:
- No external price lookup allowed
- Models must use MIT/Apache 2.0 licenses
- Maximum 8 billion parameters
The solution uses a multi-stage ensemble pipeline combining:
-
Advanced Feature Engineering (230+ features)
- Text-based features (IPQ, measurements, categories)
- Statistical features (50+ derived metrics)
- TF-IDF embeddings (800 → 150 via SVD)
- Clustering features (200 clusters)
- Target encoding (IPQ and category-based)
-
Ensemble Learning (3-5 models)
- LightGBM with different objectives (MAE, Huber, RMSE)
- XGBoost with different objectives (Absolute Error, Squared Error)
- Weighted averaging based on performance
-
Stratified Cross-Validation
- 3-7 folds depending on script
- Price-based stratification (20 quantile bins)
- Out-of-fold predictions for ensemble weighting
Amazon_ML_Challenge_2025/
│
├── 📄 Core Training Scripts
│ ├── train_ultra_fast.py # ⚡ FASTEST: 3 folds, 3 LGB models (~1.5 hours)
│ ├── train_best_fast.py # 🚀 BALANCED: 3 folds, 5 models (~2 hours)
│ └── train_ultra_optimized.py # 🎯 COMPREHENSIVE: 3 folds, 5 models (~2.5 hours)
│
├── 📂 Scripts (Modular Pipeline)
│ ├── scripts/features.py # Feature extraction functions
│ ├── scripts/train.py # Training pipeline skeleton
│ └── scripts/predict.py # Prediction pipeline
│
├── 📂 Features (Generated)
│ ├── features/production/ # Production features (218 features)
│ ├── features/advanced/ # Advanced features (246 features)
│ └── features/minimal/ # Image features (106 features)
│
├── 📂 Models (Trained)
│ ├── models/*.pkl # Trained model files
│ └── models/*.csv # Submission files
│
├── 📂 Student Resource (Dataset)
│ └── student_resource/dataset/
│ ├── train.csv # Training data (75k samples)
│ ├── test.csv # Test data (75k samples)
│ ├── sample_test.csv # Sample test
│ └── sample_test_out.csv # Sample output format
│
├── 📄 Documentation
│ ├── README.md # This file
│ ├── README_SUBMISSION.md # Submission documentation
│ ├── ML_Approach_Documentation.txt # Detailed ML approach
│ ├── FINAL_SUBMISSION_GUIDE.txt # Submission guide
│ ├── COLAB_INSTRUCTIONS.txt # Google Colab setup
│ └── FILES_TO_SUBMIT.txt # Submission checklist
│
├── 📄 Notebooks
│ └── Amazon_ML_Challenge_2025_Organized.ipynb
│
├── 📄 Output Files
│ ├── test_out.csv # Main submission
│ ├── test_out1.csv # Run 1 predictions
│ ├── test_out2.csv # Run 2 predictions
│ └── test_out3.csv # Run 3 predictions
│
├── 📄 Configuration
│ ├── requirements.txt # Python dependencies
│ └── LICENSE.md # MIT License
│
└── 📂 Virtual Environment
└── venv/ # Python virtual environment
Item Pack Quantity (IPQ) Extraction:
- 15+ regex patterns to identify bulk quantities
- Patterns: "pack of X", "X count", "X pieces", "X ct", etc.
- Critical feature: Products sold in bulk have different unit pricing
Measurement Extraction:
- Weight: oz, lb, kg, g (normalized to oz)
- Volume: ml, l, fl oz, gal (normalized to ml)
- Dimensions: length × width × height (inches/cm)
Product Categories (20 binary flags):
- Food, health, beverage, beauty, home, baby, pet
- Electronics, toys, office, outdoor, clothing, books
- Tools, garden, automotive, medical
- Premium, bulk, brand presence
Basic Counts:
- Text length, word count, number count
- Punctuation counts (comma, dot, dash)
- Character type counts (upper, lower, digit, special)
Ratios & Transformations:
- Upper/digit/special character ratios
- Average word length, words per comma
- Log transforms: log(IPQ), log(text_len), log(weight), log(volume)
- Polynomial: IPQ², IPQ³, √IPQ
Interaction Features:
- IPQ × word_count, IPQ × weight, IPQ × volume
- Weight × volume, 3D volume (L×W×H)
- Per-unit features: weight/IPQ, volume/IPQ, words/IPQ
Configuration:
- Max features: 800
- N-gram range: (1, 3) - unigrams, bigrams, trigrams
- Min document frequency: 3
- Max document frequency: 0.8
- Sublinear TF scaling: True
Dimensionality Reduction:
- TruncatedSVD: 800 → 150 components
- Preserves ~85-90% of variance
- Reduces computational cost and overfitting
Algorithm: MiniBatchKMeans
- 200 clusters on SVD embeddings
- Batch size: 2000, iterations: 100
Statistics per cluster:
- Median, mean, std, min, max, count
- 25th percentile (Q1), 75th percentile (Q3)
Rationale: Similar products cluster together with similar prices
IPQ-based Encoding:
- Median, mean, std, count of prices per IPQ value
- Captures pricing patterns for different pack sizes
Category-based Encoding:
- Median and mean prices for 5 key categories
- Categories: food, health, beverage, beauty, premium
- Handles unseen values with global statistics
- QuantileTransformer with normal output distribution
- Robust to outliers
- Ensures all features on comparable scale
Total Features: ~230 (80 numerical + 150 SVD)
The solution uses 5 diverse gradient boosting models with different objectives and hyperparameters to maximize ensemble diversity:
Objective: Regression (MAE loss)
Learning rate: 0.02-0.025
Num leaves: 150-180
Max depth: 14-15
Feature fraction: 0.65-0.7
Regularization: L1=1.0-2.0, L2=1.0-2.0Objective: Huber (robust to outliers)
Alpha: 0.85-0.9
Learning rate: 0.025-0.03
Num leaves: 120-150
Max depth: 12-13
Feature fraction: 0.7-0.75Objective: Regression (RMSE loss)
Learning rate: 0.025-0.035
Num leaves: 100-130
Max depth: 11-12
Feature fraction: 0.75-0.8Objective: reg:absoluteerror
Learning rate: 0.02-0.025
Max depth: 8-10
Subsample: 0.7
Tree method: histObjective: reg:squarederror
Learning rate: 0.025-0.03
Max depth: 7-9
Subsample: 0.75
Tree method: hist- Different Loss Functions: MAE, Huber, RMSE, Absolute Error, Squared Error
- Different Algorithms: LightGBM vs XGBoost
- Different Hyperparameters: Varying depth, learning rate, regularization
- Weighted Averaging: Performance-based weights (1 / SMAPE)
| Script | Folds | Models | Training Time | Expected SMAPE | Best For |
|---|---|---|---|---|---|
| train_ultra_fast.py | 3 | 3 LGB | ~1.5 hours | 40-50% | Quick testing |
| train_best_fast.py | 3 | 5 (3 LGB + 2 XGB) | ~2 hours | 38-45% | Balanced speed/accuracy |
| train_ultra_optimized.py | 3 | 5 (3 LGB + 2 XGB) | ~2.5 hours | 38-45% | Best accuracy |
- For quick testing: Use
train_ultra_fast.py - For submission: Use
train_best_fast.pyortrain_ultra_optimized.py - For best results: Run multiple times and ensemble the outputs
Python 3.8+
Windows/Linux/Mac
4GB+ RAMpip install -r requirements.txtRequired packages:
- pandas, numpy
- scikit-learn
- lightgbm
- xgboost (optional, for full ensemble)
- scipy
- tqdm
-
Prepare Data:
Ensure train.csv and test.csv are in student_resource/dataset/ -
Run Training:
# Fast version (~1.5 hours) python train_ultra_fast.py # Balanced version (~2 hours) python train_best_fast.py # Full version (~2.5 hours) python train_ultra_optimized.py
-
Output:
test_out.csv - Final predictions (75,000 rows)
-
Upload Files:
Amazon_ML_Challenge_2025_Organized.ipynbtrain.csvandtest.csv
-
Modify Paths in Notebook:
# Change: 'student_resource/dataset/train.csv' → 'train.csv' 'student_resource/dataset/test.csv' → 'test.csv'
-
Run All Cells
-
Download:
test_out.csv
🏆 Final Leaderboard SMAPE: 56.2%
This score was achieved on the complete 75K test set in the Amazon ML Challenge 2025, demonstrating the model's ability to generalize to unseen data.
| Metric | train_ultra_fast | train_best_fast | train_ultra_optimized |
|---|---|---|---|
| CV SMAPE | 40-50% | 38-45% | 38-45% |
| Test SMAPE | 56.2% | 56.2% | 56.2% |
| Training Time | ~1.5 hours | ~2 hours | ~2.5 hours |
| Models | 3 LightGBM | 5 (3 LGB + 2 XGB) | 5 (3 LGB + 2 XGB) |
| Folds | 3 | 3 | 3 |
Note: The gap between CV SMAPE (40-50%) and test SMAPE (56.2%) indicates some overfitting to the training distribution. This is common in competitions with diverse product catalogs.
- IPQ (Item Pack Quantity) - Most important
- IPQ median price (target encoding)
- Cluster median price
- Text length
- Word count
- Weight
- Volume
- IPQ × words interaction
- Log(IPQ)
- TF-IDF components
- LightGBM MAE: ~25-35% weight
- LightGBM Huber: ~20-30% weight
- LightGBM RMSE: ~15-25% weight
- XGBoost MAE: ~10-20% weight
- XGBoost Squared: ~10-15% weight
- Load 75,000 training samples
- Analyze price distribution ($0.13 - $2,796.00)
- Identify key patterns in catalog text
- Extract IPQ using 15+ regex patterns
- Parse measurements (weight, volume, dimensions)
- Identify product categories (20 categories)
- Compute text statistics (50+ features)
- Create interaction features
- TF-IDF vectorization (800 features)
- SVD dimensionality reduction (150 components)
- Preserve ~85-90% variance
- MiniBatchKMeans (200 clusters)
- Compute cluster statistics
- Merge with original data
- IPQ-based price statistics
- Category-based price statistics
- Handle unseen values
- Stratified K-Fold (3 folds)
- Train multiple models with different objectives
- Early stopping to prevent overfitting
- Weighted averaging based on performance
- Clip predictions to valid range
- Generate final predictions
# Reduce in training script:
max_features = 800 → 500
n_components = 150 → 100
n_clusters = 200 → 100# Reduce in training script:
n_splits = 3 → 2
rounds = 1500 → 1000
# Use train_ultra_fast.py instead# Increase in training script:
n_splits = 3 → 5
rounds = 1500 → 2000
# Add more models to ensemble
# Run multiple times and average predictions# Reinstall dependencies
pip install --upgrade -r requirements.txt
# Or install individually
pip install pandas numpy scikit-learn lightgbm xgboost scipy tqdmThis project is licensed under the MIT License - see LICENSE.md
- LightGBM: MIT License ✅
- XGBoost: Apache License 2.0 ✅
- scikit-learn: BSD 3-Clause License ✅
- pandas, numpy, scipy: BSD 3-Clause License ✅
All models and libraries are open-source and competition-compliant.
Model Parameters: All models are well within the 8 Billion parameter limit.
Final SMAPE: 56.2%
The model achieved a competitive score on the Amazon ML Challenge 2025 leaderboard. The gap between cross-validation (40-50%) and test performance (56.2%) reveals important insights:
Performance Gap Analysis:
- CV SMAPE: 40-50%
- Test SMAPE: 56.2%
- Gap: ~6-16 percentage points
Reasons for the Gap:
- Distribution Shift: Test set may have different product categories or price ranges
- Overfitting: Model learned training-specific patterns that don't generalize
- Target Encoding Leakage: IPQ and category encodings may be too specific to training data
- Limited Folds: 3 folds may not capture full data variability
-
IPQ (Item Pack Quantity) is the most important feature
- Products sold in bulk have different unit pricing
- 15+ regex patterns capture various formats
-
TF-IDF Embeddings capture product-specific information
- N-grams (1-3) capture phrases like "pack of", "premium quality"
- SVD reduction preserves most information while reducing noise
-
Clustering helps identify product categories
- Similar products cluster together with similar prices
- Cluster statistics provide strong price signals
-
Target Encoding provides direct price signals
- IPQ-based encoding captures pack size pricing patterns
- Category-based encoding captures category-specific pricing
-
Ensemble Diversity is crucial for robustness
- Different loss functions handle different error types
- Weighted averaging reduces overfitting
Challenge 1: High variance in prices ($0.13 - $2,796.00)
- Solution: Log transforms, quantile scaling, stratified CV
Challenge 2: Text data is noisy and unstructured
- Solution: Multiple regex patterns, TF-IDF with n-grams, SVD reduction
Challenge 3: Overfitting to training data (CV: 40-50% → Test: 56.2%)
- Solution: Stratified K-fold CV, early stopping, ensemble diversity
- Lesson Learned: Need more aggressive regularization and validation strategies
Challenge 4: Long training times
- Solution: MiniBatchKMeans, reduced folds, optimized hyperparameters
This project demonstrates:
-
Advanced Feature Engineering
- Text parsing with regex
- Statistical feature creation
- Interaction features
- Target encoding
-
Ensemble Learning
- Multiple model types
- Different objectives
- Weighted averaging
- Out-of-fold predictions
-
Production ML Pipeline
- Modular code structure
- Progress tracking
- Error handling
- Reproducibility
-
Optimization Techniques
- Dimensionality reduction (SVD)
- Batch processing (MiniBatchKMeans)
- Early stopping
- Hyperparameter tuning
Based on the final competition result (SMAPE: 56.2%), here are key improvements to reduce the CV-test gap:
-
More Robust Cross-Validation
- Increase folds: 3 → 7 or 10 for better generalization
- Use GroupKFold or TimeSeriesSplit if applicable
- Implement adversarial validation to detect distribution shift
-
Regularization Improvements
- Stronger L1/L2 penalties
- Lower learning rates with more rounds
- Higher min_child_samples to prevent overfitting
- Add dropout for neural network components
-
Target Encoding Refinement
- Add smoothing/noise to target encodings
- Use leave-one-out encoding instead of global
- Reduce reliance on training-specific statistics
-
Feature Selection
- Remove highly correlated features
- Use feature importance to drop low-value features
- Implement recursive feature elimination
-
Image Features: Use image_link to extract visual features
- ResNet/EfficientNet embeddings
- Color histograms, texture features
- Product category classification from images
-
Advanced Text Features
- Transformer-based embeddings (BERT, RoBERTa)
- Named entity recognition for brands
- Sentiment analysis for product descriptions
-
Hyperparameter Tuning
- Optuna or Bayesian optimization
- Grid search on key parameters
- Ensemble different hyperparameter sets
- Stacking: Meta-model on top of base models
- Pseudo-labeling: Use test predictions iteratively
- External Data: Product category databases (if allowed)
- Outlier Handling: Better strategies for extreme prices
What to do differently:
- Start with more conservative models (higher regularization)
- Use more folds (7-10) even if it takes longer
- Validate on holdout set that mimics test distribution
- Monitor CV-test gap during development
- Use simpler target encoding with smoothing
Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Competition: Amazon ML Challenge 2025
Date: January 2025
Team Members:
- Putta Sujith
- KAMANNAGARI BHAVITHAGNA
- E. Yashas Kumar
Amazon ML Challenge 2025
- Final SMAPE Score: 56.2%
- Leaderboard: Complete 75K test set evaluation
- Team: Data Miners
- Institution: Dayananda Sagar University, Bangalore
- test_out.csv - Final predictions (75,000 rows)
- ML_Approach_Documentation.txt - Detailed documentation
- Source code - Well-commented training scripts
- README.md - Project documentation
- requirements.txt - Dependencies
- LICENSE.md - License information
- Final SMAPE: 56.2% achieved
- Predictions file has correct format (sample_id, price)
- No negative or zero prices
- 75,000 predictions
- Documentation explains approach
- Source code is well-commented
- All licenses are compliant
- Team information is correct
- Competition submission completed
- Amazon ML Challenge 2025 for organizing the competition
- Dayananda Sagar University for support and resources
- LightGBM Team (Microsoft) for the excellent gradient boosting library
- XGBoost Team (DMLC) for the robust ML framework
- scikit-learn Community for comprehensive ML tools
- Open Source Community for all the amazing libraries
- TF-IDF: Term Frequency-Inverse Document Frequency
- SVD: Singular Value Decomposition
- K-Means Clustering
- Stratified K-Fold Cross-Validation
- Ensemble Learning
- Target Encoding
Built with ❤️ by Team Data Miners
Dayananda Sagar University, Bangalore
Last Updated: January 2025
train_*.py- Training scriptstest_out*.csv- Prediction outputs*_Documentation.txt- Documentation files*_GUIDE.txt- Guide files
- Dataset:
student_resource/dataset/ - Features:
features/ - Models:
models/ - Scripts:
scripts/
# Install dependencies
pip install -r requirements.txt
# Run training (choose one)
python train_ultra_fast.py # Fastest
python train_best_fast.py # Balanced
python train_ultra_optimized.py # Best
# Check output
head test_out.csv
wc -l test_out.csv # Should be 75001 (including header)sample_id,price
TEST_000000,12.99
TEST_000001,24.50
TEST_000002,8.75
...End of README