Skip to content

It's a Product Price Prediction and This project is a comprehensive machine learning solution for the Amazon ML Challenge 2025, focused on predicting product prices from catalog text descriptions. The solution employs advanced feature engineering, ensemble learning, and multiple gradient boosting models to achieve competitive SMAPE scores.

License

Notifications You must be signed in to change notification settings

sujithputta02/Team-Dataminers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amazon ML Challenge 2025 - Product Price Prediction

🏆 Project Overview

This project is a comprehensive machine learning solution for the Amazon ML Challenge 2025, focused on predicting product prices from catalog text descriptions. The solution employs advanced feature engineering, ensemble learning, and multiple gradient boosting models to achieve competitive SMAPE scores.

Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Team Members: Putta Sujith, KAMANNAGARI BHAVITHAGNA, E. Yashas Kumar

🎯 Final Competition Result

Final SMAPE Score: 56.2% (Amazon ML Challenge 2025 Leaderboard)


📊 Problem Statement

Objective: Predict product prices from catalog content (text descriptions)

Dataset:

  • Training: 75,000 samples with prices
  • Test: 75,000 samples for prediction
  • Price range: $0.13 - $2,796.00

Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)

  • Lower is better (0% = perfect, 200% = worst)
  • Formula: SMAPE = (1/n) * Σ |pred - actual| / ((|actual| + |pred|)/2) * 100%

Constraints:

  • No external price lookup allowed
  • Models must use MIT/Apache 2.0 licenses
  • Maximum 8 billion parameters

🎯 Solution Architecture

Core Approach

The solution uses a multi-stage ensemble pipeline combining:

  1. Advanced Feature Engineering (230+ features)

    • Text-based features (IPQ, measurements, categories)
    • Statistical features (50+ derived metrics)
    • TF-IDF embeddings (800 → 150 via SVD)
    • Clustering features (200 clusters)
    • Target encoding (IPQ and category-based)
  2. Ensemble Learning (3-5 models)

    • LightGBM with different objectives (MAE, Huber, RMSE)
    • XGBoost with different objectives (Absolute Error, Squared Error)
    • Weighted averaging based on performance
  3. Stratified Cross-Validation

    • 3-7 folds depending on script
    • Price-based stratification (20 quantile bins)
    • Out-of-fold predictions for ensemble weighting

📁 Project Structure

Amazon_ML_Challenge_2025/
│
├── 📄 Core Training Scripts
│   ├── train_ultra_fast.py          # ⚡ FASTEST: 3 folds, 3 LGB models (~1.5 hours)
│   ├── train_best_fast.py           # 🚀 BALANCED: 3 folds, 5 models (~2 hours)
│   └── train_ultra_optimized.py     # 🎯 COMPREHENSIVE: 3 folds, 5 models (~2.5 hours)
│
├── 📂 Scripts (Modular Pipeline)
│   ├── scripts/features.py          # Feature extraction functions
│   ├── scripts/train.py             # Training pipeline skeleton
│   └── scripts/predict.py           # Prediction pipeline
│
├── 📂 Features (Generated)
│   ├── features/production/         # Production features (218 features)
│   ├── features/advanced/           # Advanced features (246 features)
│   └── features/minimal/            # Image features (106 features)
│
├── 📂 Models (Trained)
│   ├── models/*.pkl                 # Trained model files
│   └── models/*.csv                 # Submission files
│
├── 📂 Student Resource (Dataset)
│   └── student_resource/dataset/
│       ├── train.csv                # Training data (75k samples)
│       ├── test.csv                 # Test data (75k samples)
│       ├── sample_test.csv          # Sample test
│       └── sample_test_out.csv      # Sample output format
│
├── 📄 Documentation
│   ├── README.md                    # This file
│   ├── README_SUBMISSION.md         # Submission documentation
│   ├── ML_Approach_Documentation.txt # Detailed ML approach
│   ├── FINAL_SUBMISSION_GUIDE.txt   # Submission guide
│   ├── COLAB_INSTRUCTIONS.txt       # Google Colab setup
│   └── FILES_TO_SUBMIT.txt          # Submission checklist
│
├── 📄 Notebooks
│   └── Amazon_ML_Challenge_2025_Organized.ipynb
│
├── 📄 Output Files
│   ├── test_out.csv                 # Main submission
│   ├── test_out1.csv                # Run 1 predictions
│   ├── test_out2.csv                # Run 2 predictions
│   └── test_out3.csv                # Run 3 predictions
│
├── 📄 Configuration
│   ├── requirements.txt             # Python dependencies
│   └── LICENSE.md                   # MIT License
│
└── 📂 Virtual Environment
    └── venv/                        # Python virtual environment

🔧 Feature Engineering Pipeline

1. Advanced Text Features (26 features)

Item Pack Quantity (IPQ) Extraction:

  • 15+ regex patterns to identify bulk quantities
  • Patterns: "pack of X", "X count", "X pieces", "X ct", etc.
  • Critical feature: Products sold in bulk have different unit pricing

Measurement Extraction:

  • Weight: oz, lb, kg, g (normalized to oz)
  • Volume: ml, l, fl oz, gal (normalized to ml)
  • Dimensions: length × width × height (inches/cm)

Product Categories (20 binary flags):

  • Food, health, beverage, beauty, home, baby, pet
  • Electronics, toys, office, outdoor, clothing, books
  • Tools, garden, automotive, medical
  • Premium, bulk, brand presence

2. Text Statistics (50+ features)

Basic Counts:

  • Text length, word count, number count
  • Punctuation counts (comma, dot, dash)
  • Character type counts (upper, lower, digit, special)

Ratios & Transformations:

  • Upper/digit/special character ratios
  • Average word length, words per comma
  • Log transforms: log(IPQ), log(text_len), log(weight), log(volume)
  • Polynomial: IPQ², IPQ³, √IPQ

Interaction Features:

  • IPQ × word_count, IPQ × weight, IPQ × volume
  • Weight × volume, 3D volume (L×W×H)
  • Per-unit features: weight/IPQ, volume/IPQ, words/IPQ

3. TF-IDF Text Embeddings (150 features)

Configuration:

  • Max features: 800
  • N-gram range: (1, 3) - unigrams, bigrams, trigrams
  • Min document frequency: 3
  • Max document frequency: 0.8
  • Sublinear TF scaling: True

Dimensionality Reduction:

  • TruncatedSVD: 800 → 150 components
  • Preserves ~85-90% of variance
  • Reduces computational cost and overfitting

4. Clustering Features (8 features)

Algorithm: MiniBatchKMeans

  • 200 clusters on SVD embeddings
  • Batch size: 2000, iterations: 100

Statistics per cluster:

  • Median, mean, std, min, max, count
  • 25th percentile (Q1), 75th percentile (Q3)

Rationale: Similar products cluster together with similar prices

5. Target Encoding (14 features)

IPQ-based Encoding:

  • Median, mean, std, count of prices per IPQ value
  • Captures pricing patterns for different pack sizes

Category-based Encoding:

  • Median and mean prices for 5 key categories
  • Categories: food, health, beverage, beauty, premium
  • Handles unseen values with global statistics

6. Feature Scaling

  • QuantileTransformer with normal output distribution
  • Robust to outliers
  • Ensures all features on comparable scale

Total Features: ~230 (80 numerical + 150 SVD)


🤖 Model Architecture

Ensemble Strategy

The solution uses 5 diverse gradient boosting models with different objectives and hyperparameters to maximize ensemble diversity:

Model 1: LightGBM - MAE Optimized

Objective: Regression (MAE loss)
Learning rate: 0.02-0.025
Num leaves: 150-180
Max depth: 14-15
Feature fraction: 0.65-0.7
Regularization: L1=1.0-2.0, L2=1.0-2.0

Model 2: LightGBM - Huber Loss

Objective: Huber (robust to outliers)
Alpha: 0.85-0.9
Learning rate: 0.025-0.03
Num leaves: 120-150
Max depth: 12-13
Feature fraction: 0.7-0.75

Model 3: LightGBM - RMSE

Objective: Regression (RMSE loss)
Learning rate: 0.025-0.035
Num leaves: 100-130
Max depth: 11-12
Feature fraction: 0.75-0.8

Model 4: XGBoost - Absolute Error

Objective: reg:absoluteerror
Learning rate: 0.02-0.025
Max depth: 8-10
Subsample: 0.7
Tree method: hist

Model 5: XGBoost - Squared Error

Objective: reg:squarederror
Learning rate: 0.025-0.03
Max depth: 7-9
Subsample: 0.75
Tree method: hist

Why This Ensemble Works

  1. Different Loss Functions: MAE, Huber, RMSE, Absolute Error, Squared Error
  2. Different Algorithms: LightGBM vs XGBoost
  3. Different Hyperparameters: Varying depth, learning rate, regularization
  4. Weighted Averaging: Performance-based weights (1 / SMAPE)

🚀 Training Scripts Comparison

Script Folds Models Training Time Expected SMAPE Best For
train_ultra_fast.py 3 3 LGB ~1.5 hours 40-50% Quick testing
train_best_fast.py 3 5 (3 LGB + 2 XGB) ~2 hours 38-45% Balanced speed/accuracy
train_ultra_optimized.py 3 5 (3 LGB + 2 XGB) ~2.5 hours 38-45% Best accuracy

Recommendations

  • For quick testing: Use train_ultra_fast.py
  • For submission: Use train_best_fast.py or train_ultra_optimized.py
  • For best results: Run multiple times and ensemble the outputs

💻 Installation & Usage

Prerequisites

Python 3.8+
Windows/Linux/Mac
4GB+ RAM

Install Dependencies

pip install -r requirements.txt

Required packages:

  • pandas, numpy
  • scikit-learn
  • lightgbm
  • xgboost (optional, for full ensemble)
  • scipy
  • tqdm

Quick Start (Local)

  1. Prepare Data:

    Ensure train.csv and test.csv are in student_resource/dataset/
    
  2. Run Training:

    # Fast version (~1.5 hours)
    python train_ultra_fast.py
    
    # Balanced version (~2 hours)
    python train_best_fast.py
    
    # Full version (~2.5 hours)
    python train_ultra_optimized.py
  3. Output:

    test_out.csv - Final predictions (75,000 rows)
    

Google Colab Setup

  1. Upload Files:

    • Amazon_ML_Challenge_2025_Organized.ipynb
    • train.csv and test.csv
  2. Modify Paths in Notebook:

    # Change:
    'student_resource/dataset/train.csv''train.csv'
    'student_resource/dataset/test.csv''test.csv'
  3. Run All Cells

  4. Download: test_out.csv


📈 Performance Metrics

Competition Results

🏆 Final Leaderboard SMAPE: 56.2%

This score was achieved on the complete 75K test set in the Amazon ML Challenge 2025, demonstrating the model's ability to generalize to unseen data.

Cross-Validation Results

Metric train_ultra_fast train_best_fast train_ultra_optimized
CV SMAPE 40-50% 38-45% 38-45%
Test SMAPE 56.2% 56.2% 56.2%
Training Time ~1.5 hours ~2 hours ~2.5 hours
Models 3 LightGBM 5 (3 LGB + 2 XGB) 5 (3 LGB + 2 XGB)
Folds 3 3 3

Note: The gap between CV SMAPE (40-50%) and test SMAPE (56.2%) indicates some overfitting to the training distribution. This is common in competitions with diverse product catalogs.

Feature Importance (Top 10)

  1. IPQ (Item Pack Quantity) - Most important
  2. IPQ median price (target encoding)
  3. Cluster median price
  4. Text length
  5. Word count
  6. Weight
  7. Volume
  8. IPQ × words interaction
  9. Log(IPQ)
  10. TF-IDF components

Model Contributions

  • LightGBM MAE: ~25-35% weight
  • LightGBM Huber: ~20-30% weight
  • LightGBM RMSE: ~15-25% weight
  • XGBoost MAE: ~10-20% weight
  • XGBoost Squared: ~10-15% weight

🔬 Methodology

1. Data Loading & Exploration

  • Load 75,000 training samples
  • Analyze price distribution ($0.13 - $2,796.00)
  • Identify key patterns in catalog text

2. Feature Engineering

  • Extract IPQ using 15+ regex patterns
  • Parse measurements (weight, volume, dimensions)
  • Identify product categories (20 categories)
  • Compute text statistics (50+ features)
  • Create interaction features

3. Text Embeddings

  • TF-IDF vectorization (800 features)
  • SVD dimensionality reduction (150 components)
  • Preserve ~85-90% variance

4. Clustering

  • MiniBatchKMeans (200 clusters)
  • Compute cluster statistics
  • Merge with original data

5. Target Encoding

  • IPQ-based price statistics
  • Category-based price statistics
  • Handle unseen values

6. Model Training

  • Stratified K-Fold (3 folds)
  • Train multiple models with different objectives
  • Early stopping to prevent overfitting

7. Ensemble

  • Weighted averaging based on performance
  • Clip predictions to valid range
  • Generate final predictions

🛠️ Troubleshooting

Memory Error

# Reduce in training script:
max_features = 800500
n_components = 150100
n_clusters = 200100

Training Too Slow

# Reduce in training script:
n_splits = 32
rounds = 15001000
# Use train_ultra_fast.py instead

SMAPE Too High

# Increase in training script:
n_splits = 35
rounds = 15002000
# Add more models to ensemble
# Run multiple times and average predictions

Import Errors

# Reinstall dependencies
pip install --upgrade -r requirements.txt

# Or install individually
pip install pandas numpy scikit-learn lightgbm xgboost scipy tqdm

📜 License & Compliance

Project License

This project is licensed under the MIT License - see LICENSE.md

Third-Party Licenses

  • LightGBM: MIT License ✅
  • XGBoost: Apache License 2.0 ✅
  • scikit-learn: BSD 3-Clause License ✅
  • pandas, numpy, scipy: BSD 3-Clause License ✅

All models and libraries are open-source and competition-compliant.

Model Parameters: All models are well within the 8 Billion parameter limit.


📊 Key Insights

Competition Performance Analysis

Final SMAPE: 56.2%

The model achieved a competitive score on the Amazon ML Challenge 2025 leaderboard. The gap between cross-validation (40-50%) and test performance (56.2%) reveals important insights:

Performance Gap Analysis:

  • CV SMAPE: 40-50%
  • Test SMAPE: 56.2%
  • Gap: ~6-16 percentage points

Reasons for the Gap:

  1. Distribution Shift: Test set may have different product categories or price ranges
  2. Overfitting: Model learned training-specific patterns that don't generalize
  3. Target Encoding Leakage: IPQ and category encodings may be too specific to training data
  4. Limited Folds: 3 folds may not capture full data variability

What Works Well

  1. IPQ (Item Pack Quantity) is the most important feature

    • Products sold in bulk have different unit pricing
    • 15+ regex patterns capture various formats
  2. TF-IDF Embeddings capture product-specific information

    • N-grams (1-3) capture phrases like "pack of", "premium quality"
    • SVD reduction preserves most information while reducing noise
  3. Clustering helps identify product categories

    • Similar products cluster together with similar prices
    • Cluster statistics provide strong price signals
  4. Target Encoding provides direct price signals

    • IPQ-based encoding captures pack size pricing patterns
    • Category-based encoding captures category-specific pricing
  5. Ensemble Diversity is crucial for robustness

    • Different loss functions handle different error types
    • Weighted averaging reduces overfitting

Challenges & Solutions

Challenge 1: High variance in prices ($0.13 - $2,796.00)

  • Solution: Log transforms, quantile scaling, stratified CV

Challenge 2: Text data is noisy and unstructured

  • Solution: Multiple regex patterns, TF-IDF with n-grams, SVD reduction

Challenge 3: Overfitting to training data (CV: 40-50% → Test: 56.2%)

  • Solution: Stratified K-fold CV, early stopping, ensemble diversity
  • Lesson Learned: Need more aggressive regularization and validation strategies

Challenge 4: Long training times

  • Solution: MiniBatchKMeans, reduced folds, optimized hyperparameters

🎓 Learning Outcomes

This project demonstrates:

  1. Advanced Feature Engineering

    • Text parsing with regex
    • Statistical feature creation
    • Interaction features
    • Target encoding
  2. Ensemble Learning

    • Multiple model types
    • Different objectives
    • Weighted averaging
    • Out-of-fold predictions
  3. Production ML Pipeline

    • Modular code structure
    • Progress tracking
    • Error handling
    • Reproducibility
  4. Optimization Techniques

    • Dimensionality reduction (SVD)
    • Batch processing (MiniBatchKMeans)
    • Early stopping
    • Hyperparameter tuning

🚀 Future Improvements

Based on the final competition result (SMAPE: 56.2%), here are key improvements to reduce the CV-test gap:

High Priority (Address Overfitting)

  1. More Robust Cross-Validation

    • Increase folds: 3 → 7 or 10 for better generalization
    • Use GroupKFold or TimeSeriesSplit if applicable
    • Implement adversarial validation to detect distribution shift
  2. Regularization Improvements

    • Stronger L1/L2 penalties
    • Lower learning rates with more rounds
    • Higher min_child_samples to prevent overfitting
    • Add dropout for neural network components
  3. Target Encoding Refinement

    • Add smoothing/noise to target encodings
    • Use leave-one-out encoding instead of global
    • Reduce reliance on training-specific statistics
  4. Feature Selection

    • Remove highly correlated features
    • Use feature importance to drop low-value features
    • Implement recursive feature elimination

Medium Priority (Enhance Features)

  1. Image Features: Use image_link to extract visual features

    • ResNet/EfficientNet embeddings
    • Color histograms, texture features
    • Product category classification from images
  2. Advanced Text Features

    • Transformer-based embeddings (BERT, RoBERTa)
    • Named entity recognition for brands
    • Sentiment analysis for product descriptions
  3. Hyperparameter Tuning

    • Optuna or Bayesian optimization
    • Grid search on key parameters
    • Ensemble different hyperparameter sets

Low Priority (Advanced Techniques)

  1. Stacking: Meta-model on top of base models
  2. Pseudo-labeling: Use test predictions iteratively
  3. External Data: Product category databases (if allowed)
  4. Outlier Handling: Better strategies for extreme prices

Lessons Learned

What to do differently:

  • Start with more conservative models (higher regularization)
  • Use more folds (7-10) even if it takes longer
  • Validate on holdout set that mimics test distribution
  • Monitor CV-test gap during development
  • Use simpler target encoding with smoothing

📞 Contact & Support

Team: Data Miners
Institution: Dayananda Sagar University, Bangalore
Competition: Amazon ML Challenge 2025
Date: January 2025

Team Members:

  • Putta Sujith
  • KAMANNAGARI BHAVITHAGNA
  • E. Yashas Kumar

🏆 Competition Results & Submission

Final Results

Amazon ML Challenge 2025

  • Final SMAPE Score: 56.2%
  • Leaderboard: Complete 75K test set evaluation
  • Team: Data Miners
  • Institution: Dayananda Sagar University, Bangalore

Submission Checklist

  • test_out.csv - Final predictions (75,000 rows)
  • ML_Approach_Documentation.txt - Detailed documentation
  • Source code - Well-commented training scripts
  • README.md - Project documentation
  • requirements.txt - Dependencies
  • LICENSE.md - License information
  • Final SMAPE: 56.2% achieved

Verification

  • Predictions file has correct format (sample_id, price)
  • No negative or zero prices
  • 75,000 predictions
  • Documentation explains approach
  • Source code is well-commented
  • All licenses are compliant
  • Team information is correct
  • Competition submission completed

🎉 Acknowledgments

  • Amazon ML Challenge 2025 for organizing the competition
  • Dayananda Sagar University for support and resources
  • LightGBM Team (Microsoft) for the excellent gradient boosting library
  • XGBoost Team (DMLC) for the robust ML framework
  • scikit-learn Community for comprehensive ML tools
  • Open Source Community for all the amazing libraries

📚 References

Libraries & Frameworks

Techniques

  • TF-IDF: Term Frequency-Inverse Document Frequency
  • SVD: Singular Value Decomposition
  • K-Means Clustering
  • Stratified K-Fold Cross-Validation
  • Ensemble Learning
  • Target Encoding

Built with ❤️ by Team Data Miners

Dayananda Sagar University, Bangalore

Last Updated: January 2025


📝 Quick Reference

File Naming Convention

  • train_*.py - Training scripts
  • test_out*.csv - Prediction outputs
  • *_Documentation.txt - Documentation files
  • *_GUIDE.txt - Guide files

Important Paths

  • Dataset: student_resource/dataset/
  • Features: features/
  • Models: models/
  • Scripts: scripts/

Key Commands

# Install dependencies
pip install -r requirements.txt

# Run training (choose one)
python train_ultra_fast.py        # Fastest
python train_best_fast.py         # Balanced
python train_ultra_optimized.py   # Best

# Check output
head test_out.csv
wc -l test_out.csv  # Should be 75001 (including header)

Expected Output Format

sample_id,price
TEST_000000,12.99
TEST_000001,24.50
TEST_000002,8.75
...

End of README

About

It's a Product Price Prediction and This project is a comprehensive machine learning solution for the Amazon ML Challenge 2025, focused on predicting product prices from catalog text descriptions. The solution employs advanced feature engineering, ensemble learning, and multiple gradient boosting models to achieve competitive SMAPE scores.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published