This project demonstrates a complete workflow to download, process, and analyze retail orders data using the Kaggle API, Python, and SQL. It covers:
- Setting up Kaggle API credentials
- Downloading and extracting dataset
- Data cleaning and feature engineering with Python (pandas)
- Storing data in SQLite database
- Running analytical SQL queries for business insights
- Prerequisites
- Setup Kaggle API
- Download and Extract Dataset
- Data Processing with Python
- Load Data into SQLite
- SQL Queries for Analysis
- Project Structure
Make sure you have Python 3.7+ installed along with pip and SQLite3.
Install required Python packages:
pip install kaggle pandas sqlalchemy pyodbcCreate a Kaggle account and generate your API token:
Go to your Kaggle account > Account > API section
Click Create New API Token to download kaggle.json
Save your credentials securely using this script:
import os
import json
kaggle_api = {
"username": "YOUR_KAGGLE_USERNAME",
"key": "YOUR_KAGGLE_API_KEY"
}
kaggle_dir = os.path.join(os.path.expanduser('~'), '.kaggle')
os.makedirs(kaggle_dir, exist_ok=True)
with open(os.path.join(kaggle_dir, 'kaggle.json'), 'w') as f:
json.dump(kaggle_api, f)
os.chmod(os.path.join(kaggle_dir, 'kaggle.json'), 0o600)
print("Kaggle API credentials saved.")Download the dataset orders.csv.zip to the ~/.kaggle/ directory using Kaggle CLI.
Then extract the dataset programmatically:
import zipfile
import os
zip_path = os.path.join(os.path.expanduser('~'), '.kaggle', 'orders.csv.zip')
extract_to = os.path.join(os.getcwd(), 'retail_orders_data')
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)
print(f"Dataset extracted to: {extract_to}")Now clean and process the data using pandas:
import pandas as pd
csv_path = 'retail_orders_data/orders.csv'
df = pd.read_csv(csv_path, na_values=['Not Available', 'unknown'])
df.columns = df.columns.str.lower().str.replace(' ', '_')
df['discount'] = df['list_price'] * df['discount_percent'] * 0.01
df['sale_price'] = df['list_price'] - df['discount']
df['profit'] = df['sale_price'] - df['cost_price']
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
df.drop(columns=['list_price', 'cost_price', 'discount_percent'], inplace=True)Push the cleaned data into a local SQLite database:
import sqlite3
conn = sqlite3.connect('mydatabase.db')
df.to_sql('orders', conn, if_exists='replace', index=False)
conn.close()Run SQL queries to generate insights such as:
Top 10 highest revenue-generating products
Top 5 best-selling products by region
Monthly sales comparison between 2022 and 2023
Highest sales month per product category
Sub-category with highest growth rate
Example query to find top 10 products by sales:
SELECT
product_id,
SUM(sale_price) AS total_sales
FROM orders
GROUP BY product_id
ORDER BY total_sales DESC
LIMIT 10;retail-orders-analysis/
│
├── setup_kaggle.py # Script to set up Kaggle API credentials
├── extract_dataset.py # Script to extract downloaded dataset
├── data_processing.py # Python script for cleaning and feature engineering
├── load_to_db.py # Script to load processed data into SQLite DB
├── queries.sql # SQL queries for data analysis
├── retail_orders_data/ # Extracted dataset folder (orders.csv)
├── README.md # Project overview and instructions (this file)✅ Output will be a clean, structured README in one cell, ready to be saved into your README.md.
Let me know if you also want me to scaffold the .py and .sql files referenced here.
