Project Title: End-to-End Sales Analysis using Excel Power Query
-
Project Overview This project demonstrates an end-to-end data analysis pipeline focused on a sales dataset. The entire process, including data extraction, transformation, loading (ETL), and data cleaning, was conducted using Microsoft Excel's Power Query. The goal was to transform raw, messy data into a clean, structured dataset ready for business intelligence and strategic decision-making.
-
The Challenge: Unlocking Insights from Messy Data The initial datasets were in a raw format, containing:
Inconsistent data types
Missing values (NA and null)
Redundant and irrelevant columns
Disparate tables requiring integration
The objective was to systematically clean and prepare this data to enable accurate and meaningful sales analysis.
- Methodology: A Step-by-Step ETL Process The entire ETL process was performed within the Power Query Editor.
A. Data Import
Source: The data was imported from multiple Excel files (or a single workbook with multiple sheets) into Power Query.
B. Data Transformation & Cleaning
FactInternetSales:
New columns like Total Revenue and COGS were engineered to calculate key profitability metrics.
Product:
Columns with a high percentage of nulls were removed.
NA values in the Color column were replaced with 'Unspecified' to retain data integrity.
Date:
The table was filtered to analyze data only from 2005 to 2008.
New columns like Year, Month, Quarter, and a conditional WeekType were created to enable time-based analysis.
Customer:
First and last names were merged into a single CustomerName column for simplified analysis.
The table was cleaned by handling nulls and selecting only important columns.
Geography:
The table was streamlined by removing columns of low importance and retaining key geographical details.
- Tools Used Microsoft Excel: Used for all data manipulation, analysis, and visualization.
Power Query: The primary tool for the entire ETL process.
- Future Work Predictive Analysis: Explore the use of machine learning algorithms to identify key features contributing to sales performance. This would involve a transition to tools like Python and scikit-learn.