This project began as an exercise to analyze an online retail dataset (Online_Retail.xlsx
) and prepare strategic questions for a hypothetical CEO and CMO, aiming to evaluate business performance and guide expansion. The journey evolved into a full-fledged data analysis and visualization project, where I used Python with pandas
and matplotlib
to create professional visuals: a bar chart of top-selling products and a table of revenue by country. This repository documents the entire process, from drafting business questions to overcoming technical challenges and producing polished outputs for sharing on GitHub and LinkedIn. The project reflects my learning path as an aspiring data analyst, building on my technical background in Computer Science and experience with tools like Python and Tableau.
The initial goal was to analyze the retail dataset to prepare for a strategic meeting with business leaders. I used the dataset’s columns—InvoiceNo
, StockCode
, Description
, Quantity
, InvoiceDate
, UnitPrice
, CustomerID
, and `Country—to formulate questions that would uncover insights into revenue drivers, customer behavior, and operational efficiency. The questions were tailored to the CEO’s focus on growth and the CMO’s interest in marketing optimization.
CEO Questions:
- What are the key drivers of revenue growth across product categories, and how do they correlate with customer demographics (e.g., country, purchase frequency)?
- How does average order value (AOV) vary by country, and what operational factors (e.g., shipping costs, product mix) influence these variations?
- What is the trend in customer retention and repeat purchase rates, and how does this impact long-term revenue sustainability?
- Which regions show the highest revenue contribution, and what operational challenges (e.g., cancellations, discounts) hinder scaling these markets?
CMO Questions:
- Which products have the highest sales volume and revenue, and how can we tailor marketing campaigns to promote them to similar customer segments?
- How does purchasing behavior vary by country, and what cultural or seasonal marketing strategies can boost engagement in key markets?
- What is the impact of discounts and cancellations on customer acquisition and retention, and how can we optimize promotional strategies?
- Which customer segments (based on purchase frequency, spend, or product preferences) are most responsive to marketing efforts, and how can we refine messaging?
Thought Process: The questions were designed to align with the dataset’s structure, leveraging metrics like revenue (Quantity * UnitPrice
), geographic segmentation (Country
), and customer loyalty (CustomerID
). I aimed to provide actionable insights for strategic planning and marketing, reflecting my understanding of business needs.
To showcase my analysis skills on LinkedIn, I decided to visualize key insights from the dataset, focusing on top-selling products and revenue by country. I chose Python with pandas
for data processing and matplotlib
for visualization, inspired by my prior experience with data projects (e.g., an ESG analytics dashboard). The goal was to create professional, McKinsey-style visuals that would appeal to my professional network and demonstrate my technical and analytical abilities.
The journey wasn’t without hurdles. Initially, I faced an openpyxl
import error when loading the Excel file, as pandas.read_excel
required this library. I resolved this by installing openpyxl
with pip install openpyxl
. Another issue arose when I tried running Python code directly in PowerShell, resulting in a “df not recognized” error. This was because I mistakenly executed script lines in the shell instead of within a Python environment. I learned to run the script correctly using python retail_visuals.py
and verified the Excel filename (Online_Retail.xlsx
vs. Online Retail.xlsx
). These challenges taught me the importance of environment setup and precise file handling, skills I’ll carry forward in future projects.
The final project analyzes the online retail dataset to identify top-selling products and revenue distribution by country, producing two visuals: a bar chart of the top 5 products by revenue and a table of revenue by country. These outputs were designed to support the strategic questions posed earlier, providing clear insights for business leaders. The project showcases my ability to clean, analyze, and visualize data using Python, with a focus on professional presentation for sharing on GitHub and LinkedIn.
The dataset, Online_Retail.xlsx
, contains transactional data from a UK-based online retailer (December 2010–December 2011). Key columns include:
InvoiceNo
: Transaction ID (cancellations start with 'C').StockCode
: Product code.Description
: Product name.Quantity
: Items purchased.InvoiceDate
: Transaction date.UnitPrice
: Price per item.CustomerID
: Customer identifier.Country
: Customer’s country.
The dataset’s structure enabled analysis of sales performance, customer segmentation, and geographic trends, making it ideal for business insights.
My approach was driven by the following considerations:
- Business Alignment: The visuals needed to address the CEO’s focus on revenue growth and the CMO’s need for marketing insights, building on the initial questions.
- Data Integrity: Retail datasets often have issues like cancellations or missing values. Thorough cleaning ensured accurate results.
- Metric Selection: Revenue (
Quantity * UnitPrice
) was the core metric, with product and country groupings to highlight key drivers. - Visualization Design: I chose a bar chart for products (to emphasize top performers) and a table for countries (to summarize geographic data concisely). Visuals needed to be clear, professional, and McKinsey-inspired (navy/gold color scheme, no overlapping labels).
- Learning Showcase: As part of my data analyst journey, I aimed to produce outputs that demonstrate technical skills and business acumen to my LinkedIn network.
The analysis was implemented in Python using pandas
and matplotlib
. Below are the steps:
- Loading: Loaded
Online_Retail.xlsx
withpandas.read_excel
, requiringopenpyxl
. - Cleaning:
- Removed cancellations (
InvoiceNo
starting with 'C') to focus on valid sales. - Filtered out records with negative
Quantity
, zero/negativeUnitPrice
, or missingCustomerID
.
- Removed cancellations (
- Rationale: Clean data was essential for accurate revenue calculations and reliable insights.
- Created a
Revenue
column (Quantity * UnitPrice
) for analysis. - Rationale: Revenue is a direct indicator of business performance, central to both visuals.
- Analysis: Grouped by
Description
, summedRevenue
, and selected the top 5 usingsort_values
andhead(5)
. - Visualization:
- Plotted a bar chart with
matplotlib.pyplot.bar
. - Used navy blue (
#003087
) bars and gold (#FFD700
) value labels for a professional look. - Added £-formatted labels above bars for clarity.
- Rotated x-axis labels 45 degrees and used
tight_layout
to avoid overlap.
- Plotted a bar chart with
- Output: Saved as
top_products.png
(300 DPI). - Rationale: A bar chart highlights top products, aiding marketing prioritization.
- Analysis: Grouped by
Country
, summedRevenue
, and calculated% of Total
for the top 5. - Visualization:
- Created a table with
matplotlib.pyplot.table
(Country
,Revenue (£)
,% of Total
). - Styled with navy headers, white header text, and alternating row colors (
#F5F6F5
/white). - Hid axes and added a bold title for focus.
- Created a table with
- Output: Saved as
country_revenue_table.png
(300 DPI). - Rationale: A table summarizes geographic contributions, supporting expansion decisions.
- Colors: Navy and gold for a sophisticated, business-friendly aesthetic.
- Clarity: Ensured no label overlap and used high-resolution outputs.
- Portability: PNG format for easy sharing on GitHub/LinkedIn.
- Rationale: Professional visuals enhance credibility and appeal to stakeholders.
The script (retail_visuals.py
) is structured as:
- Imports:
pandas
,matplotlib.pyplot
,numpy
. - Data Loading/Cleaning: Loads and filters the dataset.
- Revenue Calculation: Computes
Revenue
. - Bar Chart: Groups data, plots, and saves the chart.
- Table: Formats and saves the country revenue table.
- Output: Confirms file creation.
The code is modular and reusable, with comments for clarity.
The script produces:
top_products.png
: Bar chart of top 5 products by revenue (e.g., "WHITE HANGING HEART T-LIGHT HOLDER"), navy bars, gold labels.country_revenue_table.png
: Table of top 5 countries (e.g., United Kingdom), with revenue and percentage, navy headers, alternating rows.
View them here:
- Python 3.8+
- Libraries:
pandas
,matplotlib
,numpy
,openpyxl
- Dataset:
Online_Retail.xlsx
- Clone the repository:
git clone https://github.com/your-username/retail-sales-analysis.git cd retail-sales-analysis
- Install dependencies:
pip install pandas matplotlib numpy openpyxl
- Place
Online_Retail.xlsx
in the project directory.
- Execute:
python retail_visuals.py
- Outputs:
top_products.png
,country_revenue_table.png
, and a confirmation message.
- Openpyxl Error: Resolved by installing
openpyxl
(pip install openpyxl
). - PowerShell Error: Mistakenly ran Python code in PowerShell. Fixed by running the script via
python retail_visuals.py
. - Filename Mismatch: Ensured the Excel filename matched the script (
Online_Retail.xlsx
). - Label Overlap: Used
plt.tight_layout
and rotated labels to ensure clarity.
- Interactive Visuals: Use
plotly
for hoverable charts. - Extended Analysis: Explore customer retention or seasonal trends with
InvoiceDate
. - Tableau Integration: Create complementary Tableau dashboards, as explored earlier.
- Automation: Script dynamic analysis for different time periods or categories.
This project was a milestone in my transition to a data analyst role, building on my Computer Science background and experience with Python and Tableau (e.g., past ESG dashboard project). It demonstrates my ability to analyze real-world data, overcome technical challenges, and present insights professionally. Sharing this on GitHub and LinkedIn connects me with the data analytics community and showcases my growth.
MIT License. See LICENSE.
- Dataset: UCI Machine Learning Repository.
- Inspired by my goal to excel in data analytics and share my journey.
Explore the code and visuals, and feel free to provide feedback!