- Purpose
- Performance Indicators
- Data Overview
- Initial Data Observations
- Data Loading
- Data Cleaning And Transformation
- Data Exploration And Key Insights
During the global disruption caused by COVID-19, mass layoffs reshaped industries, economies, and livelihoods. This project uses SQL to clean and analyze real-world layoff data from the pandemic period—resolving inconsistencies, standardizing formats, and preparing the dataset for meaningful exploration.
Through structured SQL queries, the project uncovers key trends in layoff volume, industry impact, and geographic distribution—offering a data-driven lens into how different sectors and regions were affected. Built entirely in MySQL, the workflow emphasizes clarity, reproducibility, and readiness for downstream visualization or reporting.
This project explores the global layoff landscape during the COVID-19 pandemic through a structured SQL-based analysis. The following performance indicators were used to uncover key patterns, outliers, and trends in the dataset.
These metrics highlight the most significant layoff events and entities impacted:
- Peak Layoff Day: The single day with the highest number of reported layoffs.
- Most Affected Company: The organization with the largest total layoffs.
- Most Impacted Industry: The industry that experienced the highest cumulative layoffs.
- Country with Peak Layoff Day: The country that recorded the highest layoffs on a single day.
These provide supporting insights into the broader layoff distribution:
- Minimum Layoff Event: The lowest number of layoffs recorded in a single event.
- Least Affected Company: The company with the smallest layoff count (excluding zero).
- Least Impacted Industry: The industry with the lowest total layoffs.
- Country with Lowest Layoffs: The country with the smallest cumulative layoffs.
These indicators help contextualize the layoffs across time and organizational structure:
- 100% Layoff Companies: List of companies that reported complete workforce layoffs.
- Layoff Date Range: Start and end dates of the recorded layoff events.
- Annual Layoff Totals: Year-wise aggregation of total layoffs.
- Monthly Layoff Totals: Month-wise breakdown of layoffs across the timeline.
- Layoff Progression: Temporal trend showing how layoffs evolved over time.
This project is based on a publicly available dataset that captures global layoff events during the COVID-19 pandemic. The dataset includes detailed records of layoffs across various companies, industries, and countries—highlighting the economic impact of the pandemic on the global workforce.
The data was sourced from a raw CSV file and processed using SQL for cleansing, transformation, and exploratory analysis. It serves as the foundation for uncovering patterns in layoff frequency, industry vulnerability, and geographic distribution.
- Source: Raw CSV file (uploaded locally)
- Format: CSV (Comma-Separated Values)
- Rows: 9,217 (including header)
- Columns: 11
- Time Period Covered: January 2020 to March 2023 (based on layoff dates)
- Key Fields: Copany, Industry, Total Layoffs, and Country
A preliminary review of the raw layoff dataset revealed several structural and quality issues that required attention before analysis:
-
Duplicate Records: Multiple identical entries were present across key fields such as company, location, industry, and layoff date, indicating potential redundancy.
-
Inconsistent Text Formatting:
- Company and country names included trailing spaces and punctuation (e.g., "United States.").
- Industry names appeared with inconsistent labels (e.g., "Crypto Currency", "Crypto.com", etc.), requiring standardization.
-
Improper Date Format: The
date
column was stored as plain text inMM/DD/YYYY
format, which limited its usability for time-based analysis. -
Missing and Empty Values:
- Several rows had missing or empty values in critical fields like
industry
,total_laid_off
, andpercentage_laid_off
. - Some companies had partial data that could potentially be inferred from other entries.
- Several rows had missing or empty values in critical fields like
-
Mixed Data Types: Numeric fields such as
funds_raised_millions
andtotal_laid_off
were inconsistently populated, and some text fields contained numeric-like values.
These observations highlighted the need for a structured data cleaning process to ensure consistency, accuracy, and analytical readiness.
-
The dataset was imported into MySQL using the following workflow:
- Opened MySQL Workbench and created a new database schema for the project.
- Used the
Table Data Import Wizard
to load the CSV file into a staging table (layoffs_staging
). - Verified column mappings and data types during the import process to ensure alignment with the CSV structure.
- Confirmed successful import by previewing the first few rows and checking for nulls or misaligned fields.
-
A duplicate of the staging table was created to preserve the raw data and enable safe transformation during the cleaning phase.
The dataset underwent a structured SQL-based cleaning process in MySQL to ensure consistency, accuracy, and readiness for analysis. Key transformations are outlined below:
1. Duplicate Table Creation
Created a staging table to preserve the original dataset and imported all records for safe transformation:
CREATE TABLE layoffs_staging LIKE layoffs;
INSERT layoffs_staging SELECT * FROM layoffs;
2. Row Number Generation
Added a row_num column using ROW_NUMBER() to identify potential duplicate entries based on key fields:
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, industry, total_laid_off, percentage_laid_off, `date`
) AS row_num
FROM layoffs_staging;
3. Duplicate Detection via CTE
Used a Common Table Expression (CTE) to isolate duplicate records for review:
WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised_millions
) AS row_num
FROM layoffs_staging
)
SELECT * FROM duplicate_cte
WHERE row_num > 1;
4. New Table with Row Number
Created layoffs_staging2 to include the row_num column, as MySQL does not allow column deletion in CTEs:
CREATE TABLE layoffs_staging2 (
company TEXT,
location TEXT,
industry TEXT,
total_laid_off INT DEFAULT NULL,
percentage_laid_off TEXT,
date TEXT,
stage TEXT,
country TEXT,
funds_raised_millions INT DEFAULT NULL,
row_num INT
);
INSERT INTO layoffs_staging2
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised_millions
) AS row_num
FROM layoffs_staging;
5. Duplicate Removal
Identified and removed duplicate rows based on the row_num column:
SELECT * FROM layoffs_staging2 WHERE row_num > 1;
DELETE FROM layoffs_staging2 WHERE row_num > 1;
6. Standardizing Text Fields
Trimmed whitespace, corrected inconsistent labels, and removed punctuation:
UPDATE layoffs_staging2 SET company = TRIM(company);
UPDATE layoffs_staging2 SET industry = 'Crypto' WHERE industry LIKE 'Crypto%';
UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';
7. Date Formatting
Converted the date column from text to proper DATE format:
UPDATE layoffs_staging2
SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');
ALTER TABLE layoffs_staging2
MODIFY COLUMN `date` DATE;
8. Handling Null and Empty Values
Identified missing values and used self-joins to fill in missing industry data:
UPDATE layoffs_staging2 SET industry = NULL WHERE industry = '';
UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2 ON t1.company = t2.company
SET t1.industry = t2.industry
WHERE (t1.industry IS NULL OR t1.industry = '')
AND t2.industry IS NOT NULL;
9. Cleanup
Dropped the temporary row_num column after all transformations were complete:
ALTER TABLE layoffs_staging2 DROP COLUMN row_num;
After cleaning and transforming the dataset in MySQL, I used SQL queries to explore patterns in layoff volume, industry impact, geographic distribution, and temporal trends. These insights helped shape the analytical narrative and guided the design of performance indicators.
2. Which company had the most layoffs?
SELECT MAX(company), MAX(total_laid_off) FROM layoffs_staging2
3. Which industry was most impacted?
SELECT MAX(total_laid_off), MAX(industry) FROM layoffs_staging2
4. Which country had the highest layoff in a day?
SELECT MAX(total_laid_off), MAX(country) FROM layoffs_staging2
6. Which company had the lowest layoffs?
SELECT MIN(company), MIN(total_laid_off) FROM layoffs_staging2
7. Which industry was least impacted?
SELECT MIN(industry), MIN(total_laid_off) FROM layoffs_staging2
8. Which country had the lowest number of layoffs?
SELECT MIN(total_laid_off), MIN(country) FROM layoffs_staging2
9. Which companies had 100% layoffs?
SELECT company, total_laid_off, percentage_laid_off
FROM layoffs_staging2
WHERE percentage_laid_off = 1 AND total_laid_off IS NOT NULL
ORDER BY 2 DESC
10. What was the date range for these layoffs?
SELECT MIN(`date`), MAX(`date`) FROM layoffs_staging2
11. What was the total layoff per year?
SELECT YEAR(`date`), SUM(total_laid_off)
FROM layoffs_staging2
GROUP BY YEAR(`date`)
ORDER BY 1 DESC
12. What was the total layoff per month?
SELECT SUBSTRING(`date`,1,7) AS `MONTH`, SUM(total_laid_off)
FROM layoffs_staging2
WHERE SUBSTRING(`date`,1,7) IS NOT NULL
GROUP BY `MONTH`
ORDER BY 1
13. What was the progression of layoffs over time?
WITH Rolling_Total AS (
SELECT SUBSTRING(`date`,1,7) AS `MONTH`, SUM(total_laid_off) AS total_laidoff
FROM layoffs_staging2
WHERE SUBSTRING(`date`,1,7) IS NOT NULL
GROUP BY `MONTH`
ORDER BY 1
)
SELECT `MONTH`, total_laidoff, SUM(total_laidoff) OVER(ORDER BY `MONTH`) AS rolling_total
FROM Rolling_Total
Note
The full coding details can be found on the SQL file attached to the repositry. Some of null values not been removed from certain columns as the other information in the row found to be usfull for EDA.