- Overview
- Dataset
- Tools
- Objectives
- Data Cleaning Process
- How to Use
- Contributions
- Future Improvements
- Contact
This project focuses on cleaning and preparing a dataset of layoffs for further analysis. The dataset, originally from Kaggle, was imported into MySQL, cleaned, and standardized to ensure consistency and accuracy for potential analysis.
- Original Dataset: Layoffs Dataset on Kaggle
- Raw Data: layoffs.json (imported into MySQL)
- Cleaned Data: final_cleaned_data.csv (final output after cleaning)
- SQL: Used for data processing, transformation, and scripting tasks within MySQL.
- Database Management System: MySQL to host, manage, and manipulate the dataset.
- Data Cleaning Tools: SQL for handling missing values, removing duplicates, and standardizing data formats.
- Handle missing or null values.
- Remove duplicate records.
- Ensure consistent formatting for dates, strings, and numerical values.
- Standardize column values for better analysis.
The data cleaning was done in multiple phases, as outlined below:
- The raw data was first imported into a MySQL table
layoffs
from layoffs.json. The original dataset contained 3887 rows.
- A staging table
layoffs_staging
was created to preserve the original data. Using theROW_NUMBER()
function, duplicates were identified and removed based on specific columns (company, location, total laid off, etc.). - Two duplicates were identified for 'Beyond Meat' and 'Cazoo' and were successfully removed.
- After removing duplicates, the dataset contained 3885 rows.
- Whitespace Cleanup: Removed extra spaces from text columns (e.g.,
company
). - Misspelled Locations: Corrected common misspellings in the
location
column (e.g., "Ferdericton" → "Fredericton"). - Country Standardization: Unified variations of country names (e.g., "UAE" → "United Arab Emirates").
- Date Formatting: Converted the
date
column from text to the properDATE
type. - Numeric Conversions: Converted the
total_laid_off
andfunds_raised
columns from text to integers, rounding where necessary.
- Null values in key columns (
industry
,total_laid_off
,percentage_laid_off
) were handled by replacing empty strings with NULL values. - Empty rows, where both
total_laid_off
andpercentage_laid_off
were NULL, were removed. This reduced the dataset size to 3248 rows.
- The final cleaned dataset, now with standardized data, was exported to final_cleaned_data.csv for further use in analysis.
- MySQL or any compatible database system.
- The dataset files (layoffs.json, layoffs.csv, Data Cleaning for Layoffs.sql, and final_cleaned_data.csv).
-
Set up the MySQL database:
- Import
layoffs.json
into the MySQL database using the provided SQL script. - Create necessary tables (
layoffs
,layoffs_staging
,layoffs_staging2
).
- Import
-
Execute the Cleaning SQL Script:
- Run the
Data Cleaning for Layoffs.sql
script to perform the data cleaning steps. This script includes commands for:- Removing duplicates.
- Standardizing data formats.
- Handling missing values.
- Dropping unnecessary columns.
- Run the
-
Output:
- After executing the SQL script, the cleaned dataset will be saved as
final_cleaned_data.csv
.
- After executing the SQL script, the cleaned dataset will be saved as
Feel free to fork the repository and contribute by suggesting improvements or submitting pull requests. This project is part of ongoing efforts to clean and analyze datasets for meaningful insights.
- Incorporating additional analysis to identify trends in layoffs across different industries or locations.
- Enhancing the data validation process to automatically detect and handle other potential anomalies in future datasets.
For questions or feedback, feel free to reach out to me on GitHub or via email.