This project focuses on cleaning and preprocessing a dataset containing information on company layoffs. The dataset includes details such as company names, locations, industries, total layoffs, funding raised, and more. The goal of this project is to enhance data quality by removing inconsistencies, handling missing values, and ensuring standardized formatting using MySQL.
- File Name:
layoffs.csv
- Total Entries: 2,361 rows
- Columns: 9
company
: Name of the companylocation
: Company headquartersindustry
: Industry categorytotal_laid_off
: Number of employees laid offpercentage_laid_off
: Percentage of workforce laid offdate
: Layoff announcement datestage
: Funding stage of the companycountry
: Country of the companyfunds_raised_millions
: Total funds raised in millions
- Used a Common Table Expression (CTE) to detect and eliminate duplicate records.
- Ensured consistency in text formatting (e.g., removing extra spaces, capitalization issues).
- Identified
NULL
or blank values and addressed them appropriately. - Applied imputation techniques where necessary.
- Removed columns that were unnecessary for analysis.
- Created two staging tables (
layoffs_staging
andlayoffs_staging2
) to preserve original data and perform transformations efficiently.
- MySQL – Used for executing SQL queries and performing data cleaning operations.
- Jupyter Notebook / Python (Optional) – Could be used for further exploratory data analysis (EDA).
The cleaning process was executed using SQL queries, including:
CREATE TABLE
– To create staging tables for transformation.INSERT INTO
– To populate the staging tables.ROW_NUMBER() OVER(PARTITION BY...)
– To detect duplicate records.DELETE
– To remove unwanted rows.
Data cleaning is a crucial step in data analysis and machine learning. Unclean data can lead to inaccurate insights and faulty decision-making. By applying systematic data cleaning techniques, we ensure that our dataset is reliable, consistent, and ready for further analysis.
- Clone this repository to your local system.
git clone https://github.com/your-username/data-cleaning-project.git