Data Cleaning for Layoffs

Overview

This project focuses on cleaning and preparing a dataset of layoffs for further analysis. The dataset, originally from Kaggle, was imported into MySQL, cleaned, and standardized to ensure consistency and accuracy for potential analysis.

Dataset

Original Dataset: Layoffs Dataset on Kaggle
Raw Data: layoffs.json (imported into MySQL)
Cleaned Data: final_cleaned_data.csv (final output after cleaning)

Tools

SQL: Used for data processing, transformation, and scripting tasks within MySQL.
Database Management System: MySQL to host, manage, and manipulate the dataset.
Data Cleaning Tools: SQL for handling missing values, removing duplicates, and standardizing data formats.

Objectives

Handle missing or null values.
Remove duplicate records.
Ensure consistent formatting for dates, strings, and numerical values.
Standardize column values for better analysis.

Data Cleaning Process

The data cleaning was done in multiple phases, as outlined below:

Importing the Raw Data

The raw data was first imported into a MySQL table layoffs from layoffs.json. The original dataset contained 3887 rows.

Removing Duplicates

A staging table layoffs_staging was created to preserve the original data. Using the ROW_NUMBER() function, duplicates were identified and removed based on specific columns (company, location, total laid off, etc.).
Two duplicates were identified for 'Beyond Meat' and 'Cazoo' and were successfully removed.
After removing duplicates, the dataset contained 3885 rows.

Standardizing Data

Whitespace Cleanup: Removed extra spaces from text columns (e.g., company).
Misspelled Locations: Corrected common misspellings in the location column (e.g., "Ferdericton" → "Fredericton").
Country Standardization: Unified variations of country names (e.g., "UAE" → "United Arab Emirates").
Date Formatting: Converted the date column from text to the proper DATE type.
Numeric Conversions: Converted the total_laid_off and funds_raised columns from text to integers, rounding where necessary.

Handling Null Values

Null values in key columns (industry, total_laid_off, percentage_laid_off) were handled by replacing empty strings with NULL values.
Empty rows, where both total_laid_off and percentage_laid_off were NULL, were removed. This reduced the dataset size to 3248 rows.

Final Cleaned Dataset

The final cleaned dataset, now with standardized data, was exported to final_cleaned_data.csv for further use in analysis.

How to Use

Prerequisites

MySQL or any compatible database system.
The dataset files (layoffs.json, layoffs.csv, Data Cleaning for Layoffs.sql, and final_cleaned_data.csv).

Steps to Run

Set up the MySQL database:
- Import layoffs.json into the MySQL database using the provided SQL script.
- Create necessary tables (layoffs, layoffs_staging, layoffs_staging2).
Execute the Cleaning SQL Script:
- Run the Data Cleaning for Layoffs.sql script to perform the data cleaning steps. This script includes commands for:
  - Removing duplicates.
  - Standardizing data formats.
  - Handling missing values.
  - Dropping unnecessary columns.
Output:
- After executing the SQL script, the cleaned dataset will be saved as final_cleaned_data.csv.

Contributions

Feel free to fork the repository and contribute by suggesting improvements or submitting pull requests. This project is part of ongoing efforts to clean and analyze datasets for meaningful insights.

Future Improvements

Incorporating additional analysis to identify trends in layoffs across different industries or locations.
Enhancing the data validation process to automatically detect and handle other potential anomalies in future datasets.

Contact

For questions or feedback, feel free to reach out to me on GitHub or via email.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Dataset		Dataset
Data Cleaning for Layoffs.sql		Data Cleaning for Layoffs.sql
README.md		README.md
final_cleaned_data.csv		final_cleaned_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Cleaning for Layoffs

Table of Contents

Overview

Dataset

Tools

Objectives

Data Cleaning Process

Importing the Raw Data

Removing Duplicates

Standardizing Data

Handling Null Values

Final Cleaned Dataset

How to Use

Prerequisites

Steps to Run

Contributions

Future Improvements

Contact

About

Uh oh!

Releases

Packages

busesimsek/SQL-Data-Cleaning-Project

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning for Layoffs

Table of Contents

Overview

Dataset

Tools

Objectives

Data Cleaning Process

Importing the Raw Data

Removing Duplicates

Standardizing Data

Handling Null Values

Final Cleaned Dataset

How to Use

Prerequisites

Steps to Run

Contributions

Future Improvements

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages