Skip to content

A data cleaning project focusing on layoff trends, using MySQL to handle missing values, remove duplicates, standardize data, and ensure consistent formatting for accurate analysis.

Notifications You must be signed in to change notification settings

busesimsek/SQL-Data-Cleaning-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

image

Data Cleaning for Layoffs

Table of Contents

  1. Overview
  2. Dataset
  3. Tools
  4. Objectives
  5. Data Cleaning Process
  6. How to Use
  7. Contributions
  8. Future Improvements
  9. Contact

Overview

This project focuses on cleaning and preparing a dataset of layoffs for further analysis. The dataset, originally from Kaggle, was imported into MySQL, cleaned, and standardized to ensure consistency and accuracy for potential analysis.

Dataset


Tools

  • SQL: Used for data processing, transformation, and scripting tasks within MySQL.
  • Database Management System: MySQL to host, manage, and manipulate the dataset.
  • Data Cleaning Tools: SQL for handling missing values, removing duplicates, and standardizing data formats.

Objectives

  1. Handle missing or null values.
  2. Remove duplicate records.
  3. Ensure consistent formatting for dates, strings, and numerical values.
  4. Standardize column values for better analysis.

Data Cleaning Process

The data cleaning was done in multiple phases, as outlined below:

Importing the Raw Data

  • The raw data was first imported into a MySQL table layoffs from layoffs.json. The original dataset contained 3887 rows.

Removing Duplicates

  • A staging table layoffs_staging was created to preserve the original data. Using the ROW_NUMBER() function, duplicates were identified and removed based on specific columns (company, location, total laid off, etc.).
  • Two duplicates were identified for 'Beyond Meat' and 'Cazoo' and were successfully removed.
  • After removing duplicates, the dataset contained 3885 rows.

Standardizing Data

  • Whitespace Cleanup: Removed extra spaces from text columns (e.g., company).
  • Misspelled Locations: Corrected common misspellings in the location column (e.g., "Ferdericton" → "Fredericton").
  • Country Standardization: Unified variations of country names (e.g., "UAE" → "United Arab Emirates").
  • Date Formatting: Converted the date column from text to the proper DATE type.
  • Numeric Conversions: Converted the total_laid_off and funds_raised columns from text to integers, rounding where necessary.

Handling Null Values

  • Null values in key columns (industry, total_laid_off, percentage_laid_off) were handled by replacing empty strings with NULL values.
  • Empty rows, where both total_laid_off and percentage_laid_off were NULL, were removed. This reduced the dataset size to 3248 rows.

Final Cleaned Dataset

  • The final cleaned dataset, now with standardized data, was exported to final_cleaned_data.csv for further use in analysis.

How to Use

Prerequisites

  1. MySQL or any compatible database system.
  2. The dataset files (layoffs.json, layoffs.csv, Data Cleaning for Layoffs.sql, and final_cleaned_data.csv).

Steps to Run

  1. Set up the MySQL database:

    • Import layoffs.json into the MySQL database using the provided SQL script.
    • Create necessary tables (layoffs, layoffs_staging, layoffs_staging2).
  2. Execute the Cleaning SQL Script:

    • Run the Data Cleaning for Layoffs.sql script to perform the data cleaning steps. This script includes commands for:
      • Removing duplicates.
      • Standardizing data formats.
      • Handling missing values.
      • Dropping unnecessary columns.
  3. Output:

    • After executing the SQL script, the cleaned dataset will be saved as final_cleaned_data.csv.

Contributions

Feel free to fork the repository and contribute by suggesting improvements or submitting pull requests. This project is part of ongoing efforts to clean and analyze datasets for meaningful insights.


Future Improvements

  • Incorporating additional analysis to identify trends in layoffs across different industries or locations.
  • Enhancing the data validation process to automatically detect and handle other potential anomalies in future datasets.

Contact

For questions or feedback, feel free to reach out to me on GitHub or via email.

About

A data cleaning project focusing on layoff trends, using MySQL to handle missing values, remove duplicates, standardize data, and ensure consistent formatting for accurate analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published