Skip to content

Odongi-s-data-science-projects/Data-cleaning-PostgreSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data-cleaning-PostgreSQL

Overview

This repository hosts a SQL script, nashville_data_cleaning.sql, which performs cleaning, standardization, and transformation of the raw data in Nashville property sales records. The main focus was on data integrity, eliminating duplicates and missing values, and preparing the data for further analysis or reporting.

The script is written in PostgreSQL syntax - using things like COALESCE, POSITION, SPLIT_PART, and the particular UPDATE.FROM join syntax.

Data Cleaning & Transformation Steps

The script performs the following five major data cleaning operations in sequential order:

Missing Data Population (Property Address)

Problem: The PropertyAddress column contained NULL values.

Solution: Used a self-join (UPDATE...FROM statement) to match records with the same ParcelID (since the address should be the same for the same parcel) and copied the existing address from the matched record into the NULL fields.

Address Standardization (Splitting Columns)

The original full address strings (PropertyAddress and OwnerAddress) were broken down into separate, usable columns for better querying.

  1. PropertyAddress: Separated into propertysplitaddress and propertysplitcity using SUBSTRING and POSITION functions.
  2. OwnerAddress: Separated into ownersplitaddress, ownersplitcity, and ownersplitstate using REPLACE (to handle delimiters) and the SPLIT_PART function.

Data Standardization (Sold As Vacant)

Problem: The SoldAsVacant column contained inconsistent values (e.g., 'Y', 'N', 'YES', 'NO', 'Yes', 'No').

Solution: Used a CASE statement with an UPDATE to normalize all variations to the consistent proper case strings: 'Yes' and 'No'.

Duplicate Removal

Problem: The dataset contained duplicate property records based on core identifying features.

Solution: Employed a Common Table Expression (CTE) and the ROW_NUMBER() window function, partitioned by unique identifiers (ParcelID, SalePrice, SaleDate, etc.). All rows where row_num > 1 were identified and removed using a final DELETE statement.

Final Schema Cleanup

Problem: The original full address columns were no longer needed, and the TaxDistrict column was unused.

Solution: The script uses an ALTER TABLE...DROP COLUMN statement to remove the redundant PropertyAddress, OwnerAddress, and TaxDistrict columns, resulting in a cleaner final table schema.

Final Table Schema

The final table, Nashville_Housing_Data, includes all original columns plus the new split columns, ensuring a clean and analysis-ready structure.

About

Using a dataset called Nashville House Data I perform the necessary steps of cleansing it.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors