Data-cleaning-PostgreSQL

Overview

This repository hosts a SQL script, nashville_data_cleaning.sql, which performs cleaning, standardization, and transformation of the raw data in Nashville property sales records. The main focus was on data integrity, eliminating duplicates and missing values, and preparing the data for further analysis or reporting.

The script is written in PostgreSQL syntax - using things like COALESCE, POSITION, SPLIT_PART, and the particular UPDATE.FROM join syntax.

Data Cleaning & Transformation Steps

The script performs the following five major data cleaning operations in sequential order:

Missing Data Population (Property Address)

Problem: The PropertyAddress column contained NULL values.

Solution: Used a self-join (UPDATE...FROM statement) to match records with the same ParcelID (since the address should be the same for the same parcel) and copied the existing address from the matched record into the NULL fields.

Address Standardization (Splitting Columns)

The original full address strings (PropertyAddress and OwnerAddress) were broken down into separate, usable columns for better querying.

PropertyAddress: Separated into propertysplitaddress and propertysplitcity using SUBSTRING and POSITION functions.
OwnerAddress: Separated into ownersplitaddress, ownersplitcity, and ownersplitstate using REPLACE (to handle delimiters) and the SPLIT_PART function.

Data Standardization (Sold As Vacant)

Problem: The SoldAsVacant column contained inconsistent values (e.g., 'Y', 'N', 'YES', 'NO', 'Yes', 'No').

Solution: Used a CASE statement with an UPDATE to normalize all variations to the consistent proper case strings: 'Yes' and 'No'.

Duplicate Removal

Problem: The dataset contained duplicate property records based on core identifying features.

Solution: Employed a Common Table Expression (CTE) and the ROW_NUMBER() window function, partitioned by unique identifiers (ParcelID, SalePrice, SaleDate, etc.). All rows where row_num > 1 were identified and removed using a final DELETE statement.

Final Schema Cleanup

Problem: The original full address columns were no longer needed, and the TaxDistrict column was unused.

Solution: The script uses an ALTER TABLE...DROP COLUMN statement to remove the redundant PropertyAddress, OwnerAddress, and TaxDistrict columns, resulting in a cleaner final table schema.

Final Table Schema

The final table, Nashville_Housing_Data, includes all original columns plus the new split columns, ensuring a clean and analysis-ready structure.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data Cleaning.sql		Data Cleaning.sql
LICENSE		LICENSE
Nashville Housing Data for Data Cleaning.csv		Nashville Housing Data for Data Cleaning.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-cleaning-PostgreSQL

Overview

Data Cleaning & Transformation Steps

Missing Data Population (Property Address)

Address Standardization (Splitting Columns)

Data Standardization (Sold As Vacant)

Duplicate Removal

Final Schema Cleanup

Final Table Schema

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Data-cleaning-PostgreSQL

Overview

Data Cleaning & Transformation Steps

Missing Data Population (Property Address)

Address Standardization (Splitting Columns)

Data Standardization (Sold As Vacant)

Duplicate Removal

Final Schema Cleanup

Final Table Schema

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages