This repository hosts a SQL script, nashville_data_cleaning.sql, which performs cleaning, standardization, and transformation of the raw data in Nashville property sales records. The main focus was on data integrity, eliminating duplicates and missing values, and preparing the data for further analysis or reporting.
The script is written in PostgreSQL syntax - using things like COALESCE, POSITION, SPLIT_PART, and the particular UPDATE.FROM join syntax.
The script performs the following five major data cleaning operations in sequential order:
Problem: The PropertyAddress column contained NULL values.
Solution: Used a self-join (UPDATE...FROM statement) to match records with the same ParcelID (since the address should be the same for the same parcel) and copied the existing address from the matched record into the NULL fields.
The original full address strings (PropertyAddress and OwnerAddress) were broken down into separate, usable columns for better querying.
- PropertyAddress: Separated into propertysplitaddress and propertysplitcity using SUBSTRING and POSITION functions.
- OwnerAddress: Separated into ownersplitaddress, ownersplitcity, and ownersplitstate using REPLACE (to handle delimiters) and the SPLIT_PART function.
Problem: The SoldAsVacant column contained inconsistent values (e.g., 'Y', 'N', 'YES', 'NO', 'Yes', 'No').
Solution: Used a CASE statement with an UPDATE to normalize all variations to the consistent proper case strings: 'Yes' and 'No'.
Problem: The dataset contained duplicate property records based on core identifying features.
Solution: Employed a Common Table Expression (CTE) and the ROW_NUMBER() window function, partitioned by unique identifiers (ParcelID, SalePrice, SaleDate, etc.). All rows where row_num > 1 were identified and removed using a final DELETE statement.
Problem: The original full address columns were no longer needed, and the TaxDistrict column was unused.
Solution: The script uses an ALTER TABLE...DROP COLUMN statement to remove the redundant PropertyAddress, OwnerAddress, and TaxDistrict columns, resulting in a cleaner final table schema.
The final table, Nashville_Housing_Data, includes all original columns plus the new split columns, ensuring a clean and analysis-ready structure.