Nashville Housing Data Cleaning in SQL
2/24/2024
Introduction
Keeping on theme with the Airbnb housing project, I decided to practice my SQL and data cleaning skills on Nashville housing data. I found this excel data sheet on Kaggle, which is a fanastic place to find all sorts of data to practice skills. Anyways, this file intrigued me as there were quite a few things wrong with it. I imported the data to Navicat using MySQL and off to the races I was, beginning to explore all of its faults and fix them.
(Side note: I did test my logic with SELECT queries before updating, but for space purposes I mostly just posted the final UPDATE query. You can check the full queries on my GitHub.)
Date Formatting
When looking through the SaleDate column, I found it was not in a usuable format for a date type in most versions of SQL and DB Admin tools. So to fix that I had to convert it to a proper format of M/d/Y, using the STR_TO_DATE function. After ensuring it worked, I updated the table and set the SaleDate column equal to the function.
UPDATE nashvillehousing
SET SaleDate = STR_TO_DATE(SaleDate, '%M %d, %Y')
SoldAsVacant Standardization
SoldAsVacant had different strings of Yes and No. I had to fix this as it would be annoying to work with later. I started by figuring out how many different ones there were using a simple query.
SELECT DISTINCT(SoldAsVacant), COUNT(SoldAsVacant)
FROM nashvillehousing
GROUP BY SoldAsVacant
ORDER BY 2
The result showed that there were four types of entries: Yes, No, Y, and N. I chose to keep the full words, convert the abbrieviations, and updated the table.
UPDATE nashvillehousing
SET SoldAsVacant = CASE
WHEN SoldAsVacant = 'Y' THEN 'Yes'
WHEN SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
Remove Duplicates
Next was a more complex task of removing the the duplicate entries. I determined that there were multiple of the same data points by matching the ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference. If they were the same I counted the occurances of them in the table. The best way I deteremined to achieve this was using a CTE.
WITH RowNumCTE AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY UniqueID
) AS row_num
FROM nashvillehousing
)
DELETE n
FROM nashvillehousing n
JOIN RowNumCTE r ON n.UniqueID = r.UniqueID
WHERE r.row_num = 2;
Populate Property Address Data
Oddly, there were a many rows that did not have a property address. So to overcome this I checked to see if they were in any other entries. Luckly they were. Apparently, some of the properties were sold multiple times and the address’ were not entered again. I fixed this with the following query.
SELECT COUNT(*) - COUNT(PropertyAddress)
FROM nashvillehousing
UPDATE nashvillehousing a
JOIN nashvillehousing b
ON a.ParcelID = b. ParcelID
AND a.UniqueID <> b.UniqueID
SET a.PropertyAddress = COALESCE(a.PropertyAddress, b.PropertyAddress)
WHERE a.PropertyAddress is NULL
Address Standardization
Unfortantely the address’ have their street line and neighborhood combined in the PropertyAddress column. Using the SUBSTRING_INDEX function, I was able to give them there own columns.
ALTER TABLE nashvillehousing
ADD COLUMN StreetAddress VARCHAR(50)
ADD COLUMN Neighborhood VARCHAR(35)
UPDATE nashvillehousing
SET StreetAddress = SUBSTRING_INDEX(PropertyAddress,',',1)
UPDATE nashvillehousing
SET Neighborhood = SUBSTRING_INDEX(PropertyAddress,',',-1)
The OwenerAddress’ were even more complex as the street line, neighborhood, and state were grouped together. Thankfully, MySQL’s SUBSTRING function is very versatile, allowing nested usage. Therefore, I was able to seperate the three.
ALTER TABLE nashvillehousing
ADD COLUMN OwnerStreetAddress VARCHAR(50),
ADD COLUMN OwnerNeighborhood VARCHAR(35),
ADD COLUMN OwnerState VARCHAR(3)
UPDATE nashvillehousing
SET OwnerStreetAddress = SUBSTRING_INDEX(OwnerAddress,',',1)
UPDATE nashvillehousing
SET OwnerNeighborhood = SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress,',',2),',',-1)
UPDATE nashvillehousing
SET OwnerState = SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress,',',3),',',-1)
Delete Columns
Lastly was time to drop all the extra columns. Since this was a personal, non-crucial project I figured it was good to drop some of the columns to save some space. Of course, in a real world situation this would be done with extreme caution. I dropped the old grouped address’ and the tax district as it was not important to any future usage for me.
ALTER TABLE nashvillehousing
DROP COLUMN OwnerAddress,
DROP COLUMN TaxDistrict,
DROP COLUMN PropertyAddress;
Conclusion
Overall the cleaning project went pretty smooth. I found many oddities within the data and reformatted, standardized, and made more usable. I achieved the goals I set out for and improved my skills in the process.