
Predicting House Prices Case Study
Tableau / Python Project
conducted June 2024
Contents:
Goals
Tools & Skills Used
Overview
Goal:
This project was an assignment from my data analytics course.
The goal is to create a Tableau dashboard displaying what drives real estate prices using exploratory analysis methods.
House sales dataset utilized from Kaggle.
Sources:
GitHub - If you’d like to view my code.
Kaggle Dataset: House Sales in King County, Washington, USA
Tools:
Tableau
Python
Excel
Skills Used:
Sourcing Data
Exploring relationships
Geographical Visualizations with Python
Regression
Clustering
Sourcing & Analyzing Times Series Data
Overview:
Results & Recommendations:
The colored dots on the map correspond to price ranges for each house.
On average:
Orange 6.8M USD
Green 3.4M USD
Red 2.8M USD
Blue 1.3M USD
Value of Bathrooms:
Home prices increase significantly with more bathrooms. On average, increasing the number of bathrooms would cost around $75k USD per full bathroom. However, a unit with three baths is valued at $350k, while a unit with four baths would cost $590k, which is practically double the price of the unit for one extra full bath. For purchasing a cost-effective home, I recommend for smaller families to find homes with three or less bathrooms and for larger families to consider if getting an extra half bath is feasible.
Squarefoot Living:
Across all homes, each unit contains about 2,500 sqft on average. Anything more or less reflects on the final price.
Waterfront:
The average home would cost an extra $230k to have a waterfront. That being said, sqft decreases when a home has a waterfront. The number of sqft loss is inconclusive.
Grade:
The average grade across all homes is 7.1. Having a higher grade increases the overall value, so buying a home around a 7.1 can save money without losing quality.
Real estate prices grow higher in the spring season (March, April, May). I assume it is common for people to buy homes in the summer so prices are raised in preparation.
Methods Used:
Linear Regression:
Linear regression is a type of model that measures the relationship between quantitative variables to make a prediction. This method was used because we are trying to assess the relationships between a continuous independent variable and a dependent variable.
While exploring relationships, there are four variables with the most influence on price:
Square Feet of Living
Number of Bathrooms
Grade
Has Waterfront
Typically, a home costs more when the number of bathrooms increase. However, we could find a unit with 5 baths priced at $5M and another unit with 5 baths priced at $1.9M. It should be noted that the number of bathrooms are not directly correlated to price.
Prices increase when the Square Foot of Living Space increases.
The same can be said for Grade, which determines the quality level of construction and design.
Homes with a Waterfront have higher prices.
Cluster Analysis:
A cluster analysis groups data points into “clusters.” Comparing these new groups could reveal new patterns and relationships. This method was used because we want to look at all data attributes, not just the variables previously explored.
The cluster analysis yielded four distinct price groups:
High Price (orange)
Mid-High Price (green)
Mid-Low Price (red)
Low Price (blue)
Cluster Analysis Results:
There is a pattern that each price group excels in one or more categories, but never all four. For example, Low Price (blue) has high values in Bathrooms, Sqft Living, and Grade, but not Waterfront. This is with the exception of the Mid-Low Price (red) group, which has high values in all categories.
Average Prices:
High Price (orange) 6.8M USD
Mid-High Price (green) 3.4M USD
Mid-Low Price (red) 2.8M USD
Low Price (blue)1.3M USD
The goal of time series analysis is to better understand the current data and patterns it has.
In order to compare my analysis to a larger source, I used Zillow Real Estate Data from Data Nasdaq Quandle Marketplace. Source: https://data.nasdaq.com/databases/ZILLOW
This decomposition plot shows Zillow Real Estate Data after subsetting the data to only display home prices over time
Overall, the average price of homes grew ~16% from 2014 to 2015, which is quite high.
Time Series Analysis:
House Sales in King County, Washington, USA
About the Data:
The Google Sheets link here shows the data profile for the Kaggle dataset and how I cleaned the Excel sheet. There is ample data on date sold, price, zip code, number of beds and baths, and much more useful information needed for real estate and analytical purposes.
Limitations:
Data was collected from 2014 to 2015. Having more recent data would improve prediction and forecasting.
The data set contains many extreme outliers, mainly luxury homes.
Next Steps:
Clean the data to remove luxury homes and other outliers, then conduct the analysis again to examine any differences in results.
Compare price accuracy to other real estate websites.