Reading Time: 4 minutes

AMES Housing Project

– A minimalist approach to predicting housing prices in Ames


Real estate is a tricky game for real estate investors. Finding the next hotspot, the next trendy city, the hottest neighborhoods comes down to a balance between intuition and data. But what happens when you do find a neighborhood that is growing in popularity and consequently housing prices are rising, meaning a potential investment could be fruitful. How do you evaluate a house and whether or not that house is undervalued or overvalued. The depth and breadth of data provided by Dean De Cook in regards to the houses in Ames, Iowa give us an opportunity to explore just how we do that.

Exploratory Data Analysis

The Ames dataset is vast and has a large quantity and vast range of variables. The data set includes over 80 features, 20 Ordinal Features, 25 Categorical Features, and 36 numerical features. To make a predictive model, it is important to understand which of these features are more relevant, which are less relevant, and the multicollinearity of the variables themselves.

We begin with an overview of how each feature correlates to the feature we would like to predict, which is SalePrice. Here is a diagram of the highest correlated features:

We can see that the highest correlated values in relation to SalesPrice are OverallQual and GrLivingArea (sq ft). This appears intuitive as a larger house would be more expensive and a higher quality house would be more expensive, however, OverallQual is a vague, and ambiguous feature as we don’t really know what it means or how it is constructed.

Let’s take a look at how OverallQual measures against SalesPrice with a BoxPlot:

picture2 394808 aBSa3uTf |

One thing we notice off that bat is that there is a positive relationship between OverallQual and SalePrice, however, this relationship becomes obscured as OverallQual increases. The spread of values increases and OverallQual becomes a weaker predictor.

Next, we look at how GrLivArea relates to SalePrice:

picture3 411061 aSLvFBA7 |

We can see there is a linear relationship, and when we remove some outliers, the relationship strengthens.

picture4 153544 2vw7829Y |

While GrLivArea is a moderately strong predictor, with a correlation of ~.75 even with the removal of the outliers let’s see if we can do better using more features to predict the SalePrice and applying Regularization to limit the multicollinearity amongst the variables.


We first look at a distribution of the SalePrice:

picture5 710732 KMPmwY7n |

We see that SalePrice has a rightwards skew. In order to better predict SalePrice, we know that working with results that form a normal distribution will make our models stronger, so we apply a log-np transformation:

picture6 092198 Fp8TuUWl |

Next, we take a look at how much skew other features have and aim to reduce their skew with the same log-np transformation. Using Python we identify that the following features have skewness values greater than .6:

  • MSubClass
  • LotFrontage
  • LotArea
  • OverallCond
  • YearBuilt
  • YearRemodAdd
  • MasVnrArea
  • BsmtFinSF1
  • BsmftFinSF2
  • BsmftUnfSF
  • TotalBsmtSF
  • 1stFlrSF
  • 2ndFlrSf
  • LowQualFinSF
  • GrLivArea
  • BsmtFullBath
  • BsmtHalfBath
  • HAlfBath
  • KitchenAbvGr
  • TotRmsAbvGrd
  • Fireplaces
  • GarageYrBlt
  • WoodDeckSF
  • OpenPorchSF
  • EnclosedPorch
  • 3SsnPorch
  • ScreenPorch
  • PoolArea
  • MiscVal

We correct these skews using the log-transformations, however, some features can not be normalized and some are considered irrelevant to our model, so we drop the following:

  • ScreenPorch
  • GarageYrBlt
  • PoolArea
  • GarageArea
  • Fireplaces
  • MasVnrArea
  • 2ndFlrSF

Finally, for missing values related to the numerical variables, we fill them in using the mean of those columns.

Are Preprocessing is complete, let’s move on to the model!


Because of the high amounts of multicollinearity in the data set, we will apply Ridge regularization. The disadvantages of this are that we will introduce more bias to our data set.

Applying python we gather some evidence that our Ridge regularization has improved our linear regression.

picture7 981737 nsrlfGXm |

Using Linear Regression without Ridge regularization our rmse was .25. Therefore, we conclude that Ridge w/ Linear regression is a solid predictor! As a minimalist data scientist, we are satisfied and move forward!

Takeaways and further enhancements

Additional important steps to take are to extract the most important features that influence our model and understand how unit changes of these features affect the Sales Price.

Also, we would like to try out Lasso Regression and see how it compares to Ridge. We would like to analyze the multicollinearity of the features more and understand which variables are influencing each other the most to mitigate the bias this introduces.

Finally, we would like to derive advanced methods of filling in missing values instead of prescribing to using “Mean” to fill in the empty data.

Source link

Spread the Word!