SECTION 1: Summary of Findings

Our team set out to explore two key questions related to the housing market between May 2014 and May 2015 in King County, Washington. The first question using a linear regression: Can we predict a house’s selling price based on its size? The second question using a logistic regression: Can we predict if a house falls within one of the region’s “20 Wealthiest Zip Codes” using data on its size, condition, and the size of neighboring homes?

The results of the first question’s linear regression model found that the size of a house, measured in terms of square footage, number of bedrooms, and number of bathrooms, does indeed play an important role in predicting the selling price. The variables, in our study, together explained half of the variation in house prices. Interestingly, an increase in the number of bedrooms does not always result in an increase in price - in fact, our model with regards to bedrooms found that, with all other factors remaining constant, more bedrooms correlates with a lower price. The houses with up to five bedrooms, generally, have higher prices – while the houses with more than five bedrooms saw a decrease in price. We found that this may be due to a limited segment of buyers entering the market for homes with a high number of bedrooms, decreasing demand as well as price. An unexpected result found the lack of significance of the number of floors in a house. This did not substantially improve accuracy of our model’s price prediction. This may suggest that this may not be an important consideration for home buyers considering a home’s value. Our analysis also found that a house’s grade, an indicator of the construction quality, is a crucial determinant in its selling price.

The results for the second question sought to predict whether a house falls within a “wealthy” zip code – with an average homeowner income of $120,000. Using data on the house’s size, condition, and the size of the neighboring houses, we developed a model to address this question. Our results found that the condition of a house does not significantly influence whether the house is located in a wealthier zip code. However, the size of the house and the size of the neighboring houses do indeed give insight into the wealth level of the zip code the house resides in. Larger houses and neighborhoods with larger houses tend to be in wealthier zip codes. Our current research provides adequate predictions on whether a house is likely to be in a wealthier zip code, with room for improvement. Other factors may affect what type of houses reside in zip codes of different wealth classifications. These factors might include whether the house is located in a rural or urban area. This could also involve examining how various aspects interact with each other or how a steady growth in one specific aspect might influence the prosperity of a zip code area. In essence, our research aims to provide better understanding as to which factors can help predict whether a house is in a wealthier zip code or not.

SECTION 2: Data Description & Visualizations

WORKSPACE PREP

Load packages

Read in data

  • kc_house_data.csv

DATASET

King County House Sale Prices

A data set containing information about more than 21,600 different house sales for King County, Washington between May 2014 and May 2015 including: price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip_code, sqft_living15, sqft_lot15

VARIABLES

  • price Price of each home sold.

  • bedrooms Number of bedrooms.

  • bathrooms Number of bathrooms, where 0.5 accounts for a room with a toilet but no shower.

  • sqft_living Square footage of the apartments interior living space.

  • sqft_lot Square footage of the land space.

  • floors Number of floors.

  • waterfront A dummy variable for whether the property was overlooking the waterfront or not.

  • view An index from 0 to 4 of how good the view of the property was.

  • condition An index from 1 to 5 on the condition of the property.

  • grade An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

  • We created another variable for grade, factoring the levels, called grade_group, which included level groupings: poor (1, 2, 3), moderately poor (4, 5, 6), average (7, 8, 9, 10), high (11, 12, 13).

  • sqft_above The square footage of the interior housing space that is above ground level.

  • sqft_basement The square footage of the interior housing space that is below ground level.

  • yr_built The year the house was initially built.

  • yr_renovated The year of the house’s last renovation.

  • zip_code What zip code area the house is in.

  • We created another variable for zip_code called wealthy, grouping the zip codes based on whether they are considered one of the “20 Wealthiest Zip Codes” in King County or not. The 20 wealthiest zip codes are 98039, 98040, 98004, 98112, 98075, 98033, 98074, 98053, 98121, 98006, 98199, 98105, 98065, 98177, 98005, 98005, 98029, 98119, 98027, 98072.

  • sqft_living15 The square footage of interior housing living space for the nearest 15 neighbors.

  • sqft_lot15 The square footage of the land lots of the nearest 15 neighbors.

SECTION 3: Questions of Interest

Linear Regression Question of Interest:

  1. Is house size (measured in terms of square footage, number of bedrooms, number of bathrooms, and floors) a good predictor of selling price for houses in King County, Washington between May 2014 and May 2015?
  • Response variable: price

  • Motivation: We want to know whether we can predict the selling price of a house based on particular sizing measures. If this is accurate, this can be used in the real world to determine whether a house is being valued appropriately and not overpriced. Individuals and families looking to buy a home could utilize this tool to gauge if the listed price of a property aligns with its size-related attributes.

Logistic Regression Question of Interest:

  1. Can we predict if the house has a zip code considered one of the “20 Wealthiest Zip Codes” in King County (using size, condition, and size of neighboring houses)?
  • Response variable: wealthy

  • Motivation: We want to determine whether a house is in a wealthy zip code based on specific measures in order to determine if a house that is purchased is in a wealthier neighborhood as advertised. Also, if we are given information about a house, this allows us to narrow down where it may be located in Kings County by predicting if it is in a wealthier neighborhood or not.

SECTION 4: Linear Regression Data Visualizations

This data is interesting because it appears that more bedrooms corresponds with greater price up until 5 bedrooms, afterwards the price of houses being unchanging (arguably even decreasing after 5 bedrooms). This suggests that bedrooms might be influential in determining price but other factors are probably more important.

From the boxplot above, the number of bathrooms appears to correlate strongly with price.

From the scatterplot above (broken down by the number of floors in a house), it appears that there is a strong positive correlation between the price and the square footage of the house and the bedrooms within that house. The larger the house in sq feet and the more bedrooms, usually results in a more expensive house. The color of the points corresponds to the number of bedrooms in the house. One must consider a few outliers on this scatterplot, specifically the 33 bedroom house on the 3.5 floors graph.

There appears to be an increase in price corresponding to an increase in square footage of living space. A linear regression will have to be performed to conclusively state this, but there are more data points at higher prices at higher amounts of square footage.

It appears that properties on a waterfront cost more than properties without a waterfront. However, just because a property is not a waterfront property does not mean that it cannot be as expensive as other waterfront houses.

Lesser views have more houses found at lower prices, and having a stellar view (a rating of 4) correlates with significantly higher house prices, but most of the density plots look similar and there doesn’t appear to be a huge difference between different views.

Grade has a clear relationship with increased price. It appears that house prices increase exponentially with better grades; the better the grade, the higher the increases in prices.

It doesn’t appear that the number of floors has any significant effect on the price of the house.

The BoxCox plot tells us how we should transform our dataset in order to fix the variance in the data. Since the value 0 is included in the 95% confidence interval of the BoxCox, we can perform a log-transformation on the y-variable in the above charts. After performing a log-transformation on the x-variable as well (to account for the nonlinear data), it appears clear that increases in square footage of living increases the price of the house.

It doesn’t appear that year built has any significant effect on the price.

It appears that year renovated has a slight effect on the price as more newly renovated houses are selling at a higher price but the difference between decades may or may not be statistically significant.

There appears to be somewhat of a relationship between square footage of lot and increased price; this holds true for both square footage of living and lot, as well as the square footage of living of the surrounding 15 neighbors. It’s important to note that all of these variables are highly correlated with each other - most houses in the same neighborhood will probably have very similar square footage of living and lot space.

SECTION 5: Linear Regression Model

## 
## Call:
## lm(formula = price ~ sqft_living + sqft_lot + bedrooms + as.numeric(floors) + 
##     bathrooms, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1487212  -145520   -22928   102025  4224699 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.783e+04  9.835e+03   7.914 2.75e-15 ***
## sqft_living         3.025e+02  4.353e+00  69.504  < 2e-16 ***
## sqft_lot           -3.693e-01  5.602e-02  -6.592 4.55e-11 ***
## bedrooms           -5.604e+04  3.211e+03 -17.453  < 2e-16 ***
## as.numeric(floors)  6.490e+02  2.636e+03   0.246   0.8055    
## bathrooms           1.199e+04  5.300e+03   2.262   0.0237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256000 on 10800 degrees of freedom
## Multiple R-squared:  0.505,  Adjusted R-squared:  0.5047 
## F-statistic:  2203 on 5 and 10800 DF,  p-value: < 2.2e-16

The square root of the Mean Squared Error value (MSE) is $255748.1, a relatively large number.

In summary, square footage, number of bedrooms, and number of bathrooms significantly predict house prices in King County, Washington, between May 2014 and May 2015, explaining about 50% of the variation in house price. However, the number of floors does not seem to be a significant predictor in the presence of the other predictors. The coefficient for the number of bedrooms is negative, which might be counterintuitive, suggesting that further investigation is needed to understand this effect.

Since we found that ‘floor’ was not a useful variable, let’s try it without that predictor. Running the model without the variable ‘floor’:

## 
## Call:
## lm(formula = price ~ sqft_living + sqft_lot + bedrooms + bathrooms, 
##     data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1487659  -145294   -22906   101945  4223305 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.834e+04  9.614e+03   8.149 4.07e-16 ***
## sqft_living  3.025e+02  4.352e+00  69.506  < 2e-16 ***
## sqft_lot    -3.700e-01  5.594e-02  -6.615 3.88e-11 ***
## bedrooms    -5.612e+04  3.195e+03 -17.565  < 2e-16 ***
## bathrooms    1.249e+04  4.889e+03   2.555   0.0106 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256000 on 10801 degrees of freedom
## Multiple R-squared:  0.505,  Adjusted R-squared:  0.5048 
## F-statistic:  2754 on 4 and 10801 DF,  p-value: < 2.2e-16

This does not improve the Adjusted R-squared very much. We’ve made conclusions on how house size may correlate with house price, but what about other factors? Using our visualizations, we can see a clear relationship between house grade and house price. If we run another regression, we see that a higher grade corresponds to a higher price.

The Adjusted R-squared value for the model including the ‘grade’ variable (seen below) is 0.5444, while this model, excluding the ‘grade’ variable, is 0.5048. This means that the model including the ‘grade’ (house grade relating to construction quality) variable explains approximately 54.44% of the variability in the house prices, while the model excluding the ‘grade’ variable only explains about 50.48%. The ‘grade’ variable contributes to improving the model’s ability to explain the variability in the house prices. Therefore, the inclusion of the ‘grade’ variable results in a higher Adjusted R-squared value, demonstrating that ‘grade’ is indeed a valuable predictor in estimating house prices and should be included in the model.

## 
## Call:
## lm(formula = price ~ sqft_living + sqft_lot + bedrooms + grade + 
##     bathrooms, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -843744 -133389  -24655   96979 4681648 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.839e+05  2.053e+04 -23.570  < 2e-16 ***
## sqft_living  2.200e+02  4.968e+00  44.275  < 2e-16 ***
## sqft_lot    -3.066e-01  5.369e-02  -5.710 1.16e-08 ***
## bedrooms    -3.890e+04  3.116e+03 -12.483  < 2e-16 ***
## grade        9.757e+04  3.183e+03  30.652  < 2e-16 ***
## bathrooms   -2.186e+04  4.822e+03  -4.533 5.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 245600 on 10800 degrees of freedom
## Multiple R-squared:  0.5446, Adjusted R-squared:  0.5444 
## F-statistic:  2583 on 5 and 10800 DF,  p-value: < 2.2e-16

We calculated the square root of the Mean Squared Error of this model to be $250,220.70. This is smaller than the model without ‘grade’, which supports our use of the model with ‘grade’.

Let’s check for outliers in our data. Checking for externally studentized values, we find that many values are flagged. Here are the most flagged values:

##      7253      1316      4412     15256      8093      6509      2865     10447 
## 19.560336 13.826516 12.735047  9.091270  9.063929  8.992679  8.982808  8.668303 
##      7990     20461 
##  8.090045  7.599943

The top outliers flagged above all have very high grades, very large amounts of living space, and a large number of bedrooms and bathrooms. This is why they are flagged as outliers. It’s fair to leave these in the model, but just for testing’s sake, let’s remove the top five outliers. If we do this, running a regression again yields this result:

## 
## Call:
## lm(formula = price ~ sqft_living + sqft_lot + bedrooms + grade + 
##     bathrooms, data = data.no_out)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1002941  -133804   -23295    96779  4357016 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.900e+05  1.461e+04 -33.534  < 2e-16 ***
## sqft_living  2.197e+02  3.571e+00  61.508  < 2e-16 ***
## sqft_lot    -3.016e-01  4.070e-02  -7.410 1.31e-13 ***
## bedrooms    -3.761e+04  2.251e+03 -16.709  < 2e-16 ***
## grade        9.955e+04  2.271e+03  43.842  < 2e-16 ***
## bathrooms   -2.754e+04  3.410e+03  -8.078 6.95e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 242700 on 21602 degrees of freedom
## Multiple R-squared:  0.5448, Adjusted R-squared:  0.5447 
## F-statistic:  5171 on 5 and 21602 DF,  p-value: < 2.2e-16

The Adjusted R-Squared value improved by 0.0003, which is a slight improvement over the model with the outliers. This is not significant enough to warrant removing the data, so we can keep the outliers. Since an increase in 1 unit of grade corresponds with an increase in price of $99550, this suggests that grade has an extremely significant effect on the price of the house - even more so than factors corresponding to house size

## sqft_living    sqft_lot    bedrooms       grade   bathrooms 
##    3.903655    1.042611    1.606075    2.607466    2.517376

From these values, we can conclude that multicollinearity isn’t a significant issue in this model because all results are below these thresholds. This means that the predictor variables in our model are not too highly correlated with each other. The ‘sqft_living’ variable has the highest VIF of all (3.904), suggesting it’s the most correlated with the other variables. The more space one has in their house, the more bathrooms and bedrooms are more likely to be built.

From this, we can see that the variance is not constant throughout our dataset. We can fix this through a y-transformation (‘price’ being our response variable). Creating a boxcox plot, we get this:

From the boxcox, we know that we can perform a log transformation. After doing this, this is our results:

## 
## Call:
## lm(formula = price.star ~ sqft_living + sqft_lot + bedrooms + 
##     grade + bathrooms, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70912 -0.25110  0.00167  0.23456  1.28450 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.124e+01  2.937e-02 382.492  < 2e-16 ***
## sqft_living  2.327e-04  7.108e-06  32.739  < 2e-16 ***
## sqft_lot    -1.464e-07  7.682e-08  -1.906   0.0566 .  
## bedrooms    -2.424e-02  4.458e-03  -5.438 5.52e-08 ***
## grade        1.862e-01  4.554e-03  40.894  < 2e-16 ***
## bathrooms   -7.224e-03  6.899e-03  -1.047   0.2951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3513 on 10800 degrees of freedom
## Multiple R-squared:  0.5542, Adjusted R-squared:  0.5539 
## F-statistic:  2685 on 5 and 10800 DF,  p-value: < 2.2e-16

As one can interpret from the residual plot, this fixes the problem we were having earlier that assumption 2 was not being met. It is apparent now that the variance is constant throughout our model.

  • Residuals: These are the differences between the observed and predicted values for your dependent variable. The summary statistics of the residuals can tell us about the fit of the model. Since the residual plot shows an even, flat spread of residuals, this means that all assumptions that linear regression models must meet are met.

  • Coefficients: The Estimate column gives you the estimated regression coefficients for the intercept and each of your predictor variables:

  • (Intercept): The estimated intercept is 11.24. This means that \(e^{11.24} = 76114.95\)is the predicted value of price when all predictor variables are zero. In this context, the intercept has no practical interpretation since it’s not meaningful to have zero square footage, bedrooms, floors, or bathrooms.

  • Sqft_living: For every one unit increase in square footage, the price is predicted to increase by a factor of \(e^{0.0002327} = 1.00023\) times, assuming all other variables are held constant.

  • Sqft_lot: For every one unit increase in square footage, the price is predicted to decrease by a factor of \(e^{0.0000001464} = 1.000000146\) times, assuming all other variables are held constant.

  • Bedrooms: For each additional bedroom, the price is predicted to decrease by a factor of \(e^{0.02424 = 1.0245}\) times, assuming all other variables are held constant. This might be counterintuitive and could potentially suggest multicollinearity or other data issues.

  • Bathrooms: For each additional bathroom, the price is predicted to increase by a factor of \(e^{0.007224 = 1.007}\) times, assuming all other variables are held constant.

  • Significance codes: The stars (or lack thereof) next to the coefficients correspond to the p-values for the hypothesis tests that each coefficient equals zero. Here, the coefficient for floors is not significantly different from zero at the common thresholds (0.05), suggesting that the number of floors might not be a significant predictor of price in the presence of the other predictors.

  • Residual standard error: This is the standard deviation of the residuals, which is a measure of the typical size of the residuals.

  • Multiple R-squared: This is the proportion of variance in the dependent variable that can be explained by the independent variables. In this case, about 55.42% of the variability in price can be explained by square footage, number of bedrooms, floors, and bathrooms.

  • Adjusted R-squared: This is the proportion of variance in the dependent variable that can be explained by the independent variables, but it penalizes you for including unnecessary predictors in your model. The fact that it is almost the same as the multiple R-squared suggests that all predictors contribute to the explanation of the variability in price.

  • F-statistic and p-value: These are the results of the hypothesis test that all of the regression coefficients are zero. A low p-value (here, essentially zero) provides strong evidence against the null hypothesis that all regression coefficients are zero. Since our p-value is nearly zero, this suggests that our regression is useful in predicting house price.

Why is it that the bedroom is negative, the more bedrooms in a house, the less the house costs? Using the visualizations for the number of bedrooms we derived, it appears that the price increases up to five bedrooms, but decreases rapidly after this number. There are also cases of houses with many bedrooms but relatively mediocre prices. This may account for the negative coefficient with regards to the number of bedrooms - not many houses are practically going to have more than five bedrooms, and too many bedrooms may actually bring the price down. It’s also worth noting that this is a multivariable regression - the coefficient is used in the presence of other variables, and does not necessarily mean it is the individual effect of the predictor on the response.

SECTION 6: Logistic Regression Data Visualizations

Our second question is, “can we predict if the house has a zip code within one of the ‘20 Wealthiest Zip Codes’ in King County using the size of the house, condition of the house, and size of neighboring houses?” We took the variable in our dataset named zipcode and split it into two categories:

  1. The house is located in a wealthy zip code (TRUE)

  2. The house is not located in a wealthy zip code (FALSE)

The determination of whether or not a zip code was considered “wealthy” was taken from a local news outlet in Seattle, Washington. They claim that their selection was taken from the United States Internal Revenue Service (IRS) and calculated using the average income of homeowners in each zip code. These 20 zip codes in King County have an average of more than $120,000 before deductions, according to the IRS.

In order to answer the question, the following variables were used in the initial model:

  1. Number of bedrooms (bedrooms)

  2. Number of bathrooms (bathrooms)

  3. Square-ft living area (sqft_living)

  4. Square-ft property area (sqft_lot)

  5. Number of floors (floors)

  6. Condition level (condition)

  7. Grade level (grade_group)

  8. Square-ft living area for the 15 nearest houses (sqft_living15)

  9. Square-ft living area for the 15 nearest properties (sqft_lot15)

Some visualizations were made in order to assess what can be expected when testing the logistic regression model.

Figure 6.1 and Figure 6.2 show the proportion of houses within the zip code category based on number of bedrooms and bathrooms (respectively). A general trend that can be seen in Figure 6.1 is that the proportion of houses within wealthier zip codes increases as the number of bedrooms increase. The same trend can be observed in Figure 6.2. The proportion of houses within wealthier zip codes increases as the number of bathrooms increase. It is also noted that in Figure 6.1, there seems to be a possible outlier as a data point claims to have 33 bedrooms. This is suspected to be a typo or error of some sort.

Figure 6.3 and Figure 6.4 are density plots for the area of the house and the area of the property between the two zip code categories. Figure 6.3 suggests a possible significant difference between the area of houses in wealthier zip codes and the area of houses in non-wealthier zip codes. Meanwhile, Figure 6.4 doesn’t seem to show too much of a difference in density between the area of houses within each zip code category, suggesting that we may or may not be dropping this variable for the final model.

Figure 6.5 is the proportion of houses in zip code category by how many floors the houses have. Surprisingly, the bar graph does not suggest a positive correlation between number of floors and proportion of houses in wealthier zip codes. Instead, it seems like the “golden spot” for the number of floors for houses in wealthier zip codes is around 2 floors. In the future, we may consider combining houses with 1 and 1.5 floors, 2 and 2.5 floors, and 3 and 3.5 floors together to make 3 categories within the number of floors variable which might give more insight on whether or not there’s a significant difference in the number of floors for houses in wealthier zip codes.

Figure 6.6 is the proportion of houses in the zip code category based on the condition of the houses. The graphs show that there are more houses that are in poor condition (1 and 2) that are also within non-wealthier zip codes. However, it’s also noted that between fairly-good conditions and really-good conditions (3-5), there is not that much of a difference in proportion between these houses, suggesting that the condition level of the house may not be significant in predicting on whether or not that house resides in a wealthier zip code or not.

Figure 6.7 shows the proportion of houses within each zip code category by grade level. There is a clear, positive correlation between the grade level and proportion of houses in wealthier zip codes, suggesting that the houses with higher construction quality and design can be found in wealthier zip codes. This also suggests that there is a significant difference in grade level between houses in wealthier zip codes and houses in non-wealthier zip codes.

Figure 6.8 and Figure 6.9 are density plots for the area of the nearest 15 houses and the area of the nearest 15 properties between the two zip code categories. Keep in mind that the former refers to the square-ft area of the house while the latter refers to the square-ft area of the house plus the garden and any other exterior land that belongs to the house. In Figure 6.8, there is a more apparent difference between the two zip code categories based on the area of the 15 nearest neighboring houses. The graph shows that houses in wealthier zip codes tend to be surrounded by other houses with more area, so there may be a significant difference between the two zip codes.

However, Figure 6.9 doesn’t seem to show too much of a difference in mean area for the 15 nearest neighboring properties. This indicates that there may or may not be a significant difference in mean area for the 15 nearest neighboring properties between the two zip code categories.

So far, based on our exploratory data analysis, the area of the property, the condition level, and the area for the 15 nearest properties seem to be insignificant variables and could be potentially dropped. In order to test the significance of the variables, a logistic regression was run on the model.

SECTION 7: Logistic Regression Model

In order to answer the question posed in Section 6, the initial model consisted of the 9 variables associated with the size of the house, condition of the house, and size of neighboring houses (refer to the list in Section 6). The reasoning for testing these variables is that usually bigger houses in better conditions are associated with wealthy neighborhoods. Not only that, but if the surrounding houses are bigger or the surrounding properties are bigger, we predict that the odds of that house residing in a wealthy zip code is higher. However, it is also a known trend that smaller houses in urban areas are more likely to be expensive than rural areas. So theoretically, zip codes for urban areas could be in the wealthier zip code category, but have houses that are smaller.

We tested a regression model consisting of the 9 variables using a training subset of the Kings County data, resulting in the following output (Model 7.1):

Model 7.1

## 
## Call:
## glm(formula = wealthy ~ ., family = "binomial", data = train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.341e+01  1.372e+02  -0.098 0.922097    
## bedrooms            -1.703e-01  2.725e-02  -6.248 4.17e-10 ***
## bathrooms            1.723e-01  4.230e-02   4.073 4.64e-05 ***
## sqft_living          5.915e-05  4.233e-05   1.398 0.162261    
## sqft_lot            -1.391e-06  8.752e-07  -1.589 0.112042    
## floors1.5           -1.716e-01  7.670e-02  -2.237 0.025312 *  
## floors2              1.667e-02  4.789e-02   0.348 0.727743    
## floors2.5            4.770e-01  1.948e-01   2.449 0.014323 *  
## floors3             -4.802e-01  1.419e-01  -3.385 0.000712 ***
## floors3.5           -1.346e+00  1.346e+00  -1.000 0.317079    
## condition            1.140e-01  3.136e-02   3.635 0.000278 ***
## grade_groupmod_poor  8.901e+00  1.372e+02   0.065 0.948264    
## grade_groupaverage   9.483e+00  1.372e+02   0.069 0.944884    
## grade_grouphigh      9.316e+00  1.372e+02   0.068 0.945855    
## sqft_living15        1.360e-03  4.385e-05  31.008  < 2e-16 ***
## sqft_lot15          -6.002e-06  1.303e-06  -4.608 4.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20551  on 17289  degrees of freedom
## Residual deviance: 17374  on 17274  degrees of freedom
## AIC: 17406
## 
## Number of Fisher Scoring iterations: 10

We tested to see if this overall model was more useful over an intercept model by calculating a p-value for the model. The p-value was essentially 0, meaning that this current model was useful over the intercept-only model and we can move forward with testing and improving this full model.

From here, two different models were tested. The first model (Model 7.2) tested was a model that dropped the variables that we found that may be insignificant in our EDA. We did a partial test to see if we could drop the betas associated with the area of the property, the condition level, and the area for the 15 nearest properties (sqft_lot, condition, and sqft_lot15). The p-value was 1, meaning we could drop these betas and the calculated AUC for this model was 0.7628711, suggesting that this model predicts fairly well.

Model 7.2

## 
## Call:
## glm(formula = wealthy ~ bedrooms + bathrooms + sqft_living + 
##     grade_group + floors + sqft_living15, family = "binomial", 
##     data = train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.337e+01  1.369e+02  -0.098 0.922216    
## bedrooms            -1.365e-01  2.685e-02  -5.082 3.73e-07 ***
## bathrooms            1.772e-01  4.210e-02   4.210 2.56e-05 ***
## sqft_living          1.051e-05  4.148e-05   0.253 0.799939    
## grade_groupmod_poor  9.188e+00  1.369e+02   0.067 0.946487    
## grade_groupaverage   9.797e+00  1.369e+02   0.072 0.942942    
## grade_grouphigh      9.639e+00  1.369e+02   0.070 0.943859    
## floors1.5           -1.610e-01  7.637e-02  -2.108 0.035000 *  
## floors2              6.192e-03  4.586e-02   0.135 0.892600    
## floors2.5            4.852e-01  1.922e-01   2.524 0.011592 *  
## floors3             -4.666e-01  1.403e-01  -3.325 0.000883 ***
## floors3.5           -1.134e+00  1.290e+00  -0.879 0.379390    
## sqft_living15        1.317e-03  4.304e-05  30.604  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20551  on 17289  degrees of freedom
## Residual deviance: 17481  on 17277  degrees of freedom
## AIC: 17507
## 
## Number of Fisher Scoring iterations: 10

The second model (Model 7.3) tested was a model that dropped the variables that were insignificant based on the individual p-values (floor, condition, and grade). Note that we kept sqft_living and sqft_lot because the p-values weren’t too bad and we predicted that at least one of these variables would become significant after testing. A partial test was conducted to see if we could drop the betas for floor, condition, and grade. The p-value was 1, meaning we could drop these betas and the calculated AUC for this model was 0.7623842, also suggesting that this model predicts fairly well.

Comparing the AUC for Model 7.2 and Model 7.3, we didn’t see that much of a difference, but the chosen model to move forward with is Model 7.3. This is due to the fact that logistic regression isn’t reliable when there is a significant amount of multicollinearity between variables. With Model 7.3, the multicollinearity is controlled compared to Model 7.1 and Model 7.2. The final model is shown below:

Model 7.3

## 
## Call:
## glm(formula = wealthy ~ bedrooms + bathrooms + sqft_living + 
##     sqft_lot + sqft_living15 + sqft_lot15, family = "binomial", 
##     data = train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.798e+00  9.210e-02 -41.242  < 2e-16 ***
## bedrooms      -1.418e-01  2.643e-02  -5.365 8.11e-08 ***
## bathrooms      1.934e-01  3.826e-02   5.054 4.33e-07 ***
## sqft_living    6.327e-05  4.062e-05   1.557   0.1194    
## sqft_lot      -1.600e-06  8.853e-07  -1.807   0.0708 .  
## sqft_living15  1.384e-03  4.310e-05  32.105  < 2e-16 ***
## sqft_lot15    -5.822e-06  1.305e-06  -4.461 8.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20551  on 17289  degrees of freedom
## Residual deviance: 17460  on 17283  degrees of freedom
## AIC: 17474
## 
## Number of Fisher Scoring iterations: 4

What’s interesting with this final model is that it suggests that the odds of houses being in wealthier zip codes decreases for houses with fewer bedrooms if all other variables are held constant. That is contradictory to our initial belief. The same observations are made for the area of the properties and for the area of the 15 nearest properties. This model predicts that the odds of houses being in wealthier zip codes decreases for houses with smaller properties or houses that are surrounded by smaller properties.

Moving forward with Model 7.3, we created a ROC curve (Figure 7.1) and a confusion matrix (Table 7.1) in order to better assess and visualize the predictive ability of our current model on test data.

Our ROC curve visualizes how good our model is at predicting which zip code category houses reside in. This ROC curve suggests that our model is better at correctly predicting which zip code category houses reside in as opposed to just guessing randomly. The associated AUC value, which is the area under the ROC curve, was 0.7623842, which means that our model is acceptable.

Table 7.1

##        
##         FALSE TRUE
##   FALSE  2877  216
##   TRUE    833  397

Using Table 7.1, the accuracy for our model was calculated to be around 75.7%; however, the sensitivity of our model is only around 64.8%. This means that out of all the houses residing in wealthier zip codes within our testing data, our model correctly classified 64.8% of those houses to be in wealthier zip codes. The specificity of our model is around 77.5%. This means that out of all the houses residing in non-wealthier zip codes within our testing data, our model correctly classified 77.5% of those houses to be in non-wealthier zip codes.

Can we predict if the house has a zip code considered one of the “20 Wealthiest Zip Codes” in King County (using size, condition, and size of neighboring houses)? Our model indicates that using the condition of the house is not a significant predictor for the odds of a house being in a wealthier zip code. However, overall, our model does a fair job at predicting which zip code category houses reside in using the size of the house and size of neighboring houses. To improve the model so that it performs better, we suggest exploring higher order terms such as interactions, curvilinear linear relationships between predictor variables and the response variable, and other variables, such as location of the house (rural vs. urban).