Modeling Home Prices In Boulder County, CO

Introduction

Predictors

In the fifteen years since its founding in 2006, Zillow has become synonymous with high-quality home price predictions. Following the national reckoning with the myriad of ways racism is embedded into American culture during the summer of 2020, many people within Zillow began questioning our product’s reliance on data collected by police departments. To this end, our research group has spent the last few weeks exploring alternative indicators that might be used to replace crime data. We developed these metrics within Boulder County, CO. Although the model we developed lacks the predictive power of Zillow’s current model, we hope that our work on alternative indicators may become a leader again, this time in prioritizing the public good within algorithm design.

Data and Data Sources

This report outlines our efforts to build a price prediction model for Boulder County, Colorado, without relying on data generated by police departments. This effort stems from a desire to move away from building models reliant on data produced by institutions with well-documented and persistent racial biases. Instead, our team has developed indicators to measure access to public amenities such as trails, school districts, the demographics of areas in which homes are located, and proximity to polluters relevant to Boulder’s ozone issue.

Boulder’s Ozone Issue
It’s not unusual for Boulder County residents to receive warnings about high ozone on hot summer days. But due to “emissions from oil and gas development, power plants, and other industrial activity”, many counties in Colorado struggle with dangerously high ozone levels all year round. High amounts of ground-level ozone can make it hard to breathe and can even cause lung damage. While levels can get high enough to put everyone at risk in Boulder, older people and residents with asthma are the most at risk.
source: City of Boulder: https://www.bouldercounty.org/environment/air/ozone/

The polluter variable was designed to try and capture outsized effects on the price of homes nearest to facilities that emit ozone-causing pollution. While the impact of this indicator in this model was relatively modest as it was designed in our model for Boulder, we highlight it here because it was statistically significant. We believe similar indicators may prove more effective in other markets and may shed light on the economic cost of detrimental pollution.

Our model also incorporates the effects that both house features and other neighborhood amenities have on house prices. House features include measurements like the square footage of a house or categories like roof type.

In this initial investigation into alternative indicators, our team relied on free and open-source data sources. As a result, the data used in our model mainly comes from public agencies such as the County of Boulder, the U.S. Census, and the EPA.

#---- FEATURE BUILDING ----

### use this nn function instead b/c of hmisc

nn_function2 <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    dplyr::group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}



  #---- LN Price ----
  boulder_sales.sf$LNprice <- log(boulder_sales.sf$price+1)

  #---- Pollution Feature ----
st_c <- st_coordinates
boulder_sales.sf <-
  boulder_sales.sf %>%  
  dplyr::mutate(
    pollut_nn6 = nn_function2(st_c(boulder_sales.sf), st_c(O3Pollutants), 6),
    pollut_nn2 = nn_function2(st_c(boulder_sales.sf), st_c(O3Pollutants), 2), 
    pollut_nn3 = nn_function2(st_c(boulder_sales.sf), st_c(O3Pollutants), 3),
    pollut_nn4 = nn_function2(st_c(boulder_sales.sf), st_c(O3Pollutants), 4)) 
  
  #---- build trailhead feature -----
  st_c <- st_coordinates
  
  boulder_sales.sf <- #I wonder if there is a threshhold distance to nearest trail that might be worth engineering? 
    boulder_sales.sf %>%  #even adding it in this way significantly improved the model tho
    mutate(
      trail_nn1 = nn_function2(st_c(boulder_sales.sf), st_c(trail_access_sf), 1),
      trail_nn2 = nn_function2(st_c(boulder_sales.sf), st_c(trail_access_sf), 2), 
      trail_nn3 = nn_function2(st_c(boulder_sales.sf), st_c(trail_access_sf), 3),
      trail_nn4 = nn_function2(st_c(boulder_sales.sf), st_c(trail_access_sf), 4)) 
  
  #---- build school district/catchment feature and in/out boulder feature ----
  boulder_sales.sf <- st_join(boulder_sales.sf, school_zones) 
  
  #---- build census tract feature  ----
  boulder_sales.sf <- st_join(boulder_sales.sf, boulder_tracts18) 
  
  #---------------- build spatial feature using nearest neighbor ----
  
  
  coords <- st_coordinates(boulder_sales.sf) 
  
  neighborList9 <- knn2nb(knearneigh(coords, 9)) 
  spatialWeights9 <- nb2listw(neighborList9, style="W")
  
  boulder_sales.sf$lagPrice <- lag.listw(spatialWeights9, boulder_sales.sf$price, NAOK=TRUE)
  
  #### Need to do log of spatial, there are no 0's so we don't need +1
  boulder_sales.sf<- boulder_sales.sf%>%mutate(LNlag = log(lagPrice))
  
#---- DUMMY VARIABLES ----

#### Subset of Data 
  subset_boulder_sf <- boulder_sales.sf %>% 
    st_drop_geometry(.) %>%
    mutate(builtYear= as.numeric(builtYear))%>% 
    dplyr::select(-builtYear, 
                  -Stories, 
                  -nbrBedRoom,
                  -nbrRoomsNobath, 
                  -nbrThreeQtrBaths, 
                  -bld_num, 
                  -section_num, 
                  -ConstCode, 
                  -carStorageTypeDscr, 
                  -year_quarter, 
                  -saleDate, 
                  -qualityCodeDscr, 
                  -ExtWallDscrSec,
                  -AcDscr,  
                  -GEOID,
                  -Latino,
                  -OBJECTID,
                  -Shapearea,
                  -Shapelen)


  
  str(subset_boulder_sf)
  
  boulder_subset_sf_cols <- data.frame(colnames((subset_boulder_sf)))
  
#### The Dummy Variable Fn I'm using will autocmatically change all char data types to dummys 

variablesofinterest <- fastDummies::dummy_cols(subset_boulder_sf)



 #--------- Clean and Combining Cat Vars 
variablesofinterest <- variablesofinterest %>%
  mutate(Roof_tile= `Roof_CoverDscr_Clay Tile`+ `Roof_CoverDscr_Concrete Tile`,
         Roof_tar_and_rubber = `Roof_CoverDscr_Rubber Membrane` +`Roof_CoverDscr_Tar and Gravel`,
         Roof_assorted = `Roof_CoverDscr_Built-Up`+ `Roof_CoverDscr_Roll`+ `Roof_CoverDscr_Shake`,
         Const_concrete_and_precast = ConstCodeDscr_Concrete +ConstCodeDscr_Precast,
         Const_na_and_frame = ConstCodeDscr_+ConstCodeDscr_Frame, 
         Const_masonry = ConstCodeDscr_Masonry+ConstCodeDscr_Veneer,
         Ext_wall_log_asbestos = `ExtWallDscrPrim_Frame Asbestos` + ExtWallDscrPrim_Log + ExtWallDscrPrim_Metal,
         Ext_wall_stucco_strawbale_brick_blck= `ExtWallDscrPrim_Block Stucco`+`ExtWallDscrPrim_Frame Stucco` + ExtWallDscrPrim_Strawbale +`ExtWallDscrPrim_Brick on Block`, 
         Ext_wall_low_medium_fancy = ExtWallDscrPrim_ +`ExtWallDscrPrim_Brick Veneer`+ ExtWallDscrPrim_Cedar+`ExtWallDscrPrim_Cement Board`+`ExtWallDscrPrim_Frame Wood/Shake`,
         Ext_wall_medium_fancy =`ExtWallDscrPrim_Faux Stone`+`ExtWallDscrPrim_Moss Rock/Flagstone`+`ExtWallDscrPrim_Painted Block`+ExtWallDscrPrim_Vinyl,
         value_added_basement = `bsmtTypeDscr_GARDEN BASEMENT FINISHED AREA` + `bsmtTypeDscr_LOWER LVL WALKOUT FINISHED (BI-SPLIT LVL)`+ `bsmtTypeDscr_WALK-OUT BASEMENT FINISHED AREA`, 
         all_other_basements = `bsmtTypeDscr_LOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL)`+ bsmtTypeDscr_0+ `bsmtTypeDscr_GARDEN BASEMENT UNFINISHED AREA`+ 
           `bsmtTypeDscr_LOWER LVL GARDEN FINISHED (BI-SPLIT LVL)`+ `bsmtTypeDscr_LOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL)`+ `bsmtTypeDscr_SUBTERRANEAN BASEMENT FINISHED AREA`+
           `bsmtTypeDscr_SUBTERRANEAN BASEMENT UNFINISHED AREA`,
         IntWallDscr_Drywall= IntWallDscr_Drywall+ `IntWallDscr_Wall Board`+IntWallDscr_Unfinished) %>%
  rename(., Roof_no_description = Roof_CoverDscr_,
         Roof_asphalt = Roof_CoverDscr_Asphalt,
         Roof_metal = Roof_CoverDscr_Metal,
         Const_brick=ConstCodeDscr_Brick) %>%
  dplyr::select(-`Roof_CoverDscr_Clay Tile`, -`Roof_CoverDscr_Concrete Tile`,
                -`Roof_CoverDscr_Rubber Membrane`, -`Roof_CoverDscr_Tar and Gravel`,
                -`Roof_CoverDscr_Built-Up`, -`Roof_CoverDscr_Roll`, -`Roof_CoverDscr_Shake`,
                - ConstCodeDscr_Wood, -ConstCodeDscr_Concrete, -ConstCodeDscr_Precast,
                -ConstCodeDscr_, -ConstCodeDscr_Frame, -ConstCodeDscr_Masonry, -ConstCodeDscr_Veneer,
                -`ExtWallDscrPrim_Frame Asbestos`,- ExtWallDscrPrim_Log, -ExtWallDscrPrim_Metal, 
                - `ExtWallDscrPrim_Block Stucco`, -`ExtWallDscrPrim_Frame Stucco`,- ExtWallDscrPrim_Strawbale, -`ExtWallDscrPrim_Brick on Block`,
                - `ExtWallDscrPrim_Brick Veneer`,- ExtWallDscrPrim_Cedar,-`ExtWallDscrPrim_Cement Board`,-`ExtWallDscrPrim_Frame Wood/Shake`,
                - `ExtWallDscrPrim_Faux Stone`,-`ExtWallDscrPrim_Painted Block`,-ExtWallDscrPrim_Vinyl, -`ExtWallDscrPrim_Moss Rock/Flagstone`,-ExtWallDscrPrim_,
                -`designCodeDscr_Split-level`,-`bldgClassDscr_SINGLE FAM RES IMPROVEMENTS`,
                -`bsmtTypeDscr_WALK-OUT BASEMENT UNFINISHED AREA`,
                - `bsmtTypeDscr_GARDEN BASEMENT FINISHED AREA`,- `bsmtTypeDscr_LOWER LVL WALKOUT FINISHED (BI-SPLIT LVL)`, -`bsmtTypeDscr_WALK-OUT BASEMENT FINISHED AREA`,
                -bsmtTypeDscr_0, -`bsmtTypeDscr_GARDEN BASEMENT UNFINISHED AREA`, -`bsmtTypeDscr_LOWER LVL GARDEN FINISHED (BI-SPLIT LVL)`, -`bsmtTypeDscr_LOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL)`, -`bsmtTypeDscr_SUBTERRANEAN BASEMENT FINISHED AREA`,
                -`bsmtTypeDscr_SUBTERRANEAN BASEMENT UNFINISHED AREA`, -`bsmtTypeDscr_LOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL)`, -`IntWallDscr_Wall Board`, -IntWallDscr_Unfinished, -`HeatingDscr_No HVAC`,
                -`HeatingDscr_Electric Wall Heat (1500W)`, -HeatingDscr_Electric, -`HeatingDscr_Package Unit`, -`HeatingDscr_Ventilation Only`,
                -`HeatingDscr_Wall Furnace`,
                -`HeatingDscr_Hot Water`, -`HeatingDscr_Heat Pump`,-HeatingDscr_Gravity,- `HeatingDscr_Forced Air`)

variablesofinterest <- variablesofinterest %>%
  mutate(smaller_disticts= `district_Park School District R-3`+ `district_Thompson School District R-2J`) %>%
  dplyr::select(-`district_Park School District R-3`,- `district_Thompson School District R-2J`, 
                -designCodeDscr,-qualityCode, -bldgClassDscr,-ConstCodeDscr,-CompCode,-bsmtTypeDscr, -HeatingDscr,
                -nbrFullBaths, -nbrHalfBaths, -ExtWallDscrPrim, -IntWallDscr, -Roof_CoverDscr,
                -status_cd, -lagPrice, -district)

Our variables, or house price predictors, are listed in Table 1, along with some summary statistics. Our model was developed using the natural log of home prices, as it does a better job of accounting for houses with very high values. We’ll refer to this variable as LNPrice for the rest of the document. 

We’ve used a combination of house features and neighborhood amenity predictors. These predictors have a strong relationship with house price. In other words, we’ve selected predictors that have a significant—in the statistical sense—effect on house price, whether good or bad. 

While we want these predictors to have a relationship with house price, it would not be ideal for any one of our predictors to have strong relationships with another predictor, because when present such relationships between independent variables produce less accurate models. We’ve employed a correlation matrix to identify whether this may be the case (Figure. 1).

##  [1] ""                                                                                                                                            
##  [2] "--------------------------------------------------------------------------------------------------------------------------------------------"
##  [3] "Statistic                                                      N       Mean         SD      Minimum Percentile(25) Percentile(75)  Maximum  "
##  [4] "--------------------------------------------------------------------------------------------------------------------------------------------"
##  [5] "Price_NaturalLog                                             11,364   13.247       1.358     0.000      13.017         13.629       17.265  "
##  [6] "Basement Type Variable                                       11,364    0.880       0.324       0          1              1            1     "
##  [7] "Drywall Variable                                             11,364    0.932       0.253       0          1              1            1     "
##  [8] "Roof Type Variable                                           11,364    0.287       0.452       0          0              1            1     "
##  [9] "Ranch House Variable                                         11,364    0.320       0.466       0          0              1            1     "
## [10] "Radiant Heat Floor Variable                                  11,364    0.008       0.089       0          0              0            1     "
## [11] "Exterior Wall Variable                                       11,364    0.084       0.277       0          0              0            1     "
## [12] "In/Out of Boulder Valley School District                     11,364    0.554       0.497       0          0              1            1     "
## [13] "Distance from Nearest 3 Trail Heads                          11,364  1,731.314    980.009   46.453    1,027.802      2,234.521    10,275.150"
## [14] "Log of the Lag Price                                         11,364   13.384       0.457    12.258      13.044         13.643       15.163  "
## [15] "Main Floor Square Footage                                    11,364  1,313.513    607.191      0         923           1,619        7,795   "
## [16] "Sale Year                                                    11,364  2,019.792     0.729     2,019      2,019          2,020        2,021   "
## [17] "Car Storage Square Footage                                   11,364   471.891     235.855      0         380            600         3,040   "
## [18] "Proximity to Two Nearest Polluting Facilities                11,364  1,647.872   1,947.912   5.571     753.500       1,723.139    16,280.040"
## [19] "Percent of Population with Bachelors Degrees in Census Tract 11,364    0.014       0.018     0.000      0.004          0.018        0.129   "
## [20] "Median Household Income                                      11,364 95,292.380  27,494.670  22,578      74,286        114,271      167,917  "
## [21] "Age of House                                                 11,364   33.823      26.770       0          15             49          161    "
## [22] "Price in Dollars                                             11,364 742,520.900 617,487.300    0       450,000        830,000     31,500,000"
## [23] "--------------------------------------------------------------------------------------------------------------------------------------------"

The correlation matrix below illustrates the extent of the relationship between the continuous independent variables, like square footage and house age, in our model and the price measure we used to develop our model. This matrix also demonstrates the relationship these variables have with one another. While ideally, each variable would have a high correlation with the natural LNPrice and not each other, in actuality, the interactions are more complex. Indeed, the only continuous variable that does not have co-variance with other continuous variables in our model is the sale year. While home prices rose (significantly) between 2019 and 2021, the housing market stock, or what was being valued by the market, did not substantially change.

Figure 2 presents four scatter plots that illustrate the relationship between various features and the log of home prices. While there is a wide range in each category, it is clear that the prices do have a linear, if not perfectly linear, relationship with these indicators. The best and most powerful of these predictors is the log of the lag price; you can see that illustrated visually in Figure 2, as the prices cluster most closely along the plotline.

The figure below, Home Sale Price, maps the cost of homes in the training set across the county; more saturated areas saw more home sales between 2019 and 2021. Even a cursory glance makes clear that most home sales were in the eastern portion of the county while far fewer transactions happened in the more mountainous western area of the county. Higher sale prices also appear to cluster in and around the City of Boulder and the southeastern part of the county; however, there is also a smattering of more highly valued homes in the county’s eastern portion.

The maps below visualize how select indicators present themselves for properties throughout Boulder County. It is worth noting here that while the measure of proximity to polluters and proximity to trailheads spatially cluster, availability of car storage seems to have more variation throughout the county.

Methods

The method we followed allowed us to identify what house features and neighborhood amenities influence the price of a house in a simple way. The statistical function we employed was a simple regression, Ordinary Least Squares.

Process for Developing a Model to Estimate Home Sale Price in Boulder Colorado.
Get and clean the data
Engineer predictor variables (e.g. distance from nearest polluting facility)
Explore the relationships between all variables (Figure 1)
Split our data into two groups, allowing us to train and then test (~75% train, ~25% test)
Narrow down predictors and incorporate them into a model
Train the model by creating it on the train data subset
Test the model on the test data subset, and evaluate its performance
Perform a Cross Validation test, evaluate the model’s performance
Back to Step 5, until we were happy with the results

After developing several indicators, our team split our data into two portions–allocating 75% to training a model and 25% to testing that model. We then created a correlation matrix of our continuous variables to see how these indicators interact and which combination might be best included in the model. From here, our team made a variety of models using these indicators and combinations of categorical variables. Experimenting in this way allowed us to determine which combination of categorical variables would enable us to create the most powerful model.

After testing on our test set and performing a cross validation test, we identified our strongest model. To further explore the weakness of our model, we evaluated whether we failed to account for various spatial factors. To do this, we investigated if the errors in our price predictions were clustered by Census Tract. If clustering of errors were apparent and proven to be statistically significant (Moran’s I), then our model would have lacked critical spatial predictors.

Additionally, we investigated our model for generalizability - that is to say, we investigated if our model did as equally well of job predicting under different conditions. Due to racial segregation in many U.S. cities, predicting home prices in majority white and majority-minority neighborhoods can be hard to accomplish with the same model if it is not generalizable. However, Boulder County is largely racially homogeneous. To test the generalizability of our model, we investigated whether the errors in our sale price predictions were roughly the same between houses located in neighborhoods with low/high median income and in areas with a population of senior citizens below or above the national average.

Results: Effectiveness of Our Model

Altogether the predictive power of our variables varied, but they all significantly influenced house price. Table 2 shows that the strongest predictor included in our model was a neighborhood feature, the population of residents with a bachelor’s degree. It appears that higher education in the population of a neighborhood has a positive relationship with home price. The house characteristic with the strongest predictive power was the presence or absence of a radiant floor heating system.

Table 3 shows the mean residuals of our model—this is the difference between the actual price and the price our model predicts within our test set. On average, our model predicts within $133,688.80 or 16.42% of the actual price.

Table 4 shows our cross-validation results; comparing this with Table 2, the models are fairly similar. This indicates to us that the model we developed is reasonably accurate for our purposes. The histogram of errors from a 100-fold cross-validation cluster between 1.18 and 1.20—in the neighborhood of mean absolute error we found testing our model on our training data.

Coefficient-Level Estimates for a Model Fitted to Estimate Home Sale Price in Boulder Colorado.
Predictor Estimator SE t p
(Intercept) -167.33328 8.90885 -18.78281 0.00000
Basement Type Variable -0.07655 0.01042 -7.34372 0.00000
Drywall Variable 0.02671 0.01416 1.88554 0.05939
Roof Type Variable -0.05455 0.00744 -7.33602 0.00000
Ranch House Variable -0.15649 0.00798 -19.61247 0.00000
Radiant Heat Floor Variable 0.30309 0.03475 8.72298 0.00000
Exterior Wall Variable 0.08195 0.01238 6.61864 0.00000
In/Out of Boulder Valley School District 0.09507 0.00807 11.77643 0.00000
Distance from Nearest 3 Trail Heads -0.00003 0.00000 -7.58893 0.00000
Log of the Lag Price 0.65421 0.01040 62.90122 0.00000
Main Floor Square Footage 0.00022 0.00001 30.50632 0.00000
Sale Year 0.08501 0.00441 19.26539 0.00000
Car Storage Square Footage 0.00015 0.00002 8.15574 0.00000
Percent of Population with Bachelors Degrees in Census Tract 1.49614 0.21554 6.94132 0.00000
Proximity to Two Nearest Polluting Facilities 0.00000 0.00000 -1.65357 0.09825
Age of House 0.00011 0.00017 0.64476 0.51910
Median Household Income 0.00000 0.00000 -2.84221 0.00449

Table 2

\(~\)

Mean.Absolute.Error..MAE. Mean.Absolute.Percent.Error..MAPE.
153938.4 19.97

Table 3

\(~\)

Coefficient-Level Estimates for a Model Fitted to Estimate Home Sale Price (LN) in Boulder Colorado.
Predictor Estimator SE t p
(Intercept) -173.21 7.753 -22.34 0.00000
Basement Type Variable -0.08 0.009 -9.08 0.00000
Drywall Variable 0.03 0.012 2.42 0.01554
Roof Type Variable -0.05 0.006 -8.30 0.00000
Ranch House Variable -0.16 0.007 -22.58 0.00000
Radiant Heat Floor Variable 0.30 0.032 9.48 0.00000
Exterior Wall Variable 0.09 0.011 8.61 0.00000
In/Out of Boulder Valley School District 0.09 0.007 13.34 0.00000
Distance from Nearest 3 Trail Heads 0.00 0.000 -9.19 0.00000
Log of the Lag Price 0.65 0.009 71.87 0.00000
Main Floor Square Footage 0.00 0.000 34.12 0.00000
Sale Year 0.09 0.004 22.90 0.00000
Car Storage Square Footage 0.00 0.000 9.96 0.00000
Percent of Population with Bachelors Degrees in Census Tract 0.00 0.000 -2.27 0.02326
Proximity to Two Nearest Polluting Facilities 1.32 0.187 7.07 0.00000
Age of House 0.00 0.000 -3.72 0.00020
Median Household Income 0.00 0.000 1.09 0.27572

Table 4

And plotting our predicted prices (Fig. 4)indicates that our model performs better at predicting the prices of homes with lower values. We suspect that this is the result of having transformed price data to make our model perform, on average, with a greater degree of accuracy.

Furthermore, the map of our residuals and our Moran’s I test’s plotting indicates that our errors cluster. This suggests that our model does not perform equally well throughout the county. Next, we will examine in more detail where our model performs well and where it performs less well within Boulder County.

## 
##  Monte-Carlo
##  simulation
##  of
##  Moran I
## 
## data:  boulder.test.sf$SalePrice.Error 
## weights: spatialWeights.test  
## number of simulations + 1: 1000 
## 
## statistic
## =
## -0.0090961,
## observed
## rank =
## 56,
## p-value
## = 0.944
## alternative hypothesis: greater

Examining this map, it is clear that the model predicts lower home values in the mountainous western portion of the county. Further examination of our model will indicate what should be clear simply by comparing this map with the mapped prices of our training data: the model our team developed does not perform exceptionally well in the less urban portion of the county. 

The map below indicates that our model performed better in the eastern rather than western portion of the county; this reflects that, on average, our model performed better in more populated areas. Future versions of this model could include indicators that might better account for variations in pricing and weighting of various variables depending on population density.

The following plot indicates that, on average, as the mean price within a Census Tract increases, so does the MAPE of our model. What’s more, the range of the MAPE is much wider among tracts with average home values under one million. While a helpful measure, it is worth considering that our training set is relatively small when broken down by Census Tract, and a few outliers within the set may have significant impacts on these measures.

Further we examined whether our model was generalizable across different demographics in Boulder. The two categories we looked at were population density of senior citizens and median household income. For our seniors subgroups, we divided the census tracts in Boulder between tracts that had above the national average of seniors in their population and those that didn’t. For our income subgroups, we looked at census tracts with a median household income of above $32,000 and below.

Ideally, the difference in error between the two categories in each subgroup should be roughly the same. The tables below show that we did a decent job at accounting for differences in income, but did not fare as well for senior populations greater than the national average.

Test set MAPE by neighborhood senior context
seniorContext mean.MAPE
Fewer Seniors Than the National Average 17%
More Seniors Than the National Average 25%
Test set MAPE by neighborhood income context
incomeContext mean.MAPE
High Income 18%
Low Income 17%

Discussion

Is this an effective model?

The model we developed is less accurate than the models currently used by Zillow and should not be adopted. Nevertheless, our team is quite pleased with the success we have had developing a model that does not rely on crime statistics and hopes it may act as a starting point for further research.

We are particularly pleased with the work we have done developing a number of novel indicators. Among them:

Interesting Indicators Used in Developing a Model to Estimate Home Sale Price in Boulder Colorado.
Indicator Notes
Nearness to trailheads This measures of access to outdoor recreation, a premium in Boulder County.
Nearness to polluters While relatively modest in terms of influence in our model, this measure is nevertheless statistically significant, a sign to our team that similar measures may work well in more industrialized cities.
Average sale price of nearby homes We found that this indicator was able to account for nearly fifty percent of the variation in price across our training set.

\(~\) With these variables and a handful of others, our model can account for ~70% of the variation within our training set.

The key feature in our model is the measure we developed of the prices of nearby homes. Alone this feature is capable of predicting approximately fifty percent of the variation in home sales. At its most basic level, this indicator can account for how properties of similar values tend to cluster. Our team suspects that this clustering happens more in urbanized areas; this may help account for our model’s poor performance in the western parts of the county.

On average, our model produced an error of $139,587.70 or 17.2%. While that’s certainly not as accurate as Zillow’s current model, our team is pleased with these results.

According to your maps, could you account the spatial variation in prices?

These errors are not evenly distributed throughout the county but are more extreme in less dense areas, and the model performs better in more dense areas. We suspect this may be due to indicators acting differently in urbanized and non-urbanized areas. For instance, lot sizes tend to be larger (and so square footage can also increase) in non-urban areas. Similarly, outside of cities, home prices may spatially cluster in ways that our current indicator measuring the sale price of nearby properties fails to account for properly.

Conculsion

While we do not recommend that our model replace Zillow’s current model for Boulder County, we recommend that the indicators we have developed here be adopted. Further research into similar indicators may also allow Zillow to be a leader in moving away from relying on data generated by the police while continuing to offer a high-quality product.

This model could benefit from a more nuanced understanding of how price varies in less urbanized county areas. One approach to this would be developing separate models for use in areas with radically different population densities. It could also be useful to incorporate elevation data to help predict where homes fetch higher in prices because of exceptional views. This model could also benefit from fine-tuning existing features to develop more powerful indicators.