Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Nov 01, 2020

1. What other data sources might be useful to add to the Stiernstrand/Crossfield property value prediction tool? Why? 2. How would you measure the worth

1. What other data sources might be useful to add to the Stiernstrand/Crossfield property value prediction tool? Why?

2. How would you measure the worth of the Stiernstrand/Crossfield property value prediction tool?

3. Could the Stiernstrand/Crossfield property value prediction tool be made into a data warehouse? How? What would be the value of doing so?

4. Should Zillow buy or add the Stiernstrand/Crossfield property value prediction tool to their website? Should they replicate it themselves?

5. What other technologies or tools could/should be added to the visualization part of the Stiernstrand/Crossfield property value prediction tool? Why?

This is the story:

DIY Zillow

During the 2020–21 housing boom, Charlottesville, Virginia, realtor Reidar Stiernstrand was busy assisting home buyers navigate the real estate market. One of his prospective clients, Marion Sabor, was planning to join the droves of people moving away from New York City in search of more residential communities.1 Instead of the neighboring New York City suburbs, Sabor was interested in relocating her family to Charlottesville.

The Sabors had already spent some time researching the Charlottesville-area housing market using Zillow, a popular online real estate website.2 Because of the couple’s frustration with Zillow’s views of homes within the various school districts, Stiernstrand planned a “Do It Yourself” (DIY) project to build a tool that enhanced the views of new home listings. Stiernstrand’s goal was to create a customized map to guide the search for the perfect home, not only for the Sabors but also for future clients and their families.

DIY “School Filters”

One feature on Zillow’s website Stiernstrand wished to improve with his DIY project was the “School Filters.” When using this feature to filter listings where elementary public schools received an above-average GreatSchools rating (7 or above), the site rendered two types of icons on the displayed map—gray markers representing each above-average school’s physical location and a sea of red dots representing all available properties for sale. When clicking on a school’s gray marker, a user had the option to display the school district’s boundaries. However, the user could only look at one school district at a time.3

With two young children of elementary-school age, the Sabors were particularly interested in finding a house in a good public school district. Stiernstrand knew that if he created a tool that filtered homes based on school ratings and overlaid all school district boundaries on a map with available homes, it would be a great asset for his clients.

Zillow used school ratings data from GreatSchools, so Stiernstrand went directly to greatschools.org and found the data.4 The information was a list of ratings for 22 public elementary schools in either Albemarle County (the county surrounding Charlottesville) or the City of Charlottesville, which was also technically a county in Virginia. Four of the elementary schools were rated above average; all were in Albemarle County. This fact would make the home search much easier. Albemarle County and the City of Charlottesville kept separate records on real estate transactions, so now the focus shifted to only properties in Albemarle County. Finding a home in one of these elementary schools’ districts would be the highest priority

DIY “Zestimate”

The visualization aspect of Stiernstrand’s DIY tool would undoubtedly help him locate market listings and efficiently narrow down suitable homes based on school districts. However, Stiernstrand recognized that what differentiated great realtors from average realtors was their ability to find the most valuable house for the best price. Therefore, Stiernstrand planned to incorporate another feature in his tool based on Zillow’s home value estimator, referred to as a “Zestimate.”

Zillow’s Zestimate model was trained on data reported by users, data from multiple listing services (MLS), and publicly available information.6 When discussing the Zestimate’s accuracy, Zillow distinguished between off-market and on-market home estimates. The off-market data used to help train its model included information from public records, such as tax assessments, while details like a home’s listed price and the number of days the home was on the market were considered on-market data. According to Zillow’s own internal benchmarking, the benefit of on-market data was clear: “The nationwide median error rate for the Zestimate for on-market homes is 1.9%, while the Zestimate for off-market homes has a median error rate of 6.9%.”7

Stiernstrand envisioned completing his DIY Zillow tool by building a similar predictive model to forecast property values in the greater Charlottesville area. The model would enable him to compare the forecasted value with the listed market price of each listing and pinpoint the most valuable homes within his client’s price range. Because he did not have access to historical on-market data from the county, Stiernstrand hoped to improve upon the Zestimate’s off-market median error rate of 6.9%. He believed this standard was the right metric for measuring the accuracy of his model.

As Stiernstrand thought through all the steps required to build the DIY tool—locate the needed information, clean and prepare the data, build a regression model to forecast property estimates, and visualize findings—he decided to turn to Janice Crossfield, a local data scientist and acquaintance interested in helping Stiernstrand construct his tool. Crossfield’s role was to perform the data- and analytics-intensive duties that would lead to the tool’s success, while leaning on Stiernstrand for real estate expertise.

Observational Data Preparation

Before Crossfield could begin building a predictive model or a visualization, she needed to construct a table that represented each property’s most recent transaction. A transaction would serve as the tool’s “unit of analysis,” even though Albemarle County kept records at a more granular level. The county’s “unit of observation” was a parcel, or a card.8 A parcel referred to a property’s land acreage, and a card referred to a building on the parcel.9 A parcel could have multiple buildings on it and thus have multiple cards associated with it. Reconstructing a transaction from the county’s parcels and cards was challenging, because some transactions included multiple parcels and cards.

To construct the transaction-level data, Crossfield first retrieved parcel- and card-level data from Albemarle County’s website. Parcel-level data included fields such as last sale price, current assessed value, and school district. Card-level data included building information such as number of bedrooms, number of bathrooms, and finished square footage.10 To find the assessed value from the year of the last sale (when the last sale occurred prior to the current year), several years of past assessed values were extracted from the county’s Land Books, which were provided in PDF form on Albemarle County’s website.11

A parcel’s current assessed value was a dollar value assigned by Albemarle County as the basis for real estate tax. In 2021, the county’s annual real estate tax rate was 0.845%.12 For example, if the county assessed a parcel’s total value—the sum of its land value and building value—at $1 million, then the annual real estate taxes due to the county would be $8,540. A property’s assessed value was often close to its market value, but sometimes a property would sell for an amount quite different from its assessed value.

In the case of new construction, where a home was built on a parcel and the parcel was sold on the market within the same year, the parcel’s total assessed value would include the value of the land, but not the value of the building. Tax authorities assessed a parcel’s value once a year, effective January 1.13 Therefore, a parcel with a newly constructed home would not have the value of its building reflected in the total assessed value until the year after construction.

Once Crossfield downloaded these files from the county’s website, she merged them together—careful to account for parcels that included multiple buildings. Next, she aggregated the merged data up from the parcel level to the transaction level. This step required Crossfield to consider transactions that included multiple parcels. See Exhibit 1 for a flow chart of the steps she took to prepare the observational data. With the joined and aggregated data, she was ready to model sale prices at the transaction level.

Forecast Data Preparation

To estimate sale prices, Crossfield planned to train a linear regression model with transaction-level data. Because the county’s oldest available Land Book was from 2015, she first filtered out the transactions that occurred before 2015 from the training and testing data. After analyzing the distribution of last sale prices, she also removed transactions found in the top 1% and bottom 1% since 2020. Crossfield used just two predictor variables in her model—total assessed value from the year of the last sale and finished square footage.

After training the model, Crossfield compared its forecasting errors to those of a naive benchmark using a set of homes not included in the model’s training set. She treated a home’s total assessed value at the time of last sale as the naive benchmark forecast. The model Crossfield built proved to be competitive with the benchmark in predicting the sale price of the homes held out of the training set. She viewed this simple model as a promising start to the forecasting effort.

After evaluating the model’s ability to predict historical sale prices, Crossfield deployed the model on all the transaction-level data in order to come up with current estimates of property values. For this purpose, she needed to update one of the predictor variables in her model. Specifically, she replaced the total assessed value from the year of the last sale with the property’s current total assessed value. With this change, she was able to deploy the model to predict current property values. These predictions, which she called “current estimates,” would be comparable to Zillow’s Zestimates. See Exhibit 2 for a flow chart of the steps she followed to generate the forecast data.

Visualization Data Preparation

To create the tool’s visualization, she planned to map each listing/property in Albemarle County and include details like current property estimate, school district boundaries, and listing price for listed homes. Before building the visualization, Crossfield tracked down some additional data, including homes on the market and shapefiles used to plot boundaries on the map.

Instead of using Zillow’s site to generate a list of homes for sale, Crossfield decided to use a specialized local site, mycaar.com, run by the Charlottesville Area Association of Realtors (CAAR). Crossfield recorded the address, listed price, and year built for a handful of homes listed on the website.14

She then used this information to join listing prices of available homes to the table that included current estimates. The only data this new table lacked were shapefiles that included information for each current parcel’s boundaries and each elementary-school district’s boundaries, in the form of a polygon for visualization.

Conveniently, the county provided these shapefiles—one for parcels and one for the elementary-school districts.15 Because each shapefile also included a Geographic Parcel Identification Number (GPIN),16 Crossfield used this field to join both shapefiles—the parcel shapefiles and the school district shapefiles— together, and then joined these to the new table that contained current property estimates and listing information. After this final merge, Crossfield and Stiernstrand could easily create a map that would highlight the most valuable real estate listings that matched a client’s wish list, reducing the time spent searching for properties.

Exhibit 3 contains a flow chart of the steps involved in preparing the data for visualization. The data for visualization, up to but not including information from the shapefiles, were stored in a table called “tool_data.” See Exhibit 4 for this table’s file layout.

Lining Up Some Home Tours

In the coming weeks, Sabor and her family were planning to visit Charlottesville to tour available homes. Stiernstrand hoped to identify at least 12–15 homes worth touring. The Sabors’ budget for a home was somewhere in the range of $500,000 to $1.5 million. They would go to the high end of this range if they thought they had found a bargain. At the lower end, the family would have money left over to upgrade the home.

Stiernstrand saw this as the perfect opportunity to put the new DIY home-searching tool to the test. If the tool could help identify appropriate properties for the Sabors, his side project might have the potential to assist additional clients and improve his business along the way.