Visualize your data
First, you'll add a table of data to a map as a set of point features with attributes. Later in the workflow, you can include the data's spatial characteristics in your modeling process.
Create features
You'll download an ArcGIS Pro project package with a table of house sale data and create a feature class from it.
- Download the King County House Prices project package.
- Browse to the location where you downloaded the package and double-click King_County_House_Prices.ppkx to open the project in ArcGIS Pro. If prompted, sign in using your licensed ArcGIS account.
Note:
If you don't have access to ArcGIS Pro or an ArcGIS organizational account, see options for software access.
The project opens. The extent of the map is King County, Washington. In the Contents pane, in the Standalone Tables section, is an item named kc_house_data.csv.
This file is a comma-separated values (.csv) file, a format frequently used to exchange tables of data. The first row of the file contains a comma-delimited list of the field names; each subsequent row contains comma-delimited values for each of those fields. In many data science or machine learning workflows, one of the first steps is to read this file into a data frame using a notebook. In this tutorial, you'll load the data into a geodatabase as a set of point features and use ArcGIS Pro as your data science workstation.
The table is open and displayed under the map view. You can see the table field names and some of the values.
- On the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.
The Geoprocessing pane appears.
- In the Geoprocessing pane, in the Search box, type XY Table To Point.
- In the tool search results, click XY Table To Point.
- In the XY Table To Point tool pane, for Input Table, choose kc_house_data.csv.
Note:
If you are working in a non-American English locale, use the included kc_house_data_table geodatabase table instead of the .csv file. Locale can affect the data types of the output fields when importing .csv files with the XY Table To Point tool. If you want to make points from .csv files and also have control of the data type of the imported attributes, you can import the .csv file to a geodatabase table first and set the data types of each field from the Table to Table tool.
- For Output Feature Class, type kc_house_data.
The X Field parameter is already populated with the long field from the .csv table, while the Y Field parameter is populated with the lat field. This dataset does not have a Z Field value, so you can leave that parameter blank.
Next, you'll set an appropriate coordinate system for the data.
- For Coordinate System, click the Select coordinate system button.
The Coordinate System window appears.
- In the search box, type HARN and press Enter.
- Expand Geographic Coordinate System, North America, and USA and territories. Click NAD 1983 HARN.
- Click OK. In the XY Table To Point tool pane, click Run.
The tool runs. After it finishes, the points are added to the map.
- Close the Geoprocessing pane. Close the kc_house_data.csv table view.
Change the symbology
Before you explore the data, you'll change the default symbology.
- In the Contents pane, under kc_house_data, click the point symbol.
- In the Symbology pane, in the Gallery tab, click the Circle 3 symbol.
- Click the Properties tab. Under Appearance, for Color, choose Malachite Green.
Tip:
In the color picker, point to a color to see the color name.
- For Size, choose 4 pt. Click Apply.
The symbols change on the map.
- Close the Symbology pane.
- On the Quick Access Toolbar, click the Save button to save your project.
Note:
A message may appear warning you that saving this project file with the current ArcGIS Pro version will prevent you from opening it again in an earlier version. If you see this message, click Yes to proceed.
Explore the data
Next, you'll explore the data. First, you'll familiarize yourself with its attribute fields and their meaning. Then, you'll create a scatterplot matrix and explore relationships between attributes.
- In the Contents pane, right-click the kc_house_data layer and choose Attribute Table.
The attribute table has 20 attribute fields describing the houses and sale prices. The fields are listed in the following table:
Field name Field description date
Date of sale
price
Final transaction amount
bedrooms
Number of bedrooms
bathrooms
Number of bathrooms
sqft_living
Living space size (in square feet)
sqft_lot
Lot size (in square feet)
floors
Number of floors
waterfront
Is house on waterfront (1: yes, 0: not)
view
Categorical variable for view of the house
condition
Categorical variable for condition of the house
grade
Overall house grade based on King County grading system
sqft_above
Size of the house excluding basement (in square feet)
sqft_basement
Size of the basement (in square feet)
yr_built
Year house was built
yr_renovated
Year house was renovated (if renovated)
zipcode
ZIP Code of the house
lat
Latitude of house
long
Longitude of house
sqft_living15
Size of living space in 2015 (in square feet)
sqrt_lot15
Size of lot in 2015 (in square feet)
Some of the fields contain codes for specific values. The codes for the condition field are explained in the following table:
Code Condition Description 1
Poor
Many repairs needed. House is showing serious deterioration.
2
Fair
Some repairs needed immediately. Much deferred maintenance is needed.
3
Average
Depending upon age of improvement, normal amount of upkeep for the age of the home.
4
Good
Condition above the norm for the age of the home. This indicates extra attention and care has been taken to maintain it.
5
Very Good
Excellent maintenance and updating on home; not a total renovation.
The grade field contains a different series of codes, which are explained in the following table:
Code Description 1–3
Falls short of minimum building standards; normally cabin or inferior structure.
4
Generally older low quality construction. The house does not meet code.
5
Lower construction costs and workmanship. The house has small, simple design.
6
Lowest grade currently meeting building codes. Low-quality materials and simple designs were used.
7
Average grade of construction and design. This is commonly seen in plats and older subdivisions.
8
Just above average in construction and design. Houses of this quality usually have better materials in both the exterior and interior finishes.
9
Better architectural design, with extra exterior and interior design and quality.
10
Homes of this quality generally have high-quality features. Finish work is better, and more design quality is seen in the floor plans and larger square footage.
11
Custom design and higher quality finish work, with added amenities of solid woods, bathroom fixtures, and more luxurious options.
12
Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.
13
Generally custom designed and built, approaching the mansion level. These houses have a large amount of highest quality cabinet work, wood trim, and marble with large entries.
The view field uses the following codes:
Code Description 0
Unknown
1
Fair
2
Average
3
Good
4
Excellent
The next step is to explore the data to determine the distribution of values for each variable and to determine if any of the attributes are positively or negatively correlated. A scatterplot matrix is a visualization technique commonly used for this sort of data exploration.
- Close the attribute table.
- In the Contents pane, right-click kc_house_data, point to Create Chart, and choose Scatter Plot Matrix.
- In the Chart Properties pane, for Numeric fields, click Select. Check all of the fields from price through sqft_basement.
- Click Apply.
The Chart view updates with scatterplots of the selected fields.
Note:
You can see the name of each scatterplot by pointing to it. You can also drag the boundaries of the view to increase the size of the charts.
The plot summarizes relationships between pairs of different variables. You can use the scatterplot matrix to explore the relationships by clicking one of the plots in the lower triangle; once a plot is clicked, a larger version of it is shown on the top right.
Why is this plot useful for analysis?
The first regression model you'll use to develop your valuation model is Generalized Linear Regression (GLR). GLR requires predictors and the target variable to be linearly related. You'll use this chart to find property characteristics that are linearly correlated with the variable you want to predict: the sale price of the home.
Price is the first column in the lower triangle portion of the scatterplot matrix. Charts in the first column display relationships between different property characteristics and the sale price of the home.
- Click the scatterplot of price and sqft_living (first column, third row from the top).
The Preview Plot in the Matrix Corner View updates to show a larger view of the scatterplot of price and sqft_living
There is a positive linear relationship between living space size (sqft_living) and price. An increase in living space generally corresponds to an increase in house price. This variable is a good candidate for a GLR model.
- Click the scatterplot for bathrooms and price (first column, second row from the top).
The relationship between the number of bathrooms and the price does not exhibit a strong linear relationship. This suggests that the number of bathrooms does not affect the sale price of homes in this region as much as living space.
- Click the scatterplot for number of bedrooms and price (first column, first row from the top).
There seems to be a positive linear relationship between the two variables. However, it is hard to estimate the strength of the linear relationship by visual inspection.
- In the Chart Properties pane, check the Show linear trend box.
Clicking this option adds a best-fit line to each scatterplot.
- Click the scatterplot of price and sqft_living.
The chart now has the best-fit line and the associated R2 measure.
R2, or R2, is a percentage that shows how much of the variation in the data is explained by the relationship of the two variables. An absolute value of R2 close to one indicates a strong positive linear relationship, whereas values close to zero indicate a weak linear relationship.
An R2 of 0.49 shows that the relationship between sqft_living and price accounts for 49 percent of the variation in the scatterplot of sqft_living and price.
- In the Chart Properties pane, under Matrix Layout, for Upper right, choose Pearson's r. For Diagonal, choose Field Names.
The chart updates to show Pearson's r values in addition to the scatterplot charts.
Pearson correlation coefficient (Pearson's r) quantifies the strength of the linear relationship between variables, or how much influence one variable has on another. An absolute value of Pearson's r close to one indicates a strong positive linear relationship, whereas values close to zero indicate a weak linear relationship.
- If necessary, click the scatterplot for number of price and sqft_living.
The Pearson's r value for price and sqft_living also highlights with a black outline.
The sign of Pearson's r quantifies the type of relationship between two variables. A Pearson's r value of 0.7 indicates a positive linear relationship exists between the variables. A positive relationship implies that an increase in sqft_living corresponds to an increase in price and vice versa. A negative Pearson's r value indicates that an increase in one variable corresponds to a decrease in the other variable.
All of the property characteristics in the scatterplot matrix have a positive relationship with price.
- Click the scatterplot for bathrooms and price.
The Pearson's r of 0.53 points to a weak positive linear relationship between the number of bathrooms and the price.
- Click the scatterplot for bedrooms and price.
The Pearson's r of 0.31 indicates a weak positive linear relationship between the number of bedrooms and the price. The number of bedrooms and the price exhibit a different pattern for prices less than $1,000,000. There seems to be a strong linear relationship between these two variables if the price is more than $1,000,000.
This is an example of a piecewise relationship: relationships that change after an variable crosses a certain boundary. The presence of piecewise relationships suggests that a tree-based approach, such as Forest-Based Classification and Regression, may result in a more accurate estimate. Keep this in mind for now; later, you'll delineate variables for linear regression.
So far, you've created a way to understand relationships between variables. Your initial goal is to build an accurate linear model that relates attributes of a house to its sale price. You'll accomplish this goal in the following ways:
- Find property characteristics that have a strong linear relationship to the price.
- Make sure property characteristics do not have strong linear relationships between each other (to avoid multicollinearity).
The scatterplot matrix can summarize multiple relationships further, so that you can delineate property characteristics you want to use in your analysis.
- Close the Chart of kc_house_data view and the Chart Properties pane. Save the project.
You've inspected the data to prepare to conduct a linear regression analysis. You found that sqft_living has the strongest correlation to your target variable, sale price of the home. Other property characteristics that show strong relationships to each other may cause problems if they are in the same linear system as sqft_living. If two or more property characteristics exhibit multicollinearity, it can mean that your variables are telling the same story. For example, it's important to analyze whether total area of a living space also represents number of bedrooms and bathrooms, which can change from region to region. Multicollinearity can skew your model results if it isn't addressed.
Next, you'll create a linear model of the relationship between sqft_living and sale price of the home. If the model does not perform well, you can add the grade variable, also strongly related to the sale price of the home, to the linear system.
Identify market drivers with exploratory regression
Next, you'll explore relationships between property characteristics and the sale price of the home using exploratory regression. In exploratory regression, you are trying to find a model that can model the sale price of the home accurately and give you insight into the relationships between variables, such as whether these relationships are positive or negative.
Create a Generalized Linear Regression model
The first type of regression model that you'll create is a Generalized Linear Regression (GLR) model. You'll use one of the ArcGIS Spatial Statistics geoprocessing tools.
- Open the Geoprocessing pane.
Tip:
To open the Geoprocessing pane, on the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.
- In the Geoprocessing pane search box, type generalized linear.
- Click the Generalized Linear Regression (Spatial Statistics Tools) tool.
Note:
Some tools appear twice with similar or the same names in the Geoprocessing pane search results. Make sure you select the tool from the correct toolbox, which is listed next to the tool name.
You can use the Generalized Linear Regression tool to predict different types of dependent variables. The correct model to use depends on the type of the dependent variable. Because you are predicting a continuous variable (sale price), you'll use a Gaussian model to predict the sale price of the home.
If you were predicting a target variable that was 0 or 1 (a binary variable), such as whether a house sold for more than $500,000, you would use the binary (Logistic) option of this tool.
If the target variable was a count, such as the number of people making a bid for the house, you would use the count (Poisson) option of this tool.
- In the Generalized Linear Regression tool pane, enter the following parameters:
- For Input Features, choose kc_house_data.
- For Dependent Variable, choose price.
- For Model Type, confirm that Continuous (Gaussian) is chosen.
Next, you'll choose the regression model's explanatory variable. In the exploration of the scatterplot matrix, you determined that sqft_living is a good variable to use to predict sale price of the houses.
- For Explanatory Variable(s), check the box for sqft_living.
- For Output Features, type valuation_sqft_living_glr.
You'll create multiple GLR models, so it is recommended that you give meaningful names to the different outputs. This name indicates the prediction variable and the method.
You won't define any inputs in the Prediction Options section. For now, you're performing exploratory regression to define a model to describe house price given property characteristics. In other words, you are working on understanding potential drivers behind sale price of the homes. At this stage, you are not concerned with assigning a price to a house where no sale price is assigned (prediction). Later, you'll predict sale prices for new houses and this section of the tool will be useful.
- Click Run.
The tools runs and completes with a warning: WARNING 001605: Distances for Geographic Coordinates (degrees, minutes, seconds) are analyzed using Chordal Distances in meters.
Chordal distance measurements are used because they can be computed quickly and provide good estimates of true geodesic distances. Be sure to project your data if your study area extends beyond 30 degrees. Chordal distances are not good estimators of geodesic distances beyond about 30 degrees.
One output of this tool is a standardized residual map.
Dark green and dark purple indicate a large mismatch between predicted sale price of the homes and actual sale price of the homes.
- In the Contents pane, under the valuation_sqft_living_glr layer, double-click the Relationship between Variables chart.
The Relationship between Variables chart displays predictions performed by GLR and actual data points.
Ideally, data points should be close to the line. The closer to the line the data points are, the stronger the relationship is between the two variables.
In this chart, green colors indicate an underestimation of the sale price of the home, where the actual price of the house is higher than the one predicted by the model. Purple indicates an overestimation, where the predicted price is above the actual price of the house.
- Close the chart pane and the Chart Properties pane.
On the Standardized Residual map, the darker green points seem to cluster around bodies of water. The regression model is systematically underestimating the sale price of the houses close to water bodies. It looks as though small changes to the size of the living space may result in bigger changes to the price of a house by a water body compared to a house that is inland.
Next, you'll evaluate global diagnostics from the GLR output.
- On the ribbon, on the Analysis tab, in the Geoprocessing group, click History.
The Geoprocessing History pane appears.
- In the Geoprocessing History pane, right-click Generalized Linear Regression and choose View Details.
The GLR tool results details window appears.
- In the GLR tool results details window, click the Messages tab.
Tip:
You can expand the window by dragging the edges of the window.
In the GLR Diagnostics section, the Adjusted R-Squared value is 0.492830. This is the same R2 value shown on the scatterplot for price versus sqft_living.
The Joint F, Joint Wald, and Koenker (BP) Statistics are significant with P values (Prob(>chi-squared)) of approximately 0 (approximate due to rounding). This indicates that the probability that the relationship defined by this model is occurring randomly is approximately 0. In other words, there is a statistically significant relationship between the sale price of homes and the area of the living space being modelled by the GLR.
- Close the Generalized Linear Regression (GLR) (Spatial Statistics Tools) window and the History pane.
- Save the project.
You've used GLR to determine that there is a significant relationship between the sqft_living variable and price. You also discovered that the GLR model underestimates house values for houses that are near bodies of water. Next, you'll seek an improved GLR model by adding another variable to account for this underestimation. You'll use data from ArcGIS Online to geoenrich your prediction.
Enhance the analysis with geographic data
Next, you'll add a layer of geographic data from ArcGIS Online and use it to enhance your GLR model.
Find water bodies
Since the GLR model you just made underestimates values of houses near water bodies, you'll add water body data to the map and incorporate it in the GLR model. The ability to enhance data with geographic information, which can also be done via methods like GeoEnrichment, is an important advantage of ArcGIS Pro as a data science workstation.
- On the ribbon, click the View tab. In the Windows group, click Catalog Pane.
- In the Catalog pane, click the Portal tab and click the ArcGIS Online button.
- Search for USA water bodies owner:esri_dm.
- Right-click the USA Detailed Water Bodies layer package and choose Add To Current Map.
Note:
To distinguish the USA Detailed Water Bodies layer package from the USA Detailed Water Bodies feature layer, point to the item in the search results. The workflow can be completed with either the layer package or the feature layer, but the feature layer has visibility restrictions that cause it to not be visible at your current map extent.
The layer is added to the map.
- Zoom in to the large lake in the north-central part of the data, bracketed on the east and west shores by dark blue-green points.
- On the ribbon, click the Map tab. In the Selection group, click the Select button.
- On the map, click the lake.
A blue outline highlights the lake feature, indicating that it is selected.
- In the Contents pane, right-click USA Detailed Water Bodies and choose Attribute Table.
- At the bottom of the table, click the Show Selected Records button.
The single selected feature is shown in the table.
The water bodies feature service represents this data as a polygon with an FTYPE variable (which stands for Feature Type) of Lake/Pond. The GLR model is consistently underestimating house values around lakes in Washington. The feature service also contains water body types such as swamps and streams, but they do not impact sale price as positively as lakes in this region. You'll use distances to Lake/Pond type water bodies in your analysis.
- On the Map tab, in the Selection group, click Select By Attributes.
- In the Select By Attributes window, confirm Input Rows is set to USA Detailed Water Bodies and Selection Type is set to New selection.
- Under Expression, build the expression Where FTYPE is equal to Lake/Pond.
- Click Apply.
Note:
Do not close the Select Layer By Attributes tool yet.
All the Lake/Pond features are highlighted on the map.
There are many small lakes and ponds that do not have clusters of dark blue-green points near them. This suggests that smaller lakes and ponds do not have the same effect as large ones on the GLR model results. You'll add a clause to the selection expression to select only the larger bodies of water.
- In the Select By Attributes window, click Add Clause.
This new clause is joined to the first clause using the And operator. This is correct for this selection, but for another project, you may use an Or operator.
- Use the Expression builder to build the expression And SQKM is greater than or equal to.
The other large lake in the county has an area of 19.34 square kilometers. This clause will filter out smaller bodies of water.
- Click the SQL toggle. After SQKM >=, type 19.00.
- Click OK.
The selection changes, highlighting only lakes and ponds over 19 square kilometers in area. According to the attribute table, there are now 689 selected features.
- Close the attribute table.
Export the lake features
You only want to analyze the selected features, not the other features in the layer. Next, you'll export the selected features to a new feature class using the Copy Features tool.
- In the Geoprocessing pane, click the Back button. Search for and open the Copy Features tool.
- In the Copy Features tool pane, for Input Features, choose USA Detailed Water Bodies. For Output Feature Class, type LargeLakes.
A message under the Input Features parameter informs you that the input layer has a selection and shows the number of selected records that will be processed. The USA Detailed Water Bodies layer contains water bodies from across the United States, but you're only interested in water bodies in King County, Washington. You'll change the tool's processing extent to limit the features that are copied to those that are within the extent of your kc_house_data layer.
- Click the Environments tab.
- In the Processing Extent section, for Extent, choose kc_house_data.
- Click Run.
Note:
Do not close the Geoprocessing pane after you run the tool; you'll return to it soon.
The LargeLakes layer is added to the Contents pane.
You no longer need the USA Detailed Water Bodies layer, so you'll remove it.
- In the Contents pane, right-click USA Detailed Water Bodies and choose Remove.
- Save the project.
Use distance to lakes in the GLR model
Now that you've captured the large lake features, you can use them to geoenrich your GLR model. The regression tools in the Spatial Statistics toolbox allow you to include distance features in an analysis. These tools will automatically calculate Euclidean distances from each point to the closest distance feature and use the distance as an input variable.
- In the Geoprocessing pane, at the bottom of the tool window, click Open History.
- In the History pane, right-click Generalized Linear Regression and choose Open.
The tool opens with the parameters from the last time you ran the Generalized Linear Regression (GLR) tool.
You'll add the distance to lakes to enhance the GLR model.
- For Explanatory Distance Features, choose LargeLakes.
- For Output Features, type valuation_sqft_living_d2lake_glr.
- Click Run.
The tool runs and the results are added to the map. Next, you'll visually compare the results of the two runs of the GLR tool.
- In the Contents pane, confirm that the valuation_sqft_living_d2lake_glr layer is selected.
- Click the Feature Layer tab. In the Compare group, click Swipe.
- Click the map to the north of the county and drag the Swipe tool across the data.
Note:
Depending on where you click the map, you can swipe either up and down or left and right. Either way allows you to compare the two layers.
Because valuation_sqft_living_d2lake_glr is selected in the Contents pane, the Swipe tool shows you what is under it as you drag it across the map.
The areas around the lakes still have the highest standardized residuals for both GLR runs.
- On the ribbon, click the Map tab. In the Navigate group, click Explore.
- In the Contents pane, double-click the Distribution of Standardized Residual graph for both the valuation_sqft_living _glr and the valuation_sqft_living_d2Lake_glr layers.
- In the chart pane, drag the tab for one of the charts and dock it on the right side of the chart pane.
Now you can compare the graphs side by side. The two distribution plots are very similar.
The similarities indicate that the estimation error was not improved by adding distance to lakes. If the GLR model with distance to lakes had performed better, you could expect fewer locations with dark tones of green and purple (the locations with high standard error).
There are at least two possible reasons that adding the distance features did not improve the GLR model. First, the distance features calculated in GLR are Euclidean, or straight-line, distances. Since most travel in this area is along the road network, it may be that straight-line distances are not a reasonable representation of the road travel distance from the houses to the lakes. Second, the relationship between the size of the living space and distance to a water body variables and the sale price of the home may not be a linear one. It may be that GLR is an overly simple model for this scenario.
- Close the Distribution of Standardized Residual graphs and the Chart Properties pane.
- In the Contents pane, uncheck and collapse the valuation_sqft_living_d2lake_glr and valuation_sqft_living_glr layers.
- Save the project.
You added distance to lakes as a variable for the GLR and compared the results to your original GLR model results. The simple linear relationships modeled by GLR may not apply in this dataset. Next, you'll try a more complex model.
Create a regionalized general linear regression model
Next, you'll divide the county into regions and run separate GLR analyses for each region.
Check for regions in the data
First, you'll change the symbology of the data to look for regions.
- In the Contents pane, right-click the kc_house_data layer and choose Symbology.
- In the Symbology pane, set the following parameters:
- For Primary Symbology, choose Graduated Colors.
- For Field, choose price.
- For Classes, choose 10.
- For Color scheme, click the check box for Show Names and choose Yellow-Green-Blue (Continuous).
Visualizing the data this way shows distinct spatial clusters, with lower-priced clusters in the south and the northwest and with higher-priced clusters in areas close to water. Proximity to water plays a crucial role in determining sale price in this region, and prices change gradually in a given neighborhood.
Next, you'll define data-driven valuation neighborhoods and perform GLR in each region.
- Open the Geoprocessing pane and, if necessary, click the Back button. Search for and open the Spatially Constrained Multivariate Clustering tool.
You'll use this tool to identify regions that have similar market values for homes that have similar living space size.
- In the Spatially Constrained Multivariate Clustering tool, enter the following parameters:
- For Input Features, choose kc_house_data.
- For Output Features, type price_regions.
- For Analysis Fields, check price and sqft_living.
- For Spatial Constraints, confirm Trimmed Delaunay triangulation is chosen.
- For Output Table for Evaluating Number of Clusters, type num_clusters.
Note:
If you don't specify a number of clusters, the tool automatically picks the number that results in the most homogeneous regions.
- Click Run.
Note:
If the tool fails to run, save the project and close and reopen ArcGIS Pro. Open the project and run the tool again.
The tool runs and a new layer is added to the map.
Note:
After you run the tool, don't close the Geoprocessing pane. You'll return to it shortly.
There are only two clusters in the results. You'll examine the Optimized Pseudo-F Statistic Chart to get a sense of other ways the data could be clustered.
- In the Contents pane, under Standalone Tables, double-click Optimized Pseudo-F Statistic Chart.
In this plot, you are looking for elbows, or trends in the chart where adding another region does not decrease homogeneity of the clusters considerably. In the chart there is an elbow for eight regions. After the eighth region, the number of clusters consistently decreases.
You'll rerun the tool, this time with eight regions. The Geoprocessing pane is already open to the tool with the parameters you used to run it previously.
- Close the chart and the Chart Properties pane.
- In the Geoprocessing pane, for Number of Clusters, type 8.
You'll leave the other parameters unchanged. By keeping the same output name, the new tool output will replace the old one.
- Click Run
The price_regions layer is added to the map. It has eight clusters.
- In the Contents pane, under price_regions and Charts, double-click Spatially Constrained Multivariate Clustering Box-Plots.
The colors in the chart match the colors of the clusters on the map. Blue, green, yellow, brown, and purple clusters are above the third quartile for price and sqft_living. Blue corresponds to a cluster where living space is smaller compared to green and brown, but the price is higher. This color may indicate a desirable part of town. On the map, the blue cluster corresponds to an area to the east of Lake Washington. In this cluster, living space size may not be the main driving factor for sale price of the home.
The green region, located on an island in Lake Washington, corresponds to houses with larger living spaces compared to blue clusters but at a lower price.
Looking at the below-third-price-quartile-regions, the pink cluster is cheaper than the red and grey clusters, with average living space size the same as the red cluster. This may indicate that one can get a cheaper house for the same living space size in the pink cluster. This may also indicate why the linear model did not work.
- Close the chart and the Chart Properties pane.
Run GLR for each region
Next, you'll perform GLR in every region. To do so, you'll select the set of points for each cluster by attribute and run GLR for each selection. Because there are eight regions, it is more efficient to use ModelBuilder to automate the process.
- On the ribbon, click the Analysis tab. In the Geoprocessing group, click ModelBuilder.
The Model view appears.
- Click and drag the price_regions layer from the Contents pane onto the model canvas.
- On the ribbon, in the ModelBuilder tab, in the Insert group, click Iterators and choose Iterate Feature Selection.
- On the model canvas, drag an arrow from price_regions to Iterate Feature Selection.
A drop-down menu appears.
- In the drop-down menu, choose In Features.
The Iterate Feature Selection item and connecting items change colors. Next, you'll adjust the tool parameter so the tool cycles through each of the eight Cluster ID values and creates a selection for each of them.
- Double-click Iterate Feature Selection.
- In the Iterate Feature Selection window, under Group By Fields, set the field to Cluster ID.
- Click OK.
The iterator has two outputs. I_price_regions_CLUSTER_ID is the selected feature layer and Value is a variable that holds the value for the current selection. In this case, the value is the ID value for each cluster.
Next, you'll attach the Generalized Linear Regression tool to the output of the iterator. Because the iterator is cycling through each cluster, the tool will run for each cluster.
- In the Geoprocessing pane, click the Back button. Search for generalized linear.
- In the list of search results, drag the Generalized Linear Regression (Spatial Statistics Tools) tool onto the model canvas, next to the green I_price_regions_CLUSTER_ID output of the iterator.
- On the model canvas, drag an arrow from I_price_regions_CLUSTER_ID to Generalized Linear Regression and choose Input Features.
The tool is connected to the output.
Next, you'll adjust the GLR tool parameters.
- Double-click Generalized Linear Regression.
The Input Features parameter is set to price_regions:1 because you connected the output of the iterator to the tool.
- For Dependent Variable, choose price. For Explanatory Variable(s), check sqft_living.
- For Output Features, type valuation_sqft_living_glr_region_%Value%.
Using the text %Value% at the end of the output feature name adds the contents of the variable Value to the name. With this naming scheme, each cycle of the iterator will have a unique name that is related to the cluster that is being analyzed.
- Click OK.
- On the ribbon, on the ModelBuilder tab, in the View group, click Auto Layout.
The model elements are automatically arranged.
The Output Predicted Features and Output Trained Model File ovals remain gray because they are optional outputs of the tool that you are not using at the moment.
- On the ModelBuilder tab, in the Insert group, click Utilities and choose Collect Values.
The Collect Values, Output Values, and Output Table utilities are added to the model canvas.
- On the model canvas, drag an arrow from valuation_sqft_living_glr_region_%Value% to Collect Values and choose Input Value.
Tip:
If necessary, you can reposition any object by selecting it and dragging it.
- Right-click Output Values and click Add To Display.
The model is now ready to run.
- On the ribbon, on the ModelBuilder tab, in the Run group, click Validate.
Your model is validated. It's now ready to run.
- On the ModelBuilder tab, in the Run group, click Run.
As the model runs, the tool items turn red to indicate that they are currently running and the model results window shows results from each run of the GLR model.
The GLR result group layers, eight in total, are added to the map and to the Contents pane.
Review the model results
Next, you'll review the results from the model and rename the layers to be more easily understandable.
- In the Contents pane, for Output Values:valuation_sqft_living_glr_region_1, under Charts, double-click Relationship between Variables.
The chart view appears.
The R2 value for this cluster has improved from 0.49 to about 0.67. You can open the charts for the other layers to see the R2 values for the other regions.
- Close the chart view and the Chart Properties pane.
- Close the Model view. Click Yes to save the model.
The Map view becomes active again.
Areas around Lake Washington are predicted more accurately; however other areas, such as the West Seattle District, have a high number of underestimated home sale prices (in dark green). Regionalized models run the risk of amplifying problems pertaining to outliers in regression. Overall R2 for every region is summarized in the following table:
Region
R-Squared value
Region 1
0.667345
Region 2
0.511873
Region 3
0.573594
Region 4
0.785343
Region 5
0.672591
Region 6
0.587296
Region 7
0.369590
Region 8
0.587235
The overall model quality for each of these regions is greater than the result for the GLR model that you ran on the whole dataset, with the exception of Region 7, a large region that contains outliers. Having multiple regions comes at the cost of losing parsimony of the mathematical model. Valuers have different mathematical functions for different districts of the city that explain differing trends. You'll move up in complexity and seek a model that explains the sale price of the homes in King County, Washington, using the entire dataset in one model.
Before you continue, you'll tidy up the Contents pane by grouping the outputs of your models. Each of the Output Values layers are already in a layer group named Model Builder. You'll update the group name and remove the Output Values text from each of the layers.
- In the Contents pane, click the ModelBuilder layer group name to select it and click it again to edit its name. Rename the group Regional GLR Model.
- Rename Output Values:valuation_sqft_living_glr_region_8 by deleting the Output Values: text.
- Rename the remaining 7 layers by removing the Output Values: text. Collapse all eight layers.
- Click the valuation_sqft_living_d2lake_glr layer and press Shift while clicking the valuation_sqft_living_glr layer.
- Right-click the selected layers and choose Group. Rename the layer group Global GLR Model.
- On the ribbon, click the Map tab. If necessary, in the Selection section, click Clear to clear any selections.
- Save the project.
So far, you've made two attempts to incorporate spatial characteristics into your analysis. First, you used distance to water bodies as a predictor. Then, you created data-driven regions based on sale price of the home and size of living space and performed eight spatially discrete regression models.
Next, you'll use Geographically Weighted Linear Regression to model house prices.
Model spatially varying relationships
Next, you'll use Geographically Weighted Linear Regression and Forest-Based Classification and Regression to model house prices.
Geographically Weighted Linear Regression is a continuously varying linear regression model that identifies relationships between a target variable (sale price) and several explanatory variables (property characteristics). Before you use it, you'll test whether statistically significant spatial relationships exist between the variables.
Identify spatial relationships between variables
First, you'll run the Local Bivariate Relationships tool. This tool uses an entropy-based approach to discover spatial relationships. If a significant relationship exists between two variables in a subset of data, randomizing the data increases the entropy considerably. If there are no significant relationships, randomizing the data does not increase entropy considerably. In other words, introducing entropy or randomization tests if there is a relationship to destroy between two variables.
Randomizing may not change the relationship between two variables if there is no relationship to destroy to start with. You can read more about the idea of using entropy for discovering relationships in Guo (2010).
- In the Geoprocessing pane, search for and open the Local Bivariate Relationships tool.
- In the Local Bivariate Relationships tool, set the following parameters:
- For Input Features, choose kc_house_data.
- For Dependent Variable, choose sqft_living.
- For Explanatory Variable, choose price.
- For Number of Neighbors, type 50.
Why choose 50 neighbors?
The neighborhood should be large enough to capture a significant relationship between variables, when such spatial relationships exist. You may need to try a variety of values, but 50 homes is a large enough number of neighbors that you can trust the regression diagnostics to understand whether local regression would work on this dataset; at the same time, it is a small enough percentage of the entire dataset for King County that the local regression will be different than the GLR model.
This is an application of the idea of statistical power of regression, which is the probability of finding a significant best-fit line (with low fit errors) when the population (all homes in King County, Washington) exhibits a significant relationship between variables you are interested in.
- For Output Features, type local_rlns_sqft_living_vs_price.
- Click Run.
The tool runs and adds the local_rlns_sqft_living_vs_price layer to the map.
The symbols for this layer are shown in the Contents pane.
For many of the points in many of the neighborhoods, there is a positive linear relationship between price and living space. Because there are so many points drawn close to each other in this large dataset, there is a risk that positive linear relationships may draw last, which can make them appear to dominate the results. It is worth checking the tool's geoprocessing results to see the numbers of each class.
- At the bottom of the Geoprocessing pane, click View Details.
- If necessary, in the Local Bivariate Relationships (Spatial Statistics Tools) details window, click the Messages tab.
The tool results show that about 71.6 percent of the points show a positive linear relationship.
This result suggests that Geographically Weighted Regression (GWR) can model spatial relationships between sqft_living and price at a neighborhood size of 50 homes.
However, GWR does not simply fit a line at a location using a local subset but also implements a geographic weighting scheme that weighs the predictor variable for a local regression observed in the neighborhood. Observing significant linear local relationships between variables is an indication that a GWR model will capture local relationships, but it is not a guarantee.
- Close the details window. On the map, click any of the points classified as showing a positive linear relationship (with a pink symbol).
Tip:
If you find it difficult to click a point due to their close proximity to one another, you can zoom in.
The pop-up for the point shows a graph of the local relationships at that location and its neighborhood.
- Close the pop-up. Click a point showing a concave relationship (with an orange symbol).
- Close the pop-up and save the project.
You can summarize both locations with a line and only report the type of relationship detected by testing different regression models on locations identified to possess statistically significant relationships in their neighborhoods.
The majority of King County, Washington, shows statistically significant local relationships for a neighborhood of 50. Here, 50 is a neighborhood size that makes sense. However, the tool does not automatically determine the correct neighborhood value, and for different datasets, different neighborhood sizes should be explored.
If you were running this analysis on your own data, you would now run the tool with different neighborhood sizes to explore the changes to the types of spatial relationships between sqft_living and price. The neighborhood size you find to possess local linear relationships should be used in the Geographically Weighted Regression (GWR) tool in the next step.
Perform Geographically Weighted Regression
You'll define a GWR model with the same conceptualization of spatial relationships as you identified in the previous section: neighborhoods consisting of 50 houses.
- In the Geoprocessing pane, click the Back button. Search for and open the Geographically Weighted Regression (GWR) tool.
This tool can use different types of kernels that control the weight of neighbors in the local regression model.
The following image shows an example of the kernel. The line shows the Gaussian kernel where every neighbor gets a weight in regression, with more distant neighbors getting lower weights. The Bisquare kernel truncates the kernel using a distance or a number of neighbors. This pattern is shown by the part of the curve that is filled in the plot.
You'll use a Bisquare kernel to assign weights using only the 50 nearest neighbors.
- In the Geographically Weighted Regression (GWR) tool pane, set the following parameters:
- For Input Features, choose kc_house_data.
- For Dependent Variable, choose price.
- For Explanatory Variable(s), check sqft_living.
- For Output Features, type valuation_sqft_living_gwr.
- For Neighborhood Type, choose Number of neighbors.
- For Neighborhood Selection Method, choose User defined.
- For Number of Neighbors, type 50.
You're using a user-defined number of neighbors so you can use the 50-house neighborhood (the number of neighbors you determined with the Local Bivariate Relationships tool).
This tool can also select neighbors using the manual intervals linear search option or using the golden search optimization algorithm.
- Expand Additional Options and confirm that Local Weighting Scheme is set to Bisquare.
The Bisquare weighting method ensures that at every location, exactly 50 (or the number you specify) neighbors are used. The Gaussian option uses all of the locations in the data set as the neighbors (that is, all of the houses in King County) and inversely weights them with respect to their distance. The Bisquare method uses the same weighting scheme but instead of using the entire house data from all of King County, it only uses a neighborhood of 50 houses at every location.
Next, you'll set the coefficient raster workspace, which should be a geodatabase. The tool performs local regression and calculates spatially varying regression coefficients for predictors and the intercept term. It writes the raster surfaces that depict these spatially varying coefficients into this workspace.
- For Coefficient Raster Workspace, click the Browse button. In the Coefficient Raster Workspace window, click Databases and select myproject2.gbd.
- Click OK. In the Geoprocessing pane, click Run.
The tool runs and three new layers are added to the map. Two of these layers are raster layers, which you'll turn off.
- In the Contents pane, uncheck valuation_sqft_living_gwr_SQFT_LIVING and valuation_sqft_living_gwr_INTERCEPT.
As with the GLR model, this GWR model also makes underestimations for houses by the lake. Unlike the GLR model, it underestimates house value on the ocean coast as well.
- For the valuation_sqft_living_gwr layer, under Charts, double-click Distribution of Standardized Residual.
A majority of the points have standardized residuals close to 0. The model makes fewer over- and underestimations (standardized residuals more than one standard deviations away) compared to the GLR model.
Based on the tails of the curve, GWR has fewer locations with large residuals (more than two standard deviations) compared to GLR. This indicates that GWR captures variations in price better compared to the GLR model.
- Close the chart and the Chart Properties pane.
- In the Geoprocessing pane, click View Details. In the details window, scroll to the Model Diagnostics section.
The R2 value is 0.89 and the adjusted R2 (AdjR2) is 0.87. This is a much higher R2 than the GLR models your ran earlier, indicating that this is a more accurate model.
- Close the details window.
- In the Contents pane, press the Ctrl key and uncheck valuation_sqft_living_gwr.
All of the layers on the map are no longer visible.
- Check the following layers to make them visible:
- World Topographic Map
- World Hillshade
- valuation_sqft_living_gsr_SQFT_LIVING
- LargeLakes
- Right-click valuation_sqft_living_gwr_SQFT_LIVING and choose Symbology.
- In the Symbology pane, for Color Scheme, choose Yellow-Green (Continuous).
- For Stretch type, choose Histogram Equalize. Close the Symbology pane.
The Contents pane shows the legend for the valuation_sqft_living_gwr_SQFT_LIVING layer.
All of the local regression coefficients are positive. This implies that GWR modelled a positive relationship between size of living space and the sales price of the home.
Around both large lakes, the sale price of the home raster has a higher slope with respect to size of the living space, indicating that a small change in living space in homes close to water corresponds to a much greater increase in price compared to inland areas. This is expected as the sale price in these areas is greatly impacted by the view, a variable not captured with the size of living space.
Inland portions of the raster toward the east should not be considered. Due to spatial outliers, the study area is stretched and there is not enough data in the eastern portion of this dataset to trust the underlying coefficient surfaces as they are interpolated. You should not pay attention to coefficients in areas that have sparsely distributed points, as the algorithm interpolates the coefficient between locations with data points.
How can you improve this model further? What about distance features or using a second predictor?
- Uncheck the valuation_sqft_living_gwr_SQFT_LIVING layer to turn it off. Save the project.
Test the grade variable
Based on your previous data visualization, grade was another variable that was linearly correlated to price. First, you'll explore whether the grade variable is spatially correlated to sale price of the home using the Local Bivariate Relationships tool.
- At the bottom of the Geoprocessing pane, click Open History. In the History pane, right-click Local Bivariate Relationships and choose Open.
The tool opens with the parameters that you set earlier.
- In the Local Bivariate Relationships tool pane, change the following parameters:
- For Dependent Variable, choose grade.
- For Output Features, type local_rlns_grade_vs_price.
- Click Run.
The tool runs and adds a layer to the map that shows significant linear relationships between grade and price.
GWR is a linear model, like GLR, so you need to consider the issue of multicollinearity. You'll check if strong local linear relationships between the two predictors exist by performing a Local Bivariate Relationships analysis between sqft_living and grade.
- In the Local Bivariate Relationships tool, change the following parameters:
- For Explanatory Variable, choose sqft_living.
- For Output Features, type local_rlns_grade_vs_sqft_living.
- Click Run.
This map indicates strong local linear relationships between the two predictors. It indicates that at a neighborhood of 50, grade and square feet of living space are significantly linearly related to each other. Remember that in GLR, you should avoid linearly related explanatory variables. This map indicates that at a local neighborhood of 50 neighbors, the GWR model may fail due to multicollinearity if you include both grade and square feet of living area.
Next, you'll try using both variables to see if the tool fails or not.
- In the History pane, right-click the Geographically Weighted Regression (GWR) tool and choose Open.
The tool opens with the parameters that you set earlier.
- In the Geographically Weighted Regression tool, update the following parameters:
- For Explanatory Variable(s), check grade. Confirm that sqft_living is already checked.
- For Output Features, type valuation_sqft_living_grade_gwr.
- Click Run.
As expected, the tool fails.
- At the bottom of the Geoprocessing pane, point to the failure message.
A window appears showing an error message. The error message indicates that multicollinearity was the cause.
A limitation of GWR is it doesn't work with spatially clustered variables, and these tend to be common with housing attributes. The result shows that you cannot use these two variables to predict sale price of the home locally with the current GWR model.
GWR provides a parsimonious spatial regression mode; however, it does not work when there is a high correlation between pairs of predictor variables.
Perform Forest-based Classification and Regression
You have a rich dataset containing predictors that you want to incorporate into your regression model. Next, you'll use the Forest-based Classification and Regression (FBCR) model. This type of model is not affected by multicollinearity, because it is not a linear model, and it can model relationships between a vast number of predictor variables (spatial and non-spatial property characteristics) and a target variable (sale price). So far GLR and GWR modelled relationships between sqft_living and price with a line. Locally or globally, a unit increase in a house's size corresponds to an increase in house price.
- In the Geoprocessing pane, click the Back button. Search for and open the Forest-based Classification and Regression tool from the Spatial Statistics toolbox.
- In the Forest-based Classification and Regression tool pane, set the following parameters:
- For Prediction Type, choose Train only.
- For Input Training Features, choose kc_house_data.
- For Variable to Predict, choose price.
- Under Explanatory Training Variables, for Variable, click the Add Many button and check the following variables:
- bedrooms
- bathrooms
- sqft_living
- sqft_lot
- floors
- waterfront
- view
- condition
- grade
- sqft_above
- sqft_basement
- Click Add.
You must indicate whether each predictor is a categorical variable or not. When in doubt, you can check the attribute table to make sure you identify all of the categorical variables. The tool automatically detects string fields as categories, but for numerical categories, such as integers, you must manually identify categorical variables. In this dataset, bedrooms, bathrooms, floors, waterfront, view, condition, and grade are categorical variables that are stored as integers.
- Under Categorial, check the boxes for bedrooms, bathrooms, floors, waterfront, view, condition, and grade.
- For Explanatory Training Distance Features, choose LargeLakes.
This tool can automatically calculate distance to features and use that distance as input, similar to the GLR tool.
- Expand Additional Outputs. For Output Trained Features, type price_predicted, and for Output Variable Importance Table, type variable_importance.
FBCR defines decision trees for random subsets of the data and every tree makes a prediction, called a vote. The forest summarizes these votes as the average and reports a final prediction. The randomness of subsetting the data means forest-based models have results with varying accuracy. You can gauge the impact of random subsampling of training data on the output results—in other words, the stability of the forest-based model—by running the model multiple times and defining a distribution of R2.
In this case, you'll define 20 validation runs. As is the case with the number of trees, a higher number of validation runs is always desirable. Lastly, you'll calculate the uncertainty of your sale price predictions.
- Expand Advanced Forest Options. For Number of Trees, type 1000.
How many trees is enough? The answer is as many as you are willing to wait for the tool to process. Forest-based Classification and Regression becomes more robust to outliers and stable to random data selection if more trees are used. Accept the default values for the rest of the advanced options.
- Expand Validation Options. For Number of Runs for Validation, type 20.
- Check the Calculate Uncertainty box. For Output Validation Table, type validation_r2.
- Click Run.
The tool runs.
Note:
The tool may take over 30 minutes to run. Do not close the Geoprocessing pane after the tool finishes.
After the tool finishes, you'll first investigate the distribution of R2 from the 20 simulations.
- In the Contents pane, scroll down to the Standalone Tables section. Under validation_r2, double-click the Validation R2 chart.
The average accuracy of the FBCR model is approximately 0.79. The model seems to be stable as the R2 changes between 0.74 and 0.83 over the 20 runs. Your numbers may be slightly different.
Next, you'll investigate variable importance.
- In the Contents pane, in the Standalone Tables section, under variable_importance, double-click the Distribution of Variable Importance chart.
The two most important variables are sqft_living and grade. They appear highest on the Y (Importance) axis. Here, importance corresponds to the number of times a tree split is performed based on the variable in the entire forest model. Higher numbers indicate a higher number of tree splits based on a variable, indicating that variable's impact on the forest model result is high. This chart indicates that grade and sqft_living switch their importance rank between different runs of the model. Distance to a large lake is the third-most influential predictor in the model.
The R2 is lower than the GWR model with one variable. How can you improve this model?
One way is to remove the predictor variables with low importance. You want to remove variables that aren't important to the model, so they won't be randomly selected for a particular tree at the expense of more important explanatory variables.
The bedrooms, condition, floors, and waterfront variables were the least important, according to the Distribution of Variable Importance chart. You'll remove them.
- Close both charts and the Chart Properties pane.
- In the Geoprocessing pane, in the Explanatory Training Variables section, point to the bedrooms variable and click the Remove button.
- Remove the condition, floors, and waterfront variables.
- Change the following parameters:
- Under Additional Outputs, for Output Trained Features, type output_reduced.
- For Output Variable Importance Table, type variable_importance_reduced.
- Under Validation Options, for Output Validation Table, type validation_r2_reduced.
- Click Run.
Note:
The tool may take several minutes for the tool to run.
- After the tool finishes running, at the bottom of the Geoprocessing pane, click View Details. In the tool details window, click the Messages tab.
Forest parameters in the Model Characteristics section show the tree depth range that indicates that all trees do between 26 to 43 splits before making predictions. This implies that the decision trees capture variability in predictors as it corresponds to the variability in the target variable.
The Model Out of Bag Errors section indicates the impact of adding more trees to the model:
The MSE and variation explained do not change considerably between 500 trees and 1,000 trees. Because there is little change, it can be argued that your model has enough trees and converged to its maximum accuracy.
It is possible that there is a plateau effect, in which case you must continue to increase the number of trees until MSE and percent of variation explained increase considerably (at least a 10 percent improvement). Even though the stability of these metrics is not initially assuring, you can test again to see if there are drastic changes to OOB Error performance by increasing the number of trees. If there are drastic changes, it is a clear indication to use more trees until the performance is stable.
The Top Variable Importance section shows the variables driving the forest model.
Distance to water bodies is the third most important variable.
Training data is the data that is used by the trees in the forest. R-squared corresponds to predicting data that is already seen by the forest. Training R2 is an indication of how well the forest model learns the existing patterns in the training data. However, validation data is not previously seen by the model and validation R2 is an indication of how the model performs if used for prediction.
An R2 of 0.945 indicates that the FBCR model predicts data used to define the model with a high accuracy. A Validation R2 of 0.78 suggests that this model is generalizable, that is, it can predict data points it has not seen with high accuracy as well.
In regression problems, you use these training metrics as an indication for the potential quality of the model. With actual predictions from a trained model when you are predicting data that you do not have the true answer for, you cannot compute these metrics. These diagnostics indicate that given the training data, the model performs well to predict data that is used in its creation and generalizes to data points that it has not seen before.
- Close the details window. In the Contents pane, for the output_reduced layer, double-click the Prediction Interval chart.
This chart shows the uncertainty bounds of the prediction, with the blue line being the actual prediction (also mapped in the output feature class). Uncertainty bounds rapidly widen for homes priced at more than $1,000,000. This trend is due to the small sample size for such expensive homes. For homes more expensive than $1,500,000, the uncertainty bounds are even bigger, as there are even fewer samples in this price range. This plot is a useful way to show the uncertainty relating to your predictions given your training sample.
- Close the chart and the Chart Properties pane. Save the project.
Assess the spatial distribution of uncertainty
Finally, you'll assess the spatial distribution of FBCR model uncertainty. Currently, the model returns P95 and P05, which represent a higher and lower estimate of the house price to quantify model-based uncertainty. In other words, the uncertainty in the results is due to your model that includes your training data and the FBCR model. If the tool returns $100,000 as the prediction, $90,000 as P05, and $120,000 as P95, that means the model predicts $100,000, but small changes to the training data can result in a prediction as low as $90,000 or as high as $120,000.
This uncertainty is important to quantify because you do not always know if you have enough samples to model home sale prices accurately. You'll add a new field to contain the uncertainty metric that you'll derive from the tool output. This metric summarizes the three values—P05, prediction (P50), and P95—in one field.
- In the Geoprocessing pane, search for and open the Add Field tool.
- In the Add Field tool pane, set the following parameters:
- For Input Table, choose output_reduced.
- For Field Name, type uncertainty.
- For Field Type, choose Double (64-bit floating point).
- Click Run.
The tool runs and the field is added, but no change occurs on the map.
- In the Geoprocessing pane, click the Back button. Search for and open the Calculate Field (Data Management Tools) tool.
You'll define the uncertainty field as:
Uncertainty = (P95-P5)/P50
This metric quantifies how wide the uncertainty window is with respect to the magnitude of the prediction.
- In the Calculate Field tool pane, set the following parameters:
- For Input Table, choose output_reduced.
- For Field Name, choose uncertainty.
- Under Expression, for uncertainty =, type (.
- In the Fields column, double-click PRICE_P95.
The text !Q_HIGH! is added to the equation box. This text is the field name, delimited by exclamation marks.
- Click the minus symbol button and double-click PRICE_P05. Type ).
The expression now reads: (!Q_HIGH! - !Q_LOW!)
- Click the division button and double-click PRICE(Predicted).
The full expression reads: (!Q_HIGH! - !Q_LOW!) / !PREDICTED!
- Click the Verify button.
A message informs you that your expression is valid, meaning it can be run without errors.
- Click Run.
The tool runs and the field is calculated based on your expression. No change is made to the map.
Next, you'll run a hot spot analysis on the uncertainty field to investigate whether spatial patterns exist in the FBCR prediction uncertainty.
- In the Geoprocessing pane, click the Back button. Search for and open the Optimized Hot Spot Analysis tool.
- In the Optimized Hot Spot Analysis tool pane, enter the following parameters:
- For Input Features, choose ouput_reduced.
- For Output Features, type output_reduced_HotSpots.
- For Analysis Field, choose uncertainty.
- Click Run.
The resulting map shows that uncertainty tends to be higher in the southern half of the dataset and lower in the northern half.
- Save the project.
Your findings indicate that sale price predictions in northern portion of King County, Washington, are less prone to change by random changes to the training data.
You've used Geographically Weighted Linear Regression and Forest-based Classification and Regression to model prices. You also explored the uncertainty of your results. Next, you'll use these models to perform valuation on a new sample of points.
Compare the predictions of models
You have two models with acceptable R2s, both higher than 0.75 (depending on the level of accuracy desired, this number could be higher). One is the GWR model you built with sqft_living and the second is the FBCR model you just built. One model is parsimonious whereas the other model has greater predictive powers.
Your company has built new homes in Redmond, Washington, one of the fastest growing areas of construction for homes in King County, Washington. You'll use these models to perform valuation and compare the results.
Perform valuation with GWR
First, you'll apply the GWR model for valuation. This time, you'll run GWR in prediction mode. The Geographically Weighted Regression tool applies the model you developed for the kc_house_data to the new_homes dataset.
- In the Geoprocessing pane, click Open History.
- In the History pane, right-click the most recent successfully run Geographically Weighted Regression (GWR) tool and choose Open.
Note:
To determine whether a tool was run successfully or not, point to it. The pop-up that appears will state whether the tool failed or completed with warnings.
The tool opens with the parameters that you set previously.
- For Explanatory Variable(s), confirm that sqft_living is checked and grade is unchecked. For Output Features, confirm that the output name is valuation_sqft_living_gwr.
- Expand the Prediction Options section and change the following parameters:
- For Prediction Locations, choose new_homes.
- For Output Predicted Features, type new_home_valuation_gwr.
- Click Run.
The new_home_valuation_gwr layer is added to your map and Contents pane.
- In the Contents pane, right-click new_home_valuation_gwr and choose Zoom To Layer. Zoom out until you can see more context for the layer's location.
Perform valuation with FBCR
Next, you'll use FBCR to predict values. You'll run the Forest-based Classification and Regression tool in prediction mode.
- In the History pane, right-click the most recent successfully run Forest-based Classification and Regression tool and choose Open.
- In the Forest-based Classification and Regression tool pane, for Prediction Type, choose Predict to features.
- For Input Prediction Features, choose new_homes. For Output Predicted Features, type new_home_valuation_fbcr.
- Click Run.
Note:
It may take over 15 minutes for the tool to finish running.
When the tool finishes, the new_home_valuation_fbcr layer is added to the map.
- Save the project.
Compare the results with histograms
You've produced two sale price estimations for the planned development. Next, you'll compare these results. In prediction mode, you do not receive a true result, only an estimation. You can evaluate your results in terms of their consistency with respect to prices in their neighborhoods.
First, you'll compare the histograms of the model outputs.
- In the Contents pane, right-click the new_home_valuation_gwr layer, point to Create Chart, and choose Histogram.
- In the Chart Properties pane, under Variable, for Number, choose Predicted (PRICE).
- Create a histogram for the new_home_valuation_fbcr layer using the PRICE(Predicted) attribute.
- Drag the new_home_valuation_fbcr chart and dock it to the right of the new_home_valuation_gwr chart.
Now, you can compare the charts side-by-side.
The price ranges and average values are similar. With the given property characteristics, the average value for these new homes is about $770,000 to $849,000. The upper limit for sale price of the home in this area for GWR is $1,505,000 and for FBCR is $1,327,000.
- Close the two chart windows and the Chart Properties pane.
For home prices in this area, the kc_house_dataset GWR estimate is more reasonable. This is one of the strengths of GWR; it assigns values taking the neighborhood into account. However, all the homes in the kc_house_dataset are pre-existing homes that are not in as good of a condition or grade as these new houses. FBCR uses patterns of such homes in all of King County to make an estimate from the entire dataset.
Compare price valuation per square foot
The new houses have large differences in their attributes. To put the sale price predictions in perspective, you'll calculate price per square foot. You'll join the predictions from GWR and FBCR into one feature class for further comparison.
Before you join the prediction values, you'll update the field names to distinguish them from one another.
- In the Contents pane, right-click new_home_valuation_gwr, point to Data Design, and choose Fields.
The Fields view for the layer appears.
- In the Fields view, under Field Name, double-click PREDICTED. Type Predicted_GWR and press Enter.
The Field Name is updated.
- Under Alias, double-click Predicted (PRICE). Type GWR Prediction and press Enter.
- On the ribbon, in the Fields tab, in the Changes group, click Save.
- In the Contents pane, right-click new_home_valuation_fbcr, point to Data Design, and choose Fields. Change the following fields:
- Under Field Name, change PREDICTED to Predicted_FBCR.
- Under Alias, change PRICE(Predicted) to FBCR Prediction.
- On the ribbon, in the Fields tab, in the Changes group, click Save. Close both Fields views.
Next, you'll join the GWR results and the FBCR results.
- In the Geoprocessing pane, search for and open the Spatial Join tool. Set the following parameters:
- For Target Features, choose new_home_valuation_gwr.
- For Join Features, choose new_home_valuation_fbcr.
- For Output Feature Class, type price_comparison.
- Expand Fields. Under Field Map, for Output Fields, click the Remove button to delete all fields except SOURCE_ID, sqft_living, Predicted_GWR, and Predicted_FBCR.
- Click Run.
The tool runs and the new layer is added to the map. Next, you'll create new fields to calculate the predicted price per square foot for each prediction model.
- In the Contents pane, right-click price_comparison, point to Data Design, and choose Fields.
- In the Fields view, click Click here to add a new field. Create a field with the following parameters:
- For Field Name, type GWR_PSQFT.
- For Alias, type GWR (price per square foot).
- For Data Type, choose Double.
- Create another new field with the following parameters:
- For Field Name, type FBCR_PSQFT.
- For Alias, type FBCR (price per square foot).
- For Data Type, choose Double.
You now have two new fields.
- On the ribbon, on the Fields tab, in the Changes group, click Save. Close the Fields view.
Now that you've added fields to hold the price per square foot values, you'll calculate values based on the predicted value and the area of living space in each house. You'll create an expression that divides the price the GWR model predicted by living space.
- In the Geoprocessing pane, search for and open the Calculate Field (Data Management Tools) tool. Set the following parameters:
- For Input Table, choose price_comparison.
- For Field Name (Existing or New), choose GWR (price per square foot).
- For Expression, build the following expression: !Predicted_GWR! / !sqft_living!
- Click Run.
You'll run the tool again after changing some of the parameters to reflect FBCR instead of GWR.
- In the Calculate Field tool pane, change Field Name (Existing or New) to FBCR (price per square foot). For Expression, create the following expression: !Predicted_FBCR! / !sqft_living!
This expression divides the FBCR Prediction values by living area.
- Click Run.
Now that you've calculated both fields, you'll compare them. Box plot charts are a good way to compare two distributions. You'll use a box plot to compare the price per square foot estimations of the two methods.
- In the Contents pane, right-click the price_comparison layer, point to Create Chart, and choose Box Plot.
- In the Chart Properties, for Numeric field(s), click Select. Check the boxes next to GWR (price per square foot) and FBCR (price per square foot) and click Apply.
The box plot chart updates and showing the price per square foot estimates from the GWR and FBCR models.
The long whiskers on box plot bar for FBCR (price per square foot) indicate that few of the houses have received a significantly higher price than all others. The box plot for GWR (price per square foot) spans a larger area than FBCR, which indicates that the first and third quartile of predictions are further apart comparatively. In other words, the GWR prediction has higher variation in terms of price per square foot compared to FBCR.
The median price per square foot is almost the same for both methods. The location of the median line inside the box for FBCR indicates a left-skewed distribution of predictions, meaning the model frequently predicted higher price per square foot. This result might be due to global patterns in King County showing high prices associated with new houses—the information provided by the grade variable used in FBCR analysis. GWR predictions are symmetric around the mean, showing a more even distribution.
- Close the box plot chart and Chart Properties pane. Save the project.
Map the FBCR prediction uncertainty
The distributions for FBCR and GWR predictions exhibit considerable differences. You'll investigate the uncertainty of FBCR for the predicted points.
- Right-click the new_home_valuation_fbcr layer, point to Data Design, and choose Fields.
- Add a field named P95_minus_P5 and set the type to Double. Save the change and close the Fields view.
- In the Geoprocessing pane, open the Calculate Field tool and change the following parameters:
- For Input Table, choose new_home_valuation_fbcr.
- For Field Name, choose P95_minus_P5.
- For Expression, create the following expression: !Q_HIGH! - !Q_LOW!
- Click Run.
- In the Contents pane, turn off the price_comparison and new_home_valuation_gwr layers.
- Right-click new_home_valuation_fbcr and choose Symbology.
- In the Symbology pane, set the following parameters:
- For Field, choose P95_minus_P5.
- For Classes, choose 10.
- For Color scheme, choose Greens (Continuous).
- At the bottom of the Symbology pane, in the Classes tab, click More and choose Format all symbols.
- If necessary, click the Properties tab.
- Under Appearance, for Outline width, type 0.5. For Size, type 10.
- Click Apply.
The layer updates with the new symbology.
Dark greens indicate a high uncertainty range for predictions. Some of the houses have an uncertainty range up to 1.7 million dollars.
- In the Contents pane, under new_home_valuation_fbcr, in the Charts section, double-click Prediction Interval.
- In the Chart Properties pane, for Date or Number, choose Sort Id by Predicted Value. For Numeric field(s), choose FBCR Prediction, PRICE_P05, and PRICE_P95.
The uncertainty range is approximately $400,000 for all homes except ones with prices above $1 million. The model shows that small changes to training data from King County can result in substantial changes to the predicted sale price of home. Unlike GLR or GWR, FBCR does not extrapolate. If the maximum price in the training data is $1.2 million in training, any price the model predicts above that will have high uncertainty. Also, since there are relatively few of the highest priced houses, the uncertainty for these sorts of houses will be high.
- Close the chart and Chart Properties pane. Save the project.
When comparing FBCR and the GWR models, neither method is inherently superior to the other. They both address different needs of valuation. The GWR model defines a spatial model for home sale price and represents the hedonic model for sale price (Can, 1992) with geographically varying weights. In contrast, FBCR defines the relationship between attributes of a house and its sale price globally. This can be immensely valuable to understand, as some factors impact house price globally without spatial variation (François et al., 2005).
In this comparison of methods, GWR is better suited to capture spatial variations in relation to price. It also works well to develop a local model for price, where the predicted house price is reasonable for the neighborhood. However, due to multicollinearity, you cannot use the grade variable as a predictor for GWR. In contrast, FBCR models the impact of the condition of the new houses by using analogs from all of King County, Washington. This model results in higher home prices, which may make sense if the grade of the structures is very high and the developer is considering listing them for a significantly higher price than other houses in the neighborhood. Uncertainty analysis in FBCR shows that prices of expensive homes more than $1 million may need to be re-assessed. The GWR model shows reasonable values for the Redmond, Washington, area but does not factor in the condition of the new homes.
The workflow in this tutorial showcases regression models in ArcGIS Pro with varying assumptions and level of complexity. Visualization is a vital part of regression analysis to understand important variables and to explore relationships between variables. GLR is the simplest model that relates exploratory variables to a target variable with a global linear model. It is a useful model to try as it is the easiest regression model to understand.
GWR defines a linear model that varies from location to location. GWR solves a linear regression model at every location where predictor variables from nearby neighbors are weighted with a spatial kernel, with near neighbors having more impact on the regression model than distant neighbors. GWR coefficient surfaces are also an effective means to visualize spatial variation of relationship between an explanatory variable and a target variable. Local Bivariate Relationships (LBR) is a useful tool to explore types of spatial relationships between two variables. LBR between an explanatory variable and target variables that define predominant local linear relationships is an indication that the GWR model would be an effective model. LBR between two explanatory variables defining a large number of linear relationships indicates that a GWR may suffer multicollinearity if these variables are used jointly in the GWR model.
Finally, a Forest-Based Classification and Regression (FBCR) model defines a forest-based model to relate explanatory variables to a target variable. Despite its algorithmic complexity, FBCR can relate a wide variety of explanatory variables to a target variable, continuous or discrete. FBCR produces valuable diagnostics such as the variable importance plot that quantifies the impact of an explanatory variable in the regression model. Despite its flexibility, the FBCR model is sensitive to the training data used to define the model. In the sale price example, if certain price ranges are under represented, such as low number of expensive houses (more than $1 million), the forest-based model is not expected to be accurate for these ranges. In addition, FBCR cannot predict past the target variable range in the training dataset.
You can find more tutorials in the tutorial gallery.