Chart variable relationships

Before you run a statistical model, you'll create charts to explore the relationship between the price of home sales and other aspects of the house, such as its size and age. Your goal is to identify variables with a strong relationship, which you'll use to create your model.

Explore the data

First, you'll download an ArcGIS Pro project package containing home sales data in King County, Washington. Then, you'll explore the data and become familiar with the variables in its attribute table.

Download the Home Valuation 1 project package.
Browse to the downloaded file and double-click Home_Valuation_1.ppkx to open the project in ArcGIS Pro. If necessary, sign in using your licensed ArcGIS account.
Note:
If you don't have access to ArcGIS Pro or an ArcGIS organizational account, see options for software access.
The project contains a map of King County, Washington. Each point on the map represents a home sale that occurred between May 2014 and May 2015.

On the map, click any point.

The point's pop-up appears, listing the attribute information associated with it.

Pop-up with attribute information

The data contains the following attributes:


Name	Description
Date	The date of the sale, expressed in the format YYYYMMDDT000000.
Price	The final transaction amount in dollars.
Bedrooms	The number of bedrooms.
Bathrooms	The number of bathrooms.
Square Feet (Living)	The interior living space size in square feet.
Square Feet (Lot)	The lot size in square feet.
Floors	The number of floors in the house.
Waterfront	Whether the house is on the waterfront, expressed by the following values: 0: No 1: No, but within 0.25 miles of the coast 2: Yes
Condition	The condition of the house, expressed by the following values: 1 (Poor): Many repairs needed. The house shows serious signs of deterioration. 2 (Fair): Some repairs needed immediately. Many deferred maintenance repairs are needed. 3 (Average): A normal amount of upkeep for the age of the home is needed. 4 (Good): The home is in above average condition for its age, indicating extra attention and care has been taken to maintain it. 5 (Very Good): The home has had excellent maintenance and updating; not a total renovation.
Grade	The overall house grade based on the King County grading system, expressed by the following values: 1 through 3: Falls short of minimum building standards (cabin or inferior structure). 4: Generally older, low quality construction. The house doesn't meet building codes. 5: Lower construction costs and workmanship. The house is small with a simple design. 6: Lowest grade currently meeting building codes; low-quality material, simple design. 7: Average grade of construction and design, common for older subdivisions. 8: Just above average in construction and design. Houses of this quality usually have better materials for both the exterior and interior finishes. 9: Better architectural design, with extra exterior and interior design and quality. 10: Homes generally have high-quality features. Finish work is better, and more design quality is seen in the floor plans. Larger square footage. 11: Custom design and higher quality finish work, including added amenities for solid woods and bathroom fixtures. More luxurious options. 12: Custom design and excellent builders. All materials are of the highest quality and all conveniences are present. 13: Generally custom designed and built. Mansion level. Large amount of the highest quality cabinet work, wood trim, marble, entryways, and so on.
Square Feet (Above)	The size of the house, excluding the basement, in square feet.
Square Feet (Basement)	The size of the basement in square feet.
Year Built	The year the house was built.
Year Renovated	The year the house was renovated.
ZIP Code	The house's ZIP code.

Close the pop-up.
Using data engineering, you'll calculate statistics for the dataset to identify whether any data is missing or seems unusual.
In the Contents pane, right-click King County Housing Data and choose Data Engineering.
A pane appears, listing the fields in the data.
Click the Price field to select it. Press Shift and click the Year Renovated field.
Tip:
Drag the edge of the pane to resize it and see more fields at once.
All fields between the two you clicked are selected.
Drag the selected fields to the part of the pane with the Add All Fields and Calculate button.
Click Calculate.
Statistics, such as the minimum, maximum, and mean values, are calculated for each field.
Scroll through the statistics in the table of fields.
None of the fields have null values for any statistic, meaning there is no missing data. The statistics also provide insights about the overall distribution of the data. For instance, the Min and Max fields indicate that home sales prices ranged between $75,000 and $7,700,000.
One statistic in the table seems strange. The Min value for the Year Renovated field is 0. It is unlikely any houses were actually renovated in this year. You'll examine the field more closely by looking at a chart of its values.
For the Year Renovated field, right-click the Chart Preview cell and choose Open Histogram.
A histogram chart appears, showing the distribution of values in the dataset by year.
The histogram indicates that more than 20,000 records have a value of 0, while a relatively small number have values in more recent years. For this field, a value of 0 means that a home was not renovated at all, not that it was renovated in the year 0.
This distribution of the data may skew any analysis you perform using it. Because only a small amount of homes were renovated, you won't use this variable for analysis.
Close the histogram, the data engineering pane, and the Chart Properties pane.

Create a scatter plot matrix

Now that you're familiar with the data, you'll examine the relationship between the variables by creating a scatter plot matrix. A scatter plot matrix is a grid of charts showing each possible combination of two variables. The charts also statistically quantify how well one variable can be predicted by the other.

You'll use the scatter plot matrix to identify variables with a strong relationship to price. Later, you'll use these variables to create a linear regression model that predicts home prices.

In the Contents pane, right-click King County Housing Data. Point to Create Chart and choose Scatter Plot Matrix.
A scatter plot matrix pane appears. The Chart Properties pane also appears, displaying options for configuring the scatter plot matrix. First, you'll choose the variables to chart in the matrix.
In the Chart Properties pane, for Numeric fields, click Select.
Check all fields from Price to Year Built. Click Apply.
The scatter plot matrix is created based on the fields you chose.
Check Show trend line.
A line is added to each chart in the matrix that generalizes the relationship between the variables.
If necessary, resize the scatter plot matrix pane so you can see all the charts.
The rows and columns of the matrix correspond to the fields you chose. Each chart shows the relationship between two variables. The selected chart, which by default compares the Price and Year Built fields, is displayed larger in the upper right corner.
The larger version of the selected chart includes an R2 (also known as R² or R-Squared) value. This value is the coefficient of determination, which measures how much of the variation in the data is explained by the relationship of the two variables. A value close to 1 indicates a strong positive linear relationship, meaning that as one variable increases, so does the other.
The Price and Year Built chart has an R2 value of 0, meaning there is no relationship between the variables. This lack of relationship is evident in the chart itself, where prices are clustered on the lower end regardless of year, and the points do not visibly follow the trend line very well.
Click the scatter plot for Price and Square Feet (Basement) (column 1, row 2 from the bottom).
The large chart in the upper right corner changes to the one you selected.
These two variables also have a weak relationship, with an R2 value of 0.09. It's possible the relationship is weak because many homes don't have basements. At the bottom of the chart, there are a lot of points with a basement size of 0 square feet, which are distributed regardless of price.
You could examine each scatter plot individually, but instead you'll change the matrix to display the R2 value for every scatter plot.
In the Chart Properties pane, in the Matrix Layout section, for Lower left, choose R-Squared.
The scatter plot matrix changes. Scatter plots with higher R2 values are displayed with a darker blue color.
The variables with the strongest relationships to price are Square Feet (Living) (0.48), Grade (0.44), and Square Feet (Above) (0.36).
Normally, you'd want to use these variables when creating a linear regression model to predict home prices. However, you'll first make sure none of these variables are redundant.
Click the scatter plot for Square Feet (Living) and Square Feet (Above) (column 4, row 3 from the bottom).
These two variables have the strongest relationship between any two variables in the dataset, with an R2 value of 0.77. However, based on the description of each variable you saw when you first looked at the data, this relationship is because these variables contain similar information. One is the home's living space in square feet, while the other is the home's aboveground space in square feet.
You already saw that many homes don't have basements, so for these homes, both values will be the same. Even in homes with basements, the values will be similar. These variables are mostly redundant.
Redundant variables cause multicollinearity, which will make a linear regression model unstable. To avoid redundancy, you'll only use Square Feet (Living) in your model, not Square Feet (Above).
Lastly, you'll look at the other variable you identified as having a strong relationship with price: Grade.
Click the scatter plot for Price and Grade (column 1, row 4 from the bottom).
While this scatter plot has one of the highest R2 values, the relationship is more curvilinear or convex than linear. Price does increase with grade, but higher grades tend to have significantly higher prices.
To make the relationship stronger and more linear, you'll transform the Grade variable.
Close the scatter plot matrix pane and the Chart Properties pane.

Transform a variable

The Price and Grade variables have a clear relationship, but it is convex, rather than linear. When modeling home prices, you intend to perform linear regression, which works best when relationships are linear. To improve the model's performance, you'll transform the Grade variable to make the relationship linear.

When a relationship is concave, squaring or cubing the data is a generally effective way to make the relationship more linear. In this case, cubing produces the best result. You'll add and calculate a new field that represents cubed grades.

In the Contents pane, right-click King County Housing Data and choose Attribute Table.
Above the ribbon, in the Command Search box, type Calculate Field and choose the Calculate Field tool option..
The Calculate Field window appears. Using this tool, you can calculate a new or existing field. You don't want to overwrite the existing data, so you'll make a new field.
In the Calculate Field window, for Input Table, ensure that King County Housing Data is chosen.
For Field Name (Existing or New), type Grade_Cubed.
Because this name is not an existing field, the tool will automatically create a new field. By default, the new field is a text field, but you want the field to contain numbers, so you'll change the field type.
For Field Type, choose Float (32-bit floating point).
Next, you'll create the expression that calculates the field. Your expression will multiply the Grade field by itself multiple times to cube it.
For Expression, in the Fields box, scroll down and double-click Grade.
The field is added to the expression, represented by the notation !grade!.
Using the fields and mathematical symbols, create the following expression:
!grade! * !grade! * !grade!
Note:
Alternatively, you can copy and paste the expression into the box.
This expression will cube the Grade field.
Click OK.
The field is calculated.
Scroll to the end of the table and confirm the field was added.
You'll create a scatter plot matrix to see whether cubing the variable produces a linear relationship with a higher R2 value.
Close the attribute table. In the Contents pane, right-click King County Housing Data, point to Create Chart, and choose Scatter Plot Matrix.
In the Chart Properties pane, for Numeric fields, click Select. Check Price, Grade, and Grade_Cubed and click Apply.
Check Show trend line.
The transformed Grace_Cubed variable has a more linear relationship with the Price variable, with an R2 value of 0.5.
This result is an improvement over the original Grade variable, which had an R2 value of 0.44. When performing linear regression, you'll use the Square Feet (Living) and Grade_Cubed variables to predict price.
Close the scatter plot matrix pane and the Chart Properties pane.
On the Quick Access Toolbar, click the Save Project button.

You've identified the variables with the strongest relationship to the dependent variable. You're ready to build a house valuation model using linear regression.

Perform linear regression

Now that you've identified Square Feet (Living) and Grade_Cubed as the explanatory variables to predict price, you'll perform linear regression to model home sales prices in King County. You'll specifically perform generalized linear regression (GLR), which you can access as a geoprocessing tool.

The first time you perform GLR, you'll use the two variables you identified. After assessing the results of that model, you'll improve it by running GLR using new spatial variables that also predict home prices.

Model home prices

First, you'll access the GLR tool and run it to model home prices based on the variables you identified as being strongly related to price.

On the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.
In the Geoprocessing pane, in the search box, type Generalized Linear Regression.
There are two GLR tools, each in a different toolbox. You'll use the one in the Spatial Statistics toolbox.
In the list of results, click Generalized Linear Regression (Spatial Statistics Tools).
Note:
Do not choose the version of the tool in the GeoAnalytics Desktop toolbox.
GLR uses two types of variables: a dependent variable and one or more explanatory variables. The dependent variable is the variable you want to predict (in this case, price). The explanatory variables are the variables you'll use to make the prediction (in this case, the square footage of living space and the house's grade cubed).
For Input Features, choose King County Housing Data. For Dependent Variable, choose Price.
You can also change the model type. GLR can predict different types of dependent variables based on the model type. You're predicting price, which is a continuous variable (meaning the price can be a wide range of values). The default model type, Continuous (Gaussian), is correct for this type of dependent variable.
If you were predicting a dependent variable that had either a yes or no answer, such as whether a house sold for more than $500,000, you'd instead use the Binary (Logistic) model type. If the dependent variable was a count, such as the number of people making a bid for the house, you'd use the Count (Poisson) model type.
Next, you'll choose the model's explanatory variables. As a result of your previous exploration of the data, you already know what variables you'll use.
For Explanatory Variable(s), check Square Feet (Living) and Grade_Cubed.
For Output Features, delete the existing text and type Valuation_GLR_1.
You won't define any other inputs for now. Later, you'll run GLR again, making use of some of the other optional parameters.
Click Run.
The tool runs. When it finishes, it adds a result layer to the map and the Contents pane.
The symbology of the result layer indicates where the model over- and under-predicted prices compared to the existing home sales prices. Green points are where the actual sales price was higher than the predicted price, while purple points are where the actual sales price was lower than predicted. The darker the color, the larger the mismatch between the prediction and reality.
The darkest green points tend to cluster around bodies of water. It seems your linear regression model is systematically underestimating the sales price of houses near the waterfront. Small changes to the size of a living space may result in bigger changes to the price of a house by a water body compared to inland.

Evaluate the results

You'll evaluate the results of the GLR model by looking at the details of the tool results.

At the bottom of the Geoprocessing pane, click View Details.
The tool results details window appears.
If necessary, click the Messages tab. Optionally, expand the window by dragging its edges to better read the data.
In the Summary of GLR Results table, the Probability and Robust_Pr columns indicate whether a variable used in the model is statistically significant by marking it with an asterisk. Both the Square Feet (Living) and Grade_Cubed variables have an asterisk, so these variables are statistically significant predictors of home sales prices.
Note:
To learn more about interpreting specific values in the results tables, scroll to the Notes on Interpretation section at the bottom of the Messages tab.
In addition, the Variance Inflation Factor (VIF) values are less than 7.5, which indicates the variables are not redundant.
In the GLR Diagnostics table, the Adjusted R-Squared value is 0.552952. This value indicates that these two variables explain about 55 percent of the variation in home sales prices.
The Akaike's Information Criterion (AICc) value is 584688. This value is a relative value that reflects information lost due to the modeling process. The smaller the AICc, the better the model. You can use this value to compare models to each other.
Both Joint F-Statistic and Joint Wald Statistic have a value of 0, which indicates the overall model is statistically significant.
Overall, you'd prefer a larger adjusted R2 and a smaller AICc, but this model is a good start for predicting home prices.
Close the details window.
How do you improve your model? Visually interpreting the results suggests that the model often underpredicted home prices near water bodies. Although the Waterfront variable didn't have a strong correlation with price when you created the scatter plot matrix, you'll run GLR again using it to see whether it produces an improvement.
In the Geoprocessing pane, for Explanatory Variable(s), check Waterfront. Ensure Square Feet (Living) and Grade_Cubed remain checked and that the other parameters are unchanged.
Tip:
If you closed the Geoprocessing pane or changed other parameters, you can reopen the tool with the same parameters as when you ran it the first time. On the ribbon, on the Analysis tab, in the Geoprocessing group, click History. In the History pane, right-click Generalized Linear Regression and choose Open.
Click Run.
The tool runs. Because you did not change the Output Features parameter, the result overwrites the Valuation_GLR_1 layer.
At the bottom of the Geoprocessing pane, click View Details.
Running GLR with the Waterfront variable improved the adjusted R2 to 0.606370 (meaning the model now explains more than 60 percent of the home sales price values) and reduced the AICc to 581993. Adding this variable improved your model.
Close the details window.

Add spatial variables

There are still areas on the map where the model is not predicting home prices well, with clusters of dark green and dark purple points. Many of the dark green points are near Seattle. It's possible prices are higher in the large city compared to its surrounding suburbs.

Your home sales data doesn't include a variable that measures distance to Seattle, like it did with distance to waterfronts. Instead of an explanatory variable, you'll add an explanatory distance feature, which predicts the dependent variable based on distance to another feature layer.

In the Contents pane, turn on the Seattle layer. Drag the layer above Valuation_GLR_1.
The layer contains a single feature, representing the center of Seattle.
In the Geoprocessing pane, ensure that the parameters are unchanged from the last time you ran the GLR tool.
For Explanatory Distance Features, choose Seattle. For Output Features, change the output name to Valuation_GLR_2.
Click Run.
The tool runs. A new layer is added to the map.
At the bottom of the Geoprocessing pane, click View Details.
The adjusted R2 value is increased to 0.678208 and the AICc value is reduced to 577725, indicating the Seattle explanatory distance feature improved the model.
Close the details window.
You'll improve the model one last time by adding a major employer in the region as an explanatory distance feature. You also no longer need to see the Valuation_GLR_1 layer, so you'll turn it off.
In the Contents pane, turn off Valuation_GLR_1. Collapse Valuation_GLR_1 to hide its legend.
Turn on the Microsoft layer. Drag the layer above Valuation_GLR_2.
This feature is near some clusters of underpredicted prices, though not directly next to them. You'll see whether it improves the model.
In the Geoprocessing pane, for Explanatory Distance Features, add Microsoft.
Confirm the other parameters are unchanged from when you last ran the tool. Click Run.
The tool runs and the Valuation_GLR_2 layer is overwritten.
At the bottom of the Geoprocessing pane, click View Details.
The model has an adjusted R2 value of 0.691140 and an AICc value of 576857. Though these aren't major improvements, they do improve the model.
Your final GLR model explains more than 69 percent of the variation in home sales prices in King County.
Tip:
If you want to share your model or perform predictions later, you can set the Output Trained Model File parameter to create an output file of the model.
Close the details window. Save the project.

The refinements you made to your initial GLR model, from adding the Waterfront variable to incorporating the Seattle and Microsoft layers, made significant improvements. Your model is now reasonably good at predicting home prices.

GLR is a global regression model, meaning it assumes that relationships between the dependent variable (price) and each of the explanatory variables are the same everywhere throughout the dataset. For instance, it assumes being on the waterfront near Seattle is the same as being on the waterfront near the suburb of Tacoma. If these global assumptions aren't correct, it could explain why even with your best GLR model, there are clusters of over- and underpredictions near water bodies.

There are other types of methods for predicting home prices, such as spatial regression and machine learning. To learn more about those methods, try the tutorial Predict home prices with machine learning. For the other tutorials in this series, see Predict home prices with regression analysis and machine learning.

You can find more tutorials in the tutorial gallery.

Chart variable relationships Create a scatter plot matrix to identify factors that influence home prices.	25 minutes
Perform linear regression Create a generalized linear regression model to predict home prices and refine it with spatial data.	20 minutes

Chart variable relationships

Perform linear regression

Explore the data

Note:

Tip:

Create a scatter plot matrix

Transform a variable

Note:

Model home prices

Note:

Evaluate the results

Note:

Tip:

Add spatial variables

Tip:

Introduction to Regression Analysis

Regression Analysis: Performing Random Forest Regression Using ArcGIS Pro

Regression Analysis: Building a Regression Model Using ArcGIS Pro

Requirements

Outline

Chart variable relationships

Perform linear regression

Chart variable relationships

Explore the data

Note:

Tip:

Create a scatter plot matrix

Transform a variable

Note:

Perform linear regression

Model home prices

Note:

Evaluate the results

Note:

Tip:

Add spatial variables

Tip:

Acknowledgements

Send Us Feedback

Share and repurpose this tutorial

Ready to learn more?

Related Esri training

Introduction to Regression Analysis

Regression Analysis: Performing Random Forest Regression Using ArcGIS Pro

Regression Analysis: Building a Regression Model Using ArcGIS Pro