Fill gaps in your data with areal interpolation
Interpolate the percentage of seniors across Poland
If you know the values for most of the features in your dataset, you can use them to predict continuous values across the entire area. You'll do this to map the spatial distribution of seniors in Poland.
- Download the FillGaps project package.
- Locate the downloaded file on your computer. Double-click FillGaps.ppkx to open it.
Note:
If you don't have access to ArcGIS Pro or an ArcGIS organizational account, see options for software access.
The project opens in ArcGIS Pro.
This map depicts powiaty, which are administrative units similar to counties, in Poland. The polygons are colored to represent the percentage of the population aged 65 years or older. Unfortunately, the data is incomplete. Ten powiaty contain no value for the percentage of seniors.
This spatial data can be found on ArcGIS Living Atlas of the World. The values for the percentage of seniors were provided by Statistics Poland (the missing values were artificially removed for the purpose of this tutorial).
Demographic data is often difficult to model with geostatistics because urban areas show dramatically different patterns than rural ones. In this case, the spatial variation in this data is relatively smooth, without dramatically distinct breaks. This means that the data might be appropriate for geostatistics.
- On the ribbon, click the Analysis tab. In the Workflows group, click Geostatistical Wizard.
The Geostatistical Wizard window appears.
- In the Geostatistical Wizard window, under Geostatistical methods, choose Areal Interpolation.
Most interpolation methods require point data as the input, but areal interpolation uses polygons. In this tutorial, you are using polygons that are nearly complete and fit together like puzzle pieces. You can also use polygons that are widely spaced or overlapping. For example, you may have data representing observations of birds, which is stored in polygons for the ground covered by each observer.
Note:
You can read more about this geostatistical method at What is areal interpolation?
Areal interpolation will process values differently if you declare them as representing averages, rates, or events. You are mapping the percentage of a population over a certain age, which is a rate.
- Under Input Dataset 1, for Type, choose Rate. For Source Dataset, choose Powiaty_Seniors.
- For Count Field, choose 2017 Senior Population, and for Population Field, choose 2017 Total Population.
- Click Next.
The next window shows a covariance chart. The blue crosses represent your data without any modeling. The blue line represents the model that will be used to predict the percentage of seniors over the entire area. You want to edit parameters of the model until the model line follows the path of the crosses and 90 percent of the crosses fall within the red confidence intervals. Currently, that is not the case.
Not only does the line not follow the crosses closely, but there are two crosses that lie far away from the path. In many situations you won't be able to accomplish an ideal model, but you can try to get as close as possible. A good place to start is by making the lag size smaller. Doing so will reduce the area that is searched when sampling to generate the blue crosses.
- Under General Properties, for Lag Size, type 12000.
The model changes. However, the crosses are now even farther from the confidence intervals.
Next, you'll try to improve the model by changing its shape.
- For Model, choose Stable.
Note:
Stable and K-Bessel models often give the best result, but also take more time to process.
Achieving a perfect model can be difficult or even impossible, especially if you are working with demographic data instead of a natural phenomenon. In this scenario, even though only one of the crosses falls within the confidence intervals, the model line follows the crosses relatively closely. This model isn't perfect, but it is a suitable compromise.
- Click Next.
The next window contains a preview map.
- Click different parts of this preview map.
The map highlights neighboring polygons that will be used to determine the predicted value for the location you clicked. Polygons colored red will be weighted heavier in the analysis than those colored green.
- Click Next.
The Cross validation page opens. Cross-validation assesses the accuracy of a prediction surface. It does so by removing a single polygon from the dataset and using the remaining data to predict a value within the removed polygon.
The Predicted scatterplot for this model does not look good. Ideally, the red values should follow the trend of the blue and gray lines. Your chart looks more like a random cloud of points. On the other hand, the values listed on the Summary tab look good. These numbers should all be close to zero except for Root-Mean-Square Standardized, which should be close to 1. The Root-Mean-Square value of 0.02 means that the predicted proportion of senior citizens will be off by 2 percent on average from the real value. This is a reasonable margin of error. These values are more indicative of the quality of your model than the scatterplot.
- Click Finish. In the Method Report window, click OK.
An interpolated layer is added to the map.
- In the Contents pane, turn off Powiaty_Seniors and turn on Powiaty_Seniors outlines.
The areas with heavy black outlines are the ones with missing data.
Create polygons from the interpolation
The interpolation you created is continuous and ignores the polygon outlines. Geostatistics has smoothed the demographic data to create a gradual surface. While it may not match known data precisely, smooth interpolations like this are often better at predicting unknown values.
Next, you'll convert the continuous interpolation surface into polygons.
- On the ribbon, click the Map tab. In the Navigate group, click Bookmarks and choose Kluczborski.
The map navigates to Kluczborski powiat.
The Areal Interpolation layer is a geostatistical layer, which means that every location on the map has a slightly different value. Some of the polygons that you need to fill, such as this one, have a wide range of predicted values. You'll convert this predicted surface into a polygon layer with a single predicted value for each powiat.
- On the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.
The Geoprocessing pane appears.
- In the Geoprocessing pane, in the search bar, type Areal Interpolation Layer and in the list of results, choose the Areal Interpolation Layer To Polygons tool.
- For the Areal Interpolation Layer To Polygons tool, enter the following:
- For Input areal interpolation geostatistical layer, choose Areal Interpolation.
- For Input polygon features, choose Powiaty_Seniors.
- For Output polygon feature class, change the output name to Interpolated_Polygons. Make sure to include the underscore.
- Click Run.
The Interpolated_Polygons layer is added to the map.
- On the ribbon, click the Map tab. In the Navigate group, click the Full Extent button to return to the default view of the map.
- In the Contents pane, drag the Interpolated_Polygons layer below the Powiaty_Seniors outlines layer.
- Turn off Areal Interpolation.
You now have a value for percentage of seniors in every polygon.
Although you have the real values for most of those polygons, you only want to use the predicted values for 10 of them. You will select the 10 polygons with missing value and use the Calculate Field tool to add values for those polygons alone.
- Right-click Interpolated_Polygons and choose Attribute Table.
The attribute table appears. It contains all of the data from the Powiaty_Seniors layer and it also has three new fields: Included, Predicted, and Standard Error.
- Double-click the header for the Percent Seniors column to sort it.
Now, all the empty records are at the top of the table. Next, you'll replace these <Null> values with the data from the Predicted field.
- Select all the rows with missing senior data.
Note:
Click the row number for the first record and select multiple rows, press the Shift key or drag the cursor across the row numbers you want to select. You can also use the Select by Attribute tool.
- At the top of the attribute table, click the Calculate button.
The Calculate Field tool opens in a pop-up window. The field calculation will only be applied to the selected rows.
- For Field Name, choose Percent Seniors.
- In the Fields list, scroll down and double-click Predicted.
The PercentSeniors = box populates with !Predicted! This will take the values from the Predicted field and paste them into the Percent Seniors field. But the existing values in these two fields are formatted as decimal values, not percent values. To convert them, you'll multiply values by 100.
- After !Predicted!, type *100.
- Click Apply.
- In the attribute table, click the Show Selected Records button.
The <Null> values in the Percent Seniors column have been replaced. The unselected rows remain unchanged.
- At the top of the attribute table, click Clear to clear the selection.
- Close the attribute table.
Symbolize the map
Finally, you'll symbolize the new layer to match the original one. Instead of setting the symbology parameters one by one, you'll import them from the Powiaty_Seniors layer.
- In the Contents pane, turn off Powiaty_Seniors outlines and click Interpolated_Polygons to select it.
- On the ribbon, on the Feature Layer tab, in the Drawing group, click Import.
The Import Symbology window appears.
- In the Import Symbology window, for Symbology Layer, choose Powiaty_Seniors.
- Click Apply then click OK.
The symbology of Areal_Interpolation_Polygons now matches that of Powiaty_Seniors, your initial layer, but there are no longer any holes in the data.
- On the Quick Access Toolbar, click the Save button.
The process of substituting values to replace missing data is called imputation. Often, values are imputed using the average of the remaining dataset. When your data is spatial, you have better options available to you, because you can assume that things that are closer together are more similar than things that are farther apart. In this tutorial, you used areal interpolation to create a continuous surface across Poland to model the percentage of the population that is over 65 years of age. You then sampled from that surface to predict values for the polygons that were missing data.
Don't forget to tell your map readers that some of the values were imputed. This can be done with labels, a list, or symbology. If your map is included in a report, you can describe the method of imputation.
The Fill Missing Values tool can accomplish the same task. For some datasets, this tool will give better results. For others, geostatistics will be better. It is difficult to know until you have tried both, but if the spatial transition between values is not smooth, Fill Missing Values is recommended.
Note:
Optionally, for an extra challenge, find the Fill Missing Values tool in the Geoprocessing pane and use it to impute the missing values in the Powiaty_Seniors layer. Compare your results to the real values in the Powiaty_full_dataset, which can be accessed by opening the Catalog pane, expand the Maps folder, and double-click the Full Dataset map.
Read more in Fill Missing Values (Space Time Pattern Mining) and this ArcUser article Dealing with Missing Data .
You can find more tutorials in the tutorial gallery.