Set up the project and examine the data

You'll set up the ArcGIS Pro project and examine the input data. But first, you'll learn some basics about the machine learning workflow you'll use in this tutorial.

Understand the machine learning workflow

The fundamental concept of machine learning is to enable computers to learn from sample data and apply what they have learned to unknown data. One way of doing that is to train a regression model and use it to predict new results. This is the approach you'll take in this tutorial.

You want to predict the aboveground biomass (AGB) throughout several counties in Georgia. You'll need the following data:

  • Target sample data—This will be a set of known AGB values for sample locations. You'll use point data extracted from a GEDI satellite lidar trajectory dataset, as shown in the following example image.

    Point data extracted from a GEDI satellite lidar trajectory dataset

  • Explanatory variables—This will be data that can explain the AGB sample values and can then help predict AGB values for new areas. You'll use Landsat 9 multispectral satellite imagery, digital elevation model (DEM) data, and additional derived raster layers. The following example images show the Landsat imagery (left) and DEM raster data (right).

    Landsat 9 scene and DEM raster

The Landsat 9 multispectral satellite imagery was chosen as explanatory variable because the sensor's spectral characteristics respond to vegetation, which is directly related to biomass. Digital Elevation Model (DEM) captures the topological variability and terrain complexity, which can also be factors that influence vegetation growth.

You will train the model using the target sample data and explanatory variables as input. During the training, the model will capture the relationships between sample values and explanatory variables. Once you are satisfied with the model, you'll use it to predict AGB values throughout the entire Georgia counties extent. This output will be a raster, as shown in the following example image, where the higher AGB values appear in dark green and the lower values in white or light green.

Raster with predicted AGB values

Download and open the project

To get started, you'll download a project that contains the data for this tutorial and you'll open it in ArcGIS Pro.

  1. Download the Estimate_Biomass.zip file and locate the downloaded file on your computer.
    Note:

    Most web browsers download files to your computer's Downloads folder by default.

    The .zip file is 2.9 GB and might take a few minutes to download.

  2. Right-click the Estimate_Biomass.zip file and unzip it to a location on your computer, such as drive C.
  3. Open the extracted Estimate_Biomass folder and double-click Estimate_Biomass.aprx to open the project in ArcGIS Pro.

    Estimate_Biomass.aprx

  4. If prompted, sign in to your ArcGIS organizational account.
    Note:

    If you don't have access to ArcGIS Pro or an ArcGIS organizational account, see options for software access.

    The project opens.

    Initial project view

    The map displays the study area boundaries as a polygon outlined in orange. This area represents 20 counties in Georgia.

Examine the input data

You'll now examine the rest of the input data provided in the project. First, you'll add the Landsat image to the map.

  1. On the ribbon, click the View tab. In the Windows group, click Catalog Pane.

    Catalog Pane button

  2. In the Catalog pane, expand Folders, Estimate_Biomass, and InputData.

    Folders, Estimate_Biomass, and InputData expanded

  3. Under InputData, expand LC09_L2SP_018038_20221004_20230327_02_T1.

    This is a Landsat 9 satellite imagery scene that includes seven spectral bands with surface reflectance values:

    • Band 1—Coastal Aerosol
    • Band 2—Blue
    • Band 3—Green
    • Band 4—Red
    • Band 5—Near Infrared (NIR)
    • Band 6—Short wave infrared (SWIR) 1
    • Band 7—Short wave infrared (SWIR) 2

    Seven Landsat 9 spectral bands

    Note:

    You can drag to expand the width of the pane to better see the longer file names.

    Expanding the width of the pane

    These bands will be used as explanatory variables. You'll now add the Landsat scene to the map.

  4. Right-click LC09_L2SP_018038_20221004_20230327_02_T1_MTL.txt and choose Add To Current Map.

    Add To Current Map menu option

  5. If prompted to calculate statistics, click Yes.

    After a few moments, the image appears on the map. You'll rename it to a shorter name.

  6. In the Contents pane, click Surface Reflectance_LC09_L2SP_018038_20221004_20230327_02_T1_MTL to select it, and click it once more to enter edit mode. Change the name to Landsat9 and press Enter.

    Landsat9 layer renamed

    You'll change the image rendering to natural color, a combination of the red, green, and blue bands, which shows colors close to what the human eye would usually see.

  7. In the Content pane, make sure that Landsat9 is selected.
  8. On the ribbon, click the Raster Layer tab, in the Rendering group, click the Symbology button.

    Symbology button

  9. In the Symbology pane, set the following parameter values:
    • For Primary symbology, ensure that RGB is selected.
    • For Red, choose SRB4
    • For Green, choose SRB3
    • For Blue, choose SRB2

    Primary symbology parameters

    The image rendering updates to the natural color rendering.

    Natural color rendering

  10. Close the Symbology pane.

    Symbology pane close button

    Next, you'll add the digital elevation model (DEM) to the map.

  11. In the Catalog pane, in the InputData folder, collapse LC09_L2SP_018038_20221004_20230327_02_T1.

    LC09_L2SP_018038_20221004_20230327_02_T1 folder collapsed

  12. Right-click DEM.tif and choose Add To Current Map.

    Add to Current Map menu option

  13. In the Contents pane, rename the DEM.tif layer to DEM.

    DEM layer renamed

  14. Examine the DEM layer on the map.

    The DEM provides elevation data. The lighter tones indicate areas with higher elevation and the darker tones areas with lower elevation.

    DEM layer on the map

    That layer will also be used as an explanatory variable. Next, you'll review the GEDI data.

  15. In the Catalog pane, under InputData, expand the GEDI_L4A folder.

    GEDI_L4A folder expanded.

    This folder contains eight GEDI files that will be used as the samples with known AGB values, or training targets. Note that these are trajectory HDF5 files: they are not raster files but trajectory data. You will learn how to handle this data and display it on the map later in the workflow.

    There are two other data layers in the Content pane. You have already seen the AOI layer, which delineates the overall study area. There is also the Counties layer, which provides the county boundaries. You will turn it on.

  16. In the Contents pane, expand the arrow next to the Counties layer to reveal its legend, and check the box next to the Counties layer to turn it on.

    Counties layer turned on

  17. Review the AOI and Counties layers (orange and bright purple) on the map.

    AOI and Counties layers on the map.

    You will use these two layers later in the analysis.

  18. Click the boxes next to the Counties, DEM, and Landsat9 layers to turn them off, as you won't need them for the next workflow steps.

    Counties, DEM, and Landsat9 layers turned off

  19. On the Quick Access Toolbar, click Save to save your project.

    Save button

In this part of the workflow, after an overview of the machine learning workflow, you set up the ArcGIS Pro project. You then examined the input data: a seven-band Landsat 9 scene, a DEM raster, GEDI data, and some boundary layers.


Process and extract GEDI data

AGB represents living vegetation above the ground, measured as mass per unit, typically megagram (that is, metric ton) per hectare). Measuring AGB physically on the ground over a large study area is labor intensive and nearly impossible. In contrast, estimating AGB using remote sensing data is a good alternative solution.

GEDI is a satellite lidar mission from NASA that measures the 3D structure of the Earth surface. This includes the forest canopy height and its vertical structure, that is, the stacked-up layers of trees and shrubs that might together amount to more or less biomass. GEDI captures sample points along the sensor's tracks. From those measurements, the aboveground biomass density (AGBD) can be derived, and the GEDI L4A product contains these derived AGBD point values. The following example image shows the GEDI tracks where sample AGBD data was captured, as they intersect in this tutorial's study area.

Example of GEDI tracks

Such data is delivered as trajectory-structured HDF5 files and can be brought into ArcGIS as a trajectory dataset, a geodatabase data model meant to manage a collection of trajectory files. You will now create a trajectory dataset, add the provided GEDI data to it, and extract the relevant AGBD point data that will be used as training samples later in the workflow.

Create a trajectory dataset

First, you'll create an empty trajectory dataset in the project geodatabase.

  1. In the Catalog pane, expand Databases.
  2. Right-click Estimate_Biomass.gdb, click New, and choose Trajectory Dataset.

    Trajectory Dataset menu option

    In the Geoprocessing pane, the Create Trajectory Dataset tool appears.

  3. For Trajectory Dataset Name, type Gedi.

    Trajectory Dataset Name parameter

  4. Accept the other default values and click Run.

    The trajectory dataset appears in the Contents pane. It contains Footprint and Point sublayers.

    Trajectory dataset in the Contents pane

    This dataset is currently empty and will act as a container for the GEDI data.

Add GEDI data to the trajectory dataset

You'll now add the GEDI data that was provided for this workflow into the empty trajectory dataset you just created.

  1. Switch back to the Catalog pane.

    Catalog pane

  2. In the Catalog pane, expand the Estimate_Biomass.gdb geodatabase, right-click Gedi, and choose Add Trajectories.

    Add Trajectories menu option

    First, you'll set up the trajectory dataset type and properties.

  3. In the Add Data to Trajectory Dataset pane, for Trajectory Type choose GEDI.
  4. Under Trajectory Type, click the Properties button.

    Properties button

  5. In the Trajectory Type Properties window, click the Trajectory tab.

    The GEDI data provided is of the L4A type, so you will set the properties accordingly.

  6. Under Product Filter, choose GEDIL4A.

    GEDIL4A value for Product Filter

  7. Under Groundtracks, check the box next to Name to select all tracks.

    Check box next to Name

    GEDI data is captured as eight distinct beams, and you want to include them all.

  8. Under Predefined Variables, check the box for the Aboveground Biomass Density variable.

    Aboveground Biomass Density variable

    This is the only variable that you are interested in for this dataset.

  9. Click OK to save the properties.
  10. In the Add Data To Trajectory Dataset tool pane, under Input Data, choose Folder, and click the Browse button.

    Input Data parameter

  11. In the Input Data window, expand Folders, Estimate_Biomass, and InputData, click GEDI_L4A, and click OK.

    Input Data window

  12. In the Add Data To Trajectory Dataset tool pane, accept all other default values and click Run.

    Add Data To Trajectory Dataset parameters

    After a few moments, the GEDI data is added to the trajectory dataset and it appears on the map. You will zoom out to see the entire dataset.

  13. In the Contents pane, right-click the Gedi layer, and choose Zoom To Layer.

    Zoom To Layer menu option

    The green polygons crisscrossing North America represent the footprints of the GEDI sensor's trajectories. These specific trajectories were selected because they intersect on the study area.

    GEDI trajectories on the map

  14. In the Contents pane, right-click the Footprint layer, and choose Attribute Table.

    Attribute Table menu option

    The Footprint attribute table appears.

    Footprint attribute table

    Each row corresponds to one trajectory and contains information about it. For instance, the Count field indicates how many points there are in each trajectory.

  15. Close the Footprint table.

    Footprint table close button

    You will now look at the individual points contained in the trajectories.

  16. In the Contents pane, turn on the AOI layer. Right-click the AOI layer and choose Zoom To Layer.

    Gedi trajectory Footprint layer on the map

    Tip:

    If the Gedi trajectory layer doesn't display on the map, zoom out a bit.

  17. Turn off the Footprint layer, and turn on the Point sublayer.

    Point sublayer turned on

    The point layer may take some time to display because it contains hundreds of thousands of points.

    Gedi trajectory point layer on the map

  18. Zoom in to an area of your choice until you see the individual points.

    Gedi trajectory Point layer zoomed in

    Each point contains an AGBD value.

You added GEDI data to a trajectory dataset and you examined it.

Extract the relevant AGBD point data

Only the GEDI points within the study area are relevant to your workflow. You will now extract the points located within the AOI boundary using the Clip tool. The output will be a point feature layer.

  1. In the Geoprocessing pane, click the Back button.

    Back button

  2. In the Geoprocessing search box, type Clip. In the list of results, click the Clip tool to open it.

    Searching for the Clip tool

  3. In the Clip tool pane, set the following parameters:
    • For Input Features or Dataset, choose Point.
    • For Clip Features, choose the AOI layer.
    • For Output Features or Dataset, type AGBD_observations as the output name.

    Clip tool parameters

  4. Click Run.

    After a few moments, the AGBD_observations point layer is added to the map. You will examine it in more detail.

  5. In the Contents pane, turn off the Gedi layer, as you won't need it any longer in this workflow.

    Gedi layer turned off

  6. Right-click the AGBD_observations layer and choose Zoom To Layer.

    Zoom To Layer menu options

    You can see that the AGBD_observations layer contains only the points within the study area.

    AGBD_observations layer on the map

  7. In the Contents pane, right-click the AGBD_observations layer, and choose Attribute Table.

    The AGBD_observations attribute table appears.

    Each row corresponds to a point, and the AGBD field gives the aboveground biomass density value for each point (in metric tons per hectare). In total, there are 106,159 points in this layer.

    AGBD field

  8. Close the AGBD_observations attribute table.

    Next, you will apply an imported symbology to this layer to visualize it more effectively.

  9. In the Geoprocessing pane, click the Back button.
  10. Search for the Apply Symbology From Layer tool and open it.

    Searching for the Apply Symbology From Layer tool

  11. In the Apply Symbology From Layer tool, for Input Layer, choose AGBD_observations.
  12. For Symbology Layer, click the Browse button. Browse to Folders > Estimate_Biomass > InputData and choose the AGBD.lyrx layer file.

    Apply Symbology From Layer tool parameters

  13. Click Run.

    The map updates.

    AGBD_observations layer with the new symbology.

    The AGBD_observations layer now displays with a symbology where the points in dark green tones indicate the highest AGBD values and the points in light yellow color tones indicate the lowest AGBD values. This layer will be used as known samples, or training targets, during the model training.

  14. Press Ctrl+S to save the project.

In this part of the workflow, you created a trajectory dataset and ingested the AGBD variable from a GEDI level 4A trajectory data into it. You then extracted the relevant AGBD points as a feature layer and symbolized it.


Prepare derived explanatory variables

You'll now prepare additional explanatory variables from the initial Landsat 9 scene and DEM raster. Specifically, you will create seven spectral indices derived from the Landsat 9 scene and one aspect raster derived from the DEM.

Generate spectral indices

A spectral index combines different spectral bands through a mathematical formula, usually computing some type of ratio. The resulting output is a new raster image that emphasizes a specific phenomenon, such as vegetation, water, urban development, or moisture. These spectral index layers will provide additional information to account for different vegetation conditions, in turn helping better predict AGB values.

Note:

Learn more about common spectral indices.

You'll create several indices that will serve as additional explanatory variables:

  • NDVI—Normalized difference vegetation index
  • EVI—Enhanced vegetation index
  • PVI—Perpendicular vegetation index
  • NBR—Normalized burn ratio
  • NDWI—Normalized difference water index
  • NDBI—Normalized difference built-up index
  • MSI—Moisture stress index

You'll start with NDVI, used to differentiate healthy vegetation from unhealthy vegetation or absence of vegetation. You'll use the Band Arithmetic raster function.

  1. In the Contents pane, turn off the AGBD_observations layer.

    AGBD_observations layer turned off

  2. On the ribbon, on the Imagery tab, in the Analysis group, click the Raster Functions button.

    Raster Functions button

  3. In the Raster Functions pane, in the search box, type Band Arithmetic.

    Band Arithmetic search

  4. In the list of results, click the Band Arithmetic raster function to open it.

    Band Arithmetic raster function button

  5. In the Band Arithmetic Properties raster function pane, set the following parameters:
    • For Raster, choose Landsat9.
    • For Method, choose NDVI.
    • For Band Indexes, type 5 4, corresponding to the near infrared and red bands that are needed for the NDVI calculation.

    Band Arithmetic raster function pane

  6. Click the General tab, and for Name, type NDVI.

    Band Arithmetic General tab

  7. Click Create new layer.

    A new layer named NDVI_Landsat9 is added to the map. The raster in the map contains calculated NDVI values ranging between -1 (absence of vegetation) and 1 (healthy vegetation).

    NDVI_Landsat9 layer on the map

    Next, you'll create the remaining spectral index layers—EVI, NBR, PVI, NDWI, and NDBI—following the same steps.

  8. Repeat steps 4 to 7 with the following band settings:

    Name/MethodDescription (for reference)Band IndexesBand names

    EVI

    Enhanced vegetation index

    5 4 2

    NIR, red, blue

    NBR

    Normalized burn ratio (used to identify burn scars)

    5 7

    NIR, SWIR 2

    PVI

    Perpendicular vegetation index

    5 4 0.3 0.5

    NIR, red (and slope and gradient values)

    NDWI

    Normalized difference water index

    5 3

    NIR, green

    NDBI

    Normalized difference built-up index

    6 5

    SWIR 1, NIR

    For MSI (moisture stress index), the Band Arithmetic raster function doesn't include the MSI option under Method. Instead, you'll use the User Defined option to calculate it, spelling out the mathematical formula explicitly: B6 / B5, where the bands are referred to by B + [a band number]. So, this formula means that the SWIR 1 band should be divided by the NIR band.

  9. Repeat steps 4 to 7 to create the MSI layer, using the following parameters:
    • For Raster, choose Landsat9.
    • For Method, choose User Defined.
    • For Band Indexes, type B6 / B5.
    • Under General, for Name, type MSI.

    Band Arithmetic raster function parameters for MSI

    At the end of this process, all seven index layers should be added to the map and listed in the Contents pane.

    Seven index layers in the Contents pane

Derive an aspect layer from the DEM

You will now derive an aspect layer from the DEM layer using the Aspect raster function. The aspect indicates the direction that each downhill slope faces (north, south, east, west). It is relevant as an explanatory variable since solar illumination will vary according to the aspect value and this will affect vegetation growth.

  1. In the Raster Functions pane, search for and open the Aspect raster function.

    Aspect raster function

  2. In the Aspect raster function pane, for Raster, choose the DEM layer.

    Aspect raster function parameters

  3. Click Create new layer.

    A layer named Aspect_DEM is added to the map.

    Aspect_DEM on the map

    In the next section, you will use all the explanatory variable layers you created as input to the machine learning model. However, you won't need to see them on your map, so you will now turn them off.

  4. In the Contents pane, turn off all seven spectral index layers and the DEM and Aspect_DEM layers.
  5. Press Ctrl+S to save the project.

In this part of the workflow, you prepared seven layers derived from the Landsat scene and one aspect layer derived from the DEM. These layers will be used as explanatory variables alongside the Landsat scene and the DEM when training the regression model.


Train a regression model and predict biomass density

You have now prepared the target sample data and explanatory variables. Next, you'll use all this data as input to train your regression model and capture the relationships between known AGBD values and explanatory variables. You will then examine the performance of your model, proceed to do some data cleanup, and retrain your model to obtain a higher performance. Then, you'll use the resulting model to predict AGBD values throughout the entire study area. Finally, you'll summarize the results to obtain the average AGBD by county in the study area.

Train a random tree regression model

First, you'll train the model to predict biomass with the Train Random Trees Regression Model tool. Random forest regression is a machine learning approach that operates by constructing a multitude of decision trees at training time.

  1. In the Geoprocessing pane, if necessary, click the Back button.
    Note:

    If you closed the Geoprocessing tab, you can reopen it by going to the ribbon, to the Analysis tab, in the Geoprocessing group, and clicking Tools.

  2. Search for and open the Train Random Trees Regression Model tool.

    Train Random Trees Regression Model tool search

    You'll define the explanatory variable inputs.

  3. In the Train Random Trees Regression Model tool pane, for Input Rasters, add Landsat9, DEM, and all eight derived explanatory variable layers.

    Input Rasters for the Train Random Trees Regression Model tool

    Caution:

    You should use the exact same order for these layers now in the Train Random Trees Regression Model tool and later in the Predict Using Regression Model tool.

    You'll then point to the AGDB target sample data.

  4. For Target Raster or Points, choose AGBD_observations.
  5. For Target Value Field, choose AGBD.

    The resulting output model will be an .ecd file. You'll choose a name for it.

  6. For Output Regression Definition File, click the Browse button.

    Target Raster and Output Regression Definition File parameters

  7. In the Output Regression Definition File window, browse to Folders > Estimate_Biomass and for Name, type Biomass_model.ecd and click Save.

    Output Regression Definition File window

    The output will also include some additional auxiliary files that you can use to understand the model's accuracy. You'll set up their names.

  8. In the Train Random Trees Regression Model tool pane, expand Additional Outputs.
  9. For Output Importance Table, click the Browse button, browse to Folders > Estimate_Biomass and for Name, type Importance.csv.
  10. For Output Scatter Plots, click the Browse button, browse to Folders > Estimate_Biomass and for Name, type Biomass_scatterplots.pdf.

    Additional Outputs parameters

    Finally, you will also set up the training option parameters.

  11. Expand Training Options.
  12. For Percent of Samples for Testing, type 5, and accept the other default values.

    Percent of Samples for Testing parameter

    Note:

    The 5 percent value (instead of the default 10) ensures that less data will be set aside for testing and more will remain available for training.

  13. Click Run.

    After a couple of minutes, the model training is complete.

Review the model performance

To understand the model performance, you will now review the outputs from the Train Random Trees Regression Model tool. Machine learning workflows are often iterative. You must decide if the model is performing optimally or whether cleaning up some of the input data could improve its performance. In that latter case, you will need to retrain the model using the cleaned-up data.

First, you will look at the content of the Importance.csv table, which shows how each explanatory variable contributed more or less to predict the target sample values. You'll create a bar chart to summarize that information.

  1. In the Contents pane, under Standalone Tables, right-click the Importance.csv table layer, click Create Chart and choose Bar Chart.

    Bar Chart menu option

    An Importance.csv chart pane and a Chart Properties pane appear.

  2. In the Chart Properties pane, set the following parameters:
    • For Category or Date, choose Explanatory_Variables.
    • For Aggregation, choose <none>.
    • Under Numeric field(s), click Select, check the Importance field, and click Apply.

    Chart Properties parameters

    In the Importance.cvs chart pane, the Importance by Explanatory_Variable chart appears.

    Importance by Explanatory_Variable chart

    You can observe that the Landsat spectral bands, especially SWIR 1 (Landsat9_6) and near infrared (Landsat9_5) play important roles in explaining (or predicting) the biomass values. Additionally, several band indices make substantial contributions, especially MSI_Landsat9, PVI_Landsat9, and NDBI_Landsat9. On the other hand, the DEM and Aspect_DEM layers contribute the least, which make sense, since this study area is mostly flat terrain. However, in other extents with more elevation variation, the importance of the elevation data would probably be higher. Next, you'll review the scatterplots document.

    Note:

    The Random Trees algorithm is not deterministic, so the results you obtain may vary slightly.

  3. Close the Importance.cvs chart pane.

    Importance.cvs chart pane closing button

  4. In File Explorer, browse to the Estimate_Biomass folder, and double-click the Biomass_scatterplot.pdf file to open it.

    Biomass_scatterplot.pdf file

    In the PDF, the first scatterplot shows for each sample point used in training:

    • The original known value (x axis).
    • The predicted value, after the training is complete (y axis).

    Scatterplot in the PDF

    The R2 value, ranging from 0-1, serves as an indicator of the model's performance. An R2 value of 0.834 for the training performance is acceptable. However, while most values are concentrated under 1,000, you can observe some extremely high values scattered from a bit under 1,000 to over 4,000.

    Extremely high values on the scatterplot

    You suspect that these points might be erroneous outliers that degrade the model's learning performance. To decide whether you should keep these extreme points or remove them from the training data, you will review them on the map. First, you will look at a histogram chart for the AGBD_observations layer to choose a more precise threshold for the outlier points.

  5. Close the PDF and switch back to ArcGIS Pro.
  6. In the Contents pane, right-click the AGBD_observations layer and choose Attribute Table.

    Attribute Table menu option

  7. In the attribute table, right-click the AGBD field, and choose Visualize Statistics.

    Visualize Statistics menu option

    The statistics for the AGBD field appear in a histogram chart named Distribution of AGBD.

    Distribution of AGBD chart

    The histogram shows the distribution of the AGBD_observations point features across all possible AGBD values. You can see that most of the points have AGBD values that are less than 700, with only a few points having values greater than 1,000. You will choose 1,000 as the threshold to define outlier points.

    You will now modify the display on the map to make the exploration of the high-value points easier.

  8. In the Contents pane, drag the Landsat9 layer to position it just above Aspect_DEM, and turn on the AGBD_observations and Landsat9 layers.

    AGBD_observations and Landsat9 layers turned on.

  9. Right-click the AGBD_observations layer and choose Symbology.

    Symbology menu option

  10. In the Symbology pane, for Primary symbology, choose Single Symbol.

    Primary symbology with Single Symbol value

    Note:

    The color of the symbol may vary.

    This symbology will make it easier to see the points you select on the map.

    Map updated with the new symbology

    Tip:

    You can shrink the size of the chart pane to increase the size of the map.

    Resizing the chart pane and the map

    You will now select the high-value AGBD points.

  11. In the Contents pane, ensure that the AGBD_observations layer is selected.

    AGBD_observations layer selected

  12. On the ribbon, on the Map tab, in the Selection group, click Select By Attributes.

    Select By Attributes button

  13. In the Select By Attributes window, under Expression, form the expression Where AGBD is greater than 1000.

    Where AGBD is greater than 1000 expression

  14. Click OK.

    About 40 points are selected, they appear in cyan blue on the map.

    40 points selected on the map.

    You will now review a few of these points individually.

  15. Click the AGBD_observations tab, and click the Show Selected Records button at the bottom of the pane.

    Show Selected Records button

    Only the selected features are now listed in the table.

  16. Double-click the row header for the first feature.

    Row header for the first feature

    On the map, the point appears highlighted in yellow.

  17. Zoom in until you can see the imagery details underneath.

    Point highlighted in yellow

    The point falls on some type of not-so-dense grass field, which should not have an AGBD value above 1,000. In contrast, you can see that neighboring points don't appear in cyan, because they were not selected. This means their AGBD value is under 1,000 and is not abnormally high.

  18. In the attribute table, double-click the row header for the third feature.

    Row header for the third feature

    That point also falls on some type of grass field, which should not have a value above 1,000. You can see that these high value points are outliers that must be faulty. You will delete them.

Clean up AGBD observations and retrain the model

You'll now delete the high-value outlier points. You'll also delete the points that have a null value, since they are of no use for training. Then, you'll retrain the model.

  1. In the Contents pane, right-click AGBD_observations and choose Zoom To Layer.
  2. On the ribbon, on the Map tab, click the Select By Attributes button.

    In the Select By Attributes window, the first clause Where AGBD is greater than 1000 is still present. You will add a second clause to select the features with null values.

  3. In the Select By Attributes window, click the Add Clause button.

    Add Clause button

  4. For the new clause, form the expression Or AGBD is null and click OK.

    Or AGBD is null expression

    In the AGBD_observations attribute table, there are now over 20,000 points selected, between abnormally high values and null values.

    Over 20,000 points selected

  5. On the attribute table toolbar, click the Delete Selection button.

    Delete Selection button

  6. When prompted to confirm that you want to delete the data, click Yes.

    You will save these edits.

  7. On the ribbon, on the Edit tab, in the Manage Edits group, click Save.

    Save button on the Edit tab

    The selected points are deleted from the AGBD_observations feature class. Next, you will rerun the training tool with the updated data to obtain a higher performing model.

  8. On the ribbon, on the Analysis tab, in the Geoprocessing group, click History.

    History button

    The History pane appears, it contains the history of all the tools you have run so far in this project.

  9. In the History pane, double-click the Train Random Trees Regression Model entry.

    Train Random Trees Regression Model entry in the History pane

    The Train Random Trees Regression Model tool appears, with all the parameter values you used originally.

    Train Random Trees Regression Model tool with original parameters

    You will rename the outputs, so that they don't overwrite the original results.

  10. For Output Regression Definition File, rename Biomass_model.ecd to Biomass_model2.ecd.
  11. Expand Additional Outputs, rename Importance.csv to Importance2.csv, and rename Biomass_scatterplots.pdf to Biomass_scatterplots2.pdf.

    Output files renamed

  12. Click Run.

    After a couple of minutes, the model is retrained.

  13. In File Explorer, browse to the Estimate_Biomass folder, and double-click the Biomass_scatterplots2.pdf file to open it.

    Biomass_scatterplots2.pdf file

    In the PDF, in the first scatterplot, you can see that the model performance has improved to a R2= 0.888 (up from R2= 0.834 previously). You can also note that all the values in the plot are now lower than 1,000.

    New version of the scatterplot

    You have also obtained better results in the second and third scatterplots found in the PDF, which show the model performance on test points.

  14. Close the PDF and switch back to ArcGIS Pro.

Create biomass prediction

You'll now use the model to predict biomass for the entire study area. You will do that with the Predict Using Regression Model tool. The input will be the same explanatory variables that you used for the model training (seven-band Landsat scene, DEM layer, spectral index layers, and aspect layer).

  1. In the Geoprocessing pane, click the Back button.
  2. Search and open the Predict Using Regression Model tool.

    Predict Using Regression Model tool search

  3. In the Predict Using Regression Model tool pane, for Input Rasters, add Landsat9, DEM, and all eight derived layers in the same order as before.

    Input Rasters for the Predict Using Regression Model tool

    Caution:

    It is important that you use the same order for these layers in the Predict Using Regression Model tool as you did earlier in the Train Random Trees Regression Model tool.

    You will now point to the trained model.

  4. For Input Regression Definition File, click the Browse button, browse to Folders > Estimate_Biomass, click Biomass_model2.ecd, and click OK.

    Finally, you will name the output.

  5. For Output predicted raster, type Biomass_prediction.crf.

    Output predicted raster parameter

  6. Click Run.

    After a few minutes, the resulting layer is added to the layer. You will now change the color scheme.

  7. In the Contents pane, right-click the Biomass_prediction.crf symbol.

    Biomass_prediction.crf symbol

  8. In the color scheme drop-down list, check the Show names box, and click the Blue-Green (Continuous) color scheme.

    Blue-Green (Continuous) color scheme

  9. Turn off the AGBD_observations and Landsat9 layers.

    AGBD_observations and Landsat9 layers turned off

  10. Turn off all the derived layers (spectral indices and aspect).
  11. On the map, review the Biomass_prediction.crf layer.

    Dark green tones indicate the areas with the highest biomass density, and light or white tones indicate low density or absence of biomass.

    Biomass_prediction.crf layer on the map

Summarize biomass density by county

Finally, you'll compute the biomass density per county. You'll use the Counties polygon layer and the Zonal Statistics as Table tool to find the average biomass density per county and you'll generate a chart to give an overview of your results.

  1. In the Contents pane, turn on the Counties layer.

    Counties layer turned on

    The county boundaries appear on the map.

    County boundaries on the map

  2. In the Geoprocessing pane, click the Back button.
  3. Search for and open the Zonal Statistics as Table tool.

    Zonal Statistics as Table tool search

  4. In the Zonal Statistics as Table tool pane, set the following parameters.
    • For Input Raster or Feature Zone Data, choose Counties.
    • For Zone Field, verify that Name is selected.
    • For Input Value Raster, choose Biomass_prediction.crf.
    • For Output Table, type Average_biomass_by_county.
    • For Statistics Type, choose Mean.

    Zonal Statistics as Table tool parameters

  5. Accept all other default values and click Run.

    The Average_biomass_by_county table is added to the Contents pane.

  6. In the Contents pane, under Standalone Tables, right-click the Average_biomass_by_county table, click Create Chart, and choose Bar Chart.

    Bar chart menu option

  7. In the Chart Properties pane, on the Data tab, set the following parameters:
    • For Category or Date, choose NAME.
    • For Aggregation, choose <none>.
    • Under Numeric field(s), click Select, check the MEAN field, and click Apply.
    • Under Sort, choose Y-axis Descending.

    Chart Properties Data tab

  8. Click the General pane and set the following parameters:
    • For Chart title, type Average biomass by county.
    • For X axis title, type Counties.
    • For Y axis title, type Biomass density (in metric tons per hectare).

    Chart Properties General tab

  9. In the Average_biomass_by_county chart pane, view the Average biomass by county chart.

    Average biomass by county chart

    From the bar chart, you see that some counties, such as Telfair, Houston, Macon, and Ben Hill, have higher average biomass density. Based on the United States Energy Information Administration report, almost half of the households in Georgia use biomass as a fuel, and 80 percent of that happens in rural areas. Understanding the status of biomass in those rural counties will help the government develop practical policies to mitigate the biomass consumption and protect forest and biodiversity loss.

    Note:

    You can also join the Biomass_by_county table to the Counties layer, to create a thematic map showing the average biomass by county. To do that, in the Contents pane, right-click Counties, click Joins and Relates, and choose Add Join.

  10. Press Ctrl+S to save the project.

In this tutorial, after setting up the project and examining the data, you prepared a trajectory dataset containing GEDI data and extracted the relevant AGBD point data for the study area. You used raster functions to prepare explanatory variables. You then trained a model to predict biomass density. You examined the performance of the model, proceeded to do some data cleanup, and retrained the model to obtain a higher performance. You used this better performing model to predict biomass density throughout your entire study area. And finally, you summarized the results to obtain the average biomass density per county in the study area.

For the brevity of this workflow, you used a relatively small study area. To apply a similar workflow to large areas that are represented across several Landsat scenes, and include images containing clouds or shadows, it is recommended that you first address cloud and shadow removal and compose these images into an mosaic dataset. Refer to the Python workflow and code-free workflow on creating a cloud-free image composite from satellite imagery. Furthermore, considering that the data used in this tutorial is also accessible from cloud platforms such as AWS or Microsoft Planetary Computer, you can leverage the capabilities of direct data access and cloud-based computing using ArcGIS Pro. To learn more, see the Cloud-Based Aboveground Biomass Mapping using Landsat and GEDI Data article.

You can find more tutorials in the tutorial gallery.