Skip To Content

Create a training dataset

In this lesson, you'll establish a data-driven relationship between ocean measurements at a location and seagrass occurrence using a supervised machine learning method, random forest. To perform this analysis, you'll clean the data and download the necessary Python libraries. First, you'll install the scikit-learn library using the ArcGIS Pro Python package manager. Python package manager makes installing Python libraries easy and makes sure that you can utilize libraries you installed, so that their functionality can directly be used from ArcGIS Pro. Next, you'll prepare your data to be used in the predictive analysis. You'll create interpolation surfaces to estimate ocean measurements at 10,000 randomly created coastal locations around United States.

Add Python packages

First, you'll install the Python libraries you'll use later for machine learning and data analysis. ArcGIS Pro includes a default conda environment, arcgispro-py3. The default conda environment includes several common packages, like ArcPy, SciPy, NumPy, and Pandas, among others. You can add and remove packages from this environment as needed. For this lesson, you will be adding two libraries: scikit-learn, a popular machine learning library, and seaborn, a statistical data visualization library.

  1. Download the SeagrassPrediction zipped folder.
  2. Locate the downloaded file on your computer and unzip it to your Documents folder.
    Note:

    It is important to save the file to your Documents folder because you'll need to use the exact file path in the Python script. If the file doesn't have the right path, the script will fail.

    The file contains an ArcGIS Pro package.

  3. Double-click the SeagrassPrediction.aprx file to open the project in ArcGIS Pro. If prompted, sign in using your licensed ArcGIS account.
    Note:

    If you don't have ArcGIS Pro or an ArcGIS account, you can sign up for an ArcGIS free trial.

  4. On the ribbon, click Project. On the left pane, click Python.

    Open Python Package Manager

  5. Scroll down the list of installed packages. If you already have scikit-learn and seaborn installed, skip to the next section.

    If you don't have scikit-learn and seaborn installed, you'll need to use a clone of the default environment to download them. The default environment can't be modified.

  6. Under Project Environment, click Manage Environments.

    The Manage Environments window opens.

  7. In the Manage Environments window, click Clone Default.

    The clone is typically named arcgispro-py3-clone. It may take a few minutes to clone the environment.

  8. Select the cloned environment and click OK.

    For the new environment to be recognized, you need to restart the program.

  9. Restart ArcGIS Pro.
  10. In the Package Manager, click Add Packages and type scikit in the search box.

    Add the scikit-learn package

  11. Click scikit-learn and click Install. In the Install wizard, accept the license agreement and click Install.
    Note:

    The image shown is an example, and a newer version of the package may be available. Install the latest version of the package. If you can't install scikit-learn, your software might have an earlier version of the package pinned. Under Versions, click the menu and choose 0.18.0, then follow the install procedure as written.

  12. If necessary, in the Conda_uac.exe window, click Run.
  13. When the library is done installing, type seaborn in the search box and click Install. Follow the prompts to install the library.
  14. Click Installed Packages and scroll down to make sure both libraries were added to the list.

    Installed Packages

  15. In the upper left corner, click the back arrow to return to the map view.

Prepare input data

For your analysis, you need to know variables like ocean temperature, salinity, and nutrient concentration to predict the suitability of a location for seagrass growth. As a marine ecologist, you're working with field measurements to build your predictive model. Because field data is not perfect and often has missing values, you'll need to fill in values to complete the raw data before you use it in your analysis.

In the Contents pane, there are four feature classes:

  • EMU_Global_90m: Ecological Marine Unit point data that contains ocean measurements up to a 90-meter water depth.
  • Seagrass_USA: Polygon data for seagrass occurrence. Every polygon in Seagrass_USA is an identified seagrass habitat.
  • US_coastline_shallow: Polygon data for United States coast that covers bathymetry up to the depth at which seagrass habitats are observed in Seagrass_USA.
  • bathymetry_shallow: Global shallow bathymetry polygon used to predict seagrass globally.

  1. Right-click EMU_Global_90m and choose Attribute Table.

    Open Attribute Table

    The attribute table opens. This feature class contains data from the Ecological Marine Unit dataset, and its attributes are the prediction variables to be used in random forest. Some of these variables include salinity, ocean temperature, and nitrate level. But you can see that this data contains lots of missing values.

    Missing values

    You'll replace the null data using the Fill Missing Values tool. This tool provides estimates for missing values in the data using spatial, spatiotemporal, or temporal neighbors. In this case, you will be using an average value of 100 nearest neighboring points with values to estimate the missing values.

  2. On the ribbon, click the Analysis tab and in the Geoprocessing group, click Tools.

    Tools

    The Geoprocessing pane opens.

  3. In the Geoprocessing pane, search for the Fill Missing Values tool. Choose the first result.

    Fill Missing Values tool

  4. For Input Features, expand the list and choose EMU_Global_90m.
  5. Rename Output Features to EMU_Global_90m_Filled.
  6. For Fields to Fill, click the drop-down arrow and check the following variables:

    • dissO2
    • nitrate
    • phosphate
    • salinity
    • silicate
    • srtm30
    • temp

    Add fields

  7. Click Add, and then for Fill Method, choose Average.
  8. For Conceptualization of Spatial Relationships, choose K nearest neighbors and set Number of Spatial Neighbors to 100.

    Fill Missing Values tool parameters

  9. Click Run.

    After the tool finishes, you will see a warning message at the bottom of the Geoprocessing pane explaining that no missing values were filled in for Salinity, SRTM30, and Temp attributes. These attributes did not have missing values, but will be needed later on for analysis, and were included to carry them over to the new output attribute table.

  10. Once the tool is finished, right-click EMU_Global_90m_Filled and open its attribute table.

    Filled values

    The Fill Missing Values tool appends a string to the output field names designated either filled or unfilled. The columns marked as filled include data that has been created by the tool, while unfilled marks original data. For the fields that you filled values for, the tool also creates two more columns with the suffixes _STD and _ESTIMATED. The _STD field shows the standard deviation of the neighboring data points used in estimating the missing value. The _ESTIMATED field shows a 1 if the attribute was filled using the tool and a 0 if the data already existed. Now you have spatially complete data for the oceanic variables needed.

  11. Close the attribute table.

    Filled data symbology

    The new data has been added to the map. Currently, it is symbolized with two points, a blue circle where new data values were added and an empty circle where there is only original data. You don't care that the data has been added, just that it is available, so you'll change the symbology.

  12. In the Contents pane, right-click EMU_Global_90m_Filled and choose Symbology.
  13. In the Symbology pane, for Symbology, expand the menu and choose Single Symbol.

    Single Symbol

    The layer redraws to show the new symbol. This symbol is randomly chosen, so you'll change it to something more uniform and easily seen.

  14. Next to Symbol, click the example.

    Example symbol

    The gallery opens.

  15. In the symbol gallery, choose Circle 1.

    Circle 1 symbol

  16. Click the Properties tab and set Size to 6 pt. Click Apply.
  17. Close the Symbology pane.

Create training data

Next, you'll create the training data that the random forest prediction model will need to form a relationship between seagrass occurrence and ocean conditions. The training dataset will be made up of seven predictor variables (ocean measurements) and one outcome variable (whether a location is a suitable seagrass habitat or not). To be easily accessible to the Python script you'll use later, these predictor variables need to be in a single feature class. You'll create a new feature class of random points, and then you'll add the ocean measurement data to each point.

  1. If necessary, check the boxes for EMU_Global_90m_Filled and Seagrass_USA to turn the layers on.
  2. On the ribbon, click the Map tab. In the Navigate group, click Bookmarks and choose Florida.

    Florida coast

    Most of the EMU_Global_90m_Filled data lies outside the seagrass observation layer. But if you use only the subsample of EMU_Global_90m_Filled that lies within the seagrass polygon, you'll have too few observations. You'll fix this by creating mock locations around the United States coastline and calculating the associated ocean measurements at these locations using EMU_Global_90m_Filled. Use the Create Random Points tool create a set of random points around the United States coast.

  3. In the Geoprocessing pane, search for the Create Random Points (Data Management) tool.

    Create Random Points tool

  4. For Output Point Feature Class, type USA_Train.
  5. For Constraining Feature Class, expand the menu and choose US_coastline_shallow.
  6. Change Number of Points to 10000 and click Run.

    Create Random Points parameters

    Now you have a new feature class with 10,000 points, which will be helpful in training your random forest model quickly. The problem is that these points don't have attributes. To give these points data, you'll create continuous interpolation surfaces for the EMU_Global_90m data using empirical Bayesian kriging (EBK) so that you can extract the data from these layers at each USA_Train point. This query will let you save the interpolated ocean measurements for each point location.

  7. In the Geoprocessing pane, click the back arrow and search for Empirical Bayesian Kriging. Click the first result.

    Empirical Bayesian Kriging

  8. For Input Features, choose EMU_Global_90m_Filled.
  9. For Z value field, choose TEMP_UNFILLED and set the Output raster to temp.
    Note:

    The temperature attribute may also be listed under its alias, TEMP. If TEMP_UNFILLED isn't listed, check the EMU_Global_90m_Filled attribute table to double-check.

  10. Click Run.
  11. Rerun the tool for each of the remaining ocean measurements:

    Z value fieldOutput raster

    DISSO2 (alias: DISSO2_FILLED)

    dissO2

    NITRATE (alias: NITRATE_FILLED)

    nitrate

    PHOSPHATE (alias: PHOSPHATE_FILLED)

    phosphate

    SILICATE (alias: SILICATE_FILLED)

    silicate

    SRTM30 (alias: SRTM30_UNFILLED)

    srtm30

    SALINITY (alias: SALINITY_UNFILLED)

    salinity

    Note:

    Make sure to use the names shown for output, because the code you will use later will look for these specific names. Ensure you don't confuse O and 0.

    Once the Empirical Bayesian Kriging tool is run for all seven ocean measurements, the next step is to extract the values for these measurements at USA_Train locations. All the surfaces should look similar to the one below, which shows the EBK model for nitrate concentration.

    Interpolated raster for nitrates

  12. In the Geoprocessing pane, click the back arrow and search for Extract Multi Values to Points (Spatial Analyst Tools).
  13. For Input point features, choose USA_Train.
  14. Next to Input rasters, click the drop-down arrow to expand the menu and add all seven of the interpolation rasters you just created.

    Extract Multi Values to Points parameters

  15. Click Run.

    This tool uses the interpolation rasters to extract the values of these surfaces at USA_Train locations.

Create a training label

The last step in creating the training dataset is determining where you already know seagrass grows. All the points in the training data lie within the US_Coastline_Shallow polygon, which overlaps with the Seagrass_USA polygon. You'll create a new field name, Present, and run a simple query to determine whether each point overlaps with the Seagrass_USA polygon. If it does, it can be given the categorical variable 1 to show that there is known seagrass growth at that location. All other points will be given the value 0 to show they are not suitable as seagrass habitat. Using this attribute, the machine learning model will be able to learn what combinations of ocean conditions are suitable for seagrass growth.

  1. In the Geoprocessing pane, search for Add Field (Data Management Tools).
  2. For Input Table, choose USA_Train, and for Field Name, type Present.
  3. Set Field Type to Double and click Run.

    Add Field

    This tool created an empty field in the USA_Train feature class named Present.

  4. In the Contents pane, right-click USA_Train and click Attribute Table. Make sure Present was added to the table.

    All the rows in the Present column have null data values. You want to assign points in USA_Train that fall in a seagrass polygon a value of 1, and assign a value of 0 for points that are not in a seagrass polygon. First, you'll change all the entries from null to 0.

  5. In the attribute table, right-click Present and choose Calculate Field.

    Calculate Field

    The Calculate Field menu opens in the Geoprocessing pane.

  6. For Input Table, choose USA_Train, and set Field Name to Present.
  7. Under Expression, for Present =, type 0 and click Run.

    Calculate Field parameters

    Now that the entire field is set to 0, you can find point locations that intersect the seagrass layer.

  8. On the ribbon on the Map tab, in the Selection group, click Select by Location.
  9. In the Geoprocessing pane, for Input Feature Layer, choose USA_Train.
  10. Make sure Relationship is set to Intersect, and for Selecting Features, choose Seagrass_USA and then click Run.

    Select Layer By Location parameters

    This tool selects rows in USA_Train that intersect Seagrass_USA polygons. These are the points that you'll give a value of 1.

  11. In the attribute table, right-click Present and choose Calculate Field. Set the parameters as listed below:

    • Input Table: USA_Train
    • Field Name: Present
    • Expression: Present = 1

    Calculate Present field

  12. Click Run.
  13. After the tool is finished running, on the ribbon on the Map tab, click the Selection group and click Clear.

    Clear selection

  14. Close the attribute table and save the map.

In the next lesson, you'll use the training data to create a model using the random forest classifier.