Skip To Content

Perform random forest classification

In the previous lesson, you created a training dataset with eight variables that help determine suitability for seagrass habitats. Now that you have prepared your data, you'll use the machine learning libraries you downloaded to create a prediction model. First, you'll check the correlation of the variables to make sure a random forest classification is the best option. Random forest is a supervised machine learning method that requires training, or using a dataset where you know the true answer to fit (or supervise) a predictive model. Then, you'll split the data into two sections, one to train your random forest classifier, and the other to test the results it creates. Based on the accuracy of the results, you can apply the model to the global data you have and save it as a feature class.

Move your spatial data into Python

You'll use the ArcGIS Pro Python console to interact with the spatial training data you created in the previous lesson. First, you'll import the Python libraries that you'll use to build a predictive model and perform machine learning. Then, you'll bring your data into Python by converting it to structures that the libraries can manipulate. In the same way you need to have your data in the correct format, like a shapefile, to be read by ArcGIS, you need to have your data in arrays or data frames that can be read by Python.

  1. If necessary, open your SeagrassPrediction project.
  2. On the ribbon, click the Analysis tab, and in the Geoprocessing group, click Python.

    Open Python console

    The Python console opens within ArcGIS Pro.

  3. In the Python console, paste the following code and press Enter twice to run it.

    Pressing Enter twice after every step will run the code in small pieces.

    from sklearn.ensemble import RandomForestClassifier
    import numpy as NUM
    import arcpy as ARCPY
    import arcpy.da as DA
    import pandas as PD
    import seaborn as SEA
    import matplotlib.pyplot as PLOT
    import arcgisscripting as ARC
    import SSUtilities as UTILS
    import os as OS

    This code imports all the Python libraries you'll need for analysis.

    Loading libraries in the console

  4. Name the feature classes that contain the attributes you will use in your analysis. USA_Train and EMU_Global_90m_Filled are inconvenient to type, so name them inputFC and globalFC, respectively.
    inputFC = r'USA_Train'
    globalFC = r'EMU_Global_90m_Filled'
  5. Define the names of prediction variables (ocean measurements) and the prediction and classification variable (seagrass presence). For each variable, type the attribute names that you want the variable to contain. Finally, create a variable that contains all the attributes you're using. Instead of typing all eight variables again, concatenate the previous two variables, or link them together.
    Note:

    Ensure you are typing O or 0 where appropriate.

    predictVars = ['DISSO2', 'NITRATE', 'PHOSPHATE', 'SALINITY', 'SILICATE', 'SRTM30', 'TEMP']
    classVar = ['PRESENT']
    allVars = predictVars + classVar

    These variables will be added to the NumPy array data structure once the feature classes are read in to the Python framework. NumPy stands for Numerical Python and is a library for scientific computing. This library specifically contains functions that you'll use to break your training data into training and test sets.

  6. Use the ArcPy function FeatureClassToNumPyArray. For its arguments, type the input table and the field names from the input table that you want to use. Then use the ArcPy.Describe function to read in the spatial reference of your training data feature class.

    This function brings your feature classes in ArcGIS Pro into Python as arrays.

    trainFC = DA.FeatureClassToNumPyArray(inputFC, ["SHAPE@XY"] + allVars)
    spatRef = ARCPY.Describe(inputFC).spatialReference

    The fields argument ["SHAPE@XY"] calls the location coordinates for each point in the USA_Train data (trainFC) and concatenates it with the list allVars. Your array now contains the coordinates of all the points in the training data as well as all the attributes individually associated with them. The spatial reference saves the projection of your original data in the metadata so that you can visualize it as a feature class if you export it back to a feature class.

  7. Now that your data is in Python arrays, convert it to a pandas data frame. With the first argument, specify which array you're converting, and with the second argument, define the attributes you want to include.
    data = PD.DataFrame(trainFC, columns = allVars)

    A data frame is a data structure, and pandas is a standard library used to create and reference the structure. Now that all your variables are formatted correctly in Python, you can start using them for analysis.

Choose classification and separate data

The classification scheme is one of the most important parts of creating an accurate prediction model. In statistics, you want to choose a method of processing the numbers that has the least possibility of accidental bias. To make sure that the method you're using, random forest classification, is the best option, you'll create a correlation chart for the seven variables you're using. Then, you'll separate the data points you created earlier into training and test sets. The script will use the test dataset to predict seagrass occurrence. Then, it will use the test dataset to determine the accuracy of the predictor.

  1. Use the pandas .astype function to change the data type for easier computation. Then use the .corr() function to calculate the correlation between variables.

    Remember that in the last section, you named the pandas data frame containing all the attribute data.

    corr = data.astype('float64').corr()

    This function calculates Pearson's correlation coefficient between your predictor attributes. Correlation coefficients are a way of measuring the relationship between variables. All coefficients have a value between -1 and 1, with -1 showing a perfectly negative correlation (as variable A grows, variable B tends to shrink) and 1 showing perfect correlation (when variable A grows, variable B also tends to grow). A correlation coefficient of 0 shows no relationship at all.

  2. Use the following code to plot the correlation coefficients as a correlation matrix between variables.

    The .heatmap function from the seaborn library defines the type of chart you want to use. The following arguments specify the appearance of the chart.

    ax = SEA.heatmap(corr, cmap=SEA.diverging_palette(220, 10, as_cmap=True),
    square=True, annot = True, linecolor = 'k', linewidths = 1)
    PLOT.show()

    Correlation chart

    The chart shows the correlation between the variables. High positive correlations are shown in bright red, which is why there is a diagonal line across the center. Each variable is highly correlated with itself. The dark blues show high negative correlations.

    Multiple predictors are highly correlated, either positively or negatively, which makes random forest a good method to use. Random forest can handle predictor variables that are dependent on each other in a way that minimizes bias.

    Now that you've confirmed that random forest is the best model you can use, you'll break the training data into two portions using random sampling.

  3. Close the window to continue the workflow.
  4. Define fracNum, the size of the sample you want to take, and then use the .sample function to take a random sample from the training dataset. Use fracNum as the sample size parameter.
    fracNum = 0.1
    train_set = data.sample(frac = fracNum)

    You now have a random sample of 10 percent of the training data. The rest will become the test set.

  5. Create the dataset test_set by using the .drop argument to remove all the points from data that have already been assigned to the training dataset.
    test_set = data.drop(train_set.index)

    The seagrass data for the United States was divided into a training dataset and a test dataset. You specified that 10 percent of the available data should go into the training dataset to build your random forest predictor. The remaining 90 percent of the dataset will be used as a test of how accurately the model predicts seagrass occurrence.

  6. Create the variable Indicator and use the .factorize command to make sure the train_set data reads as categorical variables.
    indicator, _ = PD.factorize(train_set[classVar[0]])

    Alternatively, Python would read the data as continuous data, meaning the model could return all values between 0 and 1. In other cases this is the norm, but a potential value of 0.5 or 0.32 wouldn't tell us anything about the presence of seagrass.

  7. Use the print command to show the values of the training and test datasets. Type labels for both and concatenate them with the string version of the variable.
    print('Training Data Size = ' + str(train_set.shape[0]))
    print('Test Data Size = ' + str(test_set.shape[0]))

    Print the size of the two datasets

    Note:

    Because you're using a different random sample taken from the data, your results will vary slightly.

    As a next step, you will be training a random forest classifier to form a relationship between your predictors and seagrass occurrence.

Train your random forest classifier

Now that you have split your data, you'll train your random forest classifier using the training data you have created.

  1. Create the variable rfco to show the results of running the RandomForestClassifier command to create 500 trees. Then use the .fit argument to apply the forest results to the training data.
    rfco = RandomForestClassifier(n_estimators = 500, oob_score = True)
    rfco.fit(train_set[predictVars], indicator)
  2. Run the classification again using the test dataset. Create the attribute seagrassPred to store this data with a 1 for occurrence and a 0 for no occurrence.

    The test data is 90 percent of the United States coastal data that was not used to train the model, and will show the accuracy of your prediction.

    seagrassPred = rfco.predict(test_set[predictVars])
  3. Use the results of the classification to check the efficiency of the model by calculating prediction accuracy and estimation error.
    test_seagrass = test_set[classVar].as_matrix()
    test_seagrass = test_seagrass.flatten()
    error = NUM.sum(NUM.abs(test_seagrass - seagrassPred))/len(seagrassPred) * 100
  4. Print the accuracy metrics of your data to make sure the model's predictions are working correctly.
    print('Accuracy = ' + str(100 - NUM.abs(error)) + ' % ')
    print('Locations with Seagrass = ' + str(len(NUM.where(test_seagrass==1)[0])) )
    print('Predicted Locations with Seagrass = ' + str(len(NUM.where(seagrassPred==1)[0])))

    Print accuracy metrics

    The script prints the number of points used for the training and test data, as well as the accuracy. Approximately 95 times out of 100, the prediction model was correct in predicting seagrass occurrence in a location in which it was known to exist. With such a high accuracy rate, you can now train this model on the entire United States data and predict global seagrass locations.

  5. Use the .factorize function to create the variable indicatorUSA.
    indicatorUSA, _ = PD.factorize(data[classVar[0]])

    Like when you created indicator, the .factorize function will encode the data as categorical.

  6. Define the variable rfco as the random forest model you're training using all the data from United States coasts. For the argument n_estimators, specify that you want to create 500 trees.
    rfco = RandomForestClassifier(n_estimators = 500)
    rfco.fit(data[predictVars], indicatorUSA)

    Now that the random forest model, rfco, is trained, you'll apply it to the EMU data for the world's coasts. The process for this is similar to the process you used to format the training data correctly.

  7. Read the global EMU data in to Python as arrays and convert it to a pandas framework, and then use the .describe function to save the spatial reference of the feature class.
    globalData = DA.FeatureClassToNumPyArray(globalFC, ["SHAPE@XY"] + predictVars)
    spatRefGlobal = ARCPY.Describe(globalFC).spatialReference
  8. Run the global data through the rfco model to get the global predictions.
    globalTrain = PD.DataFrame(globalData, columns = predictVars)
    seagrassPredGlobal = rfco.predict(globalTrain)
  9. Use the NumPyArrayToFeatureClass function to store the prediction array as a feature class. Name the feature class and specify the geodatabase. Specify the input data and format of the output table.
    Note:

    Make sure to edit the outputDir location to your Documents folder where you unzipped the project. Replace your_username before running this piece of code. To easily find the correct file path, open the geodatabase in a File Explorer window, and copy the entire path.

    nameFC = 'GlobalPrediction'
    outputDir = r'C:\Users\your_username\Documents\SeagrassPrediction\SeagrassPrediction.gdb'
    grassExists = globalData[["SHAPE@XY"]][globalTrain.index[NUM.where(seagrassPredGlobal==1)]]
    ARCPY.da.NumPyArrayToFeatureClass(grassExists, OS.path.join(outputDir, nameFC), ['SHAPE@XY'], spatRefGlobal)
  10. Close the Python console.
  11. Save the map.

You created a prediction model on whether or not seagrasses occur at a given coastal location around the globe. The attribute seagrassPred will contain the prediction for each point as either 1 or 0. A value of 1 indicates suitability as a seagrass habitat, and 0 indicates an unsuitable location for seagrass growth. In your pursuit of modeling seagrass habitats, you are interested in the 1 values, locations where seagrass grow. In addition, you are interested in contiguous patches of locations that have a high density of 1 values. In the next lesson, you'll add the prediction model to the map as a feature class and use a statistical analysis tool that will show locations with dense seagrass growth.