Use forest-based classification techniques
One method of modeling species distribution uses a common machine learning algorithm, random forest. The Forest-based and Boosted Classification and Regression tool in ArcGIS Pro has two algorithms to choose between, an adaptation of the random forest algorithm or the Extreme Gradient Boosting (XGBoost) algorithm. In this module, you’ll use the forest-based algorithm, which trains a model based on known values provided as part of a training dataset, then can be used to predict unknown values. You’ll run the tool twice, first to train a preliminary model and assess its accuracy, then to improve the model and generate a raster prediction layer.
Set up the project
First, you’ll download the data needed for species distribution modeling. This has been shared as a project package, which you can download and open in ArcGIS Pro. Data has already been extracted, clipped, and projected (processing is described below). To learn more about how to prepare your own data for species distribution modeling, see the tutorial Prepare data for species distribution modeling.
- Download the tutorial data.
- Double-click the Hurricane_Elsa project package to open ArcGIS Pro. If necessary, sign in with a licensed ArcGIS account.
This project contains the data you’ll need for species distribution modeling.
- Observation points for feral hogs (Sus scrofa) are extracted from iNaturalist observations. The Sus_scrofa_California layer contains just these observation points. The Sus_scrofa_California_absence_presence layer also contains pseudo-absence points, or points where feral hogs haven’t been observed, which is a requirement for Forest-based regression modeling.
- Bioclimate data representing 19 environmental variables on temperature and precipitation is extracted from the Bioclimate Baseline 1970-2000 layer. It has been projected to NAD 1983 California (Teale) Albers (Meters) and clipped to the state of California. The clipped layers extend slightly past the state borders to ensure that environmental data can be extracted for observation or pseudo-absence points on or near state borders.
- Elevation and slope data are derived from USGS EROS Archive - Digital Elevation - Global Multi-resolution Terrain Elevation Data 2010, projected to NAD 1983 California (Teale) Albers (Meters), and clipped to the state of California
- Land cover has been extracted from USA NLCD Land Cover, projected to NAD 1983 California (Teale) Albers (Meters), and clipped to the state of California.
Train a Forest-based and Boosted Classification and Regression model
The Forest-based and Boosted Classification and Regression tool trains a model based on known values provided as part of a training dataset, then can be used to predict unknown values. The tool can be run in three modes: training only, predict to features, and predict to raster. In this section, you’ll use the training mode to create a preliminary model. When run, the tool creates a series of charts and other outputs that allow you to assess the accuracy of the model, and make decisions about how to improve it.
- In the Geoprocessing pane, search for and open the Forest-based and Boosted Classification and Regression tool.
You’ll run the tool twice, the first time to analyze the input data and the second time to tweak the inputs for a better model. Use this option to assess the accuracy of the model before generating predictions. This option will output model diagnostics in the messages window and a chart of variable importance.
- For Prediction Type, choose Train only and ensure Model Type is set to Forest-based.
Forest-based models rely on multiple decision trees created based on the training data. A decision tree is a flowchart-like diagram that takes known characteristics of an outcome and determines how likely the unknown data point is to match it based on a series of decisions. Each decision tree generates its own prediction and votes on an outcome. The model considers the votes from all the decision trees to predict or classify the outcome of an unknown sample. The other option is a Gradient Boosted model, which creates a model in which each decision tree is created sequentially using the original data. Each tree corrects the errors of the previous trees.
- For Input Training Features, choose Sus_scrofa_California_absence_presence. For Variable to Predict, choose the Presence field, and check the Treat Variable as Categorical box.
This analysis requires both presence and absence points. In the Presence field, locations where feral swine were observed are labeled with the number 1. All other points are labeled with the number 0. Because true absence is difficult to definitively prove for species movement, this layer contains pseudo-absence points, or a set of randomly sampled points representing locations where feral pigs weren’t observed.
- Check the Include All Prediction Probabilities box.
This parameter will generate an output that shows the probability of all categories in the categorical variable. In this case, it will show the probability of both absence and presence in a given location.
Next, you’ll add the explanatory data. Explanatory variables can come from fields or be calculated from distance features or extracted from rasters. You can use any combination of these explanatory variable types, but the type of input you choose will impact the available outputs. Because you want your ultimate output to be a raster surface showing presence prediction, you’ll use the Explanatory Training Rasters option.
- For Explanatory Training Rasters, click Add Many. Check the boxes to add all 19 Bioclimate variables, CA_Elevation, CA_Slope, and CA_NCLD and click Add.
- Next to the CA_NLCD variable, check the Categorical box.
The parameters for the model have been set. Now you’ll create outputs from the training run that will help you evaluate and improve the model for prediction.
- Expand the Additional Outputs section. For Output Trained Features, type fbbcr_output_trained.
This output will test the accuracy of the prediction by showing how many of the input dataset were correctly and incorrectly classified.
- For Output Variable Importance Table, type fbbcr_variable_importance.
The Output Variable Importance Table value contains the explanatory variables used in the model and their importance. It will help you assess which of the many explanatory variables you’re using in the initial run of the model are most important to predicting feral swine presence. It also creates a chart showing the distribution of variable importance across runs.
- For Output Classification Performance Table (Confusion Matrix), type fbbcr_class_performance.
This output is only available when the dependent variable is categorical and part of the input data is used for validation. The output table shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in each category based on the validation data.
- Expand the Advanced Model Options group.
The options in this group, known as hyperparameters, allow you to control the number of decision trees and characteristics of the trees used in the modeling. For example, increasing the number of trees in the forest or boosted model will generally result in more accurate model predictions, but the model will take longer to calculate. Smaller Minimum Leaf Size values can make your model prone to noise in your data. To better understand which of these parameters you may need to adjust, you’ll first run the model with the default parameters. Using the Optimize Parameters setting will help you make these tweaks.
- Check the Optimize Parameters box.
There are several methods for optimization you can choose from. To keep processing time down, you’ll use the default Random Search (Quick) method and optimize for model accuracy. There are several other options you can choose for Optimize Target (Objective) that focus on optimizing various metrics of model performance.
- For Number of Runs for Parameter Sets, type 10.
For each search point, the Random Search (Robust) method builds a model using 10 different random seeds, picks the set of hyperparameter values with the median model performance, then moves to the next search point. The tool searches all candidate search points, then selects the set of hyperparameter values with the best model performance.
- For Model Parameter Setting, add the following hyperparameters:
- Parameter: Number of Trees
- Lower Bound: 100
- Upper Bound: 500
- Interval: 10
Note:
As you enter the hyperparameters, you might see the error indicator for Error 110535. The error will be resolved when you finish entering the hyperparameters and test criteria.
- Expand the Validation Options section. Set Number of Runs for Validation to 25.
The more runs you allow the tool, the more confidence you’ll be able to have in the model. With each validation run, it will take a different 10 percent of the data to test the model. The tool’s diagnostics will allow you to compare the accuracy score of the training runs against the validation run. You’ll also be able to get a better sense of the importance of each variable to the overall prediction.
- For Output Validation Table, type fbbcr_out_validation.
This table comes with a chart that shows the distribution of the accuracy scores. The chart helps evaluate how stable the model is or whether it needs improvement.
- Click Run.
When the tool finishes running, the fbbcr_output_trained layer is added to the map.
The output tables you created are added to the Contents pane under Standalone Tables.
Note:
The Forest-based model by default samples a different random sample of training data each run, so if you run the tool multiple times, you may get different results.
Interpret and improve the random forest model
Now that you’ve run the tool once, you'll use the tool diagnostics, charts, and training outputs to assess how well the model can predict feral swine presence. There are two areas you should assess to decide what parameters to improve: model performance and explanatory data relevance. Tool diagnostics provide a series of statistics, such as Model Out of Bag Errors and Classification Diagnostics, that help you assess whether the parameters or hyperparameters should be updated. The Top Variable Importance table also reports on the explanatory variables with the greatest impact on prediction, allowing you to remove excess data.
When training a model, it is best practice to run it multiple times, testing different parameters for improvement. This tutorial is constrained to two runs of the tool for the sake of time. For additional analysis of outputs, reference the documentation article How Forest-based and Boosted Classification and Regression works.
Note:
Your results may vary from all the examples given in this section. This variation is due to the random sampling performed by the tool.
- When the Forest-based and Boosted Classification and Regression tool is finished running, at the bottom of the Geoprocessing pane, click View Details.
Note:
If you closed the Geoprocessing pane, you can also access the Details from the Geoprocessing History. On the ribbon, click the Analysis tab. In the Geoprocessing group, click History. In the History pane, right-click the Forest-based and Boosted Classification and Regression tool and choose View Details.
The details for the tool contain both a record of the parameters used and messages that will help you interpret the results.
- If necessary, in the Details window, click the Messages tab.
The first table shows the Model Characteristics, or the hyperparameters used to specify the forest-based model. Because you allowed for optimization of the parameters, the model was likely run with more trees than the default 100. The exact number that your model used will vary depending on the random samples it took.
Note:
Warnings for the tool show that there were issues reading some of the input features. Due to the resolution and extent of the input rasters, which were clipped to the state of California to cut both processing time and file size, information for some of the observation points close to the coast wasn’t able to be extracted from the rasters to the points.
- Scroll down to the Model Out of Bag Errors table.
Model Out of Bag Errors (OOB) help you evaluate the accuracy of the model. MSE (mean squared error) is based on the ability of the model to accurately predict the Variable to Predict value. These errors are calculated for half the number of trees used and the total number of trees used. If the errors and percentage of variation explained are similar for both numbers of trees, you likely don’t need to increase the number of trees used. Because the variable to predict is categorical, OOB errors are calculated based on the percent of incorrect classifications for each category among trees that did not see a subset of the trees in the forest.
- Note the most important variables in the Top Variable Importance table.
Because you used so many explanatory variables, the importance of each will be relatively low, but the table still is a useful way to see what variables may have the most influence on feral swine presence. You’ll use the results of this table, as well as the Summary of Variable Importance table created with the fbbcr_variable_importance output to reduce the number of variables you use in the next run of the tool.
- Compare the scores in the Training Data: Classification Diagnostics table to those in the Validation Data: Classification Diagnostics table.
The Training Data: Classification Diagnostics table reports how well the model performed on the training data, and validation table reports how well the model performed on the data it didn’t know. If the model performs well on the training data, but very poorly on the validation, it indicates possible overfitting in the model. Generally, the closer the F1-Score and MCC are to 1, the better the model is.
- In the Validation Data: Classification Diagnostics table, compare the Sensitivity and Accuracy values.
The statistics reported in this table are measures of model performance. Sensitivity is the percentage of times features with an observed category were correctly predicted for that category, and accuracy is the number of times a category is identified correctly among the total number of observations for that category. Both of these values are close to 1, which means that the model has accurately classified most of the points during validation runs. You can see the Sensitivity information in graphical format by opening the Validation Performance chart created with the fbbcr_class_performance table.
- Close the Details window. In the Contents pane, under the fbbcr_output_trained layer, right click the Prediction Performance chart and choose Open.
The Prediction Performance chart opens. Each bar represents the predicted category, and the color of the sub bars reflects the actual category. This chart can be used to show both how often the model correctly predicted the variable of interest and what points gave it trouble. Because you ran the model with the Include All Prediction Probabilities parameter checked, each point in this layer also includes the probability of feral swine absence or presence.
While this chart shows how well the model performs on input training features, the Validation Accuracy chart created with the fbbcr_out_validation table shows how well the model performs on the validation data.
- In the Prediction Performance chart, in the 0 bar, click the smaller sub bar showing points that represent Presence but were misclassified as Absence points.
The points that were misclassified as Absence points are selected on the map. They’re scattered throughout the state.
- On the map, click one of the misclassified points. In the pop-up, scroll down to the Probability attributes.
In the selected point shown, based on the environmental attributes of the point, the probability of absence is 57 percent and the probability of presence is 42 percent.
- In the Contents pane, under Standalone Tables, for the fbbcr_variable_importance table, double-click the Distribution of Variable Importance chart.
Because you ran the model 25 times for validation, each against a different subset of the input data, the importance of variables varies slightly. While there is variation in the importance of the variables, there is fairly high importance among the first 12: BIO15_Precipitation_Seasonality, BIO11_Mean_Temperature_of_Coldest_Quarter, CA_Elevation, BIO3_Isothermality, CA_NLCD, BIO18_Precipitation_of_Warmest_Quarter, BIO6_Min_Temperature_of_Coldest_Month, BIO8_Mean_Temperature_of_Wettest_Quarter, CA_Slope, BIO1_Annual_Mean_Temperature, BIO14_Precipitation_of_Driest_Month, and BIO12_Annual_Precipitation.
You’ll rerun the tool, focusing on these 12 explanatory variables. Removing less important explanatory variables will help you reduce the possibility of overfitting the model.
- In the Geoprocessing pane, in the Forest-based and Boosted Classification and Regression tool, change Prediction Type to Predict to Raster.
- For Explanatory Training Rasters, remove all the rasters except Bioclimate 1, 3, 6, 8, 11, 12, 14, 15, 18, CA_Elevation, CA_NLCD and CA_Slope.
- For Output Prediction Surface, type fbbcr_feral_swine_prediction.
- For all the outputs you created in the Additional Outputs, Advanced Model Options, and Validation Options categories, add the suffix _top12 to the end of the output name.
This will re-create each output for the prediction surface, allowing you to compare the two models to ensure you’re improving prediction.
- Click Run.
- In the Contents pane, uncheck the fbbcr_output_trained layer to turn it off. Close any tables and charts you opened while evaluating the first run of the model.
- Use what you’ve learned about the model diagnostics and output tables to evaluate the new model.
The overall statistics evaluating this model, including MSE, F-1 score, and MCC, should have improved. Unlike the first model, this run tended to incorrectly predict presence more so than absence. In the case of feral swine, that’s probably beneficial, as swine populations are adaptable and can survive a range of conditions.
- In the Contents pane, uncheck the fbbcr_output_trained_top12 layer to turn it off.
The fbbcr_feral_swine_prediction layer is a raster showing where in the state pig presence is likely based on environmental characteristics.
In this section, you ran the Forest-based and Boosted Classification and Regression tool twice to train a preliminary model and assess its accuracy before generating a raster prediction layer. Realistically, this process can take more than two iterations to achieve desired results. Next, you'll use a maximum entropy algorithm to perform similar modeling and compare the results.
Use MaxEnt techniques
Another method for species distribution modeling available in ArcGIS Pro is Presence-only Prediction (MaxEnt), which uses a maximum entropy algorithm to model the presence of a phenomenon given known presence locations and explanatory variables. Like with the Forest-based model, Presence-only Prediction can be run several times to evaluate and improve upon the model, and generates a prediction surface for species occurrence. Unlike the Forest-based model, you don’t need a dataset that contains both presence and absence points (or presence and pseudoabsence, in many cases), so the raster surface shows the probability that a species may be found in an area rather than a binary presence or absence classification.
Train a Presence-only prediction model
In this section, you’ll use the Presence-only Prediction tool in its training capacity to produce a preliminary model. Since you determined the most important explanatory variables using the forest-based classification, you’ll use them as explanatory variables in this tool as well.
- In the Geoprocessing pane, search for and open the Presence-only Prediction (MaxEnt) tool.
Unlike many regression techniques, including the Forest-based and Boosted Classification and Regression tool, Presence-only Prediction doesn’t require background or pseudoabsence points. And as in the Random Forest tool, specific types of input features will generate different outputs. In this case, because you want to generate another raster surface, you’ll need to use only observation points.
- For Input Point Features, choose the Sus_scrofa_CA layer.
- For Explanatory Training Rasters, click Add Many. Check the boxes to add the same variables as the last run of the Random Forest tool: Bioclimate 1, 3, 6, 8, 11, 12, 14, 15, 18, CA_Elevation, CA_NLCD, and CA_Slope. Click Add.
While you can run this tool with all 19 of the bioclimate variables, it’s good practice to use tools such as Random Forest to understand variable importance to the model. When building models, it’s important to find a balance between simplifying models to reduce overfitting, and creating robust enough models for accurate prediction.
- Next to the CA_NLCD variable, check the Categorical box.
Next, you’ll choose variable expansions. Different expansions can help tease out relationships between variables. Expansion wasn’t needed in the Random-forest model because the algorithm handles nonlinear relationships between dependent and explanatory variables automatically. You can select multiple basis functions in one run of the tool using the Explanatory Variable Expansions (Basis Functions) parameter, and all transformed versions of the explanatory variables are then used in the model. The best performing variables are selected by regularization, a method of variable selection that balances trade-offs between model fit and model complexity.
- For Explanatory Variable Expansions (Basis Functions), check the boxes to select Original (Linear), Squared (Quadratic), and Pairwise interaction (Product).
The Original (Linear) function is the only one that will work for categorical data, such as land cover. The squared function, which creates a quadratic relationship, tends to model species’ relationships with environmental factors a bit better, as there are specific ranges within each variable that form the species’ ideal habitat. For example, species that thrive in areas with moderate precipitation aren’t suited to desert conditions or rainforests; the relationship is parabolic. The likelihood of habitat suitability for the species rises as precipitation rates do, then falls again as precipitation rates climb past a certain point. The pairwise function also is conducive to modeling environmental conditions, as it can represent relationships between them.
- For Study Area, choose Polygon study area and select the California state boundary layer as the Study Area Polygon.
- Check the Apply Spatial Thinning parameter.
Spatial thinning is applied to both observation and background points as a way of reducing potential sampling bias. Because the feral pig observation data was collected by people with the iNaturalist, there is a possibility that it shows bias both for areas where people are, and areas where there are people with the iNaturalist app who recognize and report various species. Spatial thinning can reduce the effects of bias by removing points that are close together, which may represent multiple sightings of the same animal, represent a protected area such as a national park where human-animal interactions are more likely to take place, and so on.
- For Minimum Nearest Neighbor Distance, choose 1 kilometer as the distance.
The next parameters are hyperparameters for the model.
- If necessary, expand Advanced Model Options. For Relative Weight of Presence to Background, type 1.
- For Presence Probability Transformation (Link Function), choose Logistic.
Out of the two available Presence Probability Transformation functions, Logistic is the better option when presence isn’t absolute. For example, because the swine likely aren’t staying in the spot where they were observed, but are roaming to find food and shelter, the Logistic function is appropriate. Because you’ve chosen to use the Logistic function, the Relative Weight of Presence to Background parameter should be lower. In this case, you’re equally weighting the presence and pseudoabsence points.
You’ll also accept the Presence Probability Cutoff value of 0.5 for now—diagnostics from the first run of this tool will help you determine whether a different cutoff value is needed to improve future runs.
Now you can choose what diagnostics and charts you want the tool to output. The tool organizes outputs into training and prediction outputs. The main distinction is that training outputs correspond to the data that was used in the training and selection of the model, and prediction outputs correspond to data that the model has not yet been exposed to.
- Expand the Training Output group. For Output Trained Features, type pop_output_trained.
The result of this output will be a feature class containing the points used in the training of the model and three charts for additional interpretation. This output symbolizes the input presence points and any background points that are created using a comparison between the classification from the model and the observed classification, which provides a visual method of analyzing the model’s predictions.
For now, you’ll skip the output trained raster. Once you’ve run the initial model and know how well it performs on the input point features, you’ll create the raster surface. For the first run, you’ll create a Response Curve Table to show the impact of each input raster on the prediction, and a Sensitivity Table, which will help you determine a good value for the Presence Probability Cutoff parameter.
- For Output Response Curve Table, type pop_response_curve, and for Output Sensitivity Table, type pop_sensitivity.
- Expand the Validation Options group. For Resampling Scheme, choose Random and set the Number of Groups parameter to 5.
The Resampling Scheme parameter allows the tool to do cross validation to evaluate the stability of the model. The points will be randomly divided into five groups, and each group will be left out once when performing cross validation.
- Click Run.
When the tool is finished running, the output layer and tables are added to the Contents pane. The pop_output_trained layer is added to the map.
Interpret and improve the Presence-only prediction model
Now that you’ve run the tool once, you'll use the tool diagnostics, charts, and training outputs to assess how well the model can predict feral swine presence. The tool diagnostics help you assess the accuracy of the model, reporting on the number of presence and background points that were correctly classified. While all the statistics and outputs of the initial training run can help you improve aspects of your model, in this section you’ll focus on the Area Under Curve and Omission statistics, which will help you decide on an appropriate Presence Probability Cutoff parameter for your next run of the tool.
Note:
When training a model, it is best practice to run it multiple times, testing different parameters for improvement. This tutorial is constrained to two runs of the tool for the sake of time. For additional analysis of outputs, reference the documentation article How Presence-only Prediction (MaxEnt) works.
- In the Contents pane, uncheck all layers other than pop_output_trained, the California boundary layer, and the basemap to turn them off.
- At the bottom of the Geoprocessing pane, click View Details to open the tool diagnostics.
There are a few warnings shown for this tool. Like before, some points close to the state borders may not have had raster information available. No background points were thinned, which is not necessarily a concern considering how large your study area is. Finally, one of the categories in the Land Cover dataset (the permanent ice and snow category) had less than eight data points. You can explore this issue more using the Explanatory Variable Category Diagnostics table.
The first table to review is the Count of Presence and Background Points table, which shows the accuracy of the model.
- In the Count of Presence and Background Points table, compare the Number of Presence Points row to find how many points were used in training the model and how many were correctly classified as presence.
The closer the numbers in these two columns are, the better the model is performing. You also want to evaluate the Number of Background Points row. Since you set the Relative Weight of Presence to Background parameter to 1, this number should be relatively low.
The Model Characteristics table records the model parameters that were used.
- In the Model Summary table, evaluate the AUC value.
The AUC, or area under curve statistic, describes how good the model is at estimating known presence locations as presence and known background locations as background. The higher this value is to 1, the better the model is performing. The AUC statistic is used in conjunction with the Omission Rate, which shows what percentage of presence points are incorrectly classified as having a low chance of presence. You’ll evaluate both of these statistics further using charts created with the pop_sensitivity table.
- Scroll down to the Regression Coefficients table.
This table reports the variables ultimately used in the model. Most have the word product appended as a prefix, showing that many of the variables used were transformed using the Pairwise interaction (Product) expansions.
The final two tables show the range of values represented in the sampled data. In the final table, you can review the NLCD data and see which category was undersampled, causing the warning you saw above.
- In the Explanatory Variable Category Diagnostics table, find the category that has fewer than 8 sampled values.
Category 12, in this example, has four sample points. According to the item details for the NLCD layer, Category 12 represents Perennial Ice and Snow cover, of which there is relatively little in California. Because the number of samples roughly corresponds to the real-world presence of this particular type of land cover, you don’t need to worry about this sample size.
Next, you’ll look at the trained features and tables you created to assess your model. The pop_output_trained layer shows all the points used in the model. Presence points are shown as having either been correctly or incorrectly classified by the model’s prediction. Background points are classified as either potentially being presence points or remaining background points.
- Close the details window.
- In the Contents pane, under the pop_output_trained layer, double-click the Classification Result Percentages chart.
The chart shows a comparison of the observed and predicted classifications. You’ll start by analyzing the percentage of presence points that were correctly classified by the model.
- In the Chart pane, in the Presence column, point to the Presence – Correctly Classified subbar to show a numeric summary of the data.
In the example image, 65.68 percent of presence points were correctly classified. That’s fairly good for model performance, but still can be improved.
One of the ways to improve this model is to revisit the Presence Probability Cutoff parameter. You’ll use the Omission Rates and ROC Plot charts to find a better value for this parameter.
- Close the Classification Result Percentages chart.
- In the Contents pane, under Standalone Tables, for the pop_sensitivity table, double-click the Omission Rates and ROC Plot charts to open them.
- Click and drag the ROC Plot chart so that you can see both it and the Omission Rates chart at the same time.
- On the Omission Rates chart, select the default presence probability cutoff value of 0.5 and note the resulting sensitivity on the ROC plot's y-axis.
In the example image, a probability cutoff of 0.5 has resulted in an omission rate of 0.343, resulting in a sensitivity of 0.657. The omission rate is the percent of known presence points that were misclassified as nonpresence by the model.
Used together, the Omission Rates and ROC Plot charts visualize how different Presence Probability Cutoff parameter values result in different rates of incorrectly classified presence points. While it is generally good to have an omission rate close to 0, lowering the cutoff value will also increase the number of background points classified as presence points, which can decrease the model’s specificity. Because feral swine are adaptable scavengers, in this case it’s beneficial to find more areas where they might be able to survive, so you’ll find a balance between the specificity and sensitivity that shows more presence points.
- On the ROC Plot chart, click one of the points with a value of around 0.9 on the y-axis.
In the example model, a sensitivity of 0.9 will result in an omission rate of 0.098 percent. To get this result, you’ll rerun the tool using a Cutoff value of 0.24.
- In the Geoprocessing pane, for Presence Probability Cutoff, type 0.24.
- For all the outputs you created in the Training Outputs group, add the suffix _ppc to the end of the output name.
You’ll also generate an output prediction raster.
- For Output Trained Raster, type pop_trained_raster_ppc and click Run.
- In the Contents pane, turn off all layers except the pop_trained_raster_ppc layer, the California boundary layer, and the basemap.
- Use what you’ve learned about the model diagnostics and output tables to evaluate the new model.
Like with the forest-based analysis you completed earlier, this modeling approach often requires more than two iterations. Using you understanding of the parameters and hyperparameters, you can continue making changes and comparing the accuracy of the outputs until you find the best combination for your data and situation.
Compare Random Forest and MaxEnt
Both analyses used in this tutorial can be used to model species distribution. Depending on your goals for the analysis, the data you have available, and other factors, you may choose to use one or both of these methods for your own modeling. As with all statistical and analytical methods, Forest-based classification and MaxEnt have strengths and weaknesses to consider. In this section, you’ll compare the output prediction surfaces you’ve produced and review some of the benefits of both modeling approaches.
- In the Contents pane, turn the fbbcr_feral_swine_prediction layer on.
- Click the pop_trained_raster_ppc layer to select it.
- On the ribbon, click the Raster Layer tab. In the Compare group, click the Swipe button.
- On the map, click and drag the cursor back and forth to compare the two raster prediction surfaces.
The prediction surfaces are similar, which is a good sign for the accuracy of the models.
When using spatial statistics methods for prediction, there are some strengths and limitations of each method you should consider to ensure you’re selecting the best method for the goal of your analysis and the data you have available.
Forest-based Classification and Regression
Strengths of the approach Other considerations Can capture an unknown or more complex relationship between dependent and explanatory variables.
Requires both presence and absence (or pseudoabsence) points.
Relationships don't need to be specified like they do for Presence-Only Prediction.
Although the variable importance helps us to understand the contribution of each explanatory variable to the model, it can be difficult to interpret the variable importance. For example, you don't know whether the relationship is positive or negative.
Presence-only Prediction
Strengths of the approach Other considerations It is designed for presence-only modeling, so you don't need to prepare absence points.
You need to assume the relationship between the dependent variable and explanatory variables.
Provides more flexibility in deciding how to weight the background points using the Relative Weight of Presence to Background parameter.
The spatial thinning parameter can be used to control absence points.
Output raster surface provides more detail on the probability of swine habitat rather than a binary decision on presence or absence.
In this tutorial, you used two analysis techniques to perform species distribution modeling for feral hogs in California. As an invasive species, feral hogs pose a threat to ecosystems and agriculture in the state. These modeling techniques can be used for a broad range of other species and phenomenon.