Prepare the data
To make informed conservation decisions for your forest, you will identify regions that share common bioclimatic characteristics.
Explore the project
You will start by downloading the tutorial data.
- Download the IdentifyRegions.ppkx project package and locate the downloaded file on your computer.
Note:
Most web browsers download files to your computer's Downloads folder by default. - Double-click the IdentifyRegions.ppkx project package.
Project packages are a way of sharing ArcGIS Pro projects and data. They are compressed files that, when opened, extract a copy of the project to your C:\Users\[user name]\Documents\ArcGIS\Packages folder.
The project package extracts, and the project opens to a map showing a basemap and the El Yunque National Forest outline.
El Yunque National Forest is located in the eastern part of Puerto Rico.
Download bioclimate data
The Climatologies at high resolution for the earth’s land surface areas (CHELSA) bioclimate variables dataset is a high-resolution global climate dataset that is useful for modeling species distribution. You will download a raster of one variable that may be useful in your analysis.
- In a browser, go to the CHELSA Bioclim page.
- Read the description of the Bioclim dataset.
Note:
When you are investigating a data layer that might be useful to you, it is important to study its description, which provides important context information about the data, including who compiled it, how it was collected, what it is intended to be used for, and whether there are use constraints on the data or who can use it.The CHELSA data is made available on the condition that you cite the original publication and the dataset.
- Scroll down to the layer description table.
This table provides information about each of the layers that can help you determine which layers to use.
The bio1 layer contains mean annual air temperature data. This could be useful for identifying regions within your forest.
There are 50 layers documented in this table. The 19 layers prefixed with bio are of interest for this tutorial. Depending on the details of your conservation work, you will choose different combinations of layers.
- Scroll up, to above the table, and click the THE LAYERS CAN BE DOWNLOADED HERE download link.
The S3 File Browser page opens. It shows four folders. The first folder is historical data for the period 1981-2010. The other three folders contain projected data produced using different climate models.
- On the S3 File Browser page, click the expand button next to the 1981-2010/ folder.
The folder expands.
- Click the expand button next to the bio/ folder.
- Look for the file with _bio1_ in the file name, and click the download button for the CHELSA_bio1_1981-2010.V.2.1.tif file.
The CHELSA_bio1_1981-2010.V.2.1.tif file downloads to your computer. The file is 110 MB.
According to the table you reviewed earlier, this file is a raster dataset of the mean annual air temperature for the 1981-2010 period.
You will clip and sample this file to get data for your forest.
For this tutorial, you do not need to download any of the other bioclimate data files. They have been downloaded and clipped, and are stored in the project for your use later.
Add the data to your project
Now that you've downloaded the air temperature data, you will add it to your project.
- In ArcGIS Pro, on the ribbon, click the Map tab and in the Layer section, click Add Data.
- In the Add Data window, expand Computer, expand This PC, and click Downloads.
If your browser is set to download to another location, locate the file in the Microsoft File Explorer and browse to the location in the Add Data window.
- Click the CHELSA_bio1_1981-2010.V.2.1.tif raster dataset and click OK.
A message appears asking if you would like to calculate statistics for the raster.
- Click Yes.
The CHELSA_bio1_1981-2010_V.2.1.tif layer appears on the map.
The layer has a global extent, but this part appears mostly white. It covers the basemap that was visible before. You can see the forest boundary polygon and some light gray pixels within the forest.
If you look at the layer Value range, you can see that cells in this raster have values between 2196 and 3079, with high values drawn in lighter shades of gray.
If you check the table describing the data, you see that the data has a scale factor and an offset.
Converting based on this information (multiply by the scale value of 0.1 and subtract 273.15), the values of the raster translate to degrees Celsius, ranging from -53.55 to 34.75.
- In the Contents pane, right-click the CHELSA_bio1_1981-2010.V.2.1.tif layer and click Zoom To Layer.
The map shows the full extent of the layer.
You can see that the equatorial regions are light colored, with high temperature values, and the poles are darker, with low temperature values. The park boundary is visible as a small dark patch.
It is good to explore data that you download so you understand whether it will be suitable for your analysis.
- In the Contents pane, right-click the boundary layer and click Zoom To Layer.
You've downloaded and explored the CHELSA_bio1_1981-2010.V.2.1.tif layer. It looks like it will be useful for your analysis.
Clip the raster
You don't need the whole extent of the data for your work in the forest, so next you'll clip the data using the forest boundary.
- On the ribbon, on the Analysis tab, in the Geoprocessing section, click Tools.
The Geoprocessing pane appears.
- In the Geoprocessing pane, in the search box, type clip raster and in the tool search results, click the Clip Raster tool.
- On the Clip Raster tool, for Input Raster, choose the CHELSA_bio1_1981-2010.V.2.1.tif layer.
- For Output Extent, choose the boundary layer.
The tool will clip the input raster to the full extent of the boundary feature layer, unless you specify that it should use the features in the layer for clipping.
- Check the Use Input Features for Clipping Geometry box.
This will clip the raster to the forest boundary.
The Output Raster Dataset name defaults to the first part of the original raster name with _Clip appended to the end. The raster will be saved in your project geodatabase.
- Click Run.
The raster is clipped and the CHELSA_bio1_19812010_V_Clip layer is added to the map.
Because the clipped raster contains a different range of values from the original, the symbology of the layer is different, with the color ramp showing the range of values in the forest more clearly.
You no longer need the original raster with the full global extent, so you'll remove it now.
- In the Contents pane, right-click the CHELSA_bio1_1981-2010.V.2.1.tif layer and click Remove.
Note:
To save time and reduce the tutorial data download size, the rasters for the other bioclimate variables have been clipped for you. If you had to do this yourself, you could open the Clip Raster tool in batch mode by searching for the tool, right-clicking it, and clicking Batch. On the batch tool set up pane, the batch parameter for the batch tool would be the Input Raster. You would click Next, and the Batch Clip Raster tool would open. You could then add all the input rasters for the Batch Input Raster, and configure the remainder of the tool settings the way you did for the Clip Raster tool. Running the tool would clip each raster to the forest extent.
Generate sample points
The next step is create a set of points from the raster. This will allow you to combine data from multiple rasters in a way that will allow visualization and multivariate cluster analysis. You will use the Raster to Point tool to create a set of points at the cell centers, with the raster values as an attribute.
- On the Clip Raster tool, click the Back button.
- In the Geoprocessing pane, in the search box, type raster to point and in the tool search results, click the Raster to Point tool.
- On the Raster to Point tool, for Input raster, choose the CHELSA_bio1_19812010_V_Clip layer.
- For Field, accept the default value of Value.
- For Output point features, type bioclimate_points.
- Click Run.
The bioclimate_points layer is added to the map.
You will use this layer to sample the bioclimate rasters. These have been prepared for you and are in the Sample rasters map. You'll copy the points you just made, then switch to the map with the rasters.
- In the Contents pane, right-click the bioclimate_points layer and click Copy.
- In the map pane, click the Sample rasters map tab.
- In the Contents pane, right-click Sample rasters and click Paste.
This new map has a BioclimateRasters group layer that contains a clipped version of each of the 19 CHELSA bioclimate rasters.
The group layer is configured with options that allow you to switch between individual rasters. For more information, see Work with group layers.
Sample raster values
Now that you've created points at the raster cell centers, the next step is to run the Sample tool, which you will use to get values from each of the rasters onto those points.
- On the Raster to Point tool, click the Back button.
- In the Geoprocessing pane, in the search box, type sample and in the tool search results, click the Sample tool.
- For Input rasters, click the Add Many button.
- Check the box to select all.
- Click Add.
- For Input location raster or features, choose bioclimate_points.
Each point in the bioclimate_points layer will be used to sample each of the raster values at that location.
- For Output table or feature class, type Sample_vals.
- Accept the default values for Resampling technique and Unique ID field.
- Click Run.
The Sample_vals table is added to the Contents pane in the Standalone Tables section.
- In the Contents pane, right-click the Sample_vals table and click Open.
- Examine the table.
The field names are derived from the raster layers' names. They contain the information to identify the bioclimate variables you need, but they could be clearer.
You will give a field a more understandable alias to make it easier to interpret the data.
Set a field alias
When you have field names that are difficult to understand, you can assign aliases to make them easier to work with. The CHELSA Bioclim description page included a table with shortname and longname fields. You can use this table to identify what bioclimatic variables the codes, such as bio1, bio2, bio3, and so on, correspond to, and use those names as field aliases.
- In a browser, go to the CHELSA Bioclim page.
- Review the layer description table.
- In the Contents pane, in the Standalone Tables section, right-click the Sample_vals table, point to Data Design, and click Fields.
The Fields design pane appears.
- Click in the Alias column for the _bio1_ variable.
Using the table from the CHELSA Bioclim web page, the _bio1_ variable corresponds to Mean annual air temperature.
- In the Alias column for the _bio1_ variable, type mean_annual_air_temperature.
You could use this method to update the aliases for each field. However, for this tutorial, you do not need to set aliases for all the fields. A layer that you will use later has been prepared, to save time.
- On the ribbon, on the Standalone Table tab, in the Manage Edits section, click Save.
The name of the column is updated in the table.
Join the sample table to the points
The last step in getting the bioclimate data onto point features for analysis is to join the table you created to the points you made from the cell centers.
- In the Contents pane, right-click the bioclimate_points layer, point to Joins and Relates, and click Add Join.
The Add Join tool opens.
- On the Add Join tool, for Input Field, choose OBJECTID.
The tool detects that there is a stand-alone table in the contents and adds it to the Join Table parameter. It also detects that the table has a field named OBJECTID and sets that as the default Join Field.
These are the correct inputs, so the tool is now ready to run. If you had other tables, or wanted to join on other fields, you could change them.
- Click OK.
- Right-click the bioclimate_points layer and click Attribute Table.
The point layer now has the attributes from the Sample_vals table.
The table shows the alias you set for the mean_annual_air_temperature field.
The data is now ready for exploration and analysis to identify regions.
You downloaded bioclimate data, clipped it to your area of interest, created sample points from cell centers, sampled multiple bioclimate rasters to a table, set an alias, and joined the table to your points. Next, you will do some exploration of the prepared data.
Identify variables
Now that you've prepared the bioclimate data, it is time to explore the variables.
Create a scatter plot matrix chart
A scatter plot matrix chart is a good way to compare pairs of variables when you have multiple numeric variables to consider. You will create a scatter plot matrix of the bioclimate variables to explore the relationships between them.
- In the map pane, click the Identify variables map tab.
The Identify variables map shows the forest boundary layer and a layer named sample_locations. The sample_locations layer has the bioclimate variable sample data with descriptive field names and aliases.
- In the Contents pane, right-click the sample_locations layer, point to Create Chart, and click Scatter Plot Matrix.
The Chart Properties pane appears.
This pane allows you to configure the properties of a chart. The first step to configure a scatter plot matrix is to select the numeric variables that will be plotted against each other.
- In the Chart Properties pane, click Select.
There are 19 bioclimatic variables to select for the scatter plot matrix chart.
- Click the Toggle all the check boxes button.
Since there are so many fields, it is easier to select all of the fields and uncheck the ones you don't want to use.
- Uncheck the first two fields, pointid and grid_code, and click Apply.
The scatter plot matrix appears in the chart pane.
Use a scatter plot matrix to examine data relationships
Now that you've created the scatter plot matrix chart, you will use it to explore the data and examine relationships between variables at different locations in the forest. You will decide which variables to use based on your study requirements and this exploration.
If the chart is docked in ArcGIS Pro, it may be too small to read, so first you will float the pane to make it easier to enlarge.
- Right-click the sample_locations - Scatter plot matrix of Sample_locations tab and click Float.
- Click and drag the corner of the Scatter plot matrix of sample_locations chart window to make it larger so you can see the array of plots better.
The chart shows each of the variables on the X and Y axes. It shows a small scatter plot for each combination of variables.
If you hover over one of the small plots, a pop-up will show the variable names and the R-Squared value.
- Hover over one of the plots with a high positive correlation (a narrow line of points with a rising trend from left to right).
In this dataset, for this particular study area, the precipitation at each location during the driest month is strongly positively correlated with the precipitation at each location in the wettest month.
- Hover over one of the plots with a high negative correlation (a narrow line of points with a falling trend from left to right).
In this dataset, for this study area, the mean annual temperature at each location is strongly negatively correlated with the temperature seasonality (the standard deviation of the monthly mean temperatures) at each location.
- Click one of the plots with a less clear correlation, then hover over the plot.
Clicking one of the plots selects it and also shows it as a larger plot in the upper right part of the chart window.
In this dataset, for this study area, there does not seem to be a strong correlation between the temperature seasonality and the mean precipitation during the driest quarter. It is possible that different relationships between these variables exist for different subsets of the data.
- On the scatter plot matrix chart toolbar, click the Selection tool.
The charts and the map are linked, so selections on the large chart are reflected on all of the smaller scatter plots and on the map. This allows you to interactively select data points on the chart and view their location on the map.
- Click and drag a box to select some of the points in the upper right section of the selected scatter plot.
The selection on the large chart is reflected on the smaller charts.
The selection is also shown on the map.
You could use this method to continue to explore the relationships between the different variables in various places throughout your study area.
- Click an area of the large chart with no points.
The selection is cleared on the chart and on the map.
- Close the chart window.
The chart is listed in the Contents pane. You can reopen it by double-clicking it.
The CHELSA Bioclimate dataset is a multipurpose dataset, applicable to many research topics. Of the 19 bioclimate variables, many are related to temperature and precipitation, and many of these variables are highly correlated. They show annual trends, seasonality, and extreme or limiting factors. Not all of the data here will be useful for any given project.
For the regionalization analysis, you want variables that are not highly correlated. You will use annual precipitation, annual temperature range, and mean annual temperature for defining regions. Depending on your area and use case, you could choose different variables.
You have explored the relationships between pairs of bioclimatic variables and how different geographic subsets of variables may have different relationships and clustering patterns within this dataset. Now you will conduct a regionalization analysis on your selected variables.
Identify regions
Now that the data has been prepared and the variables identified, the final step is to conduct a regionalization analysis.
Regionalization involves dividing an area into smaller regions based on specific criteria that prevail within each region, to understand the characteristics of each region. The purpose of regionalization is to summarize the important prevailing environmental factors into spatial regions for conservation planning and management. Depending on the species and area of interest, different variables will be important.
Create clusters
You will use the Multivariate Clustering tool to create clusters based on the variables you've selected.
- In the map pane, click the Identify regions map tab.
This map shows the sample_locations layer and the boundary layer.
- In the Geoprocessing pane, in the search box, type multivariate clustering and in the tool search results, click the Multivariate Clustering tool.
- On the Multivariate Clustering tool, for Input Features, choose the sample_locations layer.
- In the Analysis Fields box, check Annual_precipitation, Annual_temperature_range, and Mean_annual_temperature.
- Accept the default values for Clustering Method and Initialization Method.
The default clustering algorithm is the K means algorithm. The default initialization method is optimized seed location. For more information about these parameters, see the tool help.
- Leave the Number of Clusters parameter empty.
If you have a predetermined number of clusters that you want the tool to find, you can specify it here. For initial exploration of the natural clusters in your data, you can leave this value unspecified. The tool will identify an optimal number of clusters based on the data.
- For Output Table for Evaluating Number of Clusters, type Output_number_of_clusters.
Specifying this output table will also create a chart showing the pseudo F-statistic values for different numbers of clusters. The largest pseudo F-statistic values indicate solutions that perform best at maximizing both within-cluster similarities and between-cluster differences.
You will use this chart to understand the tool output and make your own decision about how many clusters to create.
- Click Run.
The tool runs, and after a short time the sample_locations_MultivariateClustering layer is added to the map.
This layer shows the sample locations symbolized by Cluster ID. Three clusters have been identified.
This is a start, but the three clusters are large and might not lend themselves to fine-grained ecological decision making. You could specify that the tool create more clusters, but how do you know how many clusters that should be?
Because you specified that the tool create the Output_number_of_clusters table, it also made a chart that you can explore to better understand the clusters in your data. You will refine your analysis next, using this information.
Refine the cluster analysis
Now you'll review the pseudo F-statistic values that the tool recorded in the Output_number_of_clusters table.
- In the Contents pane, in the Standalone Tables section, in the Charts subsection, double-click the Optimized Pseudo-F Statistic Chart.
The chart appears. If necessary, resize it to make it clearer.
This chart shows the Pseudo-F statistic values for clustering solutions between 3 and 30. This statistic is the ratio of between-cluster variance to within-cluster variance. Larger values are better for maximizing both within-cluster similarities and between-cluster differences.
The tool created three clusters because the highest Pseudo-F statistic value is at 3, as you can see on the chart. There is a second peak at 6, which is also an acceptable solution.
- Close the chart.
- Click the Geoprocessing tab.
The Multivariate Clustering tool pane shows the settings you used when you ran the tool. You will change the number of clusters the tools will make and run it again.
-
On the Multivariate Clustering tool, in the Number of Clusters box, type 6.
- Accept the default values for the other parameters and click Run.
The tool runs and generates a new map with six clusters.
In addition to the map of the point clusters, the tool creates a chart that shows the characteristics of the clusters. You will examine it next.
Explore the characteristics of the clusters
The points in each of these clusters tend to have similar values for each of the three variables. The Multivariate Clustering Box-Plots chart show these sets of values.
- In the Contents pane, in the sample_locations_MultivariateClustering layer section, double-click the Multivariate Clustering Box-Plots chart.
The chart shows box plots of the distribution of standardized values for each of the variables, and lines that indicate the mean values of each of the clusters for each variable.
In this chart, the three box plots show standardized values of the annual precipitation, annual temperature range, and mean annual temperature variables.
Plotted over these box plots are lines representing where the mean values within each of the clusters fall for each variable.
- On the Multivariate Clustering Box-Plots chart, hover over the cluster 2 point on the annual precipitation box plot.
In this example, cluster 2 is depicted in red. The chart colors match the point colors on the map, which are randomly assigned. Your cluster 2 points may be a different color.
A pop-up appears, showing the Cluster ID for the point, the Analysis Fields name, and the Mean Value for that cluster.
You can use this method to examine the mean values for each of the fields for each cluster. The cluster points are connected by lines of the same color.
The red line, representing points in cluster 2 in this chart, shows a higher mean value for annual precipitation than the other clusters, an intermediate value for annual temperature range, and a lower mean value for mean annual temperature. This cluster represents the higher elevation areas in the middle of the forest, which are wetter and cooler than other areas.
In contrast, the light blue line, representing points in cluster 1, have the lowest mean annual precipitation values, the highest annual temperature range values, and intermediate mean annual temperature values. These points are lower elevation points on the western edge of the forest.
You have created six clusters, representing six bioclimatically similar areas within the forest. Two of the clusters, cluster 1 and cluster 5, have areas that are not spatially continuous.
These clusters may be appropriate for your work.
In some cases, it may be desirable for the clusters to be spatially continuous. You'll use the Spatially Constrained Multivariate Clustering tool to create continuous clusters.
Create continuous clusters
Having contiguous habitat clusters can be useful for species conservation. If you need your clusters to be spatially contiguous, you can use a tool that includes spatial constraints in the clustering process.
- Click the Geoprocessing tab, and in the Multivariate Clustering tool pane, click the back button.
- In the Geoprocessing pane, in the search box, type spatially constrained multivariate clustering and in the tool search results, click the Spatially Constrained Multivariate Clustering tool.
- On the Spatially Constrained Multivariate Clustering tool, for Input Features, choose the sample_locations layer.
- In the Analysis Fields box, check Annual_precipitation, Annual_temperature_range, and Mean_annual_temperature.
- For Cluster Size Constraints, accept the default value of None.
In this case, you do not need to constrain the size of clusters.
- For Number of Clusters, type 6.
- For Spatial Constraints, accept the default value of Trimmed Delaunay triangulation.
- Click Run.
The tool runs and creates a set of spatially constrained clusters. The new layer is added to the map.
For more information about the Spatially Constrained Multivariate Clustering tool, see the tool help.
You have created bioclimatic regions using three variables and two multivariate clustering tools. You have seen how to obtain and investigate the input variables and examine the characteristics of the clusters. This was a simplified example to show the workflow and the available data. For your own work, you would likely use a different set of variables, and you would choose variables that are most relevant to the location, habitats, and species of concern in the area where you do your conservation work.