Set up a Conda environment

First, you'll use a new Conda environment to set up R and Python libraries that will work in conjunction with all the necessary libraries that ship with ArcGIS Pro, including the ArcGIS API for Python libraries. You'll copy the preexisting ArcGIS Pro libraries and add additional libraries using the Conda package manager.

Configure a new Conda environment

To begin the lesson and explore climate downscaling using spatial machine learning and geoenrichment, you'll use the ArcGIS Pro Conda package manager to create a Conda environment that includes the ArcGIS API for Python, the Python API, and all required libraries. This is an essential step for analysis that requires integration with third-party libraries that do not ship with ArcGIS Pro.

  1. Start ArcGIS Pro. If prompted, sign in using your licensed ArcGIS account.
    Note:

    If you don't have ArcGIS Pro or an ArcGIS account, you can sign up for an ArcGIS free trial.

  2. Under New, click Map.

    The Map template creates a project with a 2D map.

  3. In the Create a New Project window, name the project Climate_Downscaling. Save the project to the location of your choice (for example, C:\ClimateAnalysis) and click OK.

    Create a New Project

    The project opens and displays a map with the Topographic basemap. Before adding data, you need to set up the Conda environment.

  4. On the ribbon, click the Project tab.

    Project tab

  5. In the Project pane, on the left, click Python.

    Python tab

  6. In the Python Packet Manager, under Project Environment, click Manage Environments.

    Python Package Manager

  7. In the Manage Environments pane, select the arcgispro-py3 environment and click the Clone button.

    Clone current Python environment

  8. In the Clone Environment pane, for Name, type climate_downscaling, and then click Clone.

    Rename cloned environment

    The reason for cloning the arcgispro-py3 environment is for redundancy. In case settings and changes you make modify the original environment incorrectly, you can always revert to the original source settings.

  9. In the Manage Environments pane, select the climate_downscaling environment to make it the active environment, and then click OK.

    Select cloned environment

    ArcGIS Pro requires a restart for the climate downscaling package to replace the original arcgispro-py3 environment.

  10. In the Project pane, click Exit, and click Yes to save changes to the Climate_Downscaling project.
  11. Restart ArcGIS Pro.
  12. From Recent Projects, select your Climate_Downscaling project.

Install new Python libraries and R

At this step, you have a new climate_downscaling environment that is a clone of the one that ships with ArcGIS Pro. This means that you have access to all the base libraries ArcGIS Pro requires to operate. Why did you clone the environment you already had? For security and stability reasons, creating a clone environment is suggested if new Python libraries are to be installed. In the worst-case scenario, you can always switch back to the base Conda environment ArcGIS Pro ships with and continue working. In the following steps, you will install several new Python libraries you will need for climate downscaling.

Using a Conda deployment via a package ensures that various Python libraries work with each other and do not adversely affect other libraries. It is recommended that you use a Conda manager to install third-party libraries. Conda packages deployed via the manager consist of a set of Python libraries, and one such Conda package is created for this Learn lesson. In the next steps, you will install these libraries.

  1. On the ribbon, click the Project tab.
  2. In the Project pane, click Python.
  3. In the Python Package Manager, under Project Environment, verify your current project environment is set to climate_downscaling.
  4. Next, click Add Packages.

    Add Packages will search the online Conda package repository for a published Conda package.

    Add a Conda package

  5. In the Add Packages pane, type space-time-learn-lesson.
    Note:

    Conda is an open source package and environment management system that facilitates the quick installation, running, and update of packages and their dependencies. With Conda, users can easily create, save, load, and switch between environments on a local computer. You can create your own Conda package such as this one that you can share and distribute for others to replicate your settings.

  6. In the search results, select space-time-learn-lesson and click Install.

    Search for Conda package

  7. On the Install Package dialog box, investigate the libraries that the space-time-learn-lesson Conda package will install. The package includes the following libraries:

    R_mutex

    Mutex library enables R functionality in Conda

    arcgispro

    ArcGIS Pro metapackage application

    jupyter

    Jupyter notebook

    mor-base

    Microsoft's Open R (MRO)

    r-arcgisbinding

    R-ArcGIS Bridge package

    r-fda

    Functional data analysis (FDA) library for R

    r-irkernel

    Package allowing running R in a Jupyter Notebook

    r-raster

    R raster library

    r-rts

    R library for space-time data structure

    r-shiny

    R plotting and charting library

  8. On the Install Package dialog box, review the terms and conditions, check I agree to the terms and conditions, and click Install.

    Install Package dialog box

    Note:

    The space-time-learn-lesson Conda package is an efficient way for you to ensure that all required packages needed for these lessons are installed in a single process. For future projects, you can add or remove libraries separately as needed and do not need to use a Conda package.

    This process may take a few minutes as libraries are downloaded and configured.

  9. In the Project pane, click the back arrow to return to the project.

    Back arrow

  10. Click Save to save the project.

You've successfully cloned your Python environment, and using a Conda package, you installed the libraries and packages needed for further analysis. The Python package manager allows you to create environments where numerous Python libraries coexist with the core spatial analysis libraries that ship with ArcGIS Pro, such as ArcGIS API for Python. The Jupyter Notebook and R libraries you installed will be vital for the later steps where you will automate temperature profile forecasts using Python and R-ArcGIS integration. Next, you'll evaluate regression model accuracy and useful variables to predict temperature and decide on an appropriate automation workflow.


Build a regression model

Previously, you established the foundation for exploring predictors (general or global circulation model [GCM] variables) and a target variable (observed temperature) by installing several Python and R libraries. Next, you'll explore and predict the relationships between your predictors and target variables by using various exploratory regression methods to establish and define the relationship between the variables and observations made at weather stations.

First you'll explore the data using ArcGIS Pro charting capabilities to represent and visualize the relationships between your predictors and target variables. Following this exploratory data analysis, you will use increasingly complex regression models starting with a Generalized Linear Regression, then a Geographically Weighted Regression, and finally Forest-based Classification and Regression to explore the complex relationships between the GCM variables and observed temperature.

Click here for a review of downscaling methods for climate change projections.

Perform explanatory regression analysis

You'll start with exploring the relationships between GCM variables and observed temperature. You will begin your data exploration by using a scatterplot matrix to explore relationships between GCM variables (predictors) and observed temperature (target variables).

  1. On the Insert tab, in the Project group, click Import Map.

    The Import Map window opens and allows you to search your project, your computer, or your portal for maps to add to the current project.

    Select to import a map

  2. In the Import window, under Portal, click All Portal.

    Search for online map

  3. In the search box, type Regression Model Map and press Enter.

    The portal is the licensed account you used to sign in when you first opened ArcGIS Pro. Because you signed in with an ArcGIS Online account, your portal gives you access to all ArcGIS Online data that your account has access to.

    Select Regression Model Map

  4. In the search results, locate and select Regression Model Map and click OK.

    The published map package is downloaded and adds the Regression Model Map to the project.

    Note:

    Data acknowledgments

  5. In the Contents pane, under Regression Model Map, right-click the station_data_no_missing layer and click Attribute Table .

    This dataset contains a subset of weather station data with associated simulated GCM variables at respective locations and time steps.

    Open and review attributes

  6. Review the table and note the total number of records (18,011) in the table and the values contained in the LST_YRMO_Converted field.

    This field contains observation dates for when the GCM variables were observed. There are a total of 207 weather stations, but this table contains a total of 18,011 records. This is because the table contains multiple observations made for each weather station at different dates from 2006 to 2016.

    Note:

    GCMs simulate the earth's climate via mathematical equations that describe atmospheric, oceanic, and biotic processes, interactions, and feedbacks. They are the primary tools that provide reasonably accurate global, hemispheric, and continental-scale climate information and are used to understand present climate and future climate scenarios under increased greenhouse gas concentrations.

  7. Review the table, but this time note the various variables or predictors collected for each weather station.

    The attribute table contains the 32 GCM variables that you can use to build your temperature downscaling model. A description of field names and what they represent is given in the following table.

    GCMs represent the following: water vapor and cloud atmospheric interactions, direct and indirect effects of aerosols on radiation and precipitation, changes in snow cover and sea ice, the storage of heat in soils and oceans, surface fluxes of heat and moisture, and large-scale transport of heat and water by the atmosphere and oceans.

    Field nameDescription

    ccb

    Air Pressure at Convective Cloud Base (Pa)

    cct

    Air Pressure at Convective Cloud Top (Pa)

    ci

    Fraction of Time Convection Occurs

    clivi

    Ice Water Path (kg m-2)

    clt

    Total Cloud Fraction (%)

    clwvi

    Condensed Water Path (kg m-2)

    evspsbl

    Evaporation (kg m-2 s-1)

    hfls

    Surface Upward Latent Heat Flux (W m-2)

    hfss

    Surface Upward Sensible Heat Flux (W m-2)

    huss

    Near-Surface Specific Humidity

    pr

    Precipitation (kg m-2 s-1)

    prsn

    Snowfall Flux (kg m-2 s-1)

    prw

    Water Vapor Path (kg m-2)

    ps

    Surface Air Pressure (Pa)

    rlds

    Surface Downwelling Longwave Radiation (W m-2)

    rldscs

    Surface Downwelling Clear-Sky Longwave Radiation (W m-2)

    rlus )

    Surface Upwelling Longwave Radiation (W m-2)

    rlut

    TOA Outgoing Longwave Radiation (W m-2)

    rlutcs

    TOA Outgoing Clear-Sky Longwave Radiation (W m-2)

    rsds

    Surface Downwelling Shortwave Radiation (W m-2)

    rsdscs

    Surface Downwelling Clear-Sky Shortwave Radiation (W m-2)

    rsdt

    TOA Incident Shortwave Radiation (W m-2)

    rsus

    Surface Upwelling Shortwave Radiation (W m-2)

    rsuscs

    Surface Upwelling Clear-Sky Shortwave Radiation (W m-2)

    sbl

    Surface Snow and Ice Sublimation Flux (kg m-2 s-1)

    sfcWind

    Wind Daily-Mean Near-Surface Wind Speed (m s-1)

    tasmax

    Daily Maximum Near-Surface Air Temperature (K)

    tasmin

    Daily Minimum Near-Surface Air Temperature (K)

    tas

    Near-Surface Air Temperature (K)

    tauu

    Surface Downward Eastward Wind Stress (Pa)

    tauv

    Surface Downward Northward Wind Stress (Pa)

    ts

    Surface Temperature (K)

    In the next step, you will inspect the relationships between the 32 variables using a scatterplot matrix to help determine if multicollinearity exists.

    The first regression model you will create is a linear regression, and it requires you to identify predictors that are linearly related to the target variable (T_MONTHLY_) but not linearly related to any other predictors. If your predictors are linearly related to each other, they will give rise to a problem known as multicollinearity, which implies that some predictors could be redundant because they can be reproduced with the linear combination of other predictors.

    Learn more about exploratory regression

    For a linear regression model, multicollinearity results in a linearly unstable model, meaning that a small change to a predictor (such as a GCM variable) can result in a disproportionately high change to the predicted target variable (such as local temperature).

    Learn more about multicollinearity

  8. Close the station_data_no_missing attribute table before continuing.
  9. In the Contents pane, right-click the station_data_no_missing layer, point to Create chart, and choose Scatter Plot Matrix.

    The name of your scatterplot matrix may differ from the name in the illustrations below.

    Create a scatterplot matrix

    Note:

    A scatterplot matrix (SPM) is a data exploration tool that allows you to compare several datasets to look for patterns and relationships. An SPM has two main components: a matrix of small scatterplots for each of the variable combinations and a larger preview window that shows the scatterplot for a selected pair of variables in greater detail.

  10. In the Chart Properties pane, on the Data tab, check the T_MONTHLY_, ccb, and cct variables.

    Select SPM variables

    Notice how the SPM matrix dynamically updates as you select a variable pair.

    Review initial SPM

    Your symbology may differ from the illustrations.

    Generating a complex SPM with all 32 variables may take a long time. For now, you will explore 3 initial variables and then display a result illustrating all 32 variables.

  11. In the SPM window, inspect the graphed variables. Notice that the T_MONTHLY_ versus ccb scatterplot is enlarged and displayed as a large plot in the upper right corner. You can switch the enlargement by clicking and selecting other variable combinations in the matrix.
  12. In the Chart properties pane, check Show as R2.

    Check to show R2 values

    The SPM updates to show each scatterplot displaying the strength of the linear relationships between the three variables in the regression model.

    Review R2 values for selected variables

    At this point, you could add the additional 30 variables to your SPM, but adding these individually would take a long time. In the next step, you will close your current SPM and open an SPM that has all 33 variables preselected for you.

  13. Click the close button in the upper right corner of your current SPM to close the scatterplot window.
  14. Move and reposition the Chart Properties pane to the right of the map.
  15. In the Contents pane, locate and expand the Result - station_data_no_missing layer. Right-click the Result - SPM of station_data_no_missing chart and choose Open.

    Open results SPM with 32 variables

    Be patient as the 33 variable SPM is generated, it may take several minutes to display as each individual chart in the matrix needs to be defined and generated.

    While the SPM is generating, you can review the final scatterplot matrix as a graphic.

  16. In the Result - SPM of station_data_no_missing chart, inspect the matrix of variables.

    In Result - SPM of station_data_no_missing, R2 values have been checked and are displayed.

    Complete SPM with 32 variables

    In the matrix, high R2 (darker blue) values indicate a high linear relationship between variable pairs. To avoid multicollinearity, you need to identify predictors that are not linearly related and have a low R2, but that are linearly related to the target variable, local temperature (T_MONTHLY_ ).

  17. On your own, identify a variable combination with a high and a low R2 value. Click the scatterplot of your targeted combination to reveal the relationship and view the enlarged scatterplot to the upper right of the matrix.
    Note:

    Your goal is to identify as many informative predictor (GCM) variables as you can to predict the target variable (local temperature). You need as many linearly independent (low R2) predictors as possible that are also linearly related to the target variable (local temperature). In most cases these will have an R2 value higher than 0.5.

  18. In the SPM, notice that predictors with high R2 values are displayed with darker blue boxes. Review several of these and note that the highest R2 values range from 0.80 to 0.99.

    Specifically pay attention to the following three predictors:

    • cct (Air Pressure at Convective Cloud Top)
    • huss (Near-Surface Specific Humidity)
    • rlut (TOA Outgoing Longwave Radiation)

    These predictors do not possess a strong linear relationship with each other, but they do seem to be linearly related to the target variable T_MONTHLY_ (local observed temperature). In the next step, you will explore these relationships.

  19. Close the Result - SPM of station_data_no_missing window before continuing.
  20. Save the project.

In this last step, you used a scatterplot matrix as a data exploration tool to compare predictors to search for patterns and relationships. For regression, it is important to be able to state and justify the expected relationship between each candidate explanatory variable and the dependent variable prior to analysis, and you should question models where these relationships do not match. For this reason, you will explore a generalized linear regression next.

Execute a generalized linear regression

Building a regression model is an iterative process that involves finding effective independent variables to explain the dependent variable you are trying to model or understand. This involves a workflow where you run the regression tool to determine which variables are effective predictors, and then repeatedly remove or add variables until you find the best regression model possible.

  1. On the ribbon, on the Analysis tab, click the Tools button. In the Geoprocessing pane search box, type Generalized Linear Regression.

    Search for the Generalized Linear Regression tool

    Generalized Linear Regression (GLR) can perform a regression on continuous variables via Ordinary Least Squares (OLS) regression, logistic regression (on binary variables), and Poisson regression (on counts). Since temperature is a continuous variable, you will use the Continuous (Gaussian) option.

  2. In the Generalized Linear Regression tool, set the following parameters:

    • For Input Features, choose station_data_no_missing.
    • For Dependent Variable, choose T_MONTHLY_.
    • For Model Type, choose CONTINUOUS.
    • For Output Features, type station_data_no_missing_Gene.
    • For Explanatory Variable(s), choose cct, huss, and rlut.

    Note:

    Using the Generalized Linear Regression tool without setting any prediction options means that you are performing an exploratory regression; that is, you're only exploring the linear relationships between the predictors (cct, rlut, and huss) and the target variable (monthly average temperature).

    Generalized Linear Regression tool parameters

  3. Click Run.

    The GLR tool creates and adds a new layer and several charts to the Contents pane. Next, you'll explore each chart and evaluate the tool diagnostics to determine if you can model temperature using the cct, huss, and rlut variables in a linear regression model.

  4. In the Contents pane, locate and, if necessary, expand the station_data_no_missing_Gene layer.

    Generalized Linear Regression tool results

  5. In the Contents pane, right-click the Relationships between Variables chart and choose Open.

    Relationships between Variables histogram chart

    The Relationships between Variables chart is a correlation matrix, similar to the scatterplot matrix. But it only plots the variables included in the regression, which are cct, huss, rlut, and T_MONTHLY.

  6. In the Chart Properties pane, check Show as R2.

    The chart updates to display R2 values instead of histograms for each variable combination. Notice that these R2 values are the same as in the original SPM but are easier to investigate because this matrix contains fewer variables.

    Relationships between Variables R2 values

  7. In the Chart pane, review the first cell in the first column of the matrix showing a histogram of the monthly temperature measurements.

    Histogram of the monthly temperature measurements

    The first cell is the histogram of the monthly temperature measurements and shows a quasi-Gaussian distribution that is right-skewed with an elevated frequency of higher-than-average temperatures.

  8. Click the cct vs T_MONTHLY_ cell. From the top down, it is located in the second row of the first column.

    Linear relationship between monthly temperature and cct

    The cct versus T_MONTHLY_ combination shows that the linear relationship between monthly temperature and cct is weak. Review the scatterplot and note that cct versus monthly temperature does not show a linear relationship and that the data is scattered around the best-fit line. For linear regression, cct could be a problem, as it is not linearly related to monthly temperature.

  9. Review the huss versus T_MONTHLY_ and the rlut versus T_MONTHLY_ cells. From the top, they are the third and fourth rows in the first column.

    Strong linear relationships between predictors and monthly temperature

    Both the huss versus T_MONTHLY_ and rlut versus T_MONTHLY_ cells show stronger linear relationships between respective predictors and monthly temperature. At a first glance, these variables appear to be useful as predictors within a linear model.

  10. Click the cell showing the relationship between rlut and huss. This is the third cell in the fourth row representing the correlation coefficient between the variables.

    Correlation coefficient between huss and rlut

    Remembering that multicollinearity is another important factor in regression, the correlation coefficient between huss and rlut shows a correlation coefficient of 0.36, meaning that these two variables are linearly related to the target variable but are not linearly related to each other. Thus, they can be good candidates to reuse in this model after excluding cct as a predictor.

  11. Close the Relationships between Variables chart.

    Next, you'll check the chart representing the distribution of the standardized residuals.

    A residual is the difference between the predicted temperature and observed temperature. An accurate model has few (preferably no) large residuals and the majority of residuals are small in magnitude (close to zero). A distribution of standardized residual chart generally displays standardized residuals and the distribution of residuals after data is centralized (achieved by subtracting the average residual) and scaled with the standard deviation of the original residual distribution. The bars of the histogram display standardized residuals and the line shows a perfect standard distribution.

  12. In the Contents pane, right-click the Distribution of Standardized Residual chart and choose Open.
    Note:

    Distribution of Standardized Residual chart

    Your Distribution of Standardized Residuals chart is close to a normal distribution. The histogram shows that the linear regression model was able to predict temperature values that have a consistently similar amount of over- or underestimation. The histogram only shows the frequency of the mismatch of this linear model. In a later step, you will explore if there is a relationship between the mismatch and the magnitude of the target variable. In other words, you are now getting a good prediction for average and high temperatures, but not for low temperatures. The model predicts average values with accuracy but cannot generalize extreme temperature values.

  13. Close the Distribution of Standardized Residual chart.
    Note:

    If the bars of the histogram displaying the standardized residuals and the lines show a perfect standard distribution, it implies that residuals are centered on the average residual with few extremely low and few extremely high residuals. Accuracy is not implied if the mean of the residuals is high; it is more informative if there is a systematic error in regression that frequently under- or overestimates the temperature value. Using a statistic named Jarque-Bera, you can quantify how similar the distribution of standardized residuals is to the analytical standard Gaussian distribution (normal distribution).

  14. In the Contents pane, right-click the Standardized Residual VS. Predicted Plot chart and choose Open.

    The scatterplot opens and is color-coded to reflect number of standard deviations the residual is away from its average.

    Standardized Residual VS. Predicted Plot chart

    In this Standardized Residual VS. Predicted Plot chart, no pattern in standard residual with respect to the magnitude of the predicted temperature exists. This implies that regardless of the temperature value being predicted, the model accuracy has variability.

    Next, you will review the map of standardized residuals to investigate if there is spatial autocorrelation in the performance of the model—in other words, if the model has higher accuracy in certain areas than in others.

  15. Close the Standardized Residual VS. Predicted Plot chart.
  16. In the Contents pane, if necessary, check the station_data_no_missing_Gene layer to display it on the map.

    Standard residual map

  17. On the map, review the distribution of the standard residual values.
  18. On the ribbon, on the Map tab, in the Navigate group, click the Explore tool. Zoom in first to a mountainous area and observe the distribution of the standard residual values. When you're done, zoom in to a flat area in the plains.

    Mountainous areas

    Note that the model significantly over- and underestimates the temperature in areas with rapidly changing elevation, such as mountains, where standard residuals vary up to -2.5 standard deviations (SD), while in the plains the SD values are close to the mean.

    USA plains

    Note:

    In the map of standardized residuals, it is clear that your model had average accuracy for flat land and had estimation errors in areas of changing elevation. In addition, you know that both observed temperature and simulated climate variables contain strong spatial autocorrelation. Temperatures of adjacent sites tend to be similar unless there are natural barriers between them, such as mountains. GCM variables are simulated using finite element models where the simulated value of a cell impacts its neighbor's values.

    Next, you will review a summary of GLR results and GLR diagnostics. ArcGIS Pro maintains a history of all tools executed within a project. These can be reviewed, updated, and reexecuted as needed.

  19. In the Catalog pane, click History and click Generalized Linear Regression.

    Select tool history

    The tool details appear in a new window.

  20. In the Generalized Linear Regression details window, click the pop out button.

    Pop-out details

  21. Position and enlarge the Generalized Linear Regression pop-out window to make it easy for you to review messages.

    The details window maintains metadata pertaining to an executed tool. This information includes tool parameters used as well as messages and other import details. In the case of the Generalized Linear Regression tool and most other statistical tools, the details also include a summary of results and several important diagnostics that help in interpreting results and the validity of the tool outputs.

  22. Review the summary of GLR results and note the Intercept value.

    Summary of GLR Results

    Coefficients for every variable represent what linear weight is assigned to a variable. Note that magnitude plays an important role here, and a numerically large variable can have a small weight due to its high magnitude. This model shows that all the reported coefficients for all predictors are significant (thus quasi-zero probabilities). In addition, a Probability and Robust_Pr (Robust Probability) is reported for each coefficient.

  23. Scroll down and review the GLR diagnostics, and specifically take note of the Koenker (BP) statistic. Also note the Jarque-Bera statistic mentioned earlier.
    Note:

    To identify which probability to use to assess the statistical significance of reported coefficients, consult the Koenker (BP) test statistic. The Koenker (BP) test statistic informs you of the robustness of the regression model for different percentages for the distribution of predictors. In other words, if the model quality is significantly different for different ranges of predictors, the overall fit is not a good indicator for the robustness of the model. If this statistic is significant, you should use robust probabilities.

    GLR diagnostics

    Since the Koenker statistic is significant, you could evaluate the Joint Wald statistic instead of the Joint-F statistic to gauge the model significance. In addition, a P-value (Prob) that is smaller than 0.05 would show that the model is significant.

  24. In the summary of GLR results, review the Intercept value.

    Intercept has a value of -38, which means that if all three predictors are zero, the temperature is -38 degrees Celsius.

  25. In the GLR diagnostics, review the Adjusted R-Squared value, which shows the strength of the relationship between predicted and observed temperature.

    In this case, the predicted temperature and observed temperature values are linearly related with a correlation coefficient of 0.78, and since the metric value is 1, a value of 0.78 implies an almost perfect linearity between predictions and observations.

    Next, to investigate if spatial autocorrelation exists for this problem, you will use a Geographically Weighted Regression model to explore the relationships between the GCM predictors and average temperature.

  26. Close the Regression Model Map.
  27. Save the project.

Explore geographically weighted regression

Geographically Weighted Regression (GWR) is one of several spatial regression techniques increasingly used in geography and other disciplines and provides a local model of the variable or process you are trying to understand or predict by fitting a regression equation to every feature in the dataset. As a result, when the values for a particular explanatory variable cluster spatially, it is likely there are problems with local multicollinearity.

Note:

When you performed Generalized Linear Regression, you were exploring space-time data where you had station measurements and simulated climate for the distinct locations recorded at different times. But since Geographically Weighted Regression analysis weights data from closer locations higher than data from locations farther away, you need to remove the time component from your analysis and select a specific time period for exploring spatial relationships.

In addition, repeated measurements at different times for the same location would be used as two coincident stations and thus lead to a skewed data representation.

You may recall that your data consists of 18,011 observations made at different times for the 207 weather stations. While performing Geographically Weighted Regression, you will need to remove these repeated measurements and restrict your analysis to a specific time period.

  1. On the Insert tab, in the Project group, click Import Map.

    Select to import a map

  2. In the Import window, under Portal, click All Portal.

    Search for online map

  3. In the search box, type Geographically Weighted Regression Map and press Enter.
  4. In the search results, locate and select Geographically Weighted Regression Map and click OK.

    The Geographically Weighted Regression Map is added to your project.

  5. In the Contents pane, under Geographically Weighted Regression Map, right-click the station_data_no_missing layer and click Attribute Table.

    This dataset contains a subset of weather station data with associated simulated GCM variables at respective locations and time steps.

    Open and review attributes

  6. On the ribbon, on the Analysis tab, click the Tools button.
  7. On the Map tab, in the Selection group, click Select By Attributes.
  8. In the Select Layer By Attribute tool, set the following parameters:

    • Set Input Rows to station_data_no_missing.
    • Set Selection type to New selection.
    • Click Add Clause.
    • On the Add Clause dialog box, build the following expression: LST_YRMO_Converted = 2012-03-01
    • Verify the clause is correct and click Add.

    Select by date

  9. Click Run.

    In the station_data_no_missing attribute table, verify that 207 records are selected.

  10. In the Geoprocessing pane, click the Back button and type Geographically Weighted Regression.

    GWR performs linear regression but works with a subset of the data (due to local regression), and it can therefore suffer from multicollinearity even more than GLR. In light of your GLR analysis and findings with regard to the cct predictor, you will drop cct from the set of predictors and form a relationship between rlut, huss, and average monthly temperature. You will use the 30 nearest temperature stations to perform local regressions between huss, rlut, and monthly average temperature.

  11. In the Geographically Weighted Regression tool, set the following parameters:

    • For Input Features, choose station_data_no_missing.
    • For Dependent Variable, choose T_MONTHLY_.
    • For Model Type, confirm CONTINUOUS (Gaussian).
    • For Explanatory Variable(s), check huss and rlut.
    • For Output Features, confirm station_data_no_missing_GWR.
    • For Neighborhood Type, choose NUMBER_OF_NEIGHBORS.
    • For Neighborhood Selection Method, choose USER_DEFINED.
    • For Number of Neighbors, type 30.
    • For Additional Options, do the following:
      • Confirm Local Weighting Scheme is set to BISQUARE.
      • For Coefficient Raster Workspace, browse and select Climate_Downscaling.gdb.

    Geographically Weighted Regression tool

    Note:

    The reason for setting a coefficient raster workspace is that the GWR tool outputs a coefficient surface raster representing the weight value applied to a predictor in the regression model at a given location. This is useful in accessing autocorrelation.

  12. Click Run.

    The tool executes and reports an initial warning related to the coordinate system used, and a second warning that two input points are coincident; you may ignore these for now. Next, you can evaluate the GWR tool results.

  13. After the GWR tool has executed, click View Details on the lower left of the tool.
  14. If necessary, resize and position the details window to review the Analysis Details and Model Diagnostics sections.

    GWR tool details

  15. Review the R2 value in the Model Diagnostics section.

    This GWR local model has an R2 value of 0.854, which is clearly higher than the GLR R2 value of 0.787. This implies that the problem you are solving has a spatial component to it. In addition, this allows you to conclude that, overall, the model is not accurate to the degree by reviewing the Sigma-Squared values. Applying GWR shows that strong spatial autocorrelation in both predictors and the target variable exist. But, due to the limitation of multicollinearity and the assumption of a linear relationship between predictors and the target variable, you cannot use all of the GCM variables and can only use two or three of them at a time.

    For more details, review how geographically weighted regression works.

  16. Review the Sigma-Squared value.

    Sigma squared indicates the standard deviation of the residuals. In your model, Sigma has a value of 7.2405. This means that +- √7 = 2.7 C contains about 68 percent of all the residuals in this regression model, whereas +- 5.4 C contains about 95 percent of the residuals, which indicates that overall, the model is not accurate to the degree.

    Note:

    This part of your analysis shows that strong spatial autocorrelation in both predictors and the target variable exists. But due to the limitation of multicollinearity and the assumption of a linear relationship between the predictors and target variable, you cannot use all 32 GCM variables simultaneously and can only use two or three of them at a time effectively.

  17. Clear the selected station_data_no_missing features.
  18. Save the Project.

    Your results do not mean that a linear regression model has no merit; linear models, if formed properly using the metric you evaluated, can be a powerful predictor, especially for ranges of data that are not observed. Next, you will use another regression model named Random Forest Regression that is quite different from GLR and GWR. Random Forest creates an ensemble of decision trees that aim to find k-dimensional ranges among k predictors to establish complex relationships between the set of predictors (in this case, 32) and the target variable.

Use random forest regression

Random Forest Regression is a useful and powerful interpolator when there is enough data to resolve the complex relationship between predictors and a target variable. In addition, it can rank the importance of every predictor in the regression model and can form predictive regression models even in the presence of some uninformative and multicollinear predictors. The method is described as a Random Forest regressor, and ArcGIS Pro implements the method via a geoprocessing tool named Forest-based Classification and Regression.

For more details, review Forest-based Classification and Regression.

  1. On the Insert tab, in the Project group, click Import Map.

    Select to import a map

  2. In the Import window, under Portal, click All Portal.

    Search for online map

  3. In the search box, type Random Forest Regression Map and press Enter.
  4. In the search results, locate and select Random Forest Regression Map and click OK.

    The Random Forest Regression Map is added to your project.

  5. If necessary, open the map.
  6. On the ribbon, on the Analysis tab, click the Tools button. In the Geoprocessing pane, type Forest Based Regression and Classification.

    First you will run the tool in Train mode. Train mode indicates that you are not predicting to another dataset but are exploring the relationships between predictors and the target variable in the source data. In this tool, you'll be using all of the 32 GCM predictors.

  7. In the Forest-based Classification and Regression tool, set the following parameters:

    • For Prediction Type, choose Train only.
    • For Input Training Features, choose station_data_no_missing.
    • For Variable to Predict, choose T_MONTHLY_.

    Forest-based Classification and Regression tool

  8. For Explanatory Training Variables, click the Variable arrow button, and to the lower left of the list, click the Toggle All Checkboxes button and click Add. Uncheck T_MONTHLY_, LST_YRMO, and WBANNO, as these are not predictive GCM variables.

    Choose explanatory variables

    Under Additional Outputs, define the output table name to record variable importance. Forest-based Classification and Regression ranks variables with respect to the number of times they impacted the forest regressor and identifies important variables both as ones that are frequently used in defining the forest and ones that have little impact in defining the forest.

  9. Expand Additional Outputs. For Output Variable Importance Table, type train_variable_importance.

    Set output variable importance table

  10. Expand Validation Options.

    • For Training Data Excluded for Validation (%), type 10.
    • For Number of Runs for Validation, type 100.
    • For Output Validation Table, type r2_table.

    Set validation options

  11. Click Run.

    This process may take a while to complete.

    Note:

    The number of trees is a model parameter that can be increased to improve model robustness. Random Forest works off of a consensus scheme where consensus among different predictors (trees) defines the final result of the forest. Thus, a higher number of trees makes the model more robust to random sampling in data and allows exploring complex relationships. Leaf size indicates the minimum number of entries beyond which another split in data cannot be made for every tree. For a large leaf size, it is advised to use a large number of trees. Tree depth is the number of times data is split in a decision tree. A higher number of splits will lead to more complex regimes modeled in the data.

  12. Click View Details on the lower left after the tool has executed to review the model characteristics and outputs. Note that these may vary slightly depending on the samples your tool used.
  13. Resize and position the Details window to review the Model Characteristics section.

    Model characteristics

    In the model characteristics, you see that Random Forest performed splits an average of 31 times (mean tree depth) with a tree depth range minimum of 26 and a maximum of 41. As a result, you can conclude that all the decision trees contain a fine-grained definition of the different complex relationships in data. To reduce bias, Random Forest also uses a subset of the available predictors per tree. Here every decision tree had access to a random subset of 11 among 19 possible predictors. Finally, 10 percent of the data was left out for cross-validation. This means that Random Forest ignored 10 percent of the data, and using the model built with the available 90 percent of predicted the values was still able to return accurate metrics.

  14. Review the Model Out of Bag Errors section.

    Model out of bag errors

    Model out of the bag errors are reported for the different number of trees used in Random Forest. Both the Mean Squared Error (MSE) and Percentage of variation values are based on the ability of the model to accurately identify a variable to use to predict based on the observed values in the dataset. If MSE and Percentage of variation reported by the tool, had varied drastically from the number of trees used, this would be a strong indicator that the number of trees needs to be increased.

  15. Review the Top Variable Importance section.

    Another major factor in the performance of the model is the explanatory variables assigned. The Top Variable Importance list ranks the top 20 variables' importance scores. Remember that Random Forest works by splitting the data to create relatively homogeneous zones with respect to the target variable. Variable importance reports the percentage of the times a variable is used in such a split. Importance gives a metric for the number of times a GCM variable is used in the regression model.

    Top variable importance

    Note:

    Your individual percentages may vary slightly from the illustration and will likely change each time you run through the training process because the selection of the test dataset is always random.

  16. Review the Training Data: Regression Diagnostics and Validation Data: Regression Diagnostics sections.

    When predicting a continuous variable, the observed value for each of the test features is compared to the predictions for those features based on the trained model, and an associated R-Squared, p-value, and Standard Error value are reported. Keep in mind that these diagnostics will change each time you run through the training process because the selection of the test dataset is random.

    Training Data: Regression Diagnostics

    Validation Data: Regression Diagnostics

    Note that a median R2 of 0.922 was reached at a seed of approximately 215,355.

  17. Review the Explanatory Variable Range Diagnostics section.

    The explanatory range diagnostics can help you evaluate whether the values used for training, validation, and prediction are sufficient to produce a good model and allow you to trust other model diagnostics. The data used to train a Random Forest model has a large impact on the quality of the resulting classification and predictions. Ideally, the training data should be representative of the data you are modeling. The Explanatory Variable Range Diagnostics table shows the minimum and maximum values of the subsets of data used for prediction.

    Explanatory Variable Range Diagnostics

    An optional bar chart displaying the importance of each variable used in the model is generated. The chart displays the variables used in the model on the y-axis and their importance based on the Gini coefficient on the x-axis.

  18. In the Contents pane, right-click the Summary of Variable Importance chart and choose Open.

    Since your chart is showing a large number of predictors, you may want to maximize the window to visualize the variable importance chart. Remember to minimize or close the window when done reviewing.

    Summary of Variable Importance chart

    The strength of the Forest-based method is in capturing commonalities of weak predictors (or trees) and combining them to create a powerful predictor (the forest). If a relationship is persistently captured by singular trees, there is a strong relationship in the data that can be detected even when the model is not complex. Adjusting the forest parameters can help create a high number of weak predictors, resulting in a powerful model. You can create weak predictors by using less information in each tree. This can be accomplished by using any combination of a small subset of the features per tree, a small number of variables per tree, and a low tree depth. The number of trees controls how many of these weak predictors are created, and the weaker your predictors (trees), the more trees you need to create a strong model.

  19. In the Contents pane, right-click the Train_variable_importance table and choose Open.
  20. In the Train_variable_importance table, right-click the IMPORTANCE field and choose Sort Descending.

    Train_variable_importance table

  21. For comparison, reposition the chart and the table next to each other.
  22. In the table, select the top five rows.

    Variable_importance chart and table

    In the table and the chart from the selection, notice that tasmax (Daily Maximum Near-Surface Air Temperature or simulated maximum temperature ) is an influential predictor, indicating that it has the largest influence with regard to predicting observed temperature at weather stations. In addition, it is interesting to note that the top five most important variables encapsulate about 90 percent of the overall importance, meaning that they are responsible for about 90 percent of the splits that occur in Forest-based Classification and Regression. When you review the variable descriptions, you see that the majority of these variables are related to surface and air temperature.

    tasmax

    Daily Maximum Near-Surface Air Temperature (K)

    rlutcs

    TOA Outgoing Clear-Sky Longwave Radiation (W m-2)

    rlus

    Surface Upwelling Longwave Radiation (W m-2)

    ts

    Surface Temperature (K)

    tas

    Near-Surface Air Temperature (K)

  23. In the Train_variable_importance table, clear the selection.
  24. Close the Summary of Variable Importance chart.
  25. Save the project.

    In the next step, you will evaluate the impact of removing the seemingly insignificant variables.

Remove insignificant variables

Before you remove insignificant variables, you will evaluate the R-Squared (R2) distribution for the 100 runs executed by the Forest-based Classification and Regression tool.

  1. In the Random Forest Regression Map, in the Contents pane, right-click the r2_table table and choose Open.
  2. Review the table and note how many of the 100 records have values close to 1.

    The values in the table capture two aspects of the regression model, precision and accuracy. For a precise model, similar R-squared values are expected, and for an accurate model, high R-Squared values are expected for different runs. A model centered on an R-Squared value close to 1 that is also precise means that random sampling of the data does not result in a bad model.

    The R2 table statistics are as follows:

    • Mean = 0.9214
    • Min = 0.9105
    • Max = 0.9308

    Note that the R2 values in your table range from 0.91 to 0.93, which means that the model is both accurate and precise.

    R2 table

    Note:

    A Random Forest model with few high R-Squared values observed between different runs is an indicator that the Forest-based regressor might be overfitted to the data and thus may not generalize sufficiently to predict a different dataset. However, none of these metrics guarantee the performance of Random Forest on a dataset where the true answer is not known.

  3. In the Contents pane, right-click the Validation R2 chart and choose Open.

    R2 Validation chart

    The distribution of R-Squared values shows that the majority of the models have high predictive power. The model is accurate and precise. The R-Squared plot is an indication of the stability of the Random Forest under randomized training data. High mean value for R-Squared values with small variance indicates a good model.

  4. In the Train_variable_importance table, right-click the PERCENTAGE field and choose Sort Descending.

    The PERCENTAGE field maintains the percent of importance a variable has in predicting average monthly temperature.

  5. Review the table and focus on the values in the PERCENTAGE field values. Note the records with values ranging from 27.393 to 1.408.

    Variable percentage importance

    Various methods for reducing the number of predictors exist. In this case, you will use importance as the primary factor to eliminate variables in Random Forest that are not impactful. Here you will keep variables with high importance whose cumulative sum of importance value adds up to 90 percent. From a Random Forest perspective, this will keep variables that are responsible for 90 percent of the splits in the forest.

  6. Close all tables and charts.

    Next, you will rerun Forest-based Classification and Regression with only the first 11 most impactful variables.

  7. In the Forest-based Classification and Regression tool, set the following parameters.

    • For Prediction Type, choose Train only.
    • For Input Training Features, choose station_data_no_missing.
    • For Variable to Predict, choose T_MONTHLY_.
    • For Explanatory Training Variables, check prsn, ps, rldscs, rlus, rlutcs, rsdscs, rsdt, tas, tasmax, tasmin, and ts.

    Choose explanatory variables

  8. For Additional Outputs, do the following:

    • For Output Variable Importance Table, type variable_importance_reduced.
    • In Validation Options, do the following:
      • For Training Data Excluded for Validation (%), type 10.
      • For Number of Runs for Validation, type 100.
      • For Output Validation Table, type R2_Reduced.

    Set validation options

  9. Click Run.
  10. Click View Details on the lower left after the tool has executed.
  11. Resize and position the Details window to review the Model characteristics section.
  12. Review the Model Out of Bag Errors section.

    Model out of bag errors

    Recall that the Mean Squared Error (MSE) and percentage of variation values are based on the ability of the model to accurately predict the variable used to predict values based on the observed values in the dataset. If the MSE and percentage of variation values change drastically from up to the number of trees used, this is an indication that the number of trees needs to be increased. In your current model, these values have not changed drastically.

  13. Review the Top Variable Importance section.

    Top variable importance

    The list illustrates all 11 variables, and once again, the top five variables represent 84 percent. Some variables, such as rlus and ts, have switched ranking due to their close significance level.

  14. In the Contents pane, below the variable_importance_reduced table, right-click the Summary of Variable Importance reduced chart and choose Open.
    Note:

    Your chart name may have been truncated, but this chart represents the values from the variable_importance_reduced table.

    Summary of Variable Importance reduced chart

    Review the chart and note that the variable importance is stable when a subset of 11 predictors is used. Some variables have switched ranking due to their close significance level, but overall these 11 variables appear to accurately predict the model.

  15. Close the Summary of Variable Importance reduced chart.
  16. In the Contents pane, below the R2_Reduced table, right-click the Validation R2 chart and choose Open.

    Validation R2 Chart

    Notice that the average R2 value is the same as in the previous model, but the range of values observed is smaller as a result of the smaller number of predictors.

  17. Close the Validation R2 chart.

    In the next lesson, you will use a Jupyter Notebook to automate and execute the Forest-based Classification and Regression tool multiple times. To simplify access to your project data, you should consolidate all project data into a single location.

  18. In the Geoprocessing pane, type Consolidate Map.
  19. In the Consolidate Map tool, set the following parameters:

    • For Input Map, choose Random Forest Regression Map.
    • For Output Folder, browse and select C:\ClimateAnalysis, for example, or the project location you selected in the first lesson.
    • For Extent, click the drop-down menu and choose the US_polygon layer.
    • Check Convert data to file geodatabase.

    Consolidate all map data

  20. Click Run.

    All source data accessed in the Random Forest Regression Map will be consolidated into a new geodatabase named downscaledclimate.gdb.

  21. In the Catalog pane, if necessary, add a folder connection to your output location (for example, C:\ClimateAnalysis).
  22. Review and expand your folder connection and note the p20 folder.

    This folder contains your consolidated data and will be referenced in the Jupyter Notebook.

  23. Expand the downscaledclimate.gdb geodatabase and ensure it contains the US_polygon and station_data_no_missing feature classes.
  24. Before using the Jupyter Notebook, you must download a collection of NetCDF files containing GCM variable data needed for additional analysis.

    The NetCDF files are maintained in a zipped folder in the Learn ArcGIS organization and must be downloaded and unzipped locally to your computer. Ensure that you have sufficient space, as the zipped file is 1.83 GB and the resultant NETCDF folder is 2.60 GB.

  25. On the Downscale Climate Data - NetCDF Data Files page, click Download.

    Download NetCDF files

  26. Unzip and copy your netcdfs folder and its contents to your p20 folder.

    Add NetCDF files to analysis folder

    With all your data downloaded and consolidated, you are ready to automate temperature time snapshot estimation. The last step is to create a geodatabase to store your results and outputs.

  27. Right-click the p20 folder, click New, and choose New File Geodatabase.

    Add new database

  28. Rename the geodatabase to Default.

    Rename geodatabase

  29. Save the project.

The flexibility of Random Forest and its high prediction accuracy for average monthly temperature makes it the best choice among all the predictors you investigated.

The spatial resolution of GCMs is generally quite coarse, but they are still valuable predictive tools; however, they cannot account for fine-scale heterogeneity of climate variability and change because of the coarse resolution. Various methods have been developed to bridge the gap between what GCMs can deliver and what society, businesses, and stakeholders need for decision making. Next, you'll use Python integration to automate the Forest-based Classification and Regression model developed in this lesson to predict average monthly temperature values at different time periods.


Automate temperature estimation

Previously, you compared various regression methods, and you now have a Random Forest model for all weather stations that is trained on GCM variables to predict local average monthly temperatures. The source dataset essentially contains information from the different weather stations recorded at various times. In effect, Random Forest has learned different patterns in space and time (although not explicitly stated in the model) between GCM variables and average temperature.

Since you are done training an accurate model, next you'll execute climate downscaling at discrete time snapshots. You'll do a five-year-long prediction of average monthly temperatures, meaning you will predict temperatures for 60 time periods. You will need an efficient way to perform the prediction and aggregate the results for further investigation.

Since the Forest-based Classification and Regression model learned different relationships between different GCM variables and observed temperature at different locations and times, you'll now use it for predicting temperature snapshots. For instance, the relationship between snow thickness and average temperature in the United States is different in winter than it is in summer (when the snow thickness is 0). Now you'll generate a downscaling model that can predict temperature accurately for any given month of the year.

Install and set up Jupyter Notebook

In this section, you will build a Python workflow to accomplish predicting temperature at monthly increments using the model you have trained and improved in the previous lesson. You will use Jupyter Notebook to provide a native IDE that can be run locally or remotely, accessed through a web browser, new modality for interacting with code and sharing and document your work. To harness the power of the new libraries you installed in the first section, you will launch the Jupyter Notebook that the Python package manager set up for the Conda environment you created.

  1. Using Windows File Explorer, browse to %LOCALAPPDATA%\Esri\conda\envs\climate_downscaling.

    The system variable %LOCALAPPDATA% is a substitute for the following path: C:\Users\YourUserFolderName\AppData\Local\.

    Browse to cloned Python environment

    You may recall in the first lesson you cloned your original Python package and named the clone climate_downscaling. In addition, you added additional libraries to the base Python packages when you installed the space-time-learn-lesson Conda package. In the following steps you will begin to access these additional libraries and tools.

  2. In File Explorer, browse to the Scripts folder and copy the path, for example, C:\Users\YOURUSERNAME\AppData\Local\ESRI\conda\envs\climate_downscaling\Scripts.

    YOURUSERNAME is a generic replacement for your own user name.

    Copy and identify scripts location

  3. Next, browse to your project folder (for example, C:\ClimateAnalysis) and create a new subfolder named notebooks.

    Create notebook folder

  4. In Windows Search, type cmd, and then click Command Prompt from the search results to launch the command line interface.

    Open Command Prompt

  5. In Command Prompt, change the directory to your notebooks folder. For example, type cd C:\ClimateAnalysis\notebooks.

    Change directory to project location

    Next you will open Jupyter Notebook.

  6. In Command Prompt, paste the previously copied path and type jupyter-notebook at the end of the path. Your command should look as follows: C:\Users\YOURUSERNAME\AppData\Local\ESRI\conda\envs\climate_downscaling\Scripts\jupyter-notebook.
  7. Press Enter to execute the command and open the notebook.

    Jupyter Notebook runs in a web browser and will open within a tab in your default browser. You'll see the application opening in the web browser on the following address: http://localhost:8888. Localhost is not a website, but indicates that the content is being served from your local machine: your own computer. Jupyter Notebook and dashboards are web apps, and Jupyter starts up a local Python server to serve these apps to your web browser, making it essentially platform independent and allowing easier sharing on the web.

    Review Jupyter Notebook interface

    In the notebook, the Files tab is where all your files are kept; the Running tab keeps track of all your processes; and the third tab, Clusters, is provided by IPython's parallel computing framework, which allows you to control individual engines, which are an extended version of the IPython kernel.

  8. In the notebook, on the Files tab, click the New drop-down button in the upper right corner and select Python 3 to add a new Python notebook. Your first Jupyter Notebook will open in new tab.

    Create a Python-based notebook

    Each notebook uses its own tab because you can open multiple notebooks simultaneously. If you switch back to the dashboard, you will see the new file Untitled.ipynb, and you should see some green text that tells you your new notebook is running.

  9. In the notebook, click Untitled.
  10. In the Rename Notebook window, type ClimateAnalysis.

    Rename notebook

  11. Click Rename to apply the new notebook name.

    Now that you have named your new notebook, its interface should not look entirely unfamiliar, as it is essentially an advanced word processor. Explore the menus to get a feel for it and scroll down the list of commands in the command palette, which is the small button with the keyboard icon.

    Note:

    For reference, you can download the complete ClimateAnalysis notebook. After downloading, unzip the notebook in your notebooks folder (for example, C:\ClimateAnalysis\notebooks).

  12. Click the Cell menu.

    A cell forms the body of a notebook and serves as a container for text to be displayed in the notebook or code to be executed by the notebook's kernel. The first cell in a new notebook is always a code cell.

    Review Cell menu

    You will switch from code cell to a cell containing formatted text using Markdown.

  13. In the notebook interface, locate Code. Click the drop-down arrow and choose Markdown to switch the cell type.

    Change from code to markdown

  14. In the cell, insert the following title: ## Monthly Automated Downscaling and press Enter.
  15. On the second line of the cell, insert the following description: In this notebook you will perform monthly downscaling using the ArcGIS Forest based classification and regression model. The model will be trained using spatio temporal data for a simulated global circulation model with variables to predict average monthly temperature. Press Enter.

    Insert description and title

  16. In the notebook interface, click Run, or press Shift+Enter.

    Run cell to format markdown text

    When you ran the cell, your text was converted from raw text to text that will be embedded in your notebook using Markdown. In addition, a second cell was generated. Notice the In [ ] label in front of the cell. The In part of the label is short for Input. When you execute the cell, the label will increment to In [1 ], denoting how many times the code was executed by the kernel.

  17. Click File and choose the Save and Checkpoint button to save your changes.
    Note:

    It’s best practice to save regularly. Pressing Ctrl+S will save your notebook by calling the Save and Checkpoint command. Every time you create a new notebook, a checkpoint file is created along with your notebook file. By default, Jupyter will autosave your notebook every 120 seconds to this checkpoint file without altering your primary notebook file. When you use Save and Checkpoint, both the notebook and checkpoint files are synchronized. Hence, the checkpoint enables you to recover your unsaved work in the event of an unexpected issue.

  18. In the notebook, select the new blank cell labeled In [ ]. In the notebook interface, ensure Code is selected.
  19. Click in the blank cell and paste the following code:

    Import spatial machine learning libraries and utility functions

    import arcpy
    import numpy as np
    import os
    from dateutil.relativedelta import relativedelta
    import datetime
    arcpy.env.overwriteOutput = True

    This will import necessary spatial machine learning library (arcpy) and utility functions you will use throughout the exercise.

  20. In the notebook interface, click Run.

    The cell executes and a new blank cell is added. Note the updated cell label In [1].

  21. Click in the new blank cell and ensure that the cell type is set to Markdown.
  22. In the cell, enter the following explanation: Define input and output gdbs for the analysis.
  23. Click Run.

    Add explanatory text

    Notice that markdown cells are not assigned a label.

  24. In the blank cell, copy and paste the following Python code to define input and output locations for your Python workflow. This block of code will be required by theArcGIS API for Python forest-based regression function to define whether input variables are continuous or categorical variables.
    Note:

    Data paths in the code have been set to the following:

    • C:\ClimateAnalysis\p20\downscaleclimate.gdb
    • C:\ClimateAnalysis\p20\default.gdb
    • C:\ClimateAnalysis\p20\netcdfs

    If your source data is in a different location, change the data paths before running the code.

    Set Input and Output data paths and variables

    arcpy.env.overwriteOutput = True
    in_gdb = r"C:\ClimateAnalysis\p20\downscaleclimate.gdb"
    out_gdb = r"C:\ClimateAnalysis\p20\Default.gdb"
    netcdf_dir = r'C:\ClimateAnalysis\p20\netcdfs'
    
    ## Define atmospheric variables available for analysis
    variables = ["ccb", "cct", "ci", "clivi", "clt", "clwvi", "evspsbl", "hfls", "hfss", "huss", "pr", "prc", "prsn", "prw", "ps", "rlds", "rldscs", "rlus", "rlut", "rlutcs", "rsds", "rsdscs", "rsdt", "rsus", "rsuscs", "sbl", "sfcWind", "tas", "tasmax", "tasmin", "tauu", "tauv", "ts"]
    in_fc = "station_data_no_Missing"
    dist_feature = "Large_Water_Body"
    predict_var = "T_MONTHLY_"
    
    ## Create a dictionary to define if a predictor is categorical or not
    is_categorical = {"ccb":"false", "cct":"false", "ci":"false", "clivi":"false", "clt":"false", "clwvi":"false", "evspsbl":"false", "hfls":"false", "hfss":"false", "huss":"false", "pr":"false", "prc":"false", "prsn":"false", "prw":"false", "ps":"false", "rlds":"false", "rldscs":"false", "rlus":"false", "rlut":"false", "rlutcs":"false", "rsds":"false", "rsdscs":"false", "rsdt":"false", "rsus":"false", "rsuscs":"false", "sbl":"false", "sfcWind":"false", "tas":"false", "tasmax":"false", "tasmin":"false", "tauu":"false", "tauv":"false", "ts":"false"}
    Note:
    White space in Python is important, so when you copy and paste this code, ensure all indents are preserved or it won't run correctly.
  25. After the code is inserted and data paths checked, run the code block. A new cell is automatically added after you have run the current cell.

    Next, you will define a utility function to process the input string for Random Forest to append input variables into a string for the forest-based regression tool to process.

  26. In the next blank cell, copy and paste the following Python code. The code is a utility function to create inputs to the forest-based regression tool.

    Setup for Random Forest Regression

    ## Utility function for defining input string for Random Forest Regression
    def prepare_rf_input(vars, category_dict):
        input_string = []
        [input_string.append(var + ' ' + category_dict[var]) for var in vars]
        return ';'.join(input_string)
  27. Add a new blank cell and ensure that the cell type is set to Markdown.
  28. In the cell, enter the following text: You will perform predictions over contiguous U.S. For this reason you will define an environment mask that covers lower 48. In addition,the forest-based clustering tool can create raster output for predicted temperature. You can also define an environment setting for the default output raster cell size.

    Add markdown text

  29. After the code is inserted, run the code block.

    Next, you will define environment variables for forest-based regression to interact with.

  30. In the next blank cell, copy and paste the following Python code.

    Define Random Forest Regression Environment Variables

    #####################################################################
    ######## Perform Temperature Downscaling at Every Time Slice ########
    #####################################################################
    ## Define Environment Variables for Random Forest Tool
    arcpy.env.workspace = in_gdb
    arcpy.env.mask = os.path.join(in_gdb, 'US_polygon')
    arcpy.env.cellSize = 0.5
    
    ## Define and select the station data to be subsetted at every time-step
    start_time = datetime.datetime(2006,2,1)
    selected_stations = 'station_select'
    arcpy.management.MakeFeatureLayer(in_fc, selected_stations)
  31. After the code is inserted, run the code block.
  32. In a new markdown cell. enter the following text: As a next step you will need to subset the station data to fit a model at every time step to make a prediction. Your code will read simulated netCDF files for every simulated global climate variable and extract the slice corresponding to the year of interest.

    Next, you will add code to locate the NetCDF files in the provided directory.

  33. In a new code cell, copy and paste the following Python code to define the NetCDF file to be read.

    Define a list of NetCDF files

    ## Define NetCDF files to loop through
    netcdfs = [file for file in os.listdir(netcdf_dir) if file.endswith('.nc')]
    ## List to store variable importance
    variable_importance = []
  34. After the code is inserted, run the code block.

    The next block of code you will add to the ClimateAnalysis notebook does the following:

    • Selects weather stations for a given month
    • Reads in the NetCDF files for the same month and maintains it in memory as a raster
    • Executes a Forest-based Regression Model
    • Creates prediction rasters for the contiguous United States
    • Saves resulting rasters in a geodatabase

  35. In a new code cell, copy and paste the following Python code.

    Execute a Forest-based Regression Model

    for time_slice in range(60):
        raster_input_list = []
        print('Time Slice = {0}'.format(time_slice))
    
        ## Select Weather Stations that have date at time 
        current_time = start_time + relativedelta(months =+ time_slice)
        arcpy.management.SelectLayerByAttribute(selected_stations, 'NEW_SELECTION', "LST_YRMO_Converted = date '{0}'".format(current_time.strftime("%m/%d/%Y")))
    
        for netcdf in netcdfs:
            
            var_name = netcdf.split('_')[0]
            if var_name in ['prsn', 'sbl']:
                continue
            else:
                #### Create temporary prediction rasters from NetCDFs
                arcpy.md.MakeNetCDFRasterLayer(os.path.join(netcdf_dir, netcdf), var_name, "lon", "lat", var_name + "_raster", None, "time " + str(time_slice + 1), "BY_INDEX", "CENTER")
                raster_input_list.append(var_name + "_raster")
    
        ## String input for input rasters
        raster_input_string = ' #;'.join(raster_input_list)
        raster_input_string += ' #'
        ##String input for matched rasters
        raster_match_string =  [raster_string + ' ' + raster_string for raster_string in raster_input_list]
        raster_match_string = ';'.join(raster_match_string)
    
        ## Perform forest-based regression on selected weather stations and GCM rasters
        arcpy.stats.Forest("PREDICT_RASTER", selected_stations, predict_var, None, None, None, raster_input_string, None, None,  os.path.join(out_gdb, "temp_pred_" + str(time_slice)), None, None, raster_match_string, None, "var_importance", "TRUE", 100, None, None, 100, None, 10, None, None, None, 1)
    
        variable_importance.append(arcpy.da.TableToNumPyArray("var_importance", ('*')))

    The code added to the last cell will execute the Forest-based Regression Model 60 times at various time slices. The code and notebook are provided for you as a template to analyze your own data if you choose. You do not need to execute the code to continue. If you choose to run the notebook, click the Save and Checkpoint button save your changes, and then run the code block.

    Note:

    Running the code will take at some time. You may choose to skip running this code and use the resultant rasters provided. If you do run the code, your Jupyter Notebook will create 60 snapshots of downscaled temperatures at different time slices in your default.gdb. In the next lesson, you will use these rasters in a mosaic dataset and later for further analysis with R tools.

  36. It is recommended that you download the raster results if you would like to complete the lesson without waiting for the step to complete.

    Download the compressed raster results.

  37. After downloading, unzip the rasters.zip folder contents and copy the RasterResults.gdb to your p20 folder.
  38. In ArcGIS Pro, in the Catalog pane, in Folder connections, right click and refresh, and then locate RasterResults.gdb or your Default.gdb if you ran the code.
  39. Locate and expand RasterResults.gdb (or your Default.gdb if you ran the Jupyter Notebook code) and verify that your geodatabase has 60 rasters added.

    Raster names start with temp_pred and iterate from 0 to 59. Remember that these rasters represent a five-year-long prediction of average monthly temperatures, which is why you have 60 time slices.

Downscaling can be performed on spatial and temporal aspects of climate projections. Spatial downscaling refers to the methods used to derive finer-resolution spatial climate information from coarser-resolution GCM output. Temporal downscaling refers to the derivation of fine-scale temporal information from coarser-scale temporal GCM output (for example, daily rainfall sequences from monthly or seasonal rainfall amounts). In this lesson, you used Jupyter Notebook to automate a downscaling workflow and create time-discrete estimations of downscaled average monthly temperature profiles over the contiguous United States. Visualizing and processing 60 rasters is challenging. Next, you'll use a mosaic dataset to process and display the rasters.


Create a time series mosaic

Previously, you generated a series of 60 rasters, representing a five-year-long prediction of average monthly temperatures. Next, you'll use a time series mosaic to combine the rasters into one data structure and add time stamps for each of the rasters. Time series rasters provide an easy way to visualize and manage time-tagged rasters.

Build a time series mosaic

Mosaic datasets are used to manage, display, serve, and share raster data. When you create a new mosaic dataset, it is created as an empty container in the geodatabase with some default properties to which you can add raster data. In this lesson, you will create an empty mosaic dataset and then define fields that allow managing time stamps of rasters and store them in a unified data structure.

  1. In the Catalog pane, under Folders, right-click Default.gdb, choose New, and then choose Mosaic Dataset.

    Create new mosaic dataset in geodatabase

    This will open the Create Mosaic Dataset tool. The output location for the mosaic is automatically set to the geodatabase you have selected.

    Note:

    Mosaic dataset source rasters do not need to be in the same location as the mosaic dataset and are often maintained as raster files in folders. In your example, the source rasters were output and stored in a file geodatabase as single-band, 32-bit floating point rasters with LZ77 compression applied.

  2. In the Create Mosaic Dataset tool, set the following parameters:

    • For Output Location, verify default.gdb is set.
    • For Mosaic Dataset Name, type temperature_time_series.
    • For Coordinate System, choose GCS_WGS_1984.

    Create Mosaic Dataset tool

  3. Click Run.

    Your empty mosaic dataset will be generated and a new group layer named temperature_time_series will be added to the Contents pane.

  4. From the Catalog pane, right-click the temperature_time_series mosaic dataset and select Add Rasters.

    The Add Rasters to Mosaic Dataset tool opens.

    Add Rasters

  5. In the Add Rasters to Mosaic Dataset geoprocessing tool, set and verify the following parameters:

    • Verify Mosaic Dataset is set to temperature_time_series.
    • Verify Raster Type is set to Raster Dataset.
    • Verify Processing Templates is set to Default.
    • For Input Data, select Dataset.

    Select input data type

  6. For Input Data, click the Browse button. Locate RasterResults.gdb or Default.gdb and select all 60 temp_pred rasters, and then click OK.

    Browse and select rasters to add

    The Add Rasters to Mosaic Dataset tool updates to display the list of raster datasets that will be included in the target mosaic dataset.

    Verify rasters added to mosaic dataset

  7. In the Add Rasters to Mosaic Dataset tool, expand Raster Processing. Check Calculate Statistics.

    Check to calculate statistics

  8. Expand Mosaic Post-processing. Verify the following are checked:

    • Update Cell Size Ranges
    • Update Boundary
    • Estimate Mosaic Dataset Statistics

    Check additional processing actions

  9. Click Run.

    The tool executes and adds the temp_pred rasters to the temperature_time_series mosaic dataset.

  10. Review the map and notice that it appears to display only 1 of the 60 rasters that make up the mosaic dataset. Currently, all rasters are coincident and represent the same extent but for different time slices.

    Review mosaic dataset in map

    Next, you will need to add fields to the mosaic dataset to register the time component of all rasters. You will define three new fields in the mosaic dataset to describe the name of the variable, dimension of the input, and the value of the time field. In a time series, dimension represents time; in this instance, you will add and populate a separate time field and then convert it to text for the dimension value.

  11. On the ribbon, on the Analysis tab, click the Tools button. In the Geoprocessing pane, type Add Field.
  12. In the Add Field tool, create a field with the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, type Variable.
    • For Field Type, choose Text.
    • For Field Length, type 20.

    Add variable field

  13. Click Run.
  14. Add a second field with the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, type Dimensions.
    • For Field Type, choose Text.
    • For Field Length, type 20.

    Add dimensions field

  15. Add a third field with the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, type StdTime.
    • For Field Type, choose Date.

    Add standard time field

    Next, you will populate the fields with attributes that will enable a time series. First, the Variable field will be set to a string that describes the name of rasters (temp_pred).

  16. In the Geoprocessing pane, type Calculate Field.
  17. In the Calculate Field tool, set the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, select Variable.
    • For Expression, choose Name and add [0:9].

    Note:

    Verify that the expression is defined as Variable = ! Name ! [ 0 : 9 ]. This expression will retain the first nine characters of the Name field and strip away all characters beyond nine to retain only the string "temp_pred".

    Calculate variable filed values

  18. Click Run and verify in the temperature_time_series table that the Variable field has been updated.

    Next, you will associate distinct time stamps to every raster. Remember the start time you used in the previous section; here you will use the same start time and increment it to define a time stamp for every raster.

  19. In the Calculate Field tool, set the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, select StdTime.
    • For Expression, type calcTime and add ( !Name! ).
    • In Code Block, copy and paste the following expression:

      Calculate time

      import datetime
      from dateutil.relativedelta import *
      
      start_time = datetime.datetime(2006,2,1)
      def calcTime(var):
          increment = int(var.split('_')[-1])
          current_time = start_time + relativedelta(months =+ increment)
          return current_time.strftime("%m/%d/%Y")

    Calculate standard time values

  20. Click Run and verify in the temperature_time_series table that the StdTime field has been populated.
  21. Execute the Calculate Field tool a third time with the following parameters:

    • Set Input Table to temperature_time_series.
    • For Field Name, select Dimensions.
    • For Expression, select StdTime.

    Verify that the expression is defined as Dimensions = ! StdTime ! . This ensures that the third dimension for the mosaic dataset will be time.

    Calculate Dimensions field values

  22. Click Run and verify in the temperature_time_series table that the Dimensions field has been populated.

    After successfully creating your time series mosaic, you can now visualize time-stamped rasters.

  23. On the Map tab, in the Layer group, click the Basemap drop-down arrow. From the list of basemaps, select Terrain with Labels.

    Change basemap to Terrain with Labels

  24. In the Contents pane, right-click the temperature_time_series layer and choose Properties.
  25. In the Layer Properties pane, select Time.
  26. In the Time properties pane, set the following properties:

    • For Layer Time, choose Each feature has a single time field.
    • For Time Field, verify StdTime is selected.
    • For Time Extent, verify that the start time is set to 2/1/2006 and the end time is set to 1/1/2011.

    Set time properties

  27. Click OK to apply time property updates.

    The Map view updates to display a time slider that can be used to step through the individual time slices or all slices of the mosaic dataset.

    Map view with time slider

    The current time slice you are viewing is a result of the temporal downscaling executed in the previous lesson. The data represents a derivation of fine-scale temporal information from coarser-scale temporal GCM values. The dark-to-light gradation you are seeing represents an estimation of average monthly temperature profiles over the contiguous United States, with darker areas representing colder estimated temperatures and lighter areas representing warmer ones.

  28. On the time slider, click the Play button to automatically advance the display of time slices.

    Play button on time slider

    The map updates iteratively as new time slices fade in and old ones fade out. Spend a few minutes exploring the time slices and the time slider.

    Explore time slices

  29. Review the Time ribbon for additional Playback, Step, and Full Extent functionality if necessary.

    Step, Playback, and Full Extent functionality

  30. Click Save to save the project.

Next, you may choose to create an animation to visualize and share your time series mosaic, or continue to set up the R-ArcGIS bridge and perform functional data analysis to decompose time signals at every location into simpler analytical functions.

Create a animation of Temperature Time-Snapshot Estimations

Next, you'll create an animation of the Temperature Time-Snapshot Estimation contained in the mosaic dataset. You'll then export a video of the animation so that you can share the results of your work with colleagues and the public. This is an excellent way to illustrate the results of your modeling of the Temperature Time-Snapshot Estimation across the United States.

Note:

Watch the video of the animation if you want to review it, or if you would like to create the animation and export the result to your own video, continue with the steps that follow, or continue with setting up the R-ArcGIS bridge.

  1. In the Contents pane, right-click Station_data_no_missing and choose Zoom To Layer.
  2. In the Contents pane, click the temperature_time_series layer to select it.
  3. In the map display, ensure the time slider for the temperature_time_series layer is set to the start date 2/1/2006.

    Set time slider to start date

    Your map should display only the first raster in the time series, and as you move the slider, the map will update to display the appropriate raster for the time slice you have moved to. Creating an animation for each of the time slices is a useful way to visualize and share the various model results.

  4. On the ribbon, on the View tab, in the Animation group, click Add.

    Add an animation

    The Animation Timeline pane appears and the active ribbon tab changes to the Animation tab.

    Note:

    By default, video export is configured for YouTube using an MP4 video file format with a resolution of 1280x720. This means that your map display will be clipped (view-clipping) to indicate which section of the view will be captured in the video, based on the aspect ratio of the export format. As a result, you may want to adjust the extent of the map to display within the clipped view.

  5. On the Animation tab, in the Create group, for Append Time, update time to 00:03.000. This will set the duration of each frame generated to 3 seconds. You can adjust this later if needed.
  6. On the Animation tab, in the Create group, click Import and choose Time Slider Steps.

    Import time slider steps

    Note:

    If your Animation Timeline pane is not visible, click the Timeline button in the Playback group on the Animation tab.

    Since your mosaic dataset consists of various time slices, a sequence of 32 frames will be generated and displayed side by side in the Animation pane. Additional frames are automatically added to accommodate a start and end keyframe in the animation.

    Animation keyframes generated from time slices

    Note:

    Changing the step interval, number of steps, or direction of playback for your animation affects the total number of frames generated.

  7. If necessary, dock the Animation Timeline pane below the Map scene.

    Your animation frames may initially be empty and appear blank, but as the frames are generated for each time slice, they will slowly fill and update with previews of the Temperature Time-Snapshot Estimation at different times. It may take several minutes for all frames to load.

    Animation Timeline pane

  8. In the Animation Timeline pane, verify that all the frames have been generated, and then scroll to the right until you can see frame 32. Note the playback time total of roughly 76 seconds.

    Scroll to the last frame.

  9. On the ribbon, on the Animation tab, in the Playback group, note the Current and Duration fields of the animation.

    The duration of the animation is set to 1 minute 33 seconds (01:33.000). (Your duration time may vary.) Regardless, this time is too long, so you'll shorten the animation duration to 30 seconds.

  10. For Duration, type 00:30.000 and press Enter.

    The duration between animation frames is adjusted so the total length of the entire animation is 30 seconds. Review the animation timeline and notice that the last frame (32) ends just past 28 seconds.

    Change the duration of the animation to 30 seconds.

    Next, you'll export the animation timeline to a movie.

  11. On the ribbon, on the Animation tab, in the Export group, click Movie.

    Export animation to movie.

    The Export Movie pane appears and allows you to specify options related to the exported video.

  12. In the Export Movie pane, if necessary, expand Movie Export Presets to reveal a list of predefined export presets. Change the following presets:
    • From the presets, click HD720.
    • For File Name, change the file path to the directory where you extracted your project (for example, C:\ClimateAnalysis), and then type TemperatureTimeSnapshotEstimation.mp4 for the video file name.
    • Expand File Export Settings and verify the Media Format is set to MPEG4 movie (.mp4).
    • For Frames Per Second, type 15.

    Configure the video settings.

    To save time in the export, your video uses a relatively low frame rate of 15 frames per second. To create a higher-resolution video, you can increase the number of frames per second.

  13. Click Export.

    It may take a few minutes to export the video file. Your computer's processor and graphics card will directly influence how quickly your animation will export. In addition, the length of the animation and the size of the resolution will also impact how long it takes to generate each frame of the movie.

  14. When the video has finished exporting, click Play the video in the lower left of the Export Movie pane.

    The default video viewer of your computer opens to display your movie of Temperature Time Snapshot Estimation across the United States.

    Animation

    This video can now be shared and accessed by anyone. You can upload this video to ArcGIS Online, share it on YouTube, or show it to colleagues.

    Note:

    If Windows Media Player or another common video player is unable to play your exported movie from ArcGIS Pro, most likely you are missing the necessary codec file. The codec is a form of compression used to keep the video file size low. With the correct codecs installed, Windows Media Player will be able to play the supported movie formats.

    Before continuing, close the animation timeline and disable the time field. Since the time slider also serves as a method of feature selection, you need to ensure no time slices are currently selected.

  15. In the temperature_time_series layer properties, for Layer Time, change back to No Time. This will ensure the full time series is available.
  16. Save the project.

In this section, you created and explored the time series mosaic and created an animation of downscaled temperature profiles. In the next step, you will analyze the time series you created using the R-ArcGIS bridge.

Set up and use the R-ArcGIS bridge

In the previous section, you created a time series mosaic that contains your downscaled temperature profiles. The Conda environment you set up in the first lesson ships with R. In the first step, you will connect your R version in the Conda environment with ArcGIS Pro through the R-ArcGIS bridge.

Once the connection is established, complete the following steps:

  1. On the ribbon, click the Project tab.

    Project tab

  2. In the Project pane, on the left, click Options and choose Geoprocessing.

  3. In the Geoprocessing options, locate R-ArcGIS Support.

    Geoprocessing options

    The system automatically checks for the presence of the R-ArcGIS integration package and arcgisbinding in the home directory. If a valid R home directory is detected, you will see the path to the directory listed: [R-3.5.2] C:\program Files\R\R-3.5.2. In addition, the package version will be listed below the R home directory information.

    Note:

    • If arcgisbinding is found, you can check for updates, download the latest version, and update from a file.
    • If arcgisbinding is not found, you are presented with a warning and options to download and install the package.

  4. If you need to install arcgisbinding, click the drop-down arrow next to the warning and select Install package from the Internet.

    Install ArcGIS R integration

    A message appears asking if you would like to install the newest version.

  5. Click Yes on the prompt.

    Update arcgisbinding

  6. Review the installation summary window and click Close.

    Install message output

    In the first lesson, you installed a Conda package that installed R in your Conda directory. The R-ArcGIS bridge now identifies this installation and creates the connection between R and ArcGIS Pro, allowing you to use the R-ArcGIS bridge to analyze the time series data in your mosaic dataset.

    To perform the analysis, you will download and run an R Jupyter Notebook that contains the steps needed to complete functional data analysis (FDA).

    The notebook R code implements FDA to decompose time signals at every location into simpler analytical functions. The weights associated to every function are a proxy for the saliency of the signature in that time signal. Thus, you can decompose complex signals into simpler building blocks and perform analysis in this simplified domain.

    Decompose complex signals into building blocks.

  7. Download the TimeSeriesDecomposition zipped folder.
  8. Locate the downloaded file on your computer and unzip it to your notebooks folder (for example, C:\ClimateAnalysis\notebooks).
    Note:

    Data paths in the code have been set to the following:

    • C:\ClimateAnalysis\p20\default.gdb
    • C:\ClimateAnalysis\p20\downscaleclimate.gdb

    If your source data is found at a different location, change the data paths before attempting to run the code.

  9. Open the TimeSeriesDecomposition notebook.
    Note:

    The analysis steps as executed by the code are as follows:

    • Install analysis packages, rgdal, fda, rts, and shiny.
    • Initialize connection to ArcGIS Pro and check licensing and version.
    • Define GDB raster location.
    • Read in time series mosaic and generate a raster list.
    • Define raster extent.
    • Establish forecast times.
    • Create a multidimensional raster stack to store temperature forecasts.
    • Visualize R space-time data structure.
    • Plot time snapshots of temperature predictions.
    • Plot maximum temperatures.
    • Define Fourier basis functions.
    • Create an empty target raster based on the mosaic dataset.
    • Write FDA scores and coefficients to a multiband raster.
    • Extract predicted temperature time series for each U.S. county polygon.
    • Calculate average time series value for every county.
    • Decompose county-level time series.
    • Perform FDA for every county.
    • Add base function coefficients to county polygons.

  10. Review the notebook code, update data paths if needed, and note that the code will perform functional data analysis to evaluate different signatures in the temperature profiles you predicted. In essence, the functional data analysis will decompose your complex time series into a simple composite analytical time series.
  11. When you are ready, run the notebook code.
  12. In the notebook, review the outputs.

    The code outputs are as follows:

    • Raster Time Series with monthly periodicity from 2006-02-01 to 2011-01-01
    • class: RasterStackTS
    • raster dimensions: 200, 100, 20000, 60 (nrow, ncol, ncell, nlayers)
    • raster resolution: 0.55, 0.125 (x, y)
    • raster extent: -125, -70, 25, 50 (xmin, xmax, ymin, ymax)
    • coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
    • min values: -10 -10 -10 -10 -10 -10 -10 -10 -10 -10...
    • max values: 22 22 22 22 22 22 22 22 22 22...

    Temperature profiles of first year predictor

    Cyclic downscaled temperature variations

    This plot reflects temperatures predicted for every raster cell over time. Each line in the plot corresponds to the predicted temperature for a raster cell over a five-year period. Due to the cyclic nature of temperature, your plot of downscaled temperatures at every raster cell reflects variation via the peaks and troughs observed.

    Sine and cosine base functions used by Fourier basis functions

    In the previous plot, you observed the cyclic nature of temperature. As a result, you will use a Fourier basis function to decompose modeled temperature profiles as a combination of sine and cosine signals.

    In this step, you decomposed time series rasters into their building block scores. The magnitude of every score explains the similarity of the complex time series signal at that location to that basis function. For instance, if the sinusoidal basis function that has two peaks and one trough gets a high weight at a location, this implies that at that location the predominant temperature oscillation consists of two hot seasons and one colder season.

Define regions based on downscaled temperature profiles

In the previous step, you used a niche R library to decompose the time series mosaics by using the R-ArcGIS bridge to access the mosaic dataset in your geodatabase. In this step, you will evaluate the scores defined by the FDA library. Note that in the previous step you used a Fourier basis function, meaning that you decomposed the temperature time signals at a county level into weights of different basis function that indicate different characteristics of downscaled temperature variation in that county.

Next, you need to identify distinct regions or clusters of time series profiles using the base functions defined in the R workflow described. If successful, your analysis will identify and group areas with distinct time series characteristics into clusters. For instance, locations that experience four seasons versus no seasons are expected to be in different groups. You will use a cluster analysis to discover these regions.

  1. If necessary, switch applications to ArcGIS Pro. Don't close the notebook.
  2. In the Catalog pane, in Folder connections, locate Default.gdb, and then add us_county_fda to the map.
  3. In the Contents pane, right-click us_county_fda and choose Attribute table. In the table, scroll to the right of the table and review the const, sin1, and cos1 fields.

    Review layer table

    Next, you will map the constant value.

  4. In the Contents pane, right-click us_county_fda and choose Symbology.
  5. In the Symbology pane, for Primary symbology, choose Graduated Colors.
  6. For Field, choose const.
  7. For Color scheme, choose Yellow to Red.

    Set layer symbology

    The map updates to display the constant for each county, which is indicative of the average value of the temperature time series for that county. Review the map and note that in a predicted time series decomposition, an expected profile for average temperature is obtained.

    Map of constant values per county

    Notice how the map shows a clear temperature gradient from south to north.

    South to north temperature gradients

    In addition, coastal areas show distinctly lower average temperatures.

    Coastal area effects

    Note:

    In the TimeSeriesDecomposition R notebook, in a new cell, add and execute the following code to generate a plot of the average temperature at a given location.

    Plot First base Function

    ## Explore Some of the Basis Functions
    t <- seq(from = 0, to = 1, by = 0.01)
    bvals = eval.basis(t, basis_funcs)
    
    ## The First Basis Function
    plot(t, bvals[, 1], type = "l", lwd = 4)

    Review the result.

    Plot of average temperature at a given location

  8. In the Symbology pane, change the field to sin1.

    Set sin1 field for display

    The map updates to display the sin1 values for each county. The map now shows high coefficient values for locations that show cyclicity throughout a large period of the year, such as areas with two distinctly different seasons.

    Counties displaying high cyclicity

  9. On the map, zoom to California.

    Notice the downscaled temperature patterns in California and how they show strong patterns due to distinct summer and winter temperature variations.

    Seasonal variation in California

    Review parts of the southeastern United States. These areas show how the signature decays radially for counties closer to the Atlantic Ocean and the Midwest.

    Note:

    In the TimeSeriesDecomposition R notebook, in a new cell, add and execute the following code to generate a plot of the average temperature at a given location.

    Plot Second base function

    ## The Second Basis Function
    plot(t, bvals[, 2], type = "l", lwd = 4)

    Review the result.

    Plot of basis function with signals showing increasing temperatures

    This basis function contains signals that show increasing temperatures right after January, which start declining midyear and then going back up. Since such a profile is not expected in Northern America, this basis function might get small weights.

  10. In the Symbology pane, change the field to cos1.

    The map updates to shows areas where big drops in temperature occur throughout the year.

    Areas with high temperature drops in a year

    Note that the northern United States shows this pattern of big temperature drops most clearly.

    Note:

    In the TimeSeriesDecomposition R notebook, in a new cell, add and execute the following code to generate a plot of the average temperature at a given location.

    Plot cos1 basis function

    ## The cos1 Basis Function
    plot(t, bvals[, 3], type = "l", lwd = 4)

    Review the result.

    Plot of cos1 with temperature decline midyear

    This basis function shows a rapid decline in temperature midyear followed by temperatures going back up.

  11. In the Symbology pane, change the field to cos5.

    The map updates to show high-frequency variations of temperature throughout the year. Of particular interest here is the impact of distance from water on the downscaled temperature profile. Locations close to the great lakes and along the oceans show high variation, while inland areas located further away from water show small-scale variations of temperatures.

    Impact of distance from water

    Note:

    In the TimeSeriesDecomposition R notebook, in a new cell, add and execute the following code to generate a plot of the average temperature at a given location.

    Plot cos5 basis function

    ## The cos5 Basis Function
    plot(t, bvals[, 11], type = "l", lwd = 4)

    Review the result.

    Plot of cos5 showing monthly temperature variations

    This basis function shows monthly temperature variations. These distinct variations to temperatures can be expected near water bodies.

    After exploring some of the functional basis functions, you need to create spatially contiguous regions of temperature profiles. Counties that have similar predicted temperature variations throughout the year and that are neighbors are expected to be grouped in the same region. You can achieve this by using the Spatially Constrained Multivariate Clustering tool on all the basis functions except the constant basis function. The constant basis function only contains information about the average temperature at a location and nothing about temperature variation.

  12. On the ribbon, on the Analysis tab, click the Tools button. In the Geoprocessing pane, type Spatially Constrained Multivariate Clustering.

    You will run the tool on all the scores except the first scores to obtain a dissimilarity for the overall profile of the temperature time series.

  13. In the Spatially Constrained Multivariate Clustering tool, set the following parameters:

    • Set Input Features to us_county_fda.

    • For Output Features, type us_county_fda_SpatiallyConst1.

    • For Analysis Fields, select sin1, cos1, sin2, cos2, sin3, cos3, sin4, cos4, sin5, cos5, sin6, cos6, sin7, and cos7.

    Spatially Constrained Multivariate Clustering tool

  14. Click Run.
  15. Review the tool details and note the statistics generated for each input variable.

    Spatially Constrained Multivariate Clustering tool output statistics

    The Spatially Constrained Multivariate Clustering tool adds a new layer and several charts to the Contents pane. The statistics described in the tool details represent, for example, the mean or minimum and maximum weight of a variable across all counties in the United States.

  16. In the Contents pane, review and expand the us_county_fda_SpatiallyConst1 layer.

    Clustered U.S. counties

    Notice how the Spatially Constrained Multivariate Clustering tool has grouped U.S. counties into 24 clusters based on the highest pseudo F-statistic, which is a useful indicator of the number of clusters that should be assigned to data. In addition, the tool creates a spatially constrained multivariate clustering box plot and a number of features per cluster histogram. The box plot describes the temperature signatures associated with every region.

    Note:

    The largest F-statistic values indicate solutions that perform best at maximizing both within-cluster similarities and between-cluster differences. If no other criteria guide your choice for number of clusters, the tool will use the number associated with the largest pseudo F-statistic values.

  17. In the Contents pane, right-click Spatially Constrained Multivariate Clustering Box-Plots and choose Open.

    Spatially Constrained Multivariate Clustering Box-Plots

    Note that the green region associated with portions of the West Coast is identified with smaller weights to cyclicity up to sin4 and shows higher weights, indicating smaller-scale changes to temperature profile.

  18. In the chart, identify and select the medium green plot associated with the West Coast.

    Select the west coast cluster for review

  19. In the chart window, for Filter, click Selection to display only the selected cluster.

    Filter plot by selection

  20. Review the filtered box plot for the West Coast.

    The box plot associated with portions of the West Coast displays smaller weights to cyclicity up to sin4 and shows higher weights, indicating smaller-scale changes to temperature profile.

    West Coast cluster temperature cyclicity

  21. On your own, in the box plot, unselect records and review the light pink region surrounded by the Great Lakes.

    Great Lakes region cluster temperature cyclicity

    This region shows strong signals resembling cos5, cos6, and cos7 basis functions that are identified with drastic changes to temperature within the span of a month. In addition, it is characterized with a large cos1 signal that shows a drastic temperature drop. These factors are typical of the Great Lakes region, where the lakes' effect on temperature is significant.

In this lesson, you explored time-discrete statistical climate downscaling using the regression suite in ArcGIS Pro to form relationships between simulated climate and observed weather. You performed spatial and nonspatial data exploration and experimented with the Generalized Linear Regression, Geographically Weighted Regression, and Forest-based Classification and Regression tools. Once you found an accurate regression model, you made use of Jupyter Notebook to take advantage of the Python automation to define temperature predictions monthly for five years, a total of 60 time slices. Using the ArcGIS Pro time series mosaic, you wrangled all of your predicted temperature rasters into one data structure. Using the R-ArcGIS bridge mosaic integration, you performed functional data analysis to explore dominant time signatures in your predicted temperature profiles over the United States. Last, you seamlessly moved your functional data scores from R to ArcGIS to visually inspect predicted time series signatures and establish temperature profile regions for the United States using spatially constrained multivariate clustering on the functional scores.

You can find more lessons in the Learn ArcGIS Lesson Gallery.