In the previous lesson, you created a map in ArcGIS Pro showing where there are unusually high numbers of crimes occurring adjusted to account for population. Next, you'll determine which of the other attributes you added may be influencing the prevalence of crime across San Francisco. To do this, you will use R to leverage exploratory data analysis tools to identify the most significant influencers on crime. The most influential attributes can be used as a starting point for the department when it's ready to begin researching possible predictive models for future planning and the proactive distribution of resources in ArcGIS and R.

## Create a correlation matrix in R to evaluate attribute relationships

As an important first step in modeling the relationships between crime and the variables you've chosen, you can use exploratory data analysis tools in R. These tools allow you to identify the most likely statistically relevant predictors for your analysis, potentially making future models you build more effective in identifying future trends. The exploratory data analysis tool you're going to use is a correlation matrix. The correlation matrix tool creates an illustrated grid of values that measure the level of association between your added attributes and your population smoothed crime rates

##### Note:

The following workflow is derived from a guide created by STHDA, a website dedicated to tutorials for data analysis and data visualization using R, and explains how to improve the appearance of your matrix so you can better communicate the results of your analysis.

- If necessary, open your Crime Analysis project
in ArcGIS Pro, and then open RStudio.
Next, you'll bring your most recent version of the data into R.

- Use arc.open() to establish the project and dataset you are working with by specifying the path to your San_Francisco_Crime_Rates feature class as the first argument. Store the result in the rate_df variable.
`rate_df <- arc.open('C:/Lessons/San-Francisco/SF_Crime.gdb/San_Francisco_Crime_Rates')`

- Use arc.select() to choose the variables you want from your data to bring into R. For the first argument, specify the object from which you are selecting attributes, in this case, rates_df. For the second argument, list each attribute that you want to select in a character vector.
`rate_select_df <- arc.select(rate_df, fields = c("OBJECTID", "Crime_Counts", "Population", "Med_HomeValue", "Med_HomeIncome", "Renter_Count", "Grocery", "Restaurant", "EB_Rate"))`

- Convert your feature class into a spatial data frame object using the arc.data2sp() function.
`rate_spdf <- arc.data2sp(rate_select_df)`

To enhance the appearance of the correlation matrix, you'll load several function libraries and custom functions that can be found in the correlation matrix tutorial. You'll use the following custom functions:

- Get lower triangle of the correlation matrix—By default, a correlation matrix returns the correlation coefficient for each pair of attributes twice. This function identifies and returns only the values for the lower triangle of your correlation matrix. All other values are set to NA.
- Get upper triangle of the correlation matrix—By default, a correlation matrix returns the correlation coefficient for each pair of attributes twice. This function identifies and returns only the values for the upper triangle of your correlation matrix. All other values are set to NA.
- Reorder correlation coefficients—This function reorders the correlation matrix by correlation coefficient magnitude.

These functions produce a correlation matrix that is polished, easier to analyze, and ready to be shared with the police department.

- Enter and run the following code in your RStudio console to add the custom functions to your workspace:
`# Get lower triangle of the correlation matrix get_lower_tri<-function(cormat) { cormat[upper.tri(cormat)] <- NA return(cormat) } # # Get upper triangle of the correlation matrix get_upper_tri <- function(cormat) { cormat[lower.tri(cormat)] <- NA return(cormat) } # reorder_cormat <- function(cormat) { # Use correlation between variables as distance dd <- as.dist((1-cormat) / 2) hc <- hclust(dd) cormat <- cormat [hc$order, hc$order] }`

- Type the following code to add the functions from the reshape2, ggplot2, and ggmap libraries to your workspace.
`install.packages("reshape2") library (reshape2) install.packages("ggplot2") library (ggplot2) install.packages("ggmap") library (ggmap)`

##### Note:

If a package is already installed, you may receive a message that the package will be updated and RStudio restarted. You may choose to update or cancel the package update.

- Type the following code to create the correlation matrix:
`corr_sub <- rate_spdf@data [ c ("Grocery", "Restaurant", "Med_HomeIncome", "Renter_Count", "Med_HomeValue", "EB_Rate")] cormax <- round (cor(corr_sub), 2) upper_tri <- get_upper_tri (cormax) melted_cormax <- melt (upper_tri, na.rm = TRUE) cormax <- reorder_cormat (cormax) upper_tri <- get_upper_tri (cormax) melted_cormax <- melt (upper_tri, na.rm = TRUE) ggheatmap <- ggplot (melted_cormax, aes (Var2, Var1, fill = value)) + geom_tile(color = "white") + scale_fill_gradient2 (low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name = "Pearson\nCorrelation") + theme_minimal() + # minimal theme theme (axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) + coord_fixed() print (ggheatmap) ggheatmap + geom_text (aes (Var2, Var1, label = value), color = "black", size = 4) + theme ( axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.border = element_blank(), axis.ticks = element_blank(), legend.justification = c (1, 0), legend.position = c (0.6, 0.7), legend.direction = "horizontal") + guides (fill = guide_colorbar (barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5))`

- If necessary, click the Plots tab to view the resulting correlation
matrix.
A correlation matrix helps identify predictors you may want to focus on when deciding what attributes influence the occurrence of crime. The matrix shows values that measure how strongly correlated the attributes are with the desired dependent variable and with one another. Additionally, by identifying attributes with a higher correlation to your dependent variable, you have a better idea of what predictors to try first when attempting to find a predictive model that fits your data well. Your correlation matrix can also be used to identify possible instances of multicollinearity between predictors.

When a predictor has a positive correlation with your dependent variable, it means that as the magnitude of the predictor increases, the magnitude of the dependent variable increases as well. The higher the correlation coefficient, the stronger this relationship is. Whereas with a negative correlation, as the magnitude of the predictor increases, the magnitude of the dependent variable decreases.

Multicollinearity measures the similarity of two predictors. When a predictive model contains variables that represent the same effect, those variables tend to negate one another to the point where the effect being accounted for in the model does not impact the response as strongly, and can cause instability in the model. Both of these measures are valuable to consider as you investigate what influence the occurrence of high crime rates.

The column above and to the right of your EB_Rate variable has fairly light colors and low values indicating that it is not strongly correlated with the other attributes. In contrast, you can observe some possible instances of multicollinearity among your potential predictors. In particular, the Restaurant and Grocery attributes have strong correlation, as indicated by the bright red square with a correlation coefficient 0.76.

While the correlation matrix can offer clues about what predictors to include in your model, this type of exploratory data analysis is just one part of building a good model. For the purposes of this lesson, you'll use only the correlation matrix to help make decisions. For more information about other exploratory functions in R that are useful for developing predictive models, see this blog.

## Summary and conclusion of your work

Based on the results of your correlation matrix, it appears that your department still has some work to do in terms of identifying attributes that influence the occurrence of crime. Ideally, you'd like to see attributes with much larger correlation values to your response variable so you are fairly confident that a relationship exists between the two. However, this is not surprising, as the process of finding good predictors in a given data set for a particular response variable can be tricky and can require advanced statistical methods to account for possible non-linear terms, spatial trends, and other factors.

Through this lesson, you learned how to install and set up the R-ArcGIS bridge, how to use the bridge to transfer data between ArcGIS and R, and you have seen one of the possible ways R can enhance your ArcGIS workflows through its powerful statistical libraries. You've analyzed statistically significant spatiotemporal trends; enriched your data with a wealth of available socioeconomic, demographic, business, and environmental factors; calculated robust crime rates; found hot spots of crime rates; and began to explore relationships that might help explain those patterns. With the addition of the R-ArcGIS bridge to your workflow, you now have more possibilities at your fingertips than ever before to assist you and your department as you work to understand crimes in San Francisco and what you can do to reduce them.