Explore the data

In this tutorial, you will assume the role of a data analyst working with blood lead level test results. You must prepare data for analysis, visualization, and sharing. The data will be used for different purposes by different groups. Some staff need access to point level data for operations, such as case management and site assessments. Others need aggregated data to inform communications with your leadership, while some communications will be shared with the public. Yet others must analyze change over time and track the results of interventions and mitigation efforts. Because privacy laws protect patient data, you must prepare different derived data products using different de-identification strategies.

The tutorial data is fictitious. It has been created for the purpose of demonstrating the workflow in this tutorial. It is designed to look plausible for the workflow and is structured similarly to data that you might use in this situation, but due to the legal limitations on sharing real data of this type, it is entirely made up. Do not rely upon this data. Do not attempt to draw conclusions or make real-world decisions based on this data. Do not use this data to train AI or ML models, the results will be inaccurate. The addresses in this dataset are real addresses, for the purposes of enabling a demonstration of geocoding and to provide plausible data to de-identify, but the data has no real relation to these addresses. Any names or attribute values associated with these addresses in the datasets are made up and have nothing to do with any actual persons or conditions at these locations.

Download and inspect the data

First, you'll download and examine the data.

  1. Download the Blood_Lead_Levels_Zipped_Folder.zip zipped project data.

    A file called Blood_Lead_Levels_Zipped_Folder.zip is downloaded to your computer.

    Depending on your browser and settings, it may be saved in your Downloads folder, or on your Desktop

  2. Locate the downloaded file on your computer and use a zip utility to extract the zip file to a folder. Specify the output folder location and click Next.

    Extract the zipfile.

    Specify the path for the extracted folder.

    This is a password-protected zip archive. A password window appears.

  3. For Password, enter the password I_Understand_This_Is_Fictitious_Data and click OK.

    A password dialog box appears.

    Use of this password indicates that you understand that the data is fictitious.

    The zipfile is extracted to your computer as a folder.

  4. Open the folder where you extracted the zip file.

    It contains a file named BloodLeadLevels.ppkx. A .ppkx file is an ArcGIS Pro project package, a compressed file for sharing projects that may contain maps, data, and other files that you can open in ArcGIS Pro.

  5. Double-click BloodLeadLevels.ppkx to open it in ArcGIS Pro. If prompted, sign in with your ArcGIS account.

    A map of Sacramento, California, appears. The fictitious High_Blood_Level_Results point layer shows home address locations of kids who had high levels of lead in their blood.

    Fictitious point locations of kids with high levels of lead in their blood

    Your lead surveillance and mitigation program uses the blood test results and the location of the individual patients to investigate the sources of lead exposure in the homes of these kids. The data is also used to investigate potential exposure of family members, and for tracing sources of lead at work, school, and community locations.

  6. If the High_Blood_Level_Results attribute table is not already open, in the Contents pane, right-click High_Blood_Level_Results and click Attribute Table.

    Attribute Table option for the high blood lead level layer

    Many countries have enacted policies to protect individual privacy for sensitive information, such as financial and health data. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) was signed into law in 1996 and serves as the primary guidance for safe health data practices.

    The U.S. Department of Health and Human Services defines Protected Health Information (PHI) as "individually identifiable health information held or transmitted by a covered entity or its business associate, in any form or media, whether electronic, paper, or oral. Individually identifiable health information includes demographic data that relates to:

    • the individual’s past, present or future physical or mental health or condition,
    • the provision of health care to an individual, or
    • the past, present or future payment for the provision of health care to the individual,

    and that identifies the individual or for which there is a reasonable basis to believe it can be used to identify the individual. Individually identifiable health information includes many common identifiers (for example, name, address, birth date, Social Security Number)."

  7. Examine the attributes in the table.

    Data table contains PII and health data.

    The layer contains fictitious data for home address, first and last names, birthday, age, race, ethnicity, gender, blood test results, and test year. If this data were real, it would be considered private, highly personal information about the health status, identity, and precise location of minors.

    This is useful and valuable information, but it must be handled carefully, in accordance with health data privacy laws. Since your job requires you to use and share this data, you must be aware of the laws, and the ways in which the data can be de-identified for sharing.

    Since the High_Blood_Level_Results data table includes information about blood lead levels and identifying information about the kids, including their names, addresses, and birth dates, it is PHI according to HIPAA and must be carefully protected according to the HIPAA Privacy Rule.

    This kind of data can only be shared with staff that are authorized for access. That authorization will be determined by your internal organizational guidance and generally includes those whose job responsibilities require access to PHI or those granted access through internal processes like an institutional review board (IRB) for research and evaluation purposes.

    You may wonder if you are bound by these rules.

  8. Read the Are You a Covered Entity? section of the Centers for Medicare and Medicaid Services (CMS) page.

    This page provides guidance about who is covered by HIPAA regulations. The Covered Entity Decision Tool (PDF) provides an interactive decision tree you can use to determine if you are a covered entity who must follow the HIPAA rules.

    In general, covered entities include the following:

    • Health plans—Those that provide or pay the cost of medical care.
    • Health-care providers—Those that transmit data electronically for any purpose (billing, referrals, and so on).
    • Health-care clearinghouses—Organizations that process nonstandard health information to conform to standards for data content or format, or vice versa, on behalf of other organizations.
    • Business associates—A person or organization outside of the covered entity that performs certain functions on behalf of the covered entity that involve the use or disclosure of personally identifiable health information. In these situations, the covered entity must have a contract with the business associate that assigns the same duties and obligations for privacy protections that fall under the covered entity.

    For the purposes of this tutorial, you are a covered entity, because your organization runs health-care clinics.

    Health data like this blood lead level layer is extremely valuable for identifying health disparities, policy assessment, and strategic planning. You must use methods that protect individual privacy while maximizing the utility of the data for these important efforts.

  9. Read the De-identification Standard section of the HHS.gov page.

    You can use GIS data with PHI, but you must keep it on properly secured local computer hardware or in a secured ArcGIS Enterprise geodatabase. This data cannot be hosted in ArcGIS Online.

    If you share the data, you must first de-identify it.

    A de-identification diagram shows identifier data split from the health data

    The goal of data de-identification is to separate the identifiable information from the health information to ensure a very low risk of re-identification.

    The process of de-identification involves removing identifiers in the dataset in a way that significantly minimizes the chances that someone could figure out the identity of any individuals in that dataset. Regulators know that even when proper de-identification methods are used, there is always a greater than zero risk of identification. Therefore, the requirements for de-identification are to ensure a very low risk of re-identification of an individual. The two accepted methods for de-identification under the HIPAA standard are shown below.

    De-identification methods

    The first de-identification method, Safe Harbor, requires you to strip out the following 18 specific identifiers from the data:

    • Names
    • All geographic subdivisions smaller than a state
    • All elements of dates (except year) that are directly related to an individual
    • Telephone numbers
    • Vehicle identifiers and serial numbers
    • Fax numbers
    • Device identifiers and serial numbers
    • Email addresses
    • Web Universal Resource Locators (URLs)
    • Social Security Numbers
    • Internet Protocol (IP) addresses
    • Medical record numbers
    • Biometric identifiers, including finger and voice prints
    • Health plan beneficiary numbers
    • Full-face photographs and any comparable images
    • Account numbers
    • Certificate/license numbers
    • Any other unique identifying number, characteristic, or code, except as permitted

    Much of the data in the High_Blood_Level_Results layer would have to be removed to comply.

    Fields that must be removed

    This method will not be very useful if you’re using GIS for health, but it’s still worth knowing about. It is simpler that the second method, but does requires a bit more thought beyond removing the 18 identifiers. The data manager must also consider if there are any other identifiers in the dataset that a reasonable person could use to identify an individual, like a unique job title.

    You may also have noticed a problem with the second identifier, all geographic subdivisions smaller than a state. This would make the use of GIS extremely challenging at a useful resolution, like a city or neighborhood.

    You would go from these points:

    Fictitious point locations of kids with high levels of lead in their blood

    To state level data, such as in the following map:

    State level data without point locations

    The Safe Harbor rules allow you to use the initial three digits of a ZIP Code if, according to current US Census data, the three-digit ZIP Code has more than 20,000 people. However, few people in health-care GIS use three-digit ZIP Codes, and health-care GIS users are often concerned about health impacts at finer geographic levels.

    To make the best use of your data, you must use the second de-identification method, called the Expert Determination method.

  10. Review the guidance on Expert Determination de-identification.

    There is a lot of flexibility in the expert determination method. It requires a user with adequate knowledge and expertise to apply generally accepted scientific and statistical principles and methods in a way that renders the data de-identified with a very low risk of re-identification. A key aspect of the expert determination method is that the techniques used to make the expert determination are documented.

You have reviewed the blood lead level data. You also researched the definition of PHI, the entities that must comply with HIPAA, and two de-identification methods, Safe Harbor and Expert Determination. As you encounter PHI in your GIS-related work, it is important that you take the appropriate steps to comply with the law and prevent a privacy breach.

You must determine the best method to provide the right level of data for different members of your team, depending on their roles and tasks. You will provide point level identifiable data to some internal users. These authorized users may perform case management and investigation, looking for potential sources of exposure. They might need the residential addresses to calculate optimized routes for home visits. Others, however, will need a de-identified minimum viable dataset.

In the following sections, you will employ the Expert Determination method using several GIS techniques to create data products to support your organizations' childhood lead poisoning prevention efforts.


Design map-based visualizations

In this section, you will symbolize the blood lead level data on maps using methods that retain the data integrity and spatial patterns, while still protecting the privacy of individuals within the dataset.

Different methods are useful for different use cases. You must think about the intent, audience, and delivery mechanism for the map. If the map will be static, such as a PDF, image, or paper map, and the map user cannot interact with the data, different considerations apply than if the map user can explore the data in a web browser or application where they can zoom in and out and could potentially investigate individual points and their associated attribute data.

Make a heat map

You need to make a map for a printed poster to inform stakeholders and the public about the extent of childhood lead poisoning in Sacramento to help communicate risk and target intervention, health education, and related activities. A heat map is a good choice for this, since it creates a smoothed surface indicating the density of points in your layer, while blurring the locations of the points.

  1. In the Contents pane, right-click High_Blood_Level_Results and click Symbology.

    Symbology option

  2. In the Symbology pane, click the Primary symbology drop-down list, scroll down, and click Heat Map.

    Heat Map symbology option

    The symbology for the layer switches to show the data as a heat map.

    Point density represented as a heat map

    The high intensity yellow and red spot in the northeastern section of town represents an area where multiple children with high blood lead levels are living. Importantly, you cannot see how many children are being shown, nor the exact locations of their homes. To further protect patient privacy, you can show this heat map without including other administrative boundaries, like county lines or ZIP Codes, and you can also change the basemap to one that does not show street names to protect against re-identification of the sensitive data. This visualization technique works best for datasets with many point features where at least some of them are in close proximity to others.

    Note:
    The most visually intense areas of a heat map are sometimes referred to as hot spots. While this is a reasonable way to describe these spatial patterns, you should not confuse this type of hot spot with the results of the Hot Spot Analysis tool, which identifies statistically significant clustering across a study area.

  3. On the ribbon, click the Share tab, and in the Output section, click Capture To Clipboard.

    Capture To Clipboard button

    A static image of the heat map is coped to the clipboard. You can paste this into a presentation or document and share it without exposing PHI.

  4. Zoom in to the intense area in the northeast part of town.

    Map zoomed to the northeast area, where there is a high concentration of points

    As you zoom in, the heat map symbology changes to show the relative density of points on the screen.

    Heat map symbology changes as you zoom in.

    The closer you zoom in, the more details become apparent. Even if the data is blurred relative to the original point representation, at some scales, a heat map is no longer an appropriate way to display sensitive data while still protecting privacy.

    Heat map resolves to points as you zoom in.

    Note:
    It is important to be aware that if your intent was to create an interactive map rather than a printed map, this dynamic heat map rendering could expose personal information. When creating interactive maps, beware of dynamically rendered heat maps and consider limiting the amount of zoom that is possible using scale dependent rendering.

    At some zoom scales, you can determine house-level locations for the blurry points.

  5. Click one of the blurred points.

    Pop-up data for High_Blood_Level_Results 3012

    The pop-up shows the attributes of the point. Using heat map symbology does not protect patient data when the map is interactive. The points and their attributes are still present.

  6. In the Symbology pane, in the Radius box, type 50.

    Radius set to 50.

    The heat map symbology changes, recalculating the density using a larger radius value.

    Result of increasing the radius value

    This new representation could be captured to show the density of high blood lead level cases at a neighborhood scale.

    It is useful to explore different heat map symbology parameters to represent the degree and scale of clustering in your data, balancing the need to accurately portray the data geographically and the requirement that you protect the privacy of the subjects. Many health-related issues, including disease outbreaks, operate at different geographic scales. In some instances, there is a point source causing an outbreak, while at other times the problem involves community-level transmission. Understanding and using data at the appropriate scale is key to any successful health GIS analysis.

    Your city-level static map image can be added to reports that inform stakeholders and the public about the extent of childhood lead poisoning in the community. Heat maps are useful for showing how the data is distributed and where it is particularly concentrated. You can read more about heat map symbology in the help.

  7. Click Save Project to save your project.

Make a point cluster map

You need to make a static planning map for hospital leadership that clearly communicates where there are large and small concentrations of lead poisoning cases. Of course, you still must do so in a way that protects the privacy of individuals. In this case, leadership is concerned about the actual numbers of cases within their service area because they need to ensure that they allocate specialists and coordinate the care program resources.

To do this, you'll make a cluster map. The technique of feature clustering works by grouping the clusters of points within an area and displaying a graduated symbol that shows the number of grouped points represented by that cluster. This is recommended when you want to show exact numbers at different scales, and you do not need or want to share the individual point locations.

  1. In the Contents pane, click the High_Blood_Level_Results layer.
  2. On the ribbon, click the Feature Layer tab, and in the Drawing section, click Aggregation and click Clustering.

    Clustering option

  3. Click Yes on the message confirming that this will change the symbology .

    Yes on the Clustering dialog box

    The symbology for the layer changes to the Clusters style. The color of the symbols is randomly assigned, and the size and number of clusters will depend on your display and the map extent.

    Clustered points symbology

    The clusters of points are scaled relative to the number of points in the cluster, and they are labeled with the number of points as well.

  4. Zoom in to the cluster in the northeast of the city.

    Clusters change as you change zoom level and extent.

    As with the heat map symbology, the cluster symbology adapts to the zoom level and extent of the map.

    If you zoom in close enough, you will start to see individual patient locations.

    Zoomed in map showing individual features

    Just as with the heat map symbology, at some extents and zoom levels, the cluster symbology is not appropriate for protecting patient identity. Also, like with the heat map symbology, when you zoom in enough in an interactive version of the map, you can click individual points and get their attributes. Cluster symbology is not sufficient to protect patient identity in an interactive map.

    For static maps, you can adjust the clustering to be more appropriate at a desired scale and extent.

  5. In the Contents pane, under High_Blood_Level_Results, right-click Clusters and click Symbology.

    Symbology option under Clusters

  6. In the Symbology pane, on the Clusters tab, click Cluster settings.

    Cluster settings

  7. Click the Cluster radius slider and drag it toward the High end of the scale.

    The slider being dragged toward the High end of the scale

    As you drag the Cluster radius slider toward the High end of the scale, the number of clusters decreases and the number of points per cluster increases.

    Increasing the Cluster radius incorporates more of the points in each cluster.

    This is similar to the way the heat map radius works. You can change the cluster radius to adjust the degree of clustering to suit your map extent and scale.

  8. In the Contents pane, right-click Features and click Zoom To Layer.

    Zoom To Layer

    As with the heat map symbology, a radius that works well for one scale and extent may not be appropriate at another.

    Radius set too high

  9. In the Symbology pane, click the Cluster radius slider and drag it toward the Low end of the scale.

    Map showing adjusted cluster radius for the map extent and scale

    Cluster maps are used in static and dynamic maps to show specific occurrence numbers (case observations in this instance) and to indicate spatial patterns in the density of the data. For privacy purposes, the benefit is that the clusters are not tied to administrative boundaries like ZIP Codes or counties that can be used to identify individuals. You must adjust the cluster radius for the specific scale and extent of the map to convey useful information about the patterns without revealing individual patient locations.

    Because you are making a static map image for hospital leadership, a cluster map can be used, if you are careful to set the cluster radius appropriately for the map.

    For your hospital leadership colleagues, your static cluster map gives them exactly the information they need to plan for a coordinated approach to treatment of local children with high blood lead levels.

    You can read more about aggregating features into clusters in the help.

  10. Click Save Project to save your project.

You used two visualization techniques—heat maps and feature clustering—to visualize the point data without showing the exact locations of the individuals.


Suppress small cells

Small cells are polygons containing aggregated data in which the number of data points in the polygon are few enough to make re-identification of individuals possible. In this section, you will combine two methods to support de-identification of your data when you have small cells: hot spot analysis and tessellation. Hot spot analysis is based on mathematical calculations that identify statistically significant spatial clusters of high values (hot spots) and low values (cold spots). Tessellation is a method of tiling a surface with identical, non-overlapping geometric shapes, like squares, triangles, or hexagons. These tiles can be used to show summary information about the data points that fall within them.

Identify statistical hot and cold spots

Your next task is to make a map that shows statistically significant clusters of high blood lead level cases for a report that will be published online in a dynamic web map. You will use the Optimized Hot Spot Analysis tool to create your map and symbolize the results with a tessellation of hexagons.

In ArcGIS Pro, the Optimized Hot Spot Analysis tool allows you to aggregate the high blood lead level locations into weighted features. Using the distribution of the weighted features, the tool will identify an appropriate scale of analysis. This eliminates the need to know the size of the hexagons in advance. Aggregating or binning data with hexagons, also called hexbins, is a useful way to visualize health information while protecting patient privacy, since they do not directly align with administrative boundaries. A second level of obfuscation comes from providing an analytical output (levels of statistical significance) rather than case numbers.

Your web map will show the generalized patterns in the presence and absence of childhood lead poisoning across the study area while also communicating areas with higher concentrations.

  1. On the ribbon, click the Analysis tab and click Tools.

    Analysis tab and Tools button

    The Geoprocessing pane appears. You will use this pane to search for and run the Optimized Hot Spot Analysis tool.

  2. In the search box, type optimized hot spot, and in the results list, click the Optimized Hot Spot Analysis tool.

    Search result for optimized hot spot

    The tool is called Optimized Hot Spot Analysis because it searches for the best distance at which to perform the hot spot analysis. That will be the distance at which clustering among the counts in neighboring hexbins is most intense. If a clear distance is not achieved, the optimizer calculates an average distance that provides for a certain number of nearest neighbors for analysis. Finally, the tool compares the count of high blood lead level patients in each neighborhood cluster of hexbins with the entire study area to determine a z-score, which can then be directly related to a p-value upon which statistical significance is determined.

  3. For Input Features, choose High_Blood_Level_Results.
  4. For Output Features, accept the default location, in the BloodLeadLevels.gdb geodatabase, and type High_Blood_Lead_Hot_Spots for the feature class name.

    Input and Output features set

  5. Leave the Analysis Field empty.

    If there is a numeric value associated with the input features, you can use the Analysis Field parameter to take those values into account for the hot spot analysis. In this case, you will not set an Analysis Field value. This will evaluate the distribution of High_Blood_Level_Results points for hot and cold spots.

  6. For Incident Data Aggregation Method, click the drop-down list and choose Count incidents within hexagon grid.

    Count incidents within hexagon grid

  7. For Bounding Polygons Defining Where Incidents Are Possible, click the drop-down list and click Sacramento_ZIP_Codes.

    Sacramento_ZIP_Codes

    This layer contains ZIP Code polygons for Sacramento. These features will be used by the tool to identify places where points can occur. You are essentially specifying your study area for the tool, so areas that are outside of your Sacramento study area, but still within the maximum bounding rectangle of input points, will not be identified as cold spots.

  8. Click Run.

    The tool runs and the High_Blood_Lead_Hot_Spots layer is added to the map.

  9. In the Contents pane, uncheck the High_Blood_Level_Results layer so you can examine the new layer.

    The hot spot analysis results layer is added to the map.

    The symbol classes for the layer are shown in the Contents pane.

    Hot spot analysis symbology

    The results of the tool are symbolized using blues for statistical cold spots, reds for statistical hot spots, and white for non-significant levels. You can learn more about the Optimized Hot Spot Analysis in the documentation.

    You could share this layer as a way to show the distribution of significantly high and low counts of cases. However, before sharing it, you would need to remove the Counts field, which you will use in the next section. This field indicates the number of cases in each hexagon. Providing specific counts, especially for cells with only a few incidents, may not protect patients' identities adequately, although this depends partly on the cell size and on the frequency of occurrence of the condition.

    Next, you will symbolize the hot spot analysis layer by the total count within each bin. This method not only shows the areas of concentration but also provides a way to clearly communicate the range of the number of cases.

  10. Click Save Project to save your project.

Symbolize hexbins by count

You must make a report that will be shared with internal analysts working on a lead mitigation project who need to know the numbers of cases in an area without needing to know the specific point locations. You will change the symbology from the Hot Spot symbology to show the total feature count within each polygon.

First, you'll make a copy of the layer so you can have a version symbolized each way.

  1. In the Contents pane, right-click the High_Blood_Lead_Hot_Spots layer and click Copy.

    Copy option for the high blood lead level hot spot layer

  2. In the Contents pane, right-click Map and click Paste.

    Paste option for the layer

  3. In the Contents pane, click the name of the layer you pasted so you can edit it.

    Editable layer name

  4. Type High_Blood_Lead_Hexbin_Counts for the layer name.
  5. In the Contents pane, uncheck the High_Blood_Lead_Hot_Spots layer to turn it off.
  6. Right-click the High_Blood_Lead_Hexbin_Counts layer and click Symbology.
  7. In the Symbology pane, click Field and click Counts.

    Counts option

  8. Click the Color scheme drop-down list, scroll down, and click the Reds (7 classes) color ramp.

    Reds (7 classes) color ramp

  9. Click the Classes drop-down list and click 5.

    Classes set to 5

  10. Right-click the color patch for the lowest class, less than or equal to 0 count, and click No color.

    Patch for 0 count with No color selected

    Removing the fill for zero count hexbins gives more context for a map reader and focuses attention on the cells where there are high blood lead level patients.

    Note that there are hexbins classified with 1 point within them. In most cases, you will not want to display a single case within a single hexbin. This is clearly a small cell. You can adjust the histogram of the graduated symbols to change the classes of the map symbology.

  11. In the Symbology pane, click the Histogram tab.

    Histogram tab

  12. Click and drag the class break marker from 1 to 2.

    Class break moved to 2

  13. Click and drag the class break marker from 3 to 4.

    Class break moved to 4

    The new class breaks are set.

    New class breaks set

    The symbology is updated to group hexbins with one and two cases in the same group.

    Hexbins with one and two cases symbolized in same group.

    The right number to choose for the minimum number of cases in a hexbin varies, depending on the scenario and on your organization's rules. For conditions that are common, you may be able to use a smaller number, and for conditions that are rare, it may be better to use a larger number. It is also important to consider the area of each, and the number of people (and potential cases) one would find within one. The larger the bin and the larger the number of people, the lower you can set the minimum number of cases without risking re-identification of individuals.

    Now you are ready to share this information with your colleagues performing the analysis. While they are internal to your organization and perhaps have all required permissions to use the raw data, they do not actually need point level data for their work. It is a best practice to provide a minimum viable dataset based on work needs. This is a balanced approach that offers accurate enough data to focus in on local concerns (better than ZIP Code level) while avoiding the prospect of sharing PHI-containing point data where its not needed.

  14. Click Save Project to save your project.

You used the Optimized Hot Spot Analysis tool to help establish the appropriate hexbin size (based on the best scale of analysis, not based on privacy needs) for the input point features and symbolized the hexbins to show statistical significance. Using the hot spot map to highlight areas of relative concern communicates the problem while also making it impossible to identify individuals. You also re-symbolized the hexbin data to show actual counts of cases for a different analytic process. You used a method that did not require individual points to be shared with stakeholders that might not be authorized to see them or don’t actually need them for their work. The result provided a clear visual representation of areas with more occurrences of high blood lead levels across your study area.


Generalize and aggregate data

In this section, you will review the data by year and learn how to protect individuals and not identify small clusters of data in mapping products that will be released to the public. You will learn how to generalize and aggregate data to protect sensitive information using methods that still show relevant patterns in the data. With health data, it is often the patterns that are most informative; individual case locations are not always necessary to inform many aspects of operations. For example, as an analyst you may want to use generalized or aggregated data in childhood lead poisoning and surveillance annual reports, as opposed to individual points used in case management and investigations.

Data generalization involves simplifying data by reducing its complexity or detail. For example, you might generalize date of birth data to year of birth. You can generalize age to age cohorts in 10-year groupings. And you can combine various tribal groups such as Cherokee, Navajo, and Choctaw into an American Indian category. Aggregation, on the other hand, involves combining multiple data points into a single summary statistic, such as the number of births per year. In the steps that follow, you will focus on aggregation methods, but you can often apply generalization techniques to your underlying data to further obfuscate private information.

Summarize data by ZIP Code and year

You'll begin by summarizing the data by year using the study area ZIP Code layer. ZIP Codes boundaries are often used for reporting health statistics. They have pros and cons. On the pro side, ZIP Codes are smaller than counties and most people know their ZIP Code and can locate it on a map. On the con side, ZIP Code boundaries are artificial constructs, designed to support efficient mail delivery, and they can change over time. You, as the analyst, must decide if they are appropriate for your needs and are aligned with your organization's data release rules.

  1. In the Geoprocessing pane, click the back button.
  2. In the search box, type summarize within and in the results list, click the Summarize Within (Analysis Tools) tool.

    Summarize Within search result

    There is another Summarize Within tool that belongs to the GeoAnalytics Desktop Tools toolset, but you should use the one from the Analysis Tools toolset for this tutorial.

  3. On the Summarize Within tool dialog box, for Input Features, choose the Sacramento_Zip_Codes layer.

    Input Polygons parameter set to the Sacramento_Zip_Codes layer

  4. For Input Summary Features, choose the High_Blood_Level_Results layer.

    Input Summary Features parameter set to High_Blood_Level_Results

  5. For Output Feature Class, accept the default location, in the BloodLeadLevels.gdb geodatabase, and type HBLL_by_zip_year for the feature class name.

    Output Feature Class parameter set to HBLL_by_zip_year

  6. For Group Field, choose the Blood Level Test Year option.

    Group Field parameter set to Blood Level Test Year

  7. Click Run.

    The HBLL_by_zip_year layer is added to the map. In the Standalone Tables section, the testYear_Summary table is also added. This table contains the summary data with counts by ZIP Code by year. This can be joined back to the HBLL_by_zip_year layer to show the values for each year.

    Next, you'll join the data and learn how to generalize multiple years of data or aggregate adjacent ZIP Codes to meet your organization's minimum value thresholds for data protection.

Join the summary table to the result feature class

Now you will join the summary table to the result feature class so you will have a single feature class with data summarized by ZIP Code and year. This will allow you to create layers to show the data for each year.

  1. In the Contents pane, right-click the HBLL_by_zip_year layer and click Attribute Table.

    HBLL by Zip table values

    The table shows data from the original ZIP Code polygons and data that was added by the Summarize Within tool. The Count of Points field shows the total number of cases in each ZIP Code polygon. The JOIN ID field contains values that you can use to join the attributes from the testYear_Summary table onto this layer. There are 17 ZIP Code polygons in this feature class.

  2. In the Contents pane, in the Standalone Tables section, right-click the testYear_Summary table and click Open.

    Test Year Summary table values

    The JOIN ID field contains values that you can use to join the attributes to the HBLL_by_zip_year layer. The testYear field holds the values for the years of the blood tests. The Count of Points field shows the total number of cases in each ZIP Code polygon in each year, for a total of 50 records in the table.

  3. In the Contents pane, right-click HBLL_by_zip_year, point to Joins and Relates, and click Add Join.

    Add Join

  4. On the Add Join tool dialog box, the Input table parameter should default to the HBLL_by_zip_year layer that you right-clicked.
  5. For Input Join Field, choose JOIN ID.

    There is a warning icon beside Input Join Field that indicates the field is not indexed. For small tables like these, this is not a problem.

  6. For Join Table, choose testYear_Summary.
  7. For Join Table Field, choose Join ID.
  8. Click Validate Join.

    Add Join parameters filled out with Validate Join button highlighted

    The Validate Join process runs and returns a message.

    Validate join message

    Because two fields are not indexed, the tool recommends creating indexes for them to improve performance. Given the number of features involved, this is not necessary.

    The tool also reports that this is a one-to-many join, and that the resulting joined feature class will have 50 records (one for each record in the testYear_Summary table).

  9. Click Close to close the Message window.
  10. On the Add Join tool dialog box, click OK.

    The attribute table for the HBLL_by_zip_year layer updates to show the additional fields from testYear_Summary and the additional records for the combinations of ZIP Code polygons and test years.

    The results of the Add Join tool are temporary. You will create a copy of the feature class with all the features by exporting it to a new feature class.

  11. Right-click the HBLL_by_zip_year layer, point to Data, and click Export Features.
  12. Set the Output Feature Class name to be HBLL_by_zip_all_years.
  13. Click OK.

    The new feature class is stored in your project geodatabase.

Symbolize the combined layer

Now you'll symbolize the layer.

  1. In the Contents pane, uncheck all of the layers except HBLL_by_zip_all_years.
  2. In the Contents pane, right-click the HBLL_by_zip_all_years layer and click Symbology.
  3. In the Symbology pane, click the Primary Symbology drop-down lists and click Graduated Colors.
  4. Click the Field drop-down list and click the second of the two Count of Points fields, below Join ID.

    Second Count of Points field selected

    This field contains the aggregated count of points within the polygon that happened in a specific year. The first field contains the total count for all three years.

  5. For the Color scheme, click Purple (5 Classes).

    The symbology for the layer updates. You may notice that the symbol classes for the layer shown in the Contents pane may not all be represented on the map.

    Not all symbol classes seem to show on the map.

    In this example, the highest class seems to be missing. This is because the HBLL_by_zip_all_years layer contains multiple copies of each ZIP Code polygon, one for each year for which there were cases in that ZIP Code. The symbology for the layer takes into account the complete range of values in the attribute table, but the symbology color only shows for the top-most of the polygons.

  6. On the ribbon, on the Map tab, in the Navigate section, click the drop-down list of the Explore tool and click Visible layers.
  7. Click the northeastern-most ZIP Code polygon.

    Click the northeastern-most ZIP code.

    The Pop-up pane shows that three features from the HBLL_by_zip_all_years layer were at the location you clicked. The attributes for the top one are displayed in the lower section of the pop-up. You can see that the first one in this example is for the year 2018; there were 24 cases in the 95821 ZIP Code that year.

    Three features from the layer were returned.

    You can click the features, in this case listed by name using the word Sacramento, at the top of the Pop-up pane to see the attributes of the other two.

    The second of the features has different values.

    The second of the features is for 2019, when there were 48 cases in the 95821 ZIP Code.

Display the data in separate layers by year

Now that you have the HBLL_by_zip_all_years layer with the ZIP Code counts by year, you will make copies of the layer so you can visualize the distribution of high blood lead level cases for each year.

  1. In the Contents pane, right-click the HBLL_by_zip_all_years layer and click Copy.
  2. In the Contents pane, right-click Map and click Paste.
  3. Click the name of the copy of the HBLL_by_zip_all_years layer and type HBLL_by_zip_2018 to rename it.
  4. Double-click the HBLL_by_zip_2018 layer, and in the Layer Properties pane, click Definition Query.
  5. Click New definition query.

    New definition query button

  6. In the Definition Queries section, on the Where line, click the drop-down list and click the testYear field. Accept the default operator, is equal to, and click the third drop-down list and choose 2018.

    Query set to Where testYear is equal to 2018

    This builds a definition query Where clause that filters the layer so only the polygons for 2018 will be shown on the map.

  7. Click OK.
  8. In the Contents pane, right-click the HBLL_by_zip_2018 layer and click Copy.
  9. In the Contents pane, right-click Map and click Paste.
  10. Rename the new copy of the layer HBLL_by_zip_2019.
  11. Open the Definition Query tab for the HBLL_by_zip_2019 layer.
  12. Click Edit.

    Edit button to change the definition query for the layer

    You will change the definition query for the 2019 layer to show the 2019 data.

  13. Change the value of the year to 2019, click Apply.

    Test year set to 2019

  14. Click OK.
  15. Make a copy of the HBLL_by_zip_2019 layer, rename it HBLL_by_zip_2020 and use the process you just learned to update the definition query for that layer to show the data for 2020.

    Next, you will explore two different aggregation methods to achieve your organization’s minimum threshold value. Your leadership has determined that if 5 or more observations occur in an area, like a ZIP Code, you can display data for that ZIP Code in a product that will be released publicly.

  16. Click the Explore tool and click the central ZIP Code polygon with a low count of cases.

    The central ZIP Code polygon with few cases.

    The top layer in the Contents pane, HBLL_by_zip_2020, displays first.

    Pop-up for 2020 values

    In 2020, there were only two cases in this ZIP Code polygon. This is fewer than the minimum value of five cases that your organization has specified for releasing data by ZIP Codes.

  17. In the Pop-up pane, click the entry for Sacramento for the HBLL_by_zip_2019 layer.

    Results for 2019

    There were three cases in this ZIP Code in 2019. You could release combined data for this ZIP Code for 2019 and 2020, since the sum of the values for these two years is five.

Combine data for multiple years

One method of meeting your organization's minimum threshold value is aggregating multiple years of data until you obtain a minimum of 5 cases in each ZIP Code. This approach decreases temporal resolution to maintain spatial resolution.

  1. On the ribbon, on the Map tab, in the Selection group, click Select by Attributes.
  2. In the Select by Attributes pane, for the Input Rows, click the drop-down list and click High_Blood_Level_Results.
  3. Click Add Clause.

    Add Clause

  4. In the Where section, click the Select a field drop-down list and click Blood Level Test Year.

    Blood Level Test Year option

  5. Accept the default operator, is equal to.
  6. Click the drop-down list for the comparison value and click 2020.

    2020 option

  7. Click Add Clause.

    Add Clause button

  8. The default logical operator for combining clauses for the query is And. This allows you to build queries to select features where one field's value is something and another field's value is something else, or where values are within a range, if you are using greater-than and less-than comparisons. However, in this case, you will build the query to select features where the test year is 2020 or 2019.
  9. Choose the Or logical operator to join the clauses.
  10. Click the And logical operator and in the drop-down list, click Or.

    Or option

  11. Set the field to Blood Level Test Year and accept the default is equal to operator.
  12. Click the value drop-down lists and click 2019.

    The Select by Attributes tool is ready to select features with values of 2020 or 2019.

    The Select by Attributes tool is ready to select features with values of 2020 or 2019 in the Blood Level Test Year field.

  13. Click OK.

    The High_Blood_Level_Results features recorded for 2020 or for 2019 are selected. Now you can run the Summarize Within tool on them to get the counts by ZIP Code of the selected features.

  14. On the ribbon, on the Analysis tab, in the Geoprocessing section, click Tools.
  15. Search for and open the Summarize Within tool.

    The tool should be in the Recent list in the Geoprocessing pane.

  16. For Input Polygons, choose Sacramento_Zip_Codes.
  17. For the Input Summary Features, choose High_Blood_Level_Results.
  18. Name the Output Feature Class parameter HBLL_by_zip_2019_2020.

    Summarize Within tool for 2019 and 2020 cases

    The Summarize Within tool warns you that there is a selection on the input and only that subset of records will be processes. That is what you want.

  19. Leave the Summary Fields and Group Field blank.
  20. Click Run.

    The new HBLL_by_zip_2019_2020 layer is added to the Contents pane.

  21. In the Contents pane, right-click the HBLL_by_zip_2019_2020 layer and click Attribute Table.

    Attribute Table menu option

  22. Right-click the column header for Count of Points and click Sort Ascending.

    Sort Ascending

    The sorted column shows that there are no ZIP Code polygons in this layer that have fewer than five cases.

    Sorted column no fewer than five cases per ZIP Code

    According to your organization's minimum threshold value, the grouped counts for 2019 and 2020 can be released at the ZIP Code level.

Merge ZIP Code geometries

Suppose you needed to report the data for 2020 and not include 2019 data. You will use a second method to meet your organization's minimum threshold, by aggregating ZIP Codes for a single year until there are more than five cases in each aggregated area. This approach decreases spatial resolution to maintain temporal resolution.

  1. Open the Geoprocessing pane.
  2. In the Search box, type build balanced zones, and in the results, click Build Balanced Zones.

    Build Balanced Zones tool in the search results

  3. For Input Features, choose the HBLL_by_zip_2020 layer.

    A note appears on the tool that the input has a filter. This is because there is a definition query on the layer, filtering it to only show the 2020 data.

  4. For Output Features, type HBLL_2020_Zones.

    Build Balanced Zones tool input and outputs

  5. For Zone Creation Method, accept the default value of Attribute target.
  6. In the Zone Building Criteria With Target section, click Variable and click Count of Points [Point_Count_1].

    Variable set to Count of Points [Point_Count_1]

  7. In the Sum box, type 12.

    This value is higher than the organization's minimum value of 5. The Build Balanced Zones tool uses the Target variables as targets for a randomly seeded genetic algorithm, but the results will only approximate the target values, so if you set a lower value, it is likely that some zones would have fewer than five cases. Read more about how Build Balanced Zones works in the documentation.

  8. For Spatial Constraints, choose Contiguity edges only.

    Build Balanced Zones tool dialog box

    The Build Balanced Zones tool is ready to run.

    Note:
    If you had other criteria for the zones, such as a minimum population, you could add another variable and value, but for this task, making zones with a target of at least 12 cases is enough. You can read more about the tool in the documentation.

  9. Click Run.

    The results are added to the map. The original ZIP Code polygons are retained, but they have new attributes allocating them to different zones. You'll dissolve the polygons on these zone attributes.

  10. Click the back button to return to the Geoprocessing pane, and search for and open the Pairwise Dissolve tool.

    Pairwise Dissolve tool in the search results

  11. On the Pairwise Dissolve tool dialog box, for Input Features, choose HBLL_2020_Zones.
  12. For the Output Feature Class, type HBLL_2020_Zip_Dissolve.

    Pairwise Dissolve input and output feature classes

  13. In Dissolve Fields, choose Zone ID.

    Dissolve on Zone ID

  14. In Statistics Fields, choose Count of Points, and accept the default Statistic Type of Sum.
  15. Uncheck Create multipart features.

    Count of Points and Sum selected and Create multipart features unchecked

  16. Run the tool.

    The dissolved zones layer is added to the map.

    Dissolved zone

  17. In the Contents pane, right-click HBLL_2020_Zip_Dissolve and click Attribute Table.

    The point counts for the zones have more than 5 points each.

    The point counts for the zones are greater than 5, and most have 12 or more points. This is in line with your organization's guidance.

  18. As the analyst for the Childhood Lead Poisoning Prevention Program, you must consider which method is most appropriate to provide meaningful and actionable data for jurisdictions that often have their data suppressed. Aggregating across years means your end user cannot discern temporal variation across the aggregated years, but they can see numbers for small geographic areas that might otherwise be suppressed. Aggregating multiple ZIP Codes may make strong temporal trends identifiable as each single year is mapped, but the geographic specificity will be diminished. Each method must be weighed against the target audience and purpose for reporting and data sharing.

Add coordinate values to points

Up to this point, you’ve been creating maps for your stakeholders that are focused on questions relating to the extent of high blood lead levels in Sacramento County, how many cases there are overall, and various ways to look at the spatial and temporal patterns in the data.

Now you are working with your health equity team. They would like to do some research to determine whether there are any other factors associated with high blood lead levels in children such as: sex, race/ethnicity, and age. To help them with their work, you must be able to provide them with a de-identified point-level dataset that includes all the variables of interest for each child, as well as their general location. You will use coordinate rounding to accomplish this task and check some statistics to justify the rounding levels.

First, you will add attributes with latitude and longitude values in decimal degrees to your point features.

  1. In the Geoprocessing pane, search for and open the Calculate Geometry Attributes tool.

    Calculate Geometry Attributes tool in the search results

  2. For Input Features, choose High_Blood_Level_Results.
  3. In the first row of Geometry Attributes, in the Field (Existing or New) box, type Latitude.

    Field (Existing or New) set to Latitude

    This will add a new field to the attribute table, once the tool runs, to store the latitude values for each point.

  4. In the Property box for the Latitude field, click the drop-down list and click Point y-coordinate.

    Point y-coordinate selected for the Latitude field

    The y-coordinate value from each point will be added in the Latitude field.

  5. In the second row of Geometry Attributes, in the Field (Existing or New) box, type Longitude.

    Field (Existing or New) set to Longitude

  6. In the Property box for the Latitude field, click the drop-down list and click Point x-coordinate.
  7. In the Coordinate Format box, click the drop-down list and click Decimal Degrees.

    Coordinate Format set to Decimal Degrees

  8. Click Select coordinate system.

    Select coordinate system button

  9. In the Coordinate System window, in the search box, type WGS 1984.
  10. Expand Geographic Coordinate System and expand World.

    WGS 1984 coordinate system

  11. Click WGS 1984 and click OK.
  12. On the Calculate Geometry Attributes tool, click Run.
  13. In the Contents pane, right-click the High_Blood_Level_Results layer and click Attribute Table, and scroll to the right in the table to see the new Latitude and Longitude fields.

    New fields

    Now that you have the latitude and longitude values of the points stored in attributes, you can create new fields to hold the rounded values and calculate the new rounded values.

    Note:

    There are several ways to manipulate the latitude and longitude coordinates that represent the point locations of your high blood lead level cases. You could truncate or round the coordinates, snapping each point location to a lower resolution grid across the study area. You could also perturb the locations by replacing the last digit or two of each coordinate with a random number. This moves each point by a random distance and direction.

Add fields to hold the rounded coordinate values

You'll make two fields to hold the rounded coordinate values.

  1. Right-click High_Blood_Level_Results, point to Data Design, and click Fields.
  2. Scroll to the bottom of the list of fields.
  3. Click the row header for Latitude, and press Ctrl while clicking the row header for Longitude.

    Latitude and Longitude fields selected

  4. Right-click the row header for Latitude and click Copy.

    Copy option

  5. Right-click the row header for Latitude and click Paste.
  6. Click in the Field Name column for the Latitude1 field and type LatitudeRound.

    Latitude1 field changed to LatitudeRound

  7. Click in the Field Name column for the Longitude field and type LongitudeRound.
  8. Click in the Alias column for the LatitudeRound field and type Latitude Rounded.
  9. Click in the Alias column for the LongitudeRound field and type Longitude Rounded.

    Fields with names and aliases set

    The names and field aliases for the copied fields are set.

  10. On the ribbon, on the Fields tab, in the Changes section, click Save.

    Save button for field changes

    The two new fields are added to the table schema for the High_Blood_Level_Results feature class.

  11. Close the Fields view.

Round the values for the coordinates

Next, you'll calculate rounded coordinate values and store them in the new fields.

  1. In the attribute table for the High_Blood_Level_Results layer, right-click Latitude Rounded and click Calculate Field.

    Calculate Field option to calculate a new value for the Latitude Rounded field

  2. On the Calculate Field tool dialog box, click the Expression Type drop-down list and click Arcade.

    Expression Type set to Arcade

    Arcade is a lightweight expression language, written for ArcGIS.

  3. In the Expression box, enter the following Arcade expression:

    Round($feature.Latitude,2)

    Round to two decimal points expression

    This code uses the Arcade Round function, setting the value of the Latitude Rounded field to be equal to the value in the Latitude field, rounded to two decimal places. This rounds off the location information of the points to the nearest hundredth of a degree.

  4. Click the Verify button.

    Verify button

  5. Click Apply.

    The rounded values are calculated and added to the attribute table in the Latitude Rounded field.

    New values are added to the field.

  6. Use the same method to calculate the values for the Longitude Rounded field.

    Tip:
    In the Calculate Field tool, set the Name Field to Longitude Rounded, and use the following Arcade expression:

    Round($feature.Longitude,2)

    The Latitude Rounded and Longitude Rounded fields should be rounded to two decimal places.

    Longitude Rounded values added

    Note:

    If your coordinates were in a planar spatial reference, such as California State Plane or UTM, the coordinate values would be in linear units rather than in decimal degrees. In that case, you would need to calculate an appropriate spacing for your rounded points and round to that spacing. For example, you might choose to round to the nearest 1,000 feet, or 100 meters, depending on the units and the amount of displacement that you want.

Create new points at the rounded coordinates

Now that you have the rounded values in two fields, you can create new points at these locations.

  1. In the Geoprocessing pane, search for and open the Make XY Event Layer tool.

    Make XY Event Layer tool

  2. On the Make XY Event Layer tool dialog box, for XY Table, choose High_Blood_Level_Results.
  3. For X Field, choose Longitude [LongitudeRound].
  4. For Y Field, choose Latitude [LatitudeRound].
  5. For Layer Name, type High_Blood_Level_Results_Rounded.
    Make XY Event Layer parameters filled

    This will make a new layer of points, using the rounded latitude and longitude values that you calculated.

  6. Click Run.

    Rounded points locations

    The points made from the rounded coordinate values are arranged in a grid-like formation, spaced at hundredth of a degree intervals.

    This approach moves points from their original locations but can preserve some of the original spatial pattern, which may be useful for analysis.

    Original heat map

    Original points heat map

    Rounded heat map

    Rounded coordinates points heat map

    Caution:

    Remember that once the point level positions have been masked by a method such as coordinate rounding, you should still remove unneeded identifying PHI such as names, birthdays, address fields, and the original coordinate values from the attribute table before releasing that data to your authorized internal colleagues. Moving the points to rounded coordinate values does not protect PHI if you still provide the original address or coordinates.

    You can use the Export Features tool to export a copy of a feature class to share with an authorized member of your organization. On this tool, in the Fields section, you have access to the list of fields, where you can choose to delete fields that contain PHI that are not required for the project.

    Next, you'll make lines connecting the original and rounded points and determine their length.

Document the coordinate rounding results

For expert determination, de-identification is necessary to be able to quantify and document the extent to which the points have been moved. In this section, you will review some statistics related to the point movement using the coordinate rounding method and summarize how many points were moved to each grid point.

  1. Search for and open the XY To Line tool.

    XY To Line in the search results

  2. For Input Table, choose High_Blood_Level_Results_Rounded.
  3. For Output Feature Class, type HBLL_dist.

    XY To Line tool initial fields

    This line feature class will connect each of the original points' coordinates to their corresponding rounded coordinate location. You will use the line features to calculate the amount of displacement.

  4. For Start X Field, choose Longitude.
  5. For Start Y Field, choose Latitude.
  6. For End X Field, choose Longitude [LongitudeRound].
  7. For End Y Field, choose Latitude [LatitudeRound].

    XY To Line tool X and Y fields set

  8. For Line Type, choose Geodesic.

    This is the default value. It represents the shortest distance between two points on the surface of the earth.

  9. Leave the ID field empty.
  10. For Spatial Reference, accept the default value of GCS_WGS_1984.

    XY To Line final inputs set

  11. Click Run.

    The HBLL_dist layer is added to the map. Depending on the zoom level and extent of your map, it may be difficult to see. If you zoom in to one of the higher-density areas, you will see that a set of lines connect each of the original points to their corresponding rounded coordinate point locations.

    Zoomed in view of the HBLL_dist lines

  12. In the Contents pane, right-click the HBLL_dist layer and click Attribute Table.

    The values in the Shape_length field are small decimal values—they are in degrees. You will convert the lengths to planar units.

    The attributes table of the HBLL_dist layer

Add a distance field and calculate its value

You will add a new field to the attribute table of the HBLL_dist layer and calculate its value to get the distances that the points were displaced.

  1. On the attribute table tab for the HBLL_dist layer, click Add.
    Add button to add a new field

    You will add a new field to hold the distances in linear units.

  2. Type Distance in the Field Name column for the new field.

    Field Name column for the new field set to Distance

  3. In the Data Type column for the Distance field, click the drop-down list and click Double.
  4. On the ribbon, on the Fields tab, in the Changes section, click Save.

    Save button for the new distance field

  5. Close the Fields: HBLL_dist pane.
  6. In the HBLL_dist attribute table, right-click the column header for the Distance field and click Calculate Geometry.

    Calculate Geometry button

  7. On the Calculate Geometry tool dialog box, in the Property drop-down list for the value to be added to the Distance field, click Length (geodesic).

    Length (geodesic)

  8. For Length Unit, choose Meters.

    Meters set for the length unit

  9. Click OK.

    The lengths of the lines, in meters, are added as attributes in the Distance field.

  10. Right-click the Distance column header and click Statistics.

    Statistics button for the Distance field

    The Statistics pane for the Distance field shows summary statistics for the distance field. These show that the mean distance that points moved to the rounded coordinate location was 376 meters, with a minimum distance of 18 meters and a maximum distance of 684 meters.

    Distance statistics results

    The Statistics tool also creates a histogram of the distance values that you could use in defending your decisions in making this de-identified product using coordinate rounding.

    Distance histogram

  11. Close the Chart Properties pane.
  12. Close the Distribution of Distance chart.

Count the numbers of points at the rounded coordinates

Next, you'll calculate how many stacked points there are after using coordinate rounding. For purposes of analyzing privacy and de-identification, you can think of this count as representing how many cases are in the pool that could represent the identity of any single case. The more cases you have in each stack, the bigger the pool and the better for de-identification purposes. You'll analyze the points geographically but recognize that you’ll also need to review the uniqueness of all attributes that you’ve retained in a table that you plan to share, since a particular combination of attributes could also identify an individual. For this reason, it is recommended that you provide the minimum viable dataset to your stakeholders.

  1. In the Geoprocessing pane, search for and open the Collect Events tool.
  2. For Input Incident Features, choose High_Blood_Level_Results_Rounded.
  3. For Output Weighted Point Feature Class, type HBLL_rounded_counts.

    Collect Events tool

  4. Click Run.

    HBLL Collect Events results

    In this case, some of the clusters have as many as 15 points stacked, although many have only one or two. With a larger dataset, you might have more densely stacked points.

    You've used coordinate rounding to mask the locations of sensitive point data while allowing you to keep several additional attributes associated with the points. The health equity researchers now have the best opportunity to perform additional analysis and tell a more complete story about childhood blood lead poisoning in Sacramento using the de-identified data. In order to document your de-identification method, you calculated statistics related to the offset distancing for each point and counted the pool of points in each grid location stack. Remember that it is also important to remove attributes that could lead to re-identification (such as address, original location coordinates) and it is a best practice to minimize the number of attributes in the dataset you provide.

  5. Click Save Project to save your project.

Review advanced approaches

You have now learned several approaches for de-identifying data for different use scenarios. There may be some situations in which you may need to adopt more advanced methods. In this section, you'll learn about two advanced methods of data de-identification: geomasking and differential privacy.

Depending on where your health GIS work takes you, you may want to dive in deeper and do your own research on the following techniques so you can apply them as needed.

Geomasking

The term geomasking refers to a group of methods that change the geographic location of individual points, but in a different and more powerful way than coordinate rounding. There are two key aspects needed to make geomasking useful. First, the perturbation of the point must be unpredictable—that is what protects confidentiality in the data. Second, the point should be moved in a way that preserves spatial relationships within the dataset. After all, your GIS work is about finding patterns. In the notes that follow, you will be introduced to a specific type of geomasking—the donut method. You will then learn how to statistically evaluate the geomasking result with k-anonymity. Finally, you will be presented with a tool that automates the entire process for you.

Donut method for geomasking

The basic idea behind donut geomasking is that it improves confidentiality by making sure that the randomly moved point cannot ever end up in its original position. That means that a point must be displaced a minimum distance away from the original location. At the same time, to preserve spatial patterns, there is a calculated maximum displacement for each point as well. Those two distances create a donut shaped displacement zone within which the original point may be moved. You can learn more about the donut method in this article.

Donut geomasking diagram

K-anonymity

The Expert Determination method of de-identification includes a requirement to document the process and justify how that process achieves a very low risk of re-identification of an individual. When using the geomasking technique, the K-anonymity statistic is the evaluative measure that will support that justification. You can read more about K-Anonymity. The general idea is that K-anonymity represents the number of households in your dataset from whom a de-identified subject cannot be distinguished. For example, if you decided that the minimum value for K is five (written as KMin=5), you’re saying that there are at least five households (or individuals) that could potentially represent your original point.

The key decision for your organization is to determine what minimum value of K is deemed acceptable for privacy protection. There is no single standard; however, it may be useful to review the policies of various state and federal agencies on small cell counts. Small cells are defined as the number of people corresponding to the same combination of features. Aligning with the policy of authoritative government agencies may help support your organization’s decision about developing its own standard. Also consider that one standard value for K may not be appropriate for every situation.

MapMasq

If geomasking or other data de-identification techniques are something you regularly need to do, you can consider using MapMasq. This is a solution built by Esri partner Axim Geospatial. It works like any ArcGIS extension and automates the geomasking process and K-anonymity evaluation for you.

Differential privacy

Differential privacy is a newer technique that many believe is superior for protecting individual privacy. It works best with larger datasets. In fact, this is the method that the U.S. Census Bureau used for data reporting starting with the 2020 census. With differential privacy, data within a dataset are mathematically changed (all the data) in a way that makes identification of any individual impossible but also maintains the usefulness of the dataset. Noise is injected into the dataset according to a parameter, epsilon, that is referred to as the privacy-loss budget. The use of epsilon means that the disclosure risk for the data can be quantified which is useful for adherence to organizational policies as well as the required documentation for Expert Determination.

One way to think about how differential privacy works is to imagine one of those picture mosaics, where hundreds of ordinary pictures are put together in such a way that they create a new bigger image. Zooming in to the individual picture level, you could replace several pictures or move them around to different places and still, when you zoom out, the overall image will look essentially the same. The big picture may not be quite as sharp as a photograph, but the quality is improved as you add more individual pictures.

There is still a lot to learn about differential privacy and its value for health GIS. This is an area for you to be aware of because you may already be consuming census data that’s been shared using this method and because there may be tools that enable this technique in your own geospatial work.

To learn more about the impact of differential privacy on 2020 U.S. Census data, see the June 2022 Esri methodology report, as well as this handbook on disclosure avoidance from the U.S. Census Bureau.

In this section, you learned about two advanced methods for data de-identification that you can add to your toolkit for adhering to HIPAA and other privacy rules. Geomasking focuses on jiggering location data so that you have a KMin number of individuals that could represent the original point. Differential privacy adjusts everything using the epsilon privacy-loss budget to properly de-identify individuals. You’re well on your way to keeping your data and your organization safe from privacy breaches.

This tutorial on data de-identification for visualization and sharing provides a review of HIPAA, the U.S. law focused on protecting the privacy of personal health information. You learned several techniques that allow you to map and visualize the information safely. You also learned techniques that help you share the data, whether in a dynamic web map or as a dataset for others who may use your data for research or other purposes. You also learned about some advanced techniques that you may call on when you need more powerful options for retaining point-level data.

One tutorial cannot cover every situation. In this tutorial, you learned how to think spatially about the problem and consider the advantages and drawbacks of various methods. No matter what techniques you use as you work with protected health information, think carefully and check with your internal organizational guidelines to stay aligned and stay safe.

You can find more tutorials in the tutorial gallery.