Design map-based visualizations

Lead is a naturally occurring metal that can cause negative health effects, especially in children under the age of six. These include developmental delays, learning difficulties, behavioral problems, and neurological damage, which may be permanent and disabling. Leaders at your department need to use geographic information to enhance program reports and make decisions to eliminate childhood lead poisoning.

First, you'll download and explore the data. Then, you'll symbolize the blood lead level data on maps using methods that retain the data integrity and spatial patterns, while still protecting the privacy of individuals within the dataset.

Note:

The tutorial data is fictitious. It has been created for the purpose of demonstrating the workflow in this tutorial. It is designed to look plausible for the workflow and is structured similarly to data that you might use in this situation, but due to the legal limitations on sharing real data of this type, it is entirely made up. Do not rely upon this data. Do not attempt to draw conclusions or make real-world decisions based on this data. Do not use this data to train AI or machine learning models, the results will be inaccurate. The addresses in this dataset are real addresses, for the purposes of enabling a demonstration of geocoding and to provide plausible data to de-identify, but the data has no real relation to these addresses. Any names or attribute values associated with these addresses in the datasets are made up and have nothing to do with any actual persons or conditions at these locations.

Explore the data

First, you'll download and examine the data.

  1. Download the Blood_Lead_Levels_Zipped_Folder.zip zipped project data.
  2. Locate the downloaded file on your computer. Right-click the file and choose Extract All.

    Extract All option

  3. Specify the output folder location and click Extract.

    Output folder location

    This zip archive is password-protected. A password window appears.

  4. For Password, type I_Understand_This_Is_Fictitious_Data and click OK.
    Note:

    Use of this password indicates that you understand that the data is fictitious.

    The file is extracted to your computer as a folder.

  5. Open the extracted zip file.

    It contains a file named BloodLeadLevels.ppkx. A .ppkx file is an ArcGIS Pro project package, a compressed file for sharing projects that may contain maps, data, and other files that you can open in ArcGIS Pro.

  6. Double-click BloodLeadLevels.ppkx to open it in ArcGIS Pro. If prompted, sign in with your ArcGIS account.
    Note:

    If you don't have access to ArcGIS Pro or an ArcGIS organizational account, see options for software access.

    A map of Sacramento, California, appears. The fictitious High_Blood_Level_Results point layer shows home address locations of kids who had high levels of lead in their blood.

    Fictitious point locations of kids with high levels of lead in their blood

    Your lead surveillance and mitigation program uses the blood test results and the location of the individual patients to investigate the sources of lead exposure in the homes of these kids. The data is also used to investigate potential exposure of family members, and for tracing sources of lead at work, school, and community locations.

  7. If the High_Blood_Level_Results attribute table is not already open, in the Contents pane, right-click High_Blood_Level_Results and choose Attribute Table.

    Attribute Table option for the high blood lead level layer

    The table appears.

    Attribute table

    The layer contains fictitious data for home address, first and last names, birthday, age, race, ethnicity, gender, blood test results, and test year. If this data were real, it would be considered private, highly personal information about the health status, identity, and precise location of minors. This information must be handled carefully, in accordance with health data privacy laws. Since your job requires you to use and share this data, you must be aware of the laws, and the ways in which the data can be de-identified for sharing.

    Many countries have enacted policies to protect individual privacy for sensitive information, such as financial and health data. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) was signed into law in 1996 and serves as the primary guidance for safe health data practices.

    The United States Department of Health and Human Services defines Protected Health Information (PHI) as "individually identifiable health information held or transmitted by a covered entity or its business associate, in any form or media, whether electronic, paper, or oral. Individually identifiable health information includes demographic data that relates to:

    • the individual's past, present or future physical or mental health or condition,
    • the provision of health care to an individual, or
    • the past, present or future payment for the provision of health care to the individual,

    and that identifies the individual or for which there is a reasonable basis to believe it can be used to identify the individual. Individually identifiable health information includes many common identifiers (for example, name, address, birth date, Social Security Number)."

    Since the High_Blood_Level_Results data table includes information about blood lead levels and identifying information about the kids, including their names, addresses, and birth dates, it is PHI according to HIPAA and must be carefully protected according to the HIPAA Privacy Rule.

    This kind of data can only be shared with staff that are authorized for access. That authorization will be determined by your internal organizational guidance and generally includes those whose job responsibilities require access to PHI or those granted access through internal processes like an institutional review board (IRB) for research and evaluation purposes.

  8. Read the Are You a Covered Entity? section of the Centers for Medicare and Medicaid Services (CMS) page.

    This page provides guidance about who is covered by HIPAA regulations. The Covered Entity Decision Tool (PDF) provides an interactive decision tree you can use to determine if you are a covered entity who must follow the HIPAA rules.

    In general, covered entities include the following:

    • Health plans—Those that provide or pay the cost of medical care.
    • Health-care providers—Those that transmit data electronically for any purpose (billing, referrals, and so on).
    • Health-care clearinghouses—Organizations that process nonstandard health information to conform to standards for data content or format, or vice versa, on behalf of other organizations.
    • Business associates—A person or organization outside of the covered entity that performs certain functions on behalf of the covered entity that involve the use or disclosure of personally identifiable health information. In these situations, the covered entity must have a contract with the business associate that assigns the same duties and obligations for privacy protections that fall under the covered entity.

    For the purposes of this tutorial, you are a covered entity, because your organization runs health-care clinics.

    Health data like this blood lead level layer is extremely valuable for identifying health disparities, policy assessment, and strategic planning. You must use methods that protect individual privacy while maximizing the utility of the data for these important efforts.

  9. Read the De-identification Standard section of the HHS.gov page.

    You can use GIS data with PHI, but you must keep it on properly secured local computer hardware or in a secured ArcGIS Enterprise geodatabase. This data cannot be hosted in ArcGIS Online.

    If you share the data, you must first de-identify it.

    De-identification diagram showing identifier data split from the health data

    The goal of data de-identification is to separate the identifiable information from the health information to ensure a very low risk of re-identification.

    The process of de-identification involves removing identifiers in the dataset in a way that significantly minimizes the chances that someone could figure out the identity of any individuals in that dataset. Regulators know that even when proper de-identification methods are used, there is always a greater than zero risk of identification. Therefore, the requirements for de-identification are to ensure a very low risk of re-identification of an individual. The two accepted methods for de-identification under the HIPAA standard are shown in the following graphic:

    De-identification methods

    The first de-identification method, Safe Harbor, requires you to strip out the following 18 specific identifiers from the data:

    • Names
    • All geographic subdivisions smaller than a state
    • All elements of dates (except year) that are directly related to an individual
    • Telephone numbers
    • Vehicle identifiers and serial numbers
    • Fax numbers
    • Device identifiers and serial numbers
    • Email addresses
    • Web Universal Resource Locators (URLs)
    • Social Security Numbers
    • Internet Protocol (IP) addresses
    • Medical record numbers
    • Biometric identifiers, including finger and voice prints
    • Health plan beneficiary numbers
    • Full-face photographs and any comparable images
    • Account numbers
    • Certificate/license numbers
    • Any other unique identifying number, characteristic, or code, except as permitted

    Much of the data in the High_Blood_Level_Results layer would have to be removed to comply.

    Fields that must be removed

    This method will not be very useful if you're using GIS for health, but it's still worth knowing about. It is simpler that the second method, but does requires a bit more thought beyond removing the 18 identifiers. The data manager must also consider if there are any other identifiers in the dataset that a reasonable person could use to identify an individual, like a unique job title.

    You may also have noticed a problem with the second identifier, all geographic subdivisions smaller than a state. This would make the use of GIS extremely challenging at a useful resolution, like a city or neighborhood.

    You would go from these points:

    Fictitious point locations of kids with high levels of lead in their blood

    To state level data, such as in the following map:

    State level data without point locations

    The Safe Harbor rules allow you to use the initial three digits of a ZIP Code if, according to current US Census data, the three-digit ZIP Code has more than 20,000 people. However, few people in health-care GIS use three-digit ZIP Codes, and health-care GIS users are often concerned about health impacts at finer geographic levels.

    To make the best use of your data, you must use the second de-identification method, called the Expert Determination method.

  10. Review the guidance on Expert Determination de-identification.

    There is a lot of flexibility in the expert determination method. It requires a user with adequate knowledge and expertise to apply generally accepted scientific and statistical principles and methods in a way that renders the data de-identified with a very low risk of re-identification. A key aspect of the expert determination method is that the techniques used to make the expert determination are documented.

    You must determine the best method to provide the right level of data for different members of your team, depending on their roles and tasks. You'll provide point level identifiable data to some internal users. These authorized users may perform case management and investigation, looking for potential sources of exposure. They might need the residential addresses to calculate optimized routes for home visits. Others, however, will need a de-identified minimum viable dataset.

Make a heat map

Different methods of de-identification are useful for different use cases. You must think about the intent, audience, and delivery mechanism for the map. If the map will be static, such as a PDF, image, or paper map, and the map user cannot interact with the data, different considerations apply than if the map user can explore the data in a web browser or application where they can zoom in and out and could potentially investigate individual points and their associated attribute data.

You need to make a map for a printed poster to inform stakeholders and the public about the extent of childhood lead poisoning in Sacramento to help communicate risk and target intervention, health education, and related activities. A heat map is a good choice for this, since it creates a smoothed surface indicating the density of points in your layer, while blurring the locations of the points.

  1. Close the attribute table.
  2. In the Contents pane, right-click High_Blood_Level_Results and choose Symbology.

    Symbology option

    The Symbology pane appears.

  3. In the Symbology pane, for Primary symbology, choose Heat Map.

    Heat Map symbology option

    The symbology for the layer switches to show the data as a heat map.

    Point density represented as a heat map

    The high intensity yellow and red spot in the northeastern section of town represents an area where multiple children with high blood lead levels are living. Importantly, you cannot see how many children are being shown, nor the exact locations of their homes. To further protect patient privacy, you can show this heat map without including other administrative boundaries, like county lines or ZIP Codes, and you can also change the basemap to one that does not show street names to protect against re-identification of the sensitive data. This visualization technique works best for datasets with many point features where at least some of them are in close proximity to others.

    Note:

    The most visually intense areas of a heat map are sometimes referred to as hot spots. While this is a reasonable way to describe these spatial patterns, you should not confuse this type of hot spot with the results of the Hot Spot Analysis tool, which identifies statistically significant clustering across a study area.

  4. On the ribbon, click the Share tab. In the Output group, click Capture To Clipboard.

    Capture To Clipboard button

    A static image of the heat map is coped to the clipboard. You can paste this into a presentation or document and share it without exposing PHI.

  5. Zoom in to the intense area in the northeast part of town.

    Map zoomed to the northeast area, where there is a high concentration of points

    As you zoom in, the heat map symbology changes to show the relative density of points on the screen.

    Heat map symbology when zoomed in

    The closer you zoom in, the more details become apparent. Even if the data is blurred relative to the original point representation, at some scales, a heat map is no longer an appropriate way to display sensitive data while still protecting privacy.

    Heat map depicted as points

    Note:

    It is important to be aware that if your intent was to create an interactive map rather than a printed map, this dynamic heat map rendering could expose personal information. When creating interactive maps, beware of dynamically rendered heat maps and consider limiting the amount of zoom that is possible using scale dependent rendering.

    At some zoom scales, you can determine house-level locations for the blurry points.

  6. Click one of the blurred points.

    A pop-up appears.

    Pop-up data for High_Blood_Level_Results 3012

    The pop-up shows the attributes of the point. Using heat map symbology does not protect patient data when the map is interactive. The points and their attributes are still present.

  7. Close the pop-up.
  8. In the Symbology pane, for Radius, type 50.

    Radius parameter in the Symbology pane

    The heat map symbology changes, recalculating the density using a larger radius value.

    Result of increasing the radius value

    This new representation could be captured to show the density of high blood lead level cases at a neighborhood scale.

    It is useful to explore different heat map symbology parameters to represent the degree and scale of clustering in your data, balancing the need to accurately portray the data geographically and the requirement that you protect the privacy of the subjects. Many health-related issues, including disease outbreaks, operate at different geographic scales. In some instances, there is a point source causing an outbreak, while at other times the problem involves community-level transmission. Understanding and using data at the appropriate scale is key to any successful health GIS analysis.

    Your city-level static map image can be added to reports that inform stakeholders and the public about the extent of childhood lead poisoning in the community. Heat maps are useful for showing how the data is distributed and where it is particularly concentrated.

  9. On the Quick Access Toolbar, click the Save Project button.

    Save Project button

Make a point cluster map

You need to make a static planning map for hospital leadership that clearly communicates where there are large and small concentrations of lead poisoning cases. Of course, you still must do so in a way that protects the privacy of individuals. In this case, leadership is concerned about the actual numbers of cases within their service area because they need to ensure that they allocate specialists and coordinate the care program resources.

To do this, you'll make a cluster map. The technique of feature clustering works by grouping the clusters of points within an area and displaying a graduated symbol that shows the number of grouped points represented by that cluster. This method is recommended when you want to show exact numbers at different scales, and you do not need or want to share the individual point locations.

  1. In the Contents pane, click the High_Blood_Level_Results layer to select it.
  2. On the ribbon, click the Feature Layer tab. In the Drawing group, click Aggregation and choose Clustering.

    Clustering option

  3. In the Clustering window, click Yes.

    Yes button in the Clustering window

    The map updates to show cluster symbols. The symbol color is randomly assigned, and the size and number of clusters will depend on your display and the map extent.

    Clustered points symbology

    The size of each symbol is based on the number of points in the cluster, and they are labeled with the number of points as well.

  4. Zoom in to the cluster in the northeast of the city.

    Clusters when zoomed in

    As with the heat map symbology, the cluster symbology adapts to the zoom level and extent of the map. If you zoom in close enough, you will start to see individual patient locations.

    Zoomed in map showing individual features

    Just as with the heat map symbology, at some extents and zoom levels, the cluster symbology is not appropriate for protecting patient identity. Also, like with the heat map symbology, when you zoom in enough in an interactive version of the map, you can click individual points and get their attributes. Cluster symbology is not sufficient to protect patient identity in an interactive map.

    For static maps, you can adjust the clustering to be more appropriate at a desired scale and extent.

  5. In the Symbology pane, click the Clusters tab and the Cluster settings tab.

    Clusters and Cluster settings tabs

  6. Drag the Cluster radius slider toward the High end of the scale.

    Clustering radius slider

    As you drag the Cluster radius slider, the number of clusters decreases and the number of points per cluster increases.

    Map with a high cluster radius

    This is similar to the way the heat map radius works. You can change the cluster radius to adjust the degree of clustering to suit your map extent and scale.

  7. In the Contents pane, right-click High_Blood_Level_Results and choose Zoom To Layer.

    Zoom To Layer option

    As with the heat map symbology, a radius that works well for one scale and extent may not be appropriate at another.

    Radius set too high

  8. In the Symbology pane, drag the Cluster radius slider toward the Low end of the scale.

    Map showing adjusted cluster radius for the map extent and scale

    Cluster maps are used in static and dynamic maps to show specific occurrence numbers (case observations in this instance) and to indicate spatial patterns in the density of the data. For privacy purposes, the benefit is that the clusters are not tied to administrative boundaries like ZIP Codes or counties that can be used to identify individuals. You must adjust the cluster radius for the specific scale and extent of the map to convey useful information about the patterns without revealing individual patient locations.

    Because you are making a static map image for hospital leadership, a cluster map can be used, if you are careful to set the cluster radius appropriately for the map. For your hospital leadership colleagues, your static cluster map gives them exactly the information they need to plan for a coordinated approach to treatment of local children with high blood lead levels.

  9. Save the project.

You have reviewed the blood lead level data. You also researched the definition of PHI, the entities that must comply with HIPAA, and two de-identification methods, Safe Harbor and Expert Determination. Then, you used two visualization techniques—heat maps and feature clustering—to visualize the point data without showing the exact locations of the individuals.


Suppress small cells

Small cells are polygons containing aggregated data in which the number of data points in the polygon are few enough to make re-identification of individuals possible. In this section, you will combine two methods to support de-identification of your data when you have small cells: hot spot analysis and tessellation. Hot spot analysis is based on mathematical calculations that identify statistically significant spatial clusters of high values (hot spots) and low values (cold spots). Tessellation is a method of tiling a surface with identical, non-overlapping geometric shapes, like squares, triangles, or hexagons. These tiles can be used to show summary information about the data points that fall within them.

Identify hot and cold spots

Your next task is to make a map that shows statistically significant clusters of high blood lead level cases for a report that will be published online in a dynamic web map. You'll use the Optimized Hot Spot Analysis tool to create your map and symbolize the results with a tessellation of hexagons.

In ArcGIS Pro, the Optimized Hot Spot Analysis tool allows you to aggregate the high blood lead level locations into weighted features. Using the distribution of the weighted features, the tool will identify an appropriate scale of analysis. This eliminates the need to know the size of the hexagons in advance. Aggregating or binning data with hexagons, also called hexbins, is a useful way to visualize health information while protecting patient privacy, since they do not directly align with administrative boundaries. A second level of obfuscation comes from providing an analytical output (levels of statistical significance) rather than case numbers.

Your web map will show the generalized patterns in the presence and absence of childhood lead poisoning across the study area while also communicating areas with higher concentrations.

  1. On the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.

    Tools button

    The Geoprocessing pane appears. You'll use this pane to search for and run the Optimized Hot Spot Analysis tool.

  2. In the search box, type optimized hot spot. In the list of results, click the Optimized Hot Spot Analysis tool.

    Search result for optimized hot spot

    The tool is called Optimized Hot Spot Analysis because it searches for the best distance at which to perform the hot spot analysis. That will be the distance at which clustering among the counts in neighboring hexbins is most intense. If a clear distance is not achieved, the optimizer calculates an average distance that provides for a certain number of nearest neighbors for analysis. Finally, the tool compares the count of high blood lead level patients in each neighborhood cluster of hexbins with the entire study area to determine a z-score, which can then be directly related to a p-value upon which statistical significance is determined.

  3. For Input Features, choose High_Blood_Level_Results.
  4. For Output Features, accept the default location. For the feature class name, type High_Blood_Lead_Hot_Spots.

    Input and output parameters

  5. Leave the Analysis Field parameter empty.

    If there is a numeric value associated with the input features, you can use the Analysis Field parameter to take those values into account for the hot spot analysis. In this case, you won't set an Analysis Field value. This will evaluate the distribution of High_Blood_Level_Results points for hot and cold spots.

  6. For Incident Data Aggregation Method, choose Count incidents within hexagon grid.
  7. For Bounding Polygons Defining Where Incidents Are Possible, choose Sacramento_ZIP_Codes.

    Tool settings

    This layer contains ZIP Code polygons for Sacramento. These features will be used by the tool to identify places where points can occur. You are essentially specifying your study area for the tool, so areas that are outside of your Sacramento study area, but still within the maximum bounding rectangle of input points, will not be identified as cold spots.

  8. Click Run.

    The tool runs and the High_Blood_Lead_Hot_Spots layer is added to the map.

  9. In the Contents pane, uncheck the High_Blood_Level_Results layer so you can examine the new layer.

    Hot spot analysis results layer on the map

    The symbol classes for the layer are shown in the Contents pane.

    Hot spot analysis symbology

    The results of the tool are symbolized using blues for statistical cold spots, reds for statistical hot spots, and white for non-significant levels.

    You could share this layer as a way to show the distribution of significantly high and low counts of cases. However, before sharing it, you would need to remove the Counts field. This field indicates the number of cases in each hexagon. Providing specific counts, especially for cells with only a few incidents, may not protect patients' identities adequately, although this depends partly on the cell size and on the frequency of occurrence of the condition.

    Next, you'll symbolize the hot spot analysis layer by the total count within each bin. This method not only shows the areas of concentration but also provides a way to clearly communicate the range of the number of cases.

  10. Save the project.

Symbolize hexbins by count

You must make a report that will be shared with internal analysts working on a lead mitigation project who need to know the numbers of cases in an area without needing to know the specific point locations. You'll change the symbology to show the total feature count within each polygon.

First, you'll make a copy of the layer so you can have a version symbolized each way.

  1. In the Contents pane, right-click the High_Blood_Lead_Hot_Spots layer and choose Copy.

    Copy option for the high blood lead level hot spot layer

  2. In the Contents pane, right-click Map and choose Paste.

    Paste option

  3. In the Contents pane, click the name of the layer you pasted to edit it.

    Editable layer name

  4. Type High_Blood_Lead_Hexbin_Counts and press Enter.
  5. In the Contents pane, uncheck the High_Blood_Lead_Hot_Spots layer to turn it off.
  6. Right-click the High_Blood_Lead_Hexbin_Counts layer and choose Symbology.
  7. In the Symbology pane, for Field, choose Counts.
  8. Click the Color scheme drop-down list, scroll down and click the Reds (7 classes) color ramp.

    Reds (7 classes) color ramp

  9. For Classes, choose 5.

    Classes parameter set to 5

  10. In the symbol table, right-click the symbol for the lowest class (≤ 0), and choose No color.

    No color option

    Removing the fill for zero count hexbins gives more context for a map reader and focuses attention on the cells where there are high blood lead level patients.

    There are hexbins classified with 1 point within them. In most cases, you won't want to display a single case within a single hexbin. This is clearly a small cell. You can adjust the histogram of the graduated symbols to change the classes of the map symbology.

  11. Click the Histogram tab.

    Histogram tab

  12. On the histogram, double-click the 1 handle to edit it. Type 2 and press Enter.

    Class break moved to 2

  13. Change the 3 handle to 4.

    The new class breaks are set.

    New class breaks set

    The symbology is updated to group hexbins with one and two cases in the same group.

    Hexbins with one and two cases symbolized in same group

    The right number to choose for the minimum number of cases in a hexbin varies, depending on the scenario and on your organization's rules. For conditions that are common, you may be able to use a smaller number, and for conditions that are rare, it may be better to use a larger number. It is also important to consider the area of each, and the number of people (and potential cases) one would find within one. The larger the bin and the larger the number of people, the lower you can set the minimum number of cases without risking re-identification of individuals.

    Now you are ready to share this information with your colleagues performing the analysis. While they are internal to your organization and perhaps have all required permissions to use the raw data, they do not actually need point level data for their work. It is a best practice to provide a minimum viable dataset based on work needs. This is a balanced approach that offers accurate enough data to focus in on local concerns (better than ZIP Code level) while avoiding the prospect of sharing PHI-containing point data where its not needed.

  14. Save the project.

You used the Optimized Hot Spot Analysis tool to help establish the appropriate hexbin size (based on the best scale of analysis, not based on privacy needs) for the input point features and symbolized the hexbins to show statistical significance. Using the hot spot map to highlight areas of relative concern communicates the problem while also making it impossible to identify individuals. You also re-symbolized the hexbin data to show actual counts of cases for a different analytic process. You used a method that did not require individual points to be shared with stakeholders that might not be authorized to see them or don’t actually need them for their work. The result provided a clear visual representation of areas with more occurrences of high blood lead levels across your study area.


Generalize and aggregate data

Next, you'll review the data by year and learn how to protect individuals and not identify small clusters of data in mapping products that will be released to the public. You'll learn how to generalize and aggregate data to protect sensitive information using methods that still show relevant patterns in the data. With health data, it is often the patterns that are most informative; individual case locations are not always necessary to inform many aspects of operations. For example, as an analyst you may want to use generalized or aggregated data in childhood lead poisoning and surveillance annual reports, as opposed to individual points used in case management and investigations.

Data generalization involves simplifying data by reducing its complexity or detail. For example, you might generalize date of birth data to year of birth. You can generalize age to age cohorts in 10-year groupings. And you can combine various tribal groups such as Cherokee, Navajo, and Choctaw into an American Indian category. Aggregation, on the other hand, involves combining multiple data points into a single summary statistic, such as the number of births per year. You'll focus on aggregation methods, but you can often apply generalization techniques to your underlying data to further obfuscate private information.

Summarize the data

You'll begin by summarizing the data by year using the study area ZIP Code layer. ZIP Codes boundaries are often used for reporting health statistics. They have pros and cons. On the pro side, ZIP Codes are smaller than counties and most people know their ZIP Code and can locate it on a map. On the con side, ZIP Code boundaries are artificial constructs, designed to support efficient mail delivery, and they can change over time. You, as the analyst, must decide if they are appropriate for your needs and are aligned with your organization's data release rules.

  1. Reopen the Geoprocessing pane and click the Back button.
    Tip:

    If you can't find the Geoprocessing pane, on the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.

  2. In the search box, type summarize within. In the list of results, click Summarize Within (Analysis Tools).

    Summarize Within search result

    There is another Summarize Within tool that belongs to the GeoAnalytics Desktop Tools toolset, but you should use the one from the Analysis Tools toolset for this tutorial.

  3. For Input Polygons, choose Sacramento_Zip_Codes.

    Input Polygons parameter set to Sacramento_Zip_Codes

  4. For Input Summary Features, choose High_Blood_Level_Results.

    Input Summary Features parameter set to High_Blood_Level_Results

  5. For Output Feature Class, accept the default location. For the feature class name, type HBLL_by_zip_year.

    Output Feature Class parameter set to HBLL_by_zip_year

  6. For Group Field, choose Blood Level Test Year.

    Group Field parameter set to Blood Level Test Year

  7. Click Run.

    The HBLL_by_zip_year layer is added to the map. In the Standalone Tables section, the testYear_Summary table is also added. This table contains the summary data with counts by ZIP Code by year. It can be joined back to the HBLL_by_zip_year layer to show the values for each year.

Join the table to the feature class

Next, you'll join the summary table to the result feature class to have a single feature class with data summarized by ZIP Code and year. This will allow you to create layers to show the data for each year.

  1. In the Contents pane, right-click the HBLL_by_zip_year layer and choose Attribute Table.

    HBLL_by_zip_year table

    The table shows data from the original ZIP Code polygons and data that was added by the Summarize Within tool. The Count of Points field shows the total number of cases in each ZIP Code polygon. The JOIN ID field contains values that you can use to join the attributes from the testYear_Summary table onto this layer. There are 17 ZIP Code polygons in this feature class.

  2. In the Contents pane, in the Standalone Tables section, right-click the testYear_Summary table and choose Open.

    Test Year Summary table values

    The JOIN ID field contains values that you can use to join the attributes to the HBLL_by_zip_year layer. The testYear field holds the values for the years of the blood tests. The Count of Points field shows the total number of cases in each ZIP Code polygon in each year, for a total of 50 records in the table.

  3. Close both of the tables.
  4. In the Contents pane, right-click HBLL_by_zip_year, point to Joins and Relates, and choose Add Join.

    Add Join option

    In the Add Join window, the Input Table parameter is set to the HBLL_by_zip_year layer.

  5. For Input Field, choose JOIN ID.

    There is a warning icon beside Input Field that indicates the field is not indexed. For small tables like these, this is not a problem.

  6. For Join Table, choose testYear_Summary.
  7. For Join Field, choose Join ID.
  8. Click Validate Join.

    Validate Join button

    The Validate Join process runs and returns a message.

    Validate join message

    Because two fields are not indexed, the tool recommends creating indexes for them to improve performance. Given the number of features involved, this is not necessary.

    The tool also reports that this is a one-to-many join, and that the resulting joined feature class will have 50 records (one for each record in the testYear_Summary table).

  9. Click Close to close the Message window.
  10. In the Add Join window, click OK.

    The attribute table for the HBLL_by_zip_year layer updates to show the additional fields from testYear_Summary and the additional records for the combinations of ZIP Code polygons and test years.

    The results of the Add Join tool are temporary. You'll create a copy of the feature class with all the features by exporting it to a new feature class.

  11. Right-click the HBLL_by_zip_year layer, point to Data, and choose Export Features.
  12. In the Export Features window, for Output Feature Class, type HBLL_by_zip_all_years.

    Output Feature Class

  13. Click OK.

    The new feature class is stored in your project geodatabase and added to the Contents pane. You no longer need the older layer.

  14. In the Contents pane, right-click HBLL_by_zip_year and choose Remove.

    Remove option

Symbolize the layer

Next, you'll symbolize the layer.

  1. In the Contents pane, uncheck all of the layers except HBLL_by_zip_all_years.
  2. Right-click HBLL_by_zip_all_years and choose Symbology.
  3. In the Symbology pane, for Primary Symbology, choose Graduated Symbols.
  4. For Field, choose the second of the two Count of Points fields, below Join ID.

    Second Count of Points field

    This field contains the aggregated count of points within the polygon that happened in a specific year. The first field contains the total count for all three years.

  5. For Maximum size, type 40 pt.

    Maximum size

    The symbology for the layer updates.

    Symbolized map

    The map shows multiple point symbols of different sizes on each polygon. This is because the HBLL_by_zip_all_years layer contains multiple copies of each ZIP Code polygon, one for each year for which there were cases in that ZIP Code. The range of symbol sizes is based on the range of values, but the map is difficult to read. You can't tell which point symbol corresponds to which year.

  6. On the ribbon, click the Map tab. In the Navigate group, click the drop-down arrow of the Explore tool and choose Visible layers.

    Explore tool set to Visible Layers

  7. Click the northeastern-most ZIP Code polygon.

    Northeastern-most ZIP Code polygon

    Only two point symbols are visible on the map, but the upper section of the pop-up shows that the location contains three features from the HBLL_by_zip_all_years layer. The lower section of the pop-up displays the attributes for the top feature. The testYear and Count of Points fields show how many cases there were in the 95821 ZIP Code in each year.

    Fields in the pop-up

  8. In the upper section of the pop-up, click the other two instances of Sacramento to view the attributes for the other two features.

    Second Sacramento pop-up

    In the 95821 ZIP Code, there were 24 cases in 2018, 48 cases in 2019, and 26 cases in 2020.

  9. Close the pop-up.

Display the data by year

Now that you have the HBLL_by_zip_all_years layer with the ZIP Code counts by year, you'll make copies of the layer so you can visualize the distribution of high blood lead level cases for each year.

  1. In the Contents pane, right-click the HBLL_by_zip_all_years layer and choose Copy.
  2. In the Contents pane, right-click Map and choose Paste.
  3. Rename the copy of the HBLL_by_zip_all_years layer to HBLL_by_zip_2018.
  4. Double-click the HBLL_by_zip_2018 layer.

    The Layer Properties window appears.

  5. In the Layer Properties pane, click the Definition Query tab.
  6. Click New definition query.

    New definition query button

  7. Create the query Where testYear is equal to 2018.

    Query set to Where testYear is equal to 2018

    This query will filter the layer so only the polygons for 2018 will be shown on the map.

  8. Click Apply and OK.
  9. In the Contents pane, right-click the HBLL_by_zip_2018 layer and choose Copy.
  10. In the Contents pane, right-click Map and choose Paste.
  11. Rename the new copy of the layer HBLL_by_zip_2019.
  12. Double-click the HBLL_by_zip_2019 layer to open the Layer Properties window.
  13. On the Definition Query tab, on the Query 1 card, click Edit.

    Edit button to change the definition query for the layer

    You'll change the definition query for the 2019 layer to show the 2019 data.

  14. Change the value of the year to 2019.

    Test year set to 2019

  15. Click Apply and OK.
  16. Make a copy of the HBLL_by_zip_2019 layer, rename it HBLL_by_zip_2020, and update the definition query for that layer to show the data for 2020.

    You now have a separate layer showing the county of high blood lead level cases for each year.

    Next, you'll explore two different aggregation methods to achieve your organization's minimum threshold value. Your leadership has determined that if 5 or more observations occur in an area, like a ZIP Code, you can display data for that ZIP Code in a product that will be released publicly.

  17. On the map, click the central ZIP Code polygon with the lowest count of cases.

    Central ZIP Code polygon with few cases

    The top layer in the Contents pane, HBLL_by_zip_2020, displays first.

    Pop-up for 2020 values

    In 2020, there were only two cases in this ZIP Code polygon. This number is fewer than the minimum value of five cases that your organization has specified for releasing data by ZIP Codes.

  18. In the Pop-up pane, under HBLL_by_zip_2019, click Sacramento to view the attributes for 2019.

    Results for 2019

    There were three cases in this ZIP Code in 2019. You could release combined data for this ZIP Code for 2019 and 2020, since the sum of the values for these two years is five.

  19. Close the pop-up.

Combine data for multiple years

One method of meeting your organization's minimum threshold value is aggregating multiple years of data until you obtain a minimum of five cases in each ZIP Code. This approach decreases temporal resolution to maintain spatial resolution.

  1. On the ribbon, on the Map tab, in the Selection group, click Select by Attributes.
  2. In the Select By Attributes pane, for Input Rows, choose High_Blood_Level_Results.

    Input Rows

  3. Click Select a field and choose Blood Level Test Year.
  4. Accept the default operator, is equal to.
  5. Click the drop-down list for the comparison value and choose 2020.

    Completed clause

  6. Click Add Clause.

    Add Clause button

    The default logical operator for combining clauses for the query is And. This operator allows you to build queries to select features where one field's value is something and another field's value is something else, or where values are within a range, if you are using greater-than and less-than comparisons. However, in this case, you'll build the query to select features where the test year is 2020 or 2019.

  7. Click the And logical operator and choose Or.

    Or option

  8. Set the field to Blood Level Test Year and accept the default is equal to operator.
  9. Click the value drop-down list and choose 2019.

    Select By Attributes tool parameters

    The Select By Attributes tool is ready to select features with values of 2020 or 2019 in the Blood Level Test Year field.

  10. Click OK.

    The High_Blood_Level_Results features recorded for 2020 or for 2019 are selected. You can't see them on the map because the High_Blood_Level_Results layer is turned off. However, below the map view, a count of 270 selected features is listed.

    Selection count

    Next, you'll run the Summarize Within tool to get the counts by ZIP Code of the selected features.

  11. On the ribbon, click the Analysis tab. In the Geoprocessing group, click Tools.
  12. Search for and open the Summarize Within tool.
  13. For Input Polygons, choose Sacramento_Zip_Codes.
  14. For Input Summary Features, choose High_Blood_Level_Results.
  15. For Output Feature Class, type HBLL_by_zip_2019_2020.

    Summarize Within tool for 2019 and 2020 cases

    The Summarize Within tool warns you that there is a selection on the input and only that subset of records will be processed, which is what you want.

  16. Click Run.

    The new HBLL_by_zip_2019_2020 layer is added to the Contents pane.

  17. In the Contents pane, right-click the HBLL_by_zip_2019_2020 layer and choose Attribute Table.
  18. Right-click the column header for Count of Points and choose Sort Ascending.

    Sort Ascending option

    The sorted column shows that there are no ZIP Code polygons in this layer that have fewer than five cases.

    Sorted column

    According to your organization's minimum threshold value, the grouped counts for 2019 and 2020 can be released at the ZIP Code level.

  19. Close the attribute table.

    You'll clear the selection so it does not affect other tools.

  20. Right-click anywhere on the map and click Clear.

    Clear option

Merge ZIP Code geometries

Suppose you needed to report the data for 2020 and not include 2019 data. You'll use a second method to meet your organization's minimum threshold: aggregating ZIP Codes for a single year until there are more than five cases in each aggregated area. This approach decreases spatial resolution to maintain temporal resolution.

  1. Open the Geoprocessing pane and click the Back button.
  2. Search for build balanced zones. In the list of results, click Build Balanced Zones.

    Build Balanced Zones tool in the search results

  3. For Input Features, choose HBLL_by_zip_2020.

    A note appears that the input has a filter. This is because there is a definition query on the layer, filtering it to only show the 2020 data.

  4. For Output Features, type HBLL_2020_Zones.

    Build Balanced Zones tool input and outputs

  5. Confirm Zone Creation Method is set to Attribute target.
  6. Under Zone Building Criteria With Target, for Variable, choose Count of Points [Point_Count_1].

    Variable set to Count of Points [Point_Count_1]

  7. For Sum, type 12.

    This value is higher than the organization's minimum value of 5. The Build Balanced Zones tool uses the Target variables as targets for a randomly seeded genetic algorithm, but the results will only approximate the target values, so if you set a lower value, it is likely that some zones would have fewer than five cases.

  8. For Spatial Constraints, choose Contiguity edges only.

    Build Balanced Zones tool parameters

    The Build Balanced Zones tool is ready to run.

    Note:

    If you had other criteria for the zones, such as a minimum population, you could add another variable and value, but for this task, making zones with a target of at least 12 cases is enough. You can read more about the tool in the documentation.

  9. Click Run.

    The results are added to the map.

  10. In the Contents pane, turn off all layers except for HBLL_2020_Zones.

    Result of the Build Balanced Zones tool

    The original ZIP Code polygons are retained, but they have new attributes allocating them to different zones. You'll dissolve the polygons so there is one feature per zone.

  11. In the Geoprocessing pane, click the Back button.
  12. Search for and open the Pairwise Dissolve tool.

    Pairwise Dissolve tool in the search results

  13. For Input Features, choose HBLL_2020_Zones.
  14. For Output Feature Class, type HBLL_2020_Zip_Dissolve.

    Pairwise Dissolve input and output feature classes

  15. For Dissolve Fields, choose Zone ID.

    Dissolve Fields parameter

  16. For Statistics Fields, choose Count of Points. Confirm Statistic Type is set to Sum.
  17. Uncheck Create multipart features.

    Statistics Fields parameters

  18. Click Run.

    The dissolved zones layer is added to the map.

    Result of the Pairwise Dissolve tool

  19. In the Contents pane, right-click HBLL_2020_Zip_Dissolve and choose Attribute Table.

    The point counts for the zones are all greater than 5, and most have 12 or more points. This is in line with your organization's guidance.

    SUM_Point_Count_1 field

  20. Close the attribute table.

    As the analyst for the Childhood Lead Poisoning Prevention Program, you must consider which method is most appropriate to provide meaningful and actionable data for jurisdictions that often have their data suppressed. Aggregating across years means your end user cannot discern temporal variation across the aggregated years, but they can see numbers for small geographic areas that might otherwise be suppressed. Aggregating multiple ZIP Codes may make strong temporal trends identifiable as each single year is mapped, but the geographic specificity will be diminished. Each method must be weighed against the target audience and purpose for reporting and data sharing.

Add coordinate values to points

Up to this point, you've been creating maps for your stakeholders that are focused on questions relating to the extent of high blood lead levels in Sacramento County, how many cases there are overall, and various ways to look at the spatial and temporal patterns in the data.

Next, you'll work with your health equity team. They would like to do some research to determine whether there are any other factors associated with high blood lead levels in children such as: sex, race or ethnicity, and age. To help them with their work, you must be able to provide them with a de-identified point-level dataset that includes all the variables of interest for each child, as well as their general location. You'll use coordinate rounding to accomplish this task and check some statistics to justify the rounding levels.

First, you'll add attributes with latitude and longitude values in decimal degrees to your point features.

  1. In the Geoprocessing pane, search for and open the Calculate Geometry Attributes tool.

    Calculate Geometry Attributes tool in the search results

  2. For Input Features, choose High_Blood_Level_Results.
  3. Under Geometry Attributes, for Field (Existing or New), type Latitude.

    Field (Existing or New) set to Latitude

    This will add a new field to the attribute table to store the latitude values for each point.

  4. For Property, choose Point y-coordinate.

    Point y-coordinate selected for the Latitude field

    The y-coordinate value from each point will be added to the Latitude field.

  5. In the second row, for Field (Existing or New), type Longitude. For Property, choose Point x-coordinate.
  6. For Coordinate Format, choose Decimal Degrees.

    Coordinate Format set to Decimal Degrees

  7. Click the Select coordinate system button.

    Select coordinate system button

  8. In the Coordinate System window, search for WGS 1984.
  9. Expand Geographic Coordinate System and expand World. Click WGS 1984.

    WGS 1984 coordinate system

  10. Click OK.
  11. In the Calculate Geometry Attributes tool, click Run.
  12. In the Contents pane, right-click the High_Blood_Level_Results layer and choose Attribute Table. Scroll to the end of the table until you see the new Latitude and Longitude fields.

    New fields

    Now that you have the latitude and longitude values of the points stored in attributes, you can create new fields to hold the rounded values and calculate the new rounded values.

    Note:

    There are several ways to manipulate the latitude and longitude coordinates that represent the point locations of your high blood lead level cases. You could truncate or round the coordinates, snapping each point location to a lower resolution grid across the study area. You could also perturb the locations by replacing the last digit or two of each coordinate with a random number. This moves each point by a random distance and direction.

Add fields for rounded coordinates

You'll make two fields to hold the rounded coordinate values.

  1. Right-click High_Blood_Level_Results, point to Data Design, and choose Fields.

    The fields table appears. It lists each field in the High_Blood_Level_Results layer as a row. You'll use the table to add two new fields to the layer.

  2. Scroll to the bottom of the list of fields.
  3. Click the row header for Latitude. Press Ctrl while clicking the row header for Longitude.

    Latitude and Longitude fields

  4. Right-click the row header for Latitude and choose Copy.

    Copy option

  5. Right-click the row header for Latitude and choose Paste.

    Two new rows appear in the table, named Latitude1 and Longitude1. You'll change the names and aliases of the copied fields.

  6. In the Field Name column, double-click Latitude1 and type LatitudeRound.

    Latitude1 field changed to LatitudeRound

  7. Rename Longitude1 to LongitudeRound.
  8. In the Alias column, LatitudeRound column, type Latitude Rounded.
  9. In the Alias column, LongitudeRound column, type Longitude Rounded.

    Fields with names and aliases set

    The names and aliases for the copied fields are set.
  10. On the ribbon, on the Fields tab, in the Manage Edits group, click Save.

    Save button for field changes

    The two new fields are added to the table schema for the High_Blood_Level_Results feature class.

  11. Close the Fields view.

Round the coordinates

Next, you'll calculate rounded coordinate values and store them in the new fields.

  1. In the attribute table for the High_Blood_Level_Results layer, right-click Latitude Rounded and choose Calculate Field.

    Calculate Field option

  2. In the Calculate Field window, for Expression Type, choose Arcade.

    Expression Type parameter

    Arcade is a lightweight expression language written for ArcGIS.

  3. In the expression box, type or copy and paste the following Arcade expression:

    Round($feature.Latitude,2)

    Expression parameter

    This code uses the Round function, setting the value of the Latitude Rounded field to be equal to the value in the Latitude field, rounded to two decimal places. This rounds off the location information of the points to the nearest hundredth of a degree.

  4. Click the Verify button.

    Verify button

  5. Click Apply.

    The rounded values are calculated and added to the attribute table in the Latitude Rounded field.

    Latitude Rounded field

    You'll use the same method to calculate the values for the Longitude Rounded field.

  6. In the Calculate Field window, for Field Name (Existing or New), choose Longitude Rounded.
  7. In the expression box, replace the existing expression with the following:

    Round($feature.Longitude,2)

  8. Click OK.

    The Latitude Rounded and Longitude Rounded fields are rounded to two decimal places.

    Longitude Rounded values

    Note:

    If your coordinates were in a planar spatial reference, such as California State Plane or UTM, the coordinate values would be in linear units rather than in decimal degrees. In that case, you would need to calculate an appropriate spacing for your rounded points and round to that spacing. For example, you might choose to round to the nearest 1,000 feet, or 100 meters, depending on the units and the amount of displacement that you want.

  9. Close the attribute table.

Create new points

Now that you have the rounded values in two fields, you'll create new points at these locations.

  1. In the Geoprocessing pane, search for and open the Make XY Event Layer tool.

    Make XY Event Layer tool

  2. For XY Table, choose High_Blood_Level_Results.
  3. For X Field, choose Longitude [LongitudeRound].
  4. For Y Field, choose Latitude [LatitudeRound].
  5. For Output Layer Name, type High_Blood_Level_Results_Rounded.
  6. Ensure that Spatial Reference is set to GCS_WGS_1984.

    Make XY Event Layer tool parameters

    With these parameters, the tool will make a new layer of points using the rounded latitude and longitude values that you calculated.

  7. Click Run.
  8. In the Contents pane, turn off all layers except for High_Blood_Level_Results_Rounded and World Street Map.

    Rounded points locations

    The points made from the rounded coordinate values are arranged in a grid-like formation, spaced at hundredth of a degree intervals. This approach moves points from their original locations but can preserve some of the original spatial pattern, which may be useful for analysis.

    Original heat map

    Original points heat map

    Rounded heat map

    Rounded coordinates points heat map

    Caution:

    Once the point level positions have been masked by a method such as coordinate rounding, you should still remove unneeded identifying PHI such as names, birthdays, address fields, and the original coordinate values from the attribute table before releasing that data to your authorized internal colleagues. Moving the points to rounded coordinate values does not protect PHI if you still provide the original address or coordinates.

    You can use the Export Features tool to export a copy of a feature class to share with an authorized member of your organization. On this tool, in the Fields section, you have access to the list of fields, where you can choose to delete fields that contain PHI that are not required for the project.

Document the rounding results

For expert determination, de-identification is necessary to be able to quantify and document the extent to which the points have been moved. You'll review some statistics related to the point movement using the coordinate rounding method and summarize how many points were moved to each grid point.

  1. Search for and open the XY To Line tool.

    XY To Line tool in the search results

  2. For Input Table, choose High_Blood_Level_Results_Rounded.
  3. For Output Feature Class, type HBLL_dist.

    XY To Line tool input and output parameters

    This line feature class will connect each of the original points' coordinates to their corresponding rounded coordinate location. You'll use the line features to calculate the amount of displacement.

  4. For Start X Field, choose Longitude [Longitude].
  5. For Start Y Field, choose Latitude [Latitude].
  6. For End X Field, choose Longitude [LongitudeRound].
  7. For End Y Field, choose Latitude [LatitudeRound].

    XY To Line tool coordinate parameters

  8. For Line Type, choose Geodesic.

    This value represents the shortest distance between two points on the surface of the earth.

  9. Leave the ID field empty.
  10. For Spatial Reference, accept the default value of GCS_WGS_1984.

    XY To Line tool parameters

  11. Click Run.

    The HBLL_dist layer is added to the map. Depending on the zoom level and extent of your map, it may be difficult to see. If you zoom in to one of the higher-density areas, you'll see that a set of lines connect each of the original points to their corresponding rounded coordinate point locations.

    Zoomed view of the HBLL_dist lines

  12. In the Contents pane, right-click the HBLL_dist layer and choose Attribute Table.

    The values in the Shape_Length field are small decimal values—they are in degrees. You'll convert the lengths to planar units.

    Attribute table of the HBLL_dist layer

Add a distance field

You'll add a new field to the attribute table of the HBLL_dist layer and calculate its value to get the distances that the points were displaced.

  1. In the attribute table, click Add.

    Add button

    The Fields table appears. You'll add a new field to hold the distances in linear units.

  2. In the Field Name column, in the bottom row, type Distance.
  3. In the Data Type column, in the bottom row, choose Double.

    Field Name column for the new field

  4. On the ribbon, on the Fields tab, in the Manage Edits group, click Save.

    Save button

  5. Close the Fields view.
  6. In the attribute table, right-click the column header for the Distance field and choose Calculate Geometry.

    Calculate Geometry option

  7. In the Calculate Geometry window, for Property, choose Length (geodesic).

    Length (geodesic) option

  8. For Length Unit, choose Meters.

    Length Unit parameter

  9. Click OK.

    The lengths of the lines, in meters, are added as attributes in the Distance field.

  10. Right-click the Distance column header and choose Visualize Statistics.

    Visualize Statistics option

    A chart and the Chart Propertiess pane appear.

    In the Chart Properties pane, the Statistics section, shows summary statistics for the Distance field. These statistics show that the mean distance that points moved to the rounded coordinate location was 377 meters, with a minimum distance of 19 meters and a maximum distance of 685 meters.

    Distance statistics results

    The chart view shows a histogram of the distance values that you could use in defending your decisions in making this de-identified product using coordinate rounding.

    Distance histogram

  11. Close the Chart Properties pane, the chart, and the attribute table.

Count points at rounded coordinates

Next, you'll calculate how many stacked points there are after using coordinate rounding. For purposes of analyzing privacy and de-identification, you can think of this count as representing how many cases are in the pool that could represent the identity of any single case. The more cases you have in each stack, the bigger the pool and the better for de-identification purposes. You'll analyze the points geographically but recognize that you'll also need to review the uniqueness of all attributes that you've retained in a table that you plan to share, since a particular combination of attributes could also identify an individual. For this reason, it is recommended that you provide the minimum viable dataset to your stakeholders.

  1. In the Contents pane, turn off the High_Blood_Level_Results_Rounded and HBLL_dist layers.
  2. In the Geoprocessing pane, search for and open the Collect Events tool.
  3. For Input Incident Features, choose High_Blood_Level_Results_Rounded.
  4. For Output Weighted Point Feature Class, type HBLL_rounded_counts.

    Collect Events tool

  5. Click Run.

    HBLL Collect Events results

    In this case, some of the clusters have as many as 15 points stacked, although many have only one or two. With a larger dataset, you might have more densely stacked points.

    You've used coordinate rounding to mask the locations of sensitive point data while allowing you to keep several additional attributes associated with the points. The health equity researchers now have the best opportunity to perform additional analysis and tell a more complete story about childhood blood lead poisoning in Sacramento using the de-identified data. In order to document your de-identification method, you calculated statistics related to the offset distancing for each point and counted the pool of points in each grid location stack. Remember that it is also important to remove attributes that could lead to re-identification (such as address, original location coordinates) and it is a best practice to minimize the number of attributes in the dataset you provide.

  6. Save the project.

Review advanced approaches

You have learned several approaches for de-identifying data for different use scenarios. There may be some situations in which you may need to adopt more advanced methods. You'll learn about two advanced methods of data de-identification: geomasking and differential privacy.

Depending on where your health GIS work takes you, you may want to dive in deeper and do your own research on the following techniques so you can apply them as needed.

Geomasking

The term geomasking refers to a group of methods that change the geographic location of individual points, but in a different and more powerful way than coordinate rounding. There are two key aspects needed to make geomasking useful. First, the perturbation of the point must be unpredictable—that is what protects confidentiality in the data. Second, the point should be moved in a way that preserves spatial relationships within the dataset. After all, your GIS work is about finding patterns. In the notes that follow, you will be introduced to a specific type of geomasking—the donut method. You will then learn how to statistically evaluate the geomasking result with k-anonymity. Finally, you will be presented with a tool that automates the entire process for you.

Donut method for geomasking

The basic idea behind donut geomasking is that it improves confidentiality by making sure that the randomly moved point cannot ever end up in its original position. That means that a point must be displaced a minimum distance away from the original location. At the same time, to preserve spatial patterns, there is a calculated maximum displacement for each point as well. Those two distances create a donut shaped displacement zone within which the original point may be moved. You can learn more about the donut method in this article.

Donut geomasking diagram

K-anonymity

The Expert Determination method of de-identification includes a requirement to document the process and justify how that process achieves a very low risk of re-identification of an individual. When using the geomasking technique, the K-anonymity statistic is the evaluative measure that will support that justification. You can read more about K-Anonymity. The general idea is that K-anonymity represents the number of households in your dataset from whom a de-identified subject cannot be distinguished. For example, if you decided that the minimum value for K is five (written as KMin=5), you're saying that there are at least five households (or individuals) that could potentially represent your original point.

The key decision for your organization is to determine what minimum value of K is deemed acceptable for privacy protection. There is no single standard; however, it may be useful to review the policies of various state and federal agencies on small cell counts. Small cells are defined as the number of people corresponding to the same combination of features. Aligning with the policy of authoritative government agencies may help support your organization's decision about developing its own standard. Also consider that one standard value for K may not be appropriate for every situation.

Differential privacy

Differential privacy is a newer technique that many believe is superior for protecting individual privacy. It works best with larger datasets. In fact, this is the method that the United States Census Bureau used for data reporting starting with the 2020 census. With differential privacy, data within a dataset are mathematically changed (all the data) in a way that makes identification of any individual impossible but also maintains the usefulness of the dataset. Noise is injected into the dataset according to a parameter, epsilon, that is referred to as the privacy-loss budget. The use of epsilon means that the disclosure risk for the data can be quantified which is useful for adherence to organizational policies as well as the required documentation for Expert Determination.

One way to think about how differential privacy works is to imagine one of those picture mosaics, where hundreds of ordinary pictures are put together in such a way that they create a new bigger image. Zooming in to the individual picture level, you could replace several pictures or move them around to different places and still, when you zoom out, the overall image will look essentially the same. The big picture may not be quite as sharp as a photograph, but the quality is improved as you add more individual pictures.

There is still a lot to learn about differential privacy and its value for health GIS. This is an area for you to be aware of because you may already be consuming census data that's been shared using this method and because there may be tools that enable this technique in your own geospatial work.

To learn more about the impact of differential privacy on 2020 United States Census data, see the June 2022 Esri methodology report, as well as this handbook on disclosure avoidance from the United States Census Bureau.

This tutorial on data de-identification for visualization and sharing provides a review of HIPAA, the United States law focused on protecting the privacy of personal health information. You learned several techniques that allow you to map and visualize the information safely. You also learned techniques that help you share the data, whether in a dynamic web map or as a dataset for others who may use your data for research or other purposes. You also learned about some advanced techniques that you may call on when you need more powerful options for retaining point-level data.

One tutorial cannot cover every situation. In this tutorial, you learned how to think spatially about the problem and consider the advantages and drawbacks of various methods. No matter what techniques you use as you work with protected health information, think carefully and check with your internal organizational guidelines to stay aligned and stay safe.

You can find more tutorials in the tutorial gallery.