2 Map Neighborhoods
When considering the health of persons, we have to also consider the neighborhood environment. Sometimes this is looking at neighborhood level health outcomes, like premature mortality at the census tract scale, or cumulative COVID rates by zip code. Sometimes we’re interested in neighborhood factors like poverty, access to affordable housing, or distance to nearest health provider, or pollution-emitting facility. These measurements of the “social determinants of health” at the neighborhood scale are increasingly urgent in modern public health thinking, and are thought to drive and/or reinforce racial, social, and spatial inequity
In this module, we’ll learn about the basics of thematic mapping – known as choropleth mapping – to visualize neighborhood level health phenomena. This will allow you to begin the process of exploratory spatial data analysis and hypothesis generation & refinement.
2.1 Clean Attribute Data
Let’s consider COVID-19 cases by zip code in Chicago. We’ll upload and inspect a summary of cases from the Chicago Data Portal first:
## ZIP.Code Week.Number Week.Start Week.End Cases...Weekly
## 1 60603 39 09/20/2020 09/26/2020 0
## 2 60604 39 09/20/2020 09/26/2020 0
## 3 60611 16 04/12/2020 04/18/2020 8
## 4 60611 15 04/05/2020 04/11/2020 7
## 5 60615 11 03/08/2020 03/14/2020 NA
## 6 60603 10 03/01/2020 03/07/2020 NA
## Cases...Cumulative Case.Rate...Weekly Case.Rate...Cumulative
## 1 13 0 1107.3
## 2 31 0 3964.2
## 3 72 25 222.0
## 4 64 22 197.4
## 5 NA NA NA
## 6 NA NA NA
## Tests...Weekly Tests...Cumulative Test.Rate...Weekly
## 1 25 327 2130
## 2 12 339 1534
## 3 101 450 312
## 4 59 349 182
## 5 6 9 14
## 6 0 0 0
## Test.Rate...Cumulative Percent.Tested.Positive...Weekly
## 1 27853.5 0.0
## 2 43350.4 0.0
## 3 1387.8 0.1
## 4 1076.3 0.1
## 5 21.7 NA
## 6 0.0 NA
## Percent.Tested.Positive...Cumulative Deaths...Weekly
## 1 0.0 0
## 2 0.1 0
## 3 0.2 0
## 4 0.2 0
## 5 NA 0
## 6 NA 0
## Deaths...Cumulative Death.Rate...Weekly Death.Rate...Cumulative
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Population Row.ID ZIP.Code.Location
## 1 1174 60603-39 POINT (-87.625473 41.880112)
## 2 782 60604-39 POINT (-87.629029 41.878153)
## 3 32426 60611-16 POINT (-87.620291 41.894734)
## 4 32426 60611-15 POINT (-87.620291 41.894734)
## 5 41563 60615-11 POINT (-87.602725 41.801993)
## 6 1174 60603-10 POINT (-87.625473 41.880112)
Each row corresponds to a zip code at a different week. This data thus exists as a “long” format, which doesn’t work for spatial analysis. We need to convert to “wide” format, or at the very least, ensure that each zip code corresponds to one row.
To simplify, let’s identify the last week of the dataset, and then subset the data frame to only show that week. We will be interested in the cumulative case rate. Following is one way of doing this – can you think of another way? Try out different approaches of reshaping data to test your R and “tidy” skills.
## [1] 1 31
## ZIP.Code Week.Number Week.Start Week.End Cases...Weekly
## 1 60603 39 09/20/2020 09/26/2020 0
## 2 60604 39 09/20/2020 09/26/2020 0
## 36 60601 39 09/20/2020 09/26/2020 8
## 37 60602 39 09/20/2020 09/26/2020 0
## 41 60605 39 09/20/2020 09/26/2020 12
## 66 60610 39 09/20/2020 09/26/2020 35
## Cases...Cumulative Case.Rate...Weekly Case.Rate...Cumulative
## 1 13 0 1107.3
## 2 31 0 3964.2
## 36 213 54 1451.4
## 37 21 0 1688.1
## 41 391 44 1420.8
## 66 666 90 1706.9
## Tests...Weekly Tests...Cumulative Test.Rate...Weekly
## 1 25 327 2130
## 2 12 339 1534
## 36 202 4304 1376
## 37 27 460 2170
## 41 291 7160 1058
## 66 500 10680 1281
## Test.Rate...Cumulative Percent.Tested.Positive...Weekly
## 1 27853.5 0.0
## 2 43350.4 0.0
## 36 29328.8 0.0
## 37 36977.5 0.0
## 41 26018.4 0.0
## 66 27371.3 0.1
## Percent.Tested.Positive...Cumulative Deaths...Weekly
## 1 0.0 0
## 2 0.1 0
## 36 0.0 1
## 37 0.0 0
## 41 0.1 1
## 66 0.1 0
## Deaths...Cumulative Death.Rate...Weekly Death.Rate...Cumulative
## 1 0 0.0 0.0
## 2 0 0.0 0.0
## 36 6 6.8 40.9
## 37 0 0.0 0.0
## 41 3 3.6 10.9
## 66 10 0.0 25.6
## Population Row.ID ZIP.Code.Location
## 1 1174 60603-39 POINT (-87.625473 41.880112)
## 2 782 60604-39 POINT (-87.629029 41.878153)
## 36 14675 60601-39 POINT (-87.622844 41.886262)
## 37 1244 60602-39 POINT (-87.628309 41.883136)
## 41 27519 60605-39 POINT (-87.623449 41.867824)
## 66 39019 60610-39 POINT (-87.63581 41.90455)
To clean our data a bit, we’ll just keep the zip code name, and cumulative case rate for the week of September 20th, 2020.
## ZIP.Code Case.Rate...Cumulative
## 1 60603 1107.3
## 2 60604 3964.2
## 36 60601 1451.4
## 37 60602 1688.1
## 41 60605 1420.8
## 66 60610 1706.9
2.2 Merge Spatial Data
Next, let’s merge this data to our zip code master spatial file. Reload if necessary:
## Reading layer `geo_export_54bc15d8-5ef5-40e4-8f72-bb0c6dbac9a5' from data source
## `/Users/maryniakolak/code/Intro2RSpatialMed/data/geo_export_54bc15d8-5ef5-40e4-8f72-bb0c6dbac9a5.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 61 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## CRS: 4326
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.80649 ymin: 41.88747 xmax: -87.59852 ymax: 41.93228
## CRS: 4326
## objectid shape_area shape_len zip geometry
## 1 33 106052287 42720.04 60647 MULTIPOLYGON (((-87.67762 4...
## 2 34 127476051 48103.78 60639 MULTIPOLYGON (((-87.72683 4...
## 3 35 45069038 27288.61 60707 MULTIPOLYGON (((-87.785 41....
## 4 36 70853834 42527.99 60622 MULTIPOLYGON (((-87.66707 4...
## 5 37 99039621 47970.14 60651 MULTIPOLYGON (((-87.70656 4...
## 6 38 23506056 34689.35 60611 MULTIPOLYGON (((-87.61401 4...
Next, merge on zip code ID. The key in the Chi_Zips object is zip
, whereas the key for the COVID data is ZIP.code
. Always merge non-spatial to spatial data, not the other way around. Think of the spatial file as your master file that you will continue to add on to…
#Chi_Zipsf <- merge(Chi_Zips, COVID.39f, by.x = "zip", by.y = "ZIP.Code", all = TRUE)
Chi_Zipsf <- merge(Chi_Zips, COVID.39f, by.x = "zip", by.y = "ZIP.Code")
head(Chi_Zipsf)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
## CRS: 4326
## zip objectid shape_area shape_len Case.Rate...Cumulative
## 1 60601 27 9166246 19804.58 1451.4
## 2 60602 26 4847125 14448.17 1688.1
## 3 60603 19 4560229 13672.68 1107.3
## 4 60604 48 4294902 12245.81 3964.2
## 5 60605 20 36301276 37973.35 1420.8
## 6 60606 31 6766411 12040.44 2289.6
## geometry
## 1 MULTIPOLYGON (((-87.62271 4...
## 2 MULTIPOLYGON (((-87.60997 4...
## 3 MULTIPOLYGON (((-87.61633 4...
## 4 MULTIPOLYGON (((-87.63376 4...
## 5 MULTIPOLYGON (((-87.62064 4...
## 6 MULTIPOLYGON (((-87.63397 4...
2.3 Quantile Maps
Starting with a “classic epi” approach, let’s look at case rates as quantiles. We use the tmap
library, and update the choropleth data classification using the style
parameter. We use the Blue-Purple palette, or BuPu
, from Colorbrewer.
Colorbrewer Tip: To display all Colorbrewer palette options, load the RColorBrewer
library and run display.brewer.all()
– or just Google “R Colorbrewer palettes.”
library(tmap)
tmap_mode("plot")
tm_shape(Chi_Zipsf ) +
tm_polygons("Case.Rate...Cumulative",
style="quantile", pal="BuPu",
title = "COVID Case Rate")
Let’s try tertiles:
tm_shape(Chi_Zipsf ) +
tm_polygons("Case.Rate...Cumulative",
style="quantile", n=3, pal="BuPu",
title = "COVID Case Rate")
2.4 Standard Deviation Maps
While quantiles are a nice start, let’s classify using a standard deviation map. Standard deviation is a statistical technique type of map based on how much the data differs from the mean.
tm_shape(Chi_Zipsf ) +
tm_polygons("Case.Rate...Cumulative",
style="sd", pal="BuPu",
title = "COVID Case Rate")
2.5 Jenks Maps
Another approach of data classification is natural breaks, or jenks. This approach looks for “natural breaks” in the data using a univariate clustering algorithm.
tm_shape(Chi_Zipsf ) +
tm_polygons("Case.Rate...Cumulative",
style="jenks", pal="BuPu",
title = "COVID Case Rate")
The first bin doesn’t seem very intuitive. Let’s try 4 bins instead of 5 by changing the n
parameter. In this version, we’ll also had a histogram and scale bar, and move the legend outside the frame to make it easier to view.
tm_shape(Chi_Zipsf) +
tm_polygons("Case.Rate...Cumulative",
style="jenks", pal="BuPu",
legend.hist=T, n=4,
title = "COVID Case Rate", ) +
tm_scale_bar(position = "left") +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
2.6 Integrate More Data
To explore potential disparities in COVID health outcomes, let’s bring in pre-cleaned demographic, racial, and ethnic data from the Opioid Environment Policy Scan database. This data is orginally sourced from the American Community Survey 2018 5-year estimate, which you could also pull using the tidycensus
.
## ZCTA year totPopE whiteP blackP amIndP asianP pacIsP otherP hispP
## 1 35004 2018 11762 84.39 13.09 0.00 0.94 0.00 1.57 0.94
## 2 35005 2018 7528 55.22 42.44 0.64 0.00 0.15 1.55 1.37
## 3 35006 2018 2927 96.04 3.21 0.27 0.00 0.00 0.48 0.00
## 4 35007 2018 26328 73.83 13.75 0.04 1.33 0.02 11.01 11.11
## 5 35010 2018 20625 63.07 32.43 0.39 0.65 0.00 3.45 4.10
## 6 35013 2018 40 100.00 0.00 0.00 0.00 0.00 0.00 100.00
## noHSP age0_4 age5_14 age15_19 age20_24 age15_44 age45_49 age50_54
## 1 5.52 787 1950 457 746 4552 662 541
## 2 17.48 511 1055 455 277 2429 580 469
## 3 14.44 161 413 141 203 878 129 193
## 4 12.41 1891 4161 1619 1400 9947 1993 2067
## 5 22.00 1013 2647 1383 1087 7036 1418 1545
## 6 100.00 0 0 0 0 13 8 19
## age55_59 age60_64 ageOv65 ageOv18 age18_64 a15_24P und45P ovr65P
## 1 776 832 1662 8820 7158 10.23 61.97 14.13
## 2 560 552 1372 5691 4319 9.72 53.07 18.23
## 3 316 278 559 2308 1749 11.75 49.61 19.10
## 4 1713 1315 3241 19178 15937 11.47 60.77 12.31
## 5 1510 1341 4115 16142 12027 11.98 51.86 19.95
## 6 0 0 0 40 40 0.00 32.50 0.00
## disbP
## 1 12.7
## 2 23.2
## 3 20.9
## 4 13.5
## 5 19.6
## 6 0.0
Merge to our master Zip Code dataset.
## Simple feature collection with 6 features and 31 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
## CRS: 4326
## zip objectid shape_area shape_len Case.Rate...Cumulative year
## 1 60601 27 9166246 19804.58 1451.4 2018
## 2 60602 26 4847125 14448.17 1688.1 2018
## 3 60603 19 4560229 13672.68 1107.3 2018
## 4 60604 48 4294902 12245.81 3964.2 2018
## 5 60605 20 36301276 37973.35 1420.8 2018
## 6 60606 31 6766411 12040.44 2289.6 2018
## totPopE whiteP blackP amIndP asianP pacIsP otherP hispP noHSP age0_4
## 1 14675 74.17 5.57 0.45 18.00 0.00 1.81 8.68 0.00 550
## 2 1244 68.17 3.78 5.31 19.45 0.00 3.30 6.51 0.00 61
## 3 1174 63.46 3.24 0.00 27.60 0.00 5.71 9.80 0.00 13
## 4 782 63.43 5.63 0.00 29.67 0.00 1.28 4.35 0.00 12
## 5 27519 61.20 17.18 0.18 16.10 0.03 5.31 5.84 2.39 837
## 6 3101 72.75 2.35 0.00 18.09 0.00 6.80 6.29 0.73 57
## age5_14 age15_19 age20_24 age15_44 age45_49 age50_54 age55_59
## 1 156 907 909 8726 976 1009 324
## 2 87 18 91 987 46 53 0
## 3 43 179 172 684 75 47 150
## 4 7 52 168 450 27 47 54
## 5 1279 2172 2282 16364 1766 1520 1824
## 6 44 0 139 1863 213 153 168
## age60_64 ageOv65 ageOv18 age18_64 a15_24P und45P ovr65P disbP
## 1 859 2075 13855 11780 12.37 64.27 14.14 6.4
## 2 5 5 1095 1090 8.76 91.24 0.40 0.2
## 3 50 112 1118 1006 29.90 63.03 9.54 7.3
## 4 92 93 744 651 28.13 59.97 11.89 4.1
## 5 1360 2569 25259 22690 16.19 67.15 9.34 5.3
## 6 172 431 3000 2569 4.48 63.33 13.90 1.9
## geometry
## 1 MULTIPOLYGON (((-87.62271 4...
## 2 MULTIPOLYGON (((-87.60997 4...
## 3 MULTIPOLYGON (((-87.61633 4...
## 4 MULTIPOLYGON (((-87.63376 4...
## 5 MULTIPOLYGON (((-87.62064 4...
## 6 MULTIPOLYGON (((-87.63397 4...
2.7 Thematic Map Panel
To facilitate data discovery, we likely want to explore multiple maps at once. Here we’ll generate maps for multiple variables, and plot them as a map panel.
Can you think of more efficient ways to run this code? There are also other tmap
tricks to optimize this further, so enjoy your journey!
COVID <- tm_shape(Chi_Zipsf) + tm_fill("Case.Rate...Cumulative",
style="jenks", pal="BuPu", n=4, title = "COVID Rt") +
tm_layout(frame = F)
Senior <- tm_shape(Chi_Zipsf) + tm_fill("ovr65P",
style="jenks", pal="BuPu", n=4) +
tm_layout(frame = F)
NoHS <- tm_shape(Chi_Zipsf) + tm_fill("noHSP",
style="jenks", pal="BuPu", n=4) +
tm_layout(frame = F)
BlkP <- tm_shape(Chi_Zipsf) + tm_fill("blackP",
style="jenks", pal="BuPu", n=4) +
tm_layout(frame = F)
Latnx <- tm_shape(Chi_Zipsf) + tm_fill("hispP",
style="jenks", pal="BuPu", n=4) +
tm_layout(frame = F)
WhiP <- tm_shape(Chi_Zipsf) + tm_fill("whiteP",
style="jenks", pal="BuPu", n=4) +
tm_layout(frame = F)
tmap_arrange(COVID, Senior, NoHS, BlkP, Latnx, WhiP)
From the results, we see that cumulative COVID outcomes for one week in September 2020 seemed to have some geographic correlation with the Latinx/Hispanic community in Chicago. At the same time, low high school diploma rates are also concentrated in these areas, and there is some intersection with other variables considered. What are additional variables you could bring in to refine your approach? Perhaps percentage of essential workers; a different age group; internet access? What about linking in health outcomes like Asthma, Hypertension, and more at a similar scale?
In modern spatial epidemiology, associations must never be taken at face value. For example, we know that it is not “race” but “racism” that drives multiple health disparities – simply looking at a specific racial/ethnic group is not enough. Thus exploring multiple variables and nurturing a curiosity to understand these complex intersections will support knowledge discovery.
2.8 Write Data
We’re done! Well… not so fast. Let’s save the data so we don’t have to run the codebook again to access the data. Here, we’ll save as a geojson
file. This spatial format is more forgiving with long column names, which is a long-standing challenge with shapefiles. But sometimes it can be written oddly, so double-check the dimensions when reading in later.
We could also just write the data as a CSV
file which may be easier for future linking. For this, we use the st_drop_geometry()
function to remove the geometry data, so we’re just left with the attributes.
More Resources
For choropleth mapping in R: