Lab 4. Principles of Spatial Scale, Part 2
Overview
In this practical, we will continue to explore how coal mining impacts West Virginia in a contemporary context. Extending from our work last week exploring county-level variations, this week’s objectives will be to:
- learn some basics of R
- wrangle Census data
- access data through the Web
- compare data between sources & scales
Research Question
We have been focusing on coal mining jobs at the county-scale. This is partly due to a data limitation, as we know the exact number of jobs for coal-mining at the county level. However, is the scale of coal-mining at the county level, or below? Will there be more variation at a smaller area of resolution? Are jobs equally distributed throughout the county, or will they tend to clump in certain areas within a county – and if so, why would that be the case?
In other words: Is coal mining spatially homogeneous throughout the state, or throughout different counties – or is it spatially heterogeneous, with “clumping” behaviors? If there is a spatially heterogeneous pattern, what may tie coal mining jobs to a certain place? Is there a pattern of spatial dependence?
Environment Setup
For this lab, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, and have very basic familiarity with R following intro-level tutorials provided through installation guides. You should also know how to find the address to a folder on your computer system.
We will work with the following libraries, so please be sure to install:
- sf
- tmap
- RColorBrewer
- tidyverse
- tidycensus
Coding snippets from https://walkerke.github.io/ and https://data.cdrc.ac.uk were used in the development of this lab practical.
Need for finer data
In order to examine this further, we need to look at data at a smaller areal resolution. While we don’t have an exact number of coal mining jobs below county, the American Community Survey (ACS) does provide an approximation of mining grouped within an industry variable at the census-tract level. Census tracts are smaller units used by the Census to track populations, and encompass a population between 2,500 to 8,000 people. When using tract-level files from the ACS, it is recommended to only use 5-year estimates because of the large margins of error in smaller year samples.
Census Data
The mining data we used last week indicated exact counts of employees working within the coal mining industry at the county level, by year. If we wanted to find similar data using the Census, the closest variable available would be Mining jobs as part of the industry category “Agriculture, forestry, fishing and hunting, and mining” at the county or tract level within the American Community Survey (ACS 5-year estimate). Before we explore this data at the tract level, let’s first compare the county-level data between sources: Bureau of Labor Statistics (BLS) and ACS.
We will first work with ACS data provided as a CSV
file format, to be joined to the master West Virginia county shapefile
we generated last week.
Working Directory Setup
To do this, we first need to ensure we have a working directory where your code and data reside. Make sure that your R script, CSV file provided, and WV county shapefile are all in the same folder. Identify the address to this folder.
Now we can set up our working directory. Again, this will be the folder on your computer system where you are storing script and data for this lab. Ensure that you are working from the script from this folder.
# Set working directory
#setwd("/Users/maryniakolak/Desktop/WVCoal")
Load Libraries
Before loading our data, we will need to load the required packages. While we can load a CSV using base R capabilities, we need the sf
package to load spatial data.
If you have not installed sf
, do so now. You can uncomment the line of code to install, but then re-comment it so you don’t install the package more than once. We also load the tidyverse
package to enable some useful data wrangling tools.
# install.packages("sf")
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Load Data
Next, load the census data and read in the shapefile of your counties file. While it reads in as a shapefile, the data structure gets changed to a “simple features” dataset that inludes geometry as an additional column. This is a key feature of the sf
package.
# Load new census data you plan to join.
Census.Data <-read.csv("ACS_15_5YR_DP03.csv")
# Load the output area shapefiles
County.Areas<- st_read("WV_Counties.shp")
## Reading layer `WV_Counties' from data source `/Users/HIPark/Documents/micrometcalf/Intro2GIS/book/WV_Counties.shp' using driver `ESRI Shapefile'
## Simple feature collection with 55 features and 46 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 355916.7 ymin: 4117439 xmax: 782847.1 ymax: 4498773
## CRS: 26917
View Data
We can quickly view the attributes. One way is to use the glimpse
function. Uncomment this line in your script, and view the data.
# View the attributes associated with each county
glimpse(County.Areas)
## Rows: 55
## Columns: 47
## $ FIPSSTCO <chr> "54109", "54107", "54105", "54103", "54101", "54099", "540…
## $ STATE <chr> "West Virginia", "West Virginia", "West Virginia", "West V…
## $ COUNTY <chr> "Wyoming", "Wood", "Wirt", "Wetzel", "Webster", "Wayne", "…
## $ POP2000 <int> 25708, 87986, 5873, 17693, 9719, 42903, 23404, 9592, 7321,…
## $ WHITE <int> 25345, 85627, 5788, 17502, 9639, 42382, 22981, 9530, 7237,…
## $ BLACK <int> 161, 887, 17, 15, 1, 54, 144, 2, 5, 134, 280, 34, 14, 302,…
## $ AMERI_ES <int> 31, 188, 12, 17, 7, 99, 39, 5, 14, 31, 33, 32, 28, 46, 147…
## $ ASIAN <int> 21, 448, 6, 57, 6, 86, 72, 8, 1, 27, 12, 35, 13, 106, 568,…
## $ HAWN_PI <int> 0, 35, 0, 4, 1, 8, 3, 1, 9, 6, 5, 0, 0, 4, 13, 11, 5, 0, 0…
## $ OTHER <int> 19, 124, 6, 5, 1, 35, 30, 3, 7, 10, 13, 29, 11, 45, 99, 67…
## $ MULT_RACE <int> 131, 677, 44, 93, 64, 239, 135, 43, 48, 102, 103, 93, 71, …
## $ HISPANIC <int> 135, 514, 18, 74, 36, 202, 137, 41, 18, 95, 71, 104, 49, 1…
## $ MALES <int> 12649, 42246, 2939, 8586, 4783, 20993, 11354, 4686, 3570, …
## $ FEMALES <int> 13059, 45740, 2934, 9107, 4936, 21910, 12050, 4906, 3751, …
## $ AGE_UNDER5 <int> 1467, 5089, 329, 1007, 500, 2471, 1248, 503, 353, 864, 604…
## $ AGE_5_17 <int> 4291, 15139, 1163, 3197, 1732, 7551, 4030, 1728, 1204, 282…
## $ AGE_18_21 <int> 1257, 4181, 262, 715, 456, 2257, 2092, 378, 294, 749, 567,…
## $ AGE_22_29 <int> 2535, 8122, 514, 1463, 887, 4288, 2199, 794, 564, 1496, 11…
## $ AGE_30_39 <int> 3253, 12383, 894, 2348, 1222, 5882, 2945, 1268, 1024, 2342…
## $ AGE_40_49 <int> 4648, 13642, 903, 2699, 1579, 6292, 3446, 1523, 1078, 2565…
## $ AGE_50_64 <int> 4671, 15822, 1045, 3403, 1861, 7751, 3995, 1819, 1490, 271…
## $ AGE_65_UP <int> 3586, 13608, 763, 2861, 1482, 6411, 3449, 1579, 1314, 2539…
## $ MED_AGE <dbl> 40.1, 39.3, 37.9, 40.4, 40.4, 38.4, 37.4, 40.8, 42.0, 39.1…
## $ MED_AGE_M <dbl> 39.2, 38.0, 37.1, 39.4, 39.6, 37.0, 36.4, 40.3, 40.8, 37.7…
## $ MED_AGE_F <dbl> 41.0, 40.6, 38.6, 41.5, 41.2, 39.6, 38.4, 41.3, 43.2, 40.5…
## $ HOUSEHOLDS <int> 10454, 36275, 2284, 7164, 4010, 17239, 8972, 3836, 3052, 6…
## $ AVE_HH_SZ <dbl> 2.45, 2.39, 2.56, 2.45, 2.41, 2.48, 2.45, 2.47, 2.35, 2.47…
## $ HSEHLD_1_M <int> 1046, 3856, 229, 723, 467, 1654, 905, 350, 318, 658, 693, …
## $ HSEHLD_1_F <int> 1501, 5976, 277, 1118, 597, 2501, 1354, 538, 512, 952, 914…
## $ MARHH_CHD <int> 2493, 7682, 624, 1643, 866, 4090, 2130, 904, 657, 1466, 10…
## $ MARHH_NO_C <int> 3710, 12022, 780, 2491, 1355, 6112, 3088, 1450, 1112, 2099…
## $ MHH_CHILD <int> 198, 664, 56, 152, 91, 285, 185, 72, 53, 134, 106, 135, 74…
## $ FHH_CHILD <int> 545, 2275, 123, 372, 239, 997, 478, 185, 115, 355, 262, 31…
## $ FAMILIES <int> 7705, 24898, 1700, 5080, 2816, 12648, 6353, 2833, 2121, 44…
## $ AVE_FAM_SZ <dbl> 2.89, 2.88, 2.97, 2.92, 2.89, 2.92, 2.92, 2.89, 2.84, 2.95…
## $ HSE_UNITS <int> 11698, 39785, 3266, 8313, 5273, 19107, 10751, 4780, 4634, …
## $ URBAN <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RURAL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ VACANT <int> 1244, 3510, 982, 1149, 1263, 1868, 1779, 944, 1582, 805, 1…
## $ OWNER_OCC <int> 8713, 26609, 1898, 5625, 3167, 13466, 6883, 3209, 2520, 50…
## $ RENTER_OCC <int> 1741, 9666, 386, 1539, 843, 3773, 2089, 627, 532, 1287, 11…
## $ Mines15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 22…
## $ AveEmp15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17…
## $ Wages15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 13…
## $ AveWkWg15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14…
## $ WgEmp15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 77…
## $ geometry <POLYGON [m]> POLYGON ((478582 4150782, 4..., POLYGON ((463072.1…
We can also use the head()
function, which shows the top 5 rows. Uncomment this in your script, and view the data.
# View the attributes associated with each county
# head(Census.Data)
If we want to see the dimensions of the data, we can use the dim()
function. I also like the str()
function as it gives an overview of what type of data we have from a data structure perspective. I comment out the str()
function here as it generates a very long output, but please try for yourself!
From either of these we can see that there 55 observations (counties), and 551 variables. In other words, 55 rows x 551 columns.
# View the attributes associated with each county
dim(Census.Data)
## [1] 55 551
#str(Census.Data)
Subset Data
This ACS dataset – a raw file from the DP03: Selected Economic Characteristics summary profile – includes a lot of variables. To work with this we need to know exactly what variabes we need. You would need the metadata document that comes with the summary profile to identify that. In this lab, I will tell you which variables we’re interested in.
HC03_VC50: Percent; INDUSTRY - Civilian employed population 16 years and over - Agriculture, forestry, fishing and hunting, and mining
HC01_VC85: Estimate; INCOME AND BENEFITS (IN 2010 INFLATION- ADJUSTED DOLLARS) - Median household income (dollars).
In other words, we need to generate a subset of the raw ACS file to include the unique numerical County geographic ID, and rename the variables of interest.
- County ID -> named “GEO.id2” in dataset
- HC03_VC50 -> rename “AgFHM15”
- HC01_VC85 -> rename “MedInc15”
Let’s further subset the data to make a new, cleaner dataset. We first generate a list of variables that we want to keep. Note that “GEO.id2” is the unique code that indicates a county. We will use this as our “key” to merge with our master spatial file.
var <- c("GEO.id2","HC01_VC85","HC03_VC50")
Next, we extract only the variables of interest from the ACS dataset. We call this new object “Census.Subset.” View the data to ensure you’ve subset correctly.
Also note that there are multiple ways to subset data in R – here we use a simple base R approach. The tidyverse
package offers even more elegant options.
Census.Subset<- (Census.Data[var])
glimpse(Census.Subset)
## Rows: 55
## Columns: 3
## $ GEO.id2 <int> 54001, 54003, 54005, 54007, 54009, 54011, 54013, 54015, 540…
## $ HC01_VC85 <int> 37066, 55239, 39958, 32750, 46215, 38344, 35568, 31325, 399…
## $ HC03_VC50 <dbl> 6.2, 1.2, 23.4, 10.5, 1.7, 1.0, 12.5, 10.7, 16.7, 8.3, 8.5,…
Next, we rename the second and third columns (identified as 2 through 3) using new names. Again, there are many ways to do this in R – this is just one approach.
names(Census.Subset)[2:3]<- c("MedInc15", "AgMnJb15")
glimpse(Census.Subset)
## Rows: 55
## Columns: 3
## $ GEO.id2 <int> 54001, 54003, 54005, 54007, 54009, 54011, 54013, 54015, 5401…
## $ MedInc15 <int> 37066, 55239, 39958, 32750, 46215, 38344, 35568, 31325, 3997…
## $ AgMnJb15 <dbl> 6.2, 1.2, 23.4, 10.5, 1.7, 1.0, 12.5, 10.7, 16.7, 8.3, 8.5, …
Merge data to Master
Now that you’ve cleaned and named the census data showing our mining proxies, we can merge to our Master spatial file.
Note that we are joining using key “FIPSSTCO” from the county shapefile, and “GEO.id2” from the census dataset. Look at the data to make sure it joined correctly.
# Joins data to the shapefile
WV_county <- merge(County.Areas, Census.Subset, by.x="FIPSSTCO", by.y="GEO.id2")
head(WV_county)
## Simple feature collection with 6 features and 48 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 368368.2 ymin: 4179711 xmax: 773040.2 ymax: 4472141
## CRS: 26917
## FIPSSTCO STATE COUNTY POP2000 WHITE BLACK AMERI_ES ASIAN HAWN_PI
## 1 54001 West Virginia Barbour 15557 15147 77 111 40 3
## 2 54003 West Virginia Berkeley 75905 70392 3558 186 350 17
## 3 54005 West Virginia Boone 25535 25160 167 31 18 4
## 4 54007 West Virginia Braxton 14702 14411 101 51 16 7
## 5 54009 West Virginia Brooke 25447 24913 216 25 87 9
## 6 54011 West Virginia Cabell 96784 90370 4150 174 749 38
## OTHER MULT_RACE HISPANIC MALES FEMALES AGE_UNDER5 AGE_5_17 AGE_18_21
## 1 19 160 73 7646 7911 825 2752 928
## 2 428 974 1156 37784 38121 5024 14505 3558
## 3 17 138 117 12471 13064 1608 4317 1264
## 4 12 104 65 7444 7258 779 2567 610
## 5 22 175 99 12182 13265 1278 3922 1541
## 6 196 1107 654 46229 50555 5261 14141 8160
## AGE_22_29 AGE_30_39 AGE_40_49 AGE_50_64 AGE_65_UP MED_AGE MED_AGE_M MED_AGE_F
## 1 1483 2085 2299 2765 2420 38.7 37.3 40.0
## 2 7945 12183 11984 12240 8466 35.8 35.3 36.3
## 3 2757 3246 4406 4473 3464 38.8 37.8 39.7
## 4 1426 2049 2322 2619 2330 39.6 38.8 40.5
## 5 2249 3221 4038 4536 4662 41.2 39.6 42.7
## 6 11725 12236 13893 15869 15499 37.5 35.6 39.4
## HOUSEHOLDS AVE_HH_SZ HSEHLD_1_M HSEHLD_1_F MARHH_CHD MARHH_NO_C MHH_CHILD
## 1 6123 2.47 638 896 1399 2101 122
## 2 29569 2.53 3376 3785 7035 9118 894
## 3 10291 2.47 1030 1499 2419 3495 242
## 4 5771 2.46 638 814 1325 1980 133
## 5 10396 2.36 1104 1801 2130 3621 167
## 6 41180 2.27 5247 7652 7237 12140 619
## FHH_CHILD FAMILIES AVE_FAM_SZ HSE_UNITS URBAN RURAL VACANT OWNER_OCC
## 1 320 4367 2.94 7348 0 0 1225 4815
## 2 1960 20702 2.99 32913 0 0 3344 21927
## 3 541 7464 2.92 11575 0 0 1284 8122
## 4 289 4099 2.92 7374 0 0 1603 4511
## 5 504 7156 2.88 11150 0 0 754 7971
## 6 2535 25474 2.85 45615 0 0 4435 26591
## RENTER_OCC Mines15 AveEmp15 Wages15 AveWkWg15 WgEmp15 MedInc15 AgMnJb15
## 1 1308 NA NA NA NA NA 37066 6.2
## 2 7642 NA NA NA NA NA 55239 1.2
## 3 2169 15 1296 102248050 1517 78905 39958 23.4
## 4 1260 NA NA NA NA NA 32750 10.5
## 5 2425 NA NA NA NA NA 46215 1.7
## 6 14589 NA NA NA NA NA 38344 1.0
## geometry
## 1 POLYGON ((592517.2 4313786,...
## 2 POLYGON ((754483.2 4389441,...
## 3 POLYGON ((433341.3 4230981,...
## 4 POLYGON ((506432.2 4269583,...
## 5 POLYGON ((527478.5 4448095,...
## 6 POLYGON ((394905.6 4271170,...
Great!
Mapping in R
There are multiple libraries that exist for mapping in R. We will use tmap
and to a lesser extent,ggplot
in this lab. Let’s start with tmap.
First, we need to add the libraries. Install the libraries if you have not done so.
Load packages
library(tmap)
library(RColorBrewer)
View palette options
It is helpful to have a nice range of color palette options when mapping in R. Load the RColorBrewer
library and display all the options available to you. Each palette option has a name attached – the is the name that you indicate as a parameter for mapping in tmap.
While this appears “squished” when rendered via RMarkdown, you can expand the view when run via your Console or via R script.
# Display the color palette
display.brewer.all()
Basic tmap functions
For basic mapping in R using tmap
, you have two functions for each variable you need to map. Obviously the data needs to be the spatial file you’ve generated and loaded into the environment. Our Master spatial file is named “WV_county” in our setting here. To visualize the “AgMnJb15” variable (our ACS mining proxy) we add this to the tm_fill
parameter. This generates a default choropleth map.
# Creates a simple choropleth map of our a AgMnJb15 variable
tm_shape(WV_county) + tm_fill("AgMnJb15")
While this is not a perfect proxy for mining jobs, some of the patterns are consistent with the BLS data source. We can see that majority of jobs in this dataset are in the Southern part of West Viriginia, in the same area as the highest proportions of coal mining jobs. What areas are less well represented? Why may that be the case?
Set color palette
With ColorBrewer, we can switch palettes. If you would like to change the direction of the palette, just add a “negative” sign.
# setting a color palette
tm_shape(WV_county) + tm_fill("AgMnJb15", palette = "-BrBG")
Classify data
You can also change the data classification interval. It defaults on a “pretty classification” which has been optimized for easy-to-read breaks, but may not best reflect the actual data distribution To see different parameter options, research the tmap
documentation page.
Here we try quantile breaks, and then specific the number of levels or breaks.
# changing the intervals
tm_shape(WV_county) + tm_fill("AgMnJb15", style = "quantile", palette = "-BrBG")
# number of levels
tm_shape(WV_county) + tm_fill("AgMnJb15", style = "quantile", n = 7, palette = "-BrBG")
How does this change your interpretation? Try additional parameter specifications like “jenks” for jenks breaks, or “sd” for standard deviation breaks.
Customize Maps
There are multiple other options when designing maps using tmap.
Many of these are best optimized for generating in an R script directly, rather than in an RMarkdown, as RStudio can enforce odd behaviors when it doesn’t have full space to generate plots. Take the following code as examples for what you should try in your own RStudio environment within an R script (rather than an RMarkdown file).
Finally, experiment with the parameters here, and continue to explore the tmap documentation. (Google search terms: “tmap R” to get information directly.)
# includes a histogram in the legend
tm_shape(WV_county) + tm_fill("AgMnJb15", style = "jenks", n=6, palette = "-BrBG", legend.hist = TRUE)
# add borders
tm_shape(WV_county) + tm_fill("AgMnJb15", style = "jenks", n=6, palette = "-BrBG") +
tm_borders(alpha=.4)
# north arrow
tm_shape(WV_county) + tm_fill("AgMnJb15", style = "jenks", n=6, palette = "-BrBG") +
tm_borders(alpha=.4) +
tm_compass()
# adds in layout, gets rid of frame
tm_shape(WV_county) + tm_fill("AgMnJb15", palette = "-BrBG", style = "jenks", title = "% Jobs High Risk Employment") +
tm_borders(alpha=.4) +
tm_compass() +
tm_layout(legend.text.size = 1.1, legend.title.size = 1.5, legend.position = c("right", "bottom"), frame = FALSE)
Facet mapping
Now that we have a Master spatial dataset that includes both mining job proxies, we can compare them. How close is the mining proxy data from the ACS profile to the actual mining employment numbers from BLS? To do this we generate a “facet map” meaning we show multiple maps, side by side.
We first assign each map to a variable, and then use either tmap_arrange
or tm_facets
to generate our facet maps.
# ReMap
BLSData <- tm_shape(WV_county) + tm_fill("AveEmp15", palette = "BuPu", style = "jenks", title = "Coal Mng Jobs") +
tm_borders(alpha=.4) +
tm_layout(legend.text.size = .8, legend.title.size = 1.0, legend.position = c("right", "bottom"), frame = FALSE)
ACSData <- tm_shape(WV_county) + tm_fill("AgMnJb15", palette = "BuPu", style = "jenks", title = "High Risk Job %") +
tm_borders(alpha=.4) +
tm_layout(legend.text.size = .8, legend.title.size = 1.0, legend.position = c("right", "bottom"), frame = FALSE)
tmap_arrange(BLSData,ACSData)
Here’s another way to generate the maps side by side:
# Another Route:
tm_shape(WV_county) +
tm_polygons(c("AveEmp15", "AgMnJb15"),
style=c("jenks", "jenks"),
palette=list("BuPu", "BuPu"),
title=c("Coal Mng Jobs", "High Risk Jobs %")) +
tm_facets(ncol=2)
Obviously the ACS data is overestimating in some areas, and undestimating in others. While it captures the increase in mining jobs in the Southermost area, it misses the county with the highest mining jobs when considering all the other employment categories included. Getting finer resolution data for coal mining below county may not be available, but at least we know relatively how useful this Census category is.
Using a Census API
In this lab I provided the raw data to you for 2015 that you wrangled and cleaned. Another excellent option in R coding environments is using a package to extract Census data for you. While you won’t be able to access all the data the same way, for some of the “classic” indicators like population demographics and income, using the Census API is the most efficient approach.
Let’s extract median income for WV counties and census tracts to see if we gain any insight into our analysis.
Workspace Setup
Examples in this section were sourced from Kyle Walker’s guide to using tidycensus: Basic Usage of tidycensus and Spatial Data in tidycensus. If you run into challengs, feel free to peruse the original context for more ideas.
Load libraries
library(tidyverse)
library(tidycensus)
options(tigris_use_cache = TRUE)
Activate your API key
To work with tidycensus, you must first register for and activate your Census API key. A key can be obtained from http://api.census.gov/data/key_signup.html. Register and activate your API key.
Next, add your API key to your RStudio coding environment. This should be done in your console or your personal file; the API key should not be shared. I include dummy code below that has been commented out. For my implementation to work, I needed to add two additional parameters after simple debugging.
#census_api_key("YOURAPIKEY", install=TRUE, overwrite = TRUE)
Using the documentation for tidycensus by Kyle Walker, we first load variables from the ACS 5-year estimated for 2015 (so from years 2011-2015). The “view” option opens a window in your RStudio environment where you can explore the Census variables available to you in detail. This is how you can find the definitions of the variables of interest, using the search functionality availble in RStudio. See the Kyle Walker demo for more specifics.
The view(ACS15var)
call is a critical component to using tidycensus. This lets you explore all the available options to you when extracting variables, serving as a metadata file. Use the “search” function in RStudio to explore this dataset after it opens. Try searching for “income” for example – how many options are available to you? Uncomment and explore!
ACS15var <- load_variables(2015, "acs5", cache = TRUE)
#view(ACS15var)
Extract Census data
Next we can extract the exact data. This functions gathers data for median income levels for all counties in West Virginia, using the 2015 dataset.
Note: this call can take some time to pull. It’s not necessary to go through for the lab, so it’s okay to skip if you are running into issues.
Access census estimates
WV_county_medinc <- get_acs(geography = "county",
variables = c(medincome = "B19013_001"),
state = "WV",
year = 2015)
## Getting data from the 2011-2015 5-year ACS
WV_county_medinc
## # A tibble: 55 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 54001 Barbour County, West Virginia medincome 37066 2601
## 2 54003 Berkeley County, West Virginia medincome 55239 1628
## 3 54005 Boone County, West Virginia medincome 39958 2871
## 4 54007 Braxton County, West Virginia medincome 32750 4420
## 5 54009 Brooke County, West Virginia medincome 46215 2259
## 6 54011 Cabell County, West Virginia medincome 38344 1905
## 7 54013 Calhoun County, West Virginia medincome 35568 6515
## 8 54015 Clay County, West Virginia medincome 31325 3298
## 9 54017 Doddridge County, West Virginia medincome 39974 4623
## 10 54019 Fayette County, West Virginia medincome 36293 1836
## # … with 45 more rows
Access merged spatial file
The latest edition of tidycensus
adds the ability to extract census attribute data with the spatial data already merged. The object is thus a spatial file, not a simple data file.
WV_county_medinc.sp <- get_acs(state = "WV",
geography = "county",
variables = "B19013_001",
geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
head(WV_county_medinc.sp)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.50768 ymin: 37.76331 xmax: -77.82356 ymax: 40.39976
## CRS: 4269
## GEOID NAME variable estimate moe
## 1 54001 Barbour County, West Virginia B19013_001 39580 4632
## 2 54003 Berkeley County, West Virginia B19013_001 60615 1882
## 3 54005 Boone County, West Virginia B19013_001 38642 3787
## 4 54007 Braxton County, West Virginia B19013_001 42213 5363
## 5 54009 Brooke County, West Virginia B19013_001 49772 3595
## 6 54011 Cabell County, West Virginia B19013_001 38321 1341
## geometry
## 1 MULTIPOLYGON (((-80.22631 3...
## 2 MULTIPOLYGON (((-78.22901 3...
## 3 MULTIPOLYGON (((-81.97956 3...
## 4 MULTIPOLYGON (((-80.98495 3...
## 5 MULTIPOLYGON (((-80.6796 40...
## 6 MULTIPOLYGON (((-82.50768 3...
Plot data
Using Walker’s tidycensus
vignette, you can quickly plot the data using ggplot
. We didn’t need to install ggplot
because it was part of the tidyverse
library.
WV_county_medinc.sp %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma")
Of course, you can also plot using tmap.
Note that the variable of interest is called “estimate” here, using the tidycensus
default. I recommend just working with one attribute at a time when first working with tidycensus
and R.
tm_shape(WV_county_medinc.sp) + tm_fill("estimate", palette = "BuPu", title ="Median Income")
Rinse and Repeat
What’s nice about tidycensus
is that we can quickly run the same function to get the spatial data at a different resolution. Here we can gather median income information at the tract level.
WV_tracts_medinc.sp <- get_acs(state = "WV",
geography = "tract",
variables = "B19013_001",
geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
head(WV_tracts_medinc.sp)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -80.22717 ymin: 38.94724 xmax: -77.82975 ymax: 39.61915
## CRS: 4269
## GEOID NAME variable
## 1 54001965500 Census Tract 9655, Barbour County, West Virginia B19013_001
## 2 54001965600 Census Tract 9656, Barbour County, West Virginia B19013_001
## 3 54001965700 Census Tract 9657, Barbour County, West Virginia B19013_001
## 4 54001965800 Census Tract 9658, Barbour County, West Virginia B19013_001
## 5 54003971101 Census Tract 9711.01, Berkeley County, West Virginia B19013_001
## 6 54003971102 Census Tract 9711.02, Berkeley County, West Virginia B19013_001
## estimate moe geometry
## 1 47092 5751 MULTIPOLYGON (((-80.22631 3...
## 2 26429 8650 MULTIPOLYGON (((-80.10631 3...
## 3 41029 11601 MULTIPOLYGON (((-80.07606 3...
## 4 33929 8376 MULTIPOLYGON (((-79.98156 3...
## 5 58643 14597 MULTIPOLYGON (((-77.92404 3...
## 6 66392 10805 MULTIPOLYGON (((-77.9613 39...
Next, we can plot it using both ggplot
and tmap.
WV_tracts_medinc.sp %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma")
tm_shape(WV_tracts_medinc.sp) + tm_fill("estimate", style = "jenks", palette = "BuPu", title ="Median Income")
What new insights did you find?
Scale of Coal Mining
Finally – so, what about the scale of coal mining in West Virginia? While we know the ACS data is an imperfect represetnation of coal mining the state, we also know it captures some of the higher proportions of mining employment, especially in the Southern part of the state.
After exploring the “view” of the ACS data pulled, the closest variable I could find was B24031_002 - or the “Estimate of total persons in Agriculture, forestry, fishing and hunting, and mining.” The ACS profile used in tidycensus
does not link directly to the specific report we need to match our analysis exactly, but this should capture some of the same variation.
Thus, let’s map this variable at County and Tract scales to look for some patterns.
Extract Data
First, we extract the data:
WV_county_mining.sp <- get_acs(state = "WV",
geography = "county",
variables = "B24031_002",
geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
WV_tracts_mining.sp <- get_acs(state = "WV",
geography = "tract",
variables = "B24031_002",
geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
Map Data
Next, we map both side by side:
# ReMap
WVCounties <- tm_shape(WV_county_mining.sp) + tm_fill("estimate", n=4, palette = "BuPu", style = "jenks", title = "High Risk Jobs") +
tm_borders(alpha=.4) +
tm_layout(legend.text.size = .8, legend.title.size = 1.0, legend.position = c("right", "bottom"), frame = FALSE)
WVTracts <- tm_shape(WV_tracts_mining.sp) + tm_fill("estimate", n=4, palette = "BuPu", style = "jenks", title = "High Risk Jobs") +
tm_borders(alpha=.4) +
tm_layout(legend.text.size = .8, legend.title.size = 1.0, legend.position = c("right", "bottom"), frame = FALSE)
tmap_arrange(WVCounties,WVTracts)
## Some legend labels were too wide. These labels have been resized to 0.73. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
Interpretation
What type of variation do you find between these maps? Knowing that there is more connection with mining in th Southern area, were you suprised by the variation below the county level? At what scale is coal mining employment occurring? Broadening to a wider “High Risk Employment” category for jobs that are more prone to accidents (including construction, mining, agricultural, etc) captured in this Census variable, what additional insights might you have?
Note that this variable is sometimes used for identifying areas at risk for opioid overdose. If some jobs are more prone to injury, then workers could also be more prone to getting prescription opioids for pain management, a component of the opioid epidemic. At the same time, these areas often have few other job options. When high-paying coal mining jobs go away (“high” relative to median income in the area), there may not be other options that promise the same pay and/or stability. This “economics of despair” concept likewise is thought the drive increases in opioid-related overdoses. Consider the complexity here – and how might you further detangle these concepts?
Appendix
Write to CSV
After all the hard work in this analysis, you’ll want to save the master files you generated. One option is to save the attribute component of the data you generated, using a CSV
file. Uncomment and run on your own.
# write.csv(WV_county, "WV_county.csv")
Write to SHP
The st_write
function saves the spatial file as a shapefile.
For example you can save the county file you updated earlier. Uncomment and run on your own. The file is save to your working directory.
# st_write(WV_county, "WV_county.shp")
Census Data Access
While the tidycensus
package is great for straightforward variables like income level, it may not always be for all variables. For example, if we search for “mining” in the existing View only a small selection is returned. We could for example try variable “B08126_002” corresponding to the total estimate of persons in Agriculture, forestry, fishing and hunting, and mining industries, but this is not the percentage of the workforce in these industries (which would be more precise). Rather than calculate this on our own (which also needs to account for the total able-to-work population), it’s more efficient to use pre-calculated Census metrics.
I thus recommend that for more detailed Census explorations, you take the time to get familiar with various Census data profiles.
For your own research, I recommend investigating IPUMS and the newly revised Data.gov site that releases more detailed Census profiles with ther required metadata components. The previous version, American Factfinder, has been retired.