Today we will be working with data from the National Health and Nutrition Examination Survey (NHANES).
NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.
The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.
Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its association with health promotion and disease prevention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiological studies and health sciences research, which help develop sound public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.
The dataset we will be working with data from the 2013-2014 and 2015-2016 surveys. Information about the variables recorded in the data can be found in the following links.
- 2013-2014: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013
- Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm
- Sex Steroid Hormone - Serum : https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/TST_H.htm
- Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/RHQ_H.htm
- 2015-2016: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015
- Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
- Sex Steroid Hormone - Serum: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/TST_I.htm
- Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/RHQ_I.htm
Explore the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) and in Etherpad describe some of the information and data that can be found there.
[Insert Class Roster]
- You can search for variables.
- Make sure you read the documentation for the variables and tables you want to use. E.g. the documentation for the 2013-2014 demographic data can be found at https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm.
NHANES distributes its data in the SAS transport format (.XPT), and Tidyverse
includes a function haven::read_xpt()
to read these files directly. FFor
example, the demographic data from years 2013-2014 and 2015-2016 are at
- https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.XPT
- https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT
library(tidyverse)
# Construct URLs for the data we are going to use.
# demographics, sex steroid, reproductive health
tables <- c("DEMO", "TST", "RHQ")
url_h <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/{tables}_H.XPT")
url_i <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/{tables}_I.XPT")
url_h
url_i
# Good: Download and process data using a for-loop.
nhanes_h <- list() # Create an empty list
for(u in url_h) {
# Iterate through all the elements (urls) in url_h and download the
# data using the function `haven:read_xpt()`. Save the results as elements
# in the list
nhanes_h[[u]] <- haven::read_xpt(u)
}
# Better: Download and process data using tidyverse's `map()` function.
nhanes_i <- url_i |> map(haven::read_xpt, .progress = TRUE)
# Best: Use `set_names()` to add names to the input. Otherwise,
# nhanes_i is unnamed.
nhanes_i <- set_names(url_i) |> map(haven::read_xpt, .progress = TRUE)
# Extract the table name from the urls
names(nhanes_h) <- names(nhanes_h) |> str_extract("[^/]+(?=_H[.]XPT$)")
names(nhanes_i) <- names(nhanes_i) |> str_extract("[^/]+(?=_I[.]XPT$)")
Various sample weights are available on the data release files – such as the interview weight (WTINT2YR), the MEC exam weight (WTMEC2YR), and several subsample weights. Use of the correct sample weight for NHANES analyses depends on the variables being used. A good rule of thumb is to use “the least common denominator” where the variable of interest that was collected on the smallest number of respondents is the “least common denominator.” The sample weight that applies to that variable is the appropriate one to use for that particular analysis.
In order to work with four years of data, we will bind the rows from both of the datasets and then create new weights appropriate for the larger dataset.
# Combine datasets and create "4-year" weights from 2-year weights
# WTINT2YR - Full sample 2 year interview weight
# WTMEC2YR = Full sample 2 year MEC exam weight
dat <- bind_rows(nhanes_h$DEMO, nhanes_i$DEMO) |>
mutate(WTINT4YR = WTINT2YR/2,
WTMEC4YR = WTMEC2YR/2)
dat
Using the demographic dataset we can measure the the fraction of women who are pregnant based on age. Pregnancy was accessed during the exam stage and we will use the MEC weights in this analysis.
# Estimate number of women pregnant per age
# RIDAGEYR = respondent's age in years
# RIDEXPRG = Pregnancy status at exam
tab <- dat |>
drop_na(RIDEXPRG) |>
count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR)
tab
Wait. What the heck do 1, 2, and 3 mean under RIDEXPRG. Let's look at the documentation. https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm
RIDEXPRG: Pregnancy status at the time of the health examination was ascertained for females 8–59 years of age. Due to disclosure risks pregnancy status is only released for women 20-44 years of age. The information used to code RIDEXPRG values included self-reported pregnancy status and urine pregnancy test results. Persons who reported they were pregnant at the time of exam were assumed to be pregnant (RIDEXPRG=1). Those who reported they were not pregnant or did not know their pregnancy status were further classified based on the results of the urine pregnancy test. If the respondent reported “no” or “don’t know” and the urine test result was positive, the respondent was coded as pregnant (RIDEXPRG=1). If the respondent reported “no” and the urine test was negative, the respondent was coded not pregnant (RIDEXPRG=2). If the respondent reported did not know her pregnancy status and the urine test was negative, the respondent was coded “could not be determined” (RIDEXPRG=3). Persons who were interviewed, but not examined also have an RIDEXPRG value = 3 (could not be determined).
Lets drop the “could not be determined” pregnancy status and recalculate.
# Let's estimate a fraction keeping only the yes/no values
tab <- dat |>
filter(RIDEXPRG %in% 1:2) |>
count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR) |>
mutate(p = n / sum(n), .by=RIDAGEYR)
Let's reshape the table.
# Reshape tab to a wide format with a column for age, and two columns for
# pregnancy status. Any columns not mentioned in these three arguments will be
# dropped.
tab |> pivot_wider(id_cols = RIDAGEYR,
names_from = RIDEXPRG,
values_from = p)
# We have some NAs in the list which represent 0.0
tail(tab)
# Make the result easier to read before we save it
tab <- tab |> rename(Age = RIDAGEYR, Status = RIDEXPRG)
tab <- tab |>
mutate(Status = factor(Status, levels=1:2,
labels=c("Yes", "No"))) |>
pivot_wider(id_cols = Age, names_from = Status,
values_from = p, values_fill = 0.0)
# It looks pretty now
tail(tab)
write_csv(tab, "frac_preg_by_age.csv")
To work with data from multiple tables, we will need to combine the tables in some fashion. Let's use map to quickly look at the number of rows per table.
# Apply the function `nrow()` to every element of a list
map(nhanes_h, nrow) # returns a list
map_int(nhanes_i, nrow) # returns an integer vector
# `map()` has alternative functions that return specific structures.
# `map_int()` returns a vector of integers, `map_chr()` returns a vector of
# characters, etc.
The number (and order) of rows vary between the tables. We will use a join operation to combine the tables based on the respondent sequence number, "SEQN". Dplyr offers multiple ways to join two data frames
inner_join(x, y)
keeps observations fromx
that have a matching key iny
.left_join(x, y)
keeps observations inx
.right_join(x, y)
keeps observations iny
.full_join(x, y)
keeps observations inx
andy
.- among others.
# Use an inner_join to combine the DEMO and TST tables
# SEQN = Respondent sequence number
dat_h <- inner_join(nhanes_h$DEMO, nhanes_h$TST, by = "SEQN")
dat_i <- inner_join(nhanes_i$DEMO, nhanes_i$TST, by = "SEQN")
# check names and number of rows
names(nhanes_h$DEMO)
names(nhanes_h$TST)
names(dat_h)
nrow(nhanes_h$DEMO)
nrow(nhanes_h$TST)
nrow(dat_h)
# Bind the years together
dat <- bind_rows(dat_h, dat_i) |>
mutate(WTINT4YR = WTINT2YR/2,
WTMEC4YR = WTMEC2YR/2)
In our next analysis we are going to look at how levels of sex hormones vary by age. This will require us to take into account the 4-year MEC weights.
# Estimate mean levels of sex hormones by age.
# RIAGENDR = respondent's sex
# RIDAGEYR = respondent's age in years
# LBXTST = Testosterone, total (ng/dL)
# LBXEST = Estradiol (pg/mL)
# Note: 1 ng/dL = 10 pg/mL
tab <- dat |>
summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE),
.by = c(RIAGENDR, RIDAGEYR))
# Note: weighted.mean uses `w` as the weight parameter
# Note: count uses `wt` as the weight parameter
# save results to a CSV file
write_csv(tab, file = "hormones_by_age_and_sex.csv")
Next we are going to create a plot of our results.
# First plot attempt of TST
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst)) +
geom_line()
# The previous plot did not distinguish males and females.
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = RIAGENDR)) +
geom_line()
# The plot still didn't look good because RIAGENDR was treated as a continuous variable
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = as.factor(RIAGENDR))) +
geom_line()
ggplot(tab, aes(x = RIDAGEYR, y = mean_est, color = as.factor(RIAGENDR))) +
geom_line()
tab <- tab |> pivot_longer(c(mean_est, mean_tst))
ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
facet_wrap(vars(name)) +
geom_line()
There's a lot of noise in the EST data for females. Let's see if it is due to pregnancy. Challenge: Remove pregnant women from the data set and rerun the analysis.
# SOLUTION
# Filter out pregnant women and recreate plot
tab <- dat |>
filter(is.na(RIDEXPRG) | RIDEXPRG != 1) |>
group_by(RIAGENDR, RIDAGEYR) |>
summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
pivot_longer(c(mean_est, mean_tst))
ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
facet_wrap(vars(name)) +
geom_line()
Let's split respondents into Male, Female (not-pregnant), and Female (pregnant).
# Encode Sex and Pregnancy status as three categories
tab <- dat |> mutate(sex = case_when(
RIAGENDR == 1 ~ "M",
RIDEXPRG == 1 ~ "FP",
TRUE ~ "FN"
))
# subset the important columns
tab <- tab |> select(SEQN, sex, RIDAGEYR,
WTMEC4YR, LBXEST, LBXTST)
# use common units
tab <- tab |> mutate(LBXTST = 10*LBXTST)
# calculate mean concentrations
tab <- tab |>
pivot_longer(starts_with("LBX")) |>
drop_na(value) |>
summarize(mean = weighted.mean(value, w = WTMEC4YR),
.by = c(sex, RIDAGEYR, name))
# Plot hormone levels based on age and category
ggplot(tab, aes(x = RIDAGEYR,
y = mean,
color = as.character(sex))) +
geom_line() +
facet_wrap(vars(name))
Next let's look at how pregnancy status varies by age and previous pregnancies.
For this we will need to combine three tables together using multiple calls to
inner_join()
.
# use two inner joins
dat_h <- nhanes_h$DEMO |>
inner_join(nhanes_h$TST) |>
inner_join(nhanes_h$RHQ)
# use the reduce function
dat_i <- nhanes_i |> reduce(inner_join)
dat <- bind_rows(dat_h, dat_i) |>
mutate(WTMEC4YR = WTMEC2YR/2,
WTINT4YR = WTINT2YR/2)
Let's plot the pregnancy-by-age data we created previously.
# Recreate the pregnancy table from earlier
# Treat RIDEXPRG as a factor, and keep missing combinations in count()
tab <- dat |>
filter(RIDEXPRG %in% 1:2) |>
count(RIDAGEYR, RIDEXPRG = as.factor(RIDEXPRG),
wt = WTMEC4YR, .drop = FALSE) |>
mutate(p = n / sum(n), .by = RIDAGEYR)
# Reshape and pretty the table
tab <- tab |> filter(RIDEXPRG == "1") |>
select(Age = RIDAGEYR, Frac_Preg = p)
# Create a column plot
ggplot(tab, aes(x = Age, y = Frac_Preg)) + geom_col()
We can add more information to this plot if we consider how many deliveries the respondents have had.
# RHQ131 - Ever been pregnant before.
# RHQ171 - How many deliveries live birth result
tab <- dat |>
filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
rename(Age = RIDAGEYR, Status = RIDEXPRG)
# If a respondent answered "no" to RHQ131, then RHQ171 will be missing/NA
# Use `coalesce()` to replace these NAs with 0.
tab <- tab |> mutate(Births = coalesce(RHQ171, 0))
# Count the Age x Births x Status combinations.
# Use `complete()` to add missing combinations as 0.
tab <- tab |> count(Age, Births, Status, wt = WTMEC4YR) |>
complete(Age, Births, Status, fill = list(n = 0))
# Create pregnancy table
tab <- tab |> mutate(Frac_Preg = n/sum(n), .by = Age) |>
filter(Status == 1)
# Create a new plot with additional information
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = Births)) +
geom_col()
# Treat births as a factor and use a discrete palette
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = as.character(Births))) +
geom_col() +
scale_fill_brewer(palette = "Set1")
# Customize the scale so that 0 is at the bottom
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = fct_rev(as.character(Births)))) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = -1)
We have too many categories in our data to be properly represented by our color scheme. Let's create one category for 4 or more births.
# Create bins '(0,1]', '(1,2]', '(2,3]', and '(3,4]'
# Everything else is NA
cut(tab$Births, breaks = 0:4)
# Create bins '[0,1]', '(1,2]', '(2,3]', and '(3,4]'
cut(tab$Births, breaks = 0:4, include.lowest = TRUE)
# Create bins '[0,1]', '(1,2]', '(2,3]', '(3,4]', '(4,Inf]'
cut(tab$Births, breaks = c(0:4,Inf), include.lowest = TRUE)
# Create bins '[0,1)', '[1,2)', '[2,3)', '[3,4)', '[4,Inf)'
cut(tab$Births, breaks = c(0:4,Inf), right = FALSE)
# Add reasonable labels
cut(tab2$Births, breaks = c(0:4,Inf), right = FALSE,
labels = c(0:3, "4+"))
# Wrap everything a function
cut_births <- function(x) {
cut(x, breaks = c(0:4,Inf),
labels = c(0:3, "4+"),
right = FALSE)
}
# Finish our plot
tab2 <- tab |> mutate(Births = cut_births(Births))
ggplot(tab2, aes(x = Age, y = Frac_Preg, fill = fct_rev(Births))) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = -1)
Patchwork makes it ridiculously simple to combine multiple ggplots into a single graph.
library(tidyverse)
library(patchwork)
# Download Data
# Specify what datsets we want to use.
tables <- c("DEMO", "UM") # Demographics and Urine Metals data
# Specify what years we want to pull data from.
# Use `enframe()` to create a tibble from a named vector.
cohorts <- c(H = "2013-2014", I = "2015-2016")
cohorts <- enframe(cohorts, name = "ext", value = "years")
cohorts
# Create all permutations of years and tables.
url_data <- expand_grid(surveys, tables)
url_data
# Build the urls we will be downloading.
urls <- str_glue_data(url_data, "https://wwwn.cdc.gov/Nchs/Nhanes/{years}/{tables}_{ext}.XPT")
# Download and parse datasets.
nhanes <- set_names(urls) |> map(haven::read_xpt, .progress = TRUE)
# Split data into groups based on year.
nhanes <- split(nhanes, url_data$ext)
# Reduce each year to a single data.frame using `inner_join()`.
dat <- map(nhanes, \(x) reduce(x, inner_join))
# Reduce the entire dataset using bind_rows.
dat <- reduce(dat, bind_rows)
# Add four-year weights
dat <- dat |> mutate(WTINT4YR = WTINT2YR/2,
WTMEC4YR = WTMEC2YR/2 )
# Construct data for graph
tab <- dat |>
select(SEQN, RIDAGEYR, RIAGENDR, WTMEC4YR, starts_with("URX")) |>
mutate(RIAGENDR = as.factor(RIAGENDR)) |>
pivot_longer(starts_with("URX")) |>
summarize(mean = weighted.mean(value, w = WTMEC4YR, na.rm = TRUE),
.by = c(RIDAGEYR, RIAGENDR, name)) |>
pivot_wider(values_from="mean")
tab
# Plot manganese levels by age.
mn <- ggplot(tab, aes(x = RIDAGEYR, y = URXUMN, color = RIAGENDR)) +
geom_line() + theme(legend.position = "top") + labs(color = "Sex")
mn
# Plot lead levels by age.
pb <- ggplot(tab, aes(x = RIDAGEYR, y = URXUPB, color = RIAGENDR)) +
geom_line() + theme(legend.position = "top") + labs(color = "Sex")
pb
# Plot uranium levels by age.
u <- ggplot(tab, aes(x = RIDAGEYR, y = URXUUR, color = RIAGENDR)) +
geom_line() + theme(legend.position = "top") + labs(color = "Sex")
u
# Let's use patchwork to create figures from multiple subplots
# Patchwork has a lot of options and we will go over a few here
# Horizontal
mn + pb
# Vertical
mn / pb
# specify a column layout with heights
mn + pb + plot_layout(ncol = 1, heights = c(1,2))
# use one legend
mn + pb + plot_layout(ncol = 1, heights = c(1,2), guides = "collect")
# specify widths
mn + pb + plot_layout(ncol = 2, width = c(1, 2))
# add space
mn + plot_spacer() + pb
# complex layouts
(mn + pb) / u
(mn / pb) | u
# really complex layouts
mn + {
pb + {
pb + {
u + plot_layout(ncol = 1)
}
}
} + plot_layout(ncol = 1, guides = "collect")