Skip to content

Instantly share code, notes, and snippets.

@reedacartwright
Last active October 26, 2024 23:46
Show Gist options
  • Save reedacartwright/8d15fd091a30c38d980232b288d7ab69 to your computer and use it in GitHub Desktop.
Save reedacartwright/8d15fd091a30c38d980232b288d7ab69 to your computer and use it in GitHub Desktop.
NHANES Analysis

Introduction

Today we will be working with data from the National Health and Nutrition Examination Survey (NHANES).

NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its association with health promotion and disease prevention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiological studies and health sciences research, which help develop sound public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.

The dataset we will be working with data from the 2013-2014 and 2015-2016 surveys. Information about the variables recorded in the data can be found in the following links.

Challenge

Explore the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) and in Etherpad describe some of the information and data that can be found there.

[Insert Class Roster]

Useful Tips for NHANES Website

Import NHANES Data from the Web

NHANES distributes its data in the SAS transport format (.XPT), and Tidyverse includes a function haven::read_xpt() to read these files directly. FFor example, the demographic data from years 2013-2014 and 2015-2016 are at

library(tidyverse)

# Construct URLs for the data we are going to use.
# demographics, sex steroid, reproductive health
tables <- c("DEMO", "TST", "RHQ")

url_h <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/{tables}_H.XPT")
url_i <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/{tables}_I.XPT")

url_h
url_i

# Good: Download and process data using a for-loop.
nhanes_h <- list() # Create an empty list
for(u in url_h) {
    # Iterate through all the elements (urls) in url_h and download the
    # data using the function `haven:read_xpt()`. Save the results as elements
    # in the list
    nhanes_h[[u]] <- haven::read_xpt(u)
}

# Better: Download and process data using tidyverse's `map()` function.
nhanes_i <- url_i |> map(haven::read_xpt, .progress = TRUE)

# Best: Use `set_names()` to add names to the input. Otherwise,
# nhanes_i is unnamed.
nhanes_i <- set_names(url_i) |> map(haven::read_xpt, .progress = TRUE)

# Extract the table name from the urls
names(nhanes_h) <- names(nhanes_h) |> str_extract("[^/]+(?=_H[.]XPT$)")
names(nhanes_i) <- names(nhanes_i) |> str_extract("[^/]+(?=_I[.]XPT$)")

Weights in NHANES

Various sample weights are available on the data release files – such as the interview weight (WTINT2YR), the MEC exam weight (WTMEC2YR), and several subsample weights. Use of the correct sample weight for NHANES analyses depends on the variables being used. A good rule of thumb is to use “the least common denominator” where the variable of interest that was collected on the smallest number of respondents is the “least common denominator.” The sample weight that applies to that variable is the appropriate one to use for that particular analysis.

In order to work with four years of data, we will bind the rows from both of the datasets and then create new weights appropriate for the larger dataset.

# Combine datasets and create "4-year" weights from 2-year weights
# WTINT2YR - Full sample 2 year interview weight
# WTMEC2YR = Full sample 2 year MEC exam weight
dat <- bind_rows(nhanes_h$DEMO, nhanes_i$DEMO) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)

dat

Fraction of Women Pregnant

Using the demographic dataset we can measure the the fraction of women who are pregnant based on age. Pregnancy was accessed during the exam stage and we will use the MEC weights in this analysis.

# Estimate number of women pregnant per age
# RIDAGEYR = respondent's age in years
# RIDEXPRG = Pregnancy status at exam
tab <- dat |>
    drop_na(RIDEXPRG) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR)

tab

Wait. What the heck do 1, 2, and 3 mean under RIDEXPRG. Let's look at the documentation. https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm

RIDEXPRG: Pregnancy status at the time of the health examination was ascertained for females 8–59 years of age. Due to disclosure risks pregnancy status is only released for women 20-44 years of age. The information used to code RIDEXPRG values included self-reported pregnancy status and urine pregnancy test results. Persons who reported they were pregnant at the time of exam were assumed to be pregnant (RIDEXPRG=1). Those who reported they were not pregnant or did not know their pregnancy status were further classified based on the results of the urine pregnancy test. If the respondent reported “no” or “don’t know” and the urine test result was positive, the respondent was coded as pregnant (RIDEXPRG=1). If the respondent reported “no” and the urine test was negative, the respondent was coded not pregnant (RIDEXPRG=2). If the respondent reported did not know her pregnancy status and the urine test was negative, the respondent was coded “could not be determined” (RIDEXPRG=3). Persons who were interviewed, but not examined also have an RIDEXPRG value = 3 (could not be determined).

Lets drop the “could not be determined” pregnancy status and recalculate.

# Let's estimate a fraction keeping only the yes/no values
tab <- dat |>
    filter(RIDEXPRG %in% 1:2) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR) |>
    mutate(p = n / sum(n), .by=RIDAGEYR)

Let's reshape the table.

# Reshape tab to a wide format with a column for age, and two columns for
# pregnancy status. Any columns not mentioned in these three arguments will be
# dropped.
tab |> pivot_wider(id_cols = RIDAGEYR,
    names_from = RIDEXPRG,
    values_from = p)

# We have some NAs in the list which represent 0.0
tail(tab)

# Make the result easier to read before we save it
tab <- tab |> rename(Age = RIDAGEYR, Status = RIDEXPRG)
tab <- tab |>
    mutate(Status = factor(Status, levels=1:2,
        labels=c("Yes", "No"))) |>
    pivot_wider(id_cols = Age, names_from = Status,
        values_from = p, values_fill = 0.0)

# It looks pretty now
tail(tab)

write_csv(tab, "frac_preg_by_age.csv")

Changes in Hormone Levels by Age

To work with data from multiple tables, we will need to combine the tables in some fashion. Let's use map to quickly look at the number of rows per table.

# Apply the function `nrow()` to every element of a list
map(nhanes_h, nrow) # returns a list
map_int(nhanes_i, nrow) # returns an integer vector

# `map()` has alternative functions that return specific structures.
# `map_int()` returns a vector of integers, `map_chr()` returns a vector of
# characters, etc.

The number (and order) of rows vary between the tables. We will use a join operation to combine the tables based on the respondent sequence number, "SEQN". Dplyr offers multiple ways to join two data frames

  • inner_join(x, y) keeps observations from x that have a matching key in y.
  • left_join(x, y) keeps observations in x.
  • right_join(x, y) keeps observations in y.
  • full_join(x, y) keeps observations in x and y.
  • among others.
# Use an inner_join to combine the DEMO and TST tables
# SEQN = Respondent sequence number
dat_h <- inner_join(nhanes_h$DEMO, nhanes_h$TST, by = "SEQN")
dat_i <- inner_join(nhanes_i$DEMO, nhanes_i$TST, by = "SEQN")

# check names and number of rows
names(nhanes_h$DEMO)
names(nhanes_h$TST)
names(dat_h)
nrow(nhanes_h$DEMO)
nrow(nhanes_h$TST)
nrow(dat_h)

# Bind the years together
dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)

In our next analysis we are going to look at how levels of sex hormones vary by age. This will require us to take into account the 4-year MEC weights.

# Estimate mean levels of sex hormones by age.
# RIAGENDR = respondent's sex
# RIDAGEYR = respondent's age in years
# LBXTST = Testosterone, total (ng/dL)
# LBXEST = Estradiol (pg/mL)
# Note: 1 ng/dL = 10 pg/mL
tab <- dat |> 
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE),
        .by = c(RIAGENDR, RIDAGEYR))

# Note: weighted.mean uses `w` as the weight parameter
# Note: count uses `wt` as the weight parameter

# save results to a CSV file
write_csv(tab, file = "hormones_by_age_and_sex.csv")

Next we are going to create a plot of our results.

Testosterone Levels

# First plot attempt of TST
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst)) +
    geom_line()

# The previous plot did not distinguish males and females.
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = RIAGENDR)) +
    geom_line()

# The plot still didn't look good because RIAGENDR was treated as a continuous variable
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = as.factor(RIAGENDR))) +
    geom_line()

Estrogen Levels

ggplot(tab, aes(x = RIDAGEYR, y = mean_est, color = as.factor(RIAGENDR))) +
    geom_line()

Both Levels Together

tab <- tab |> pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()

Challenge

There's a lot of noise in the EST data for females. Let's see if it is due to pregnancy. Challenge: Remove pregnant women from the data set and rerun the analysis.

# SOLUTION
# Filter out pregnant women and recreate plot
tab <- dat |>
    filter(is.na(RIDEXPRG) | RIDEXPRG != 1) |>
    group_by(RIAGENDR, RIDAGEYR) |>
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
    pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()

Splitting

Let's split respondents into Male, Female (not-pregnant), and Female (pregnant).

# Encode Sex and Pregnancy status as three categories
tab <- dat |> mutate(sex = case_when(
    RIAGENDR == 1 ~ "M",
    RIDEXPRG == 1 ~ "FP",
    TRUE ~ "FN"
  ))

# subset the important columns
tab <- tab |> select(SEQN, sex, RIDAGEYR, 
       WTMEC4YR, LBXEST, LBXTST)
# use common units
tab <- tab |> mutate(LBXTST = 10*LBXTST)

# calculate mean concentrations
tab <- tab |>
  pivot_longer(starts_with("LBX")) |>
  drop_na(value) |>
  summarize(mean = weighted.mean(value, w = WTMEC4YR), 
            .by = c(sex, RIDAGEYR, name))

# Plot hormone levels based on age and category
ggplot(tab, aes(x = RIDAGEYR, 
                y = mean,
                color = as.character(sex))) +
  geom_line() +
  facet_wrap(vars(name))

Exploring Pregnancy in More Detail

Next let's look at how pregnancy status varies by age and previous pregnancies. For this we will need to combine three tables together using multiple calls to inner_join().

# use two inner joins
dat_h <- nhanes_h$DEMO |>
    inner_join(nhanes_h$TST) |>
    inner_join(nhanes_h$RHQ)

# use the reduce function
dat_i <- nhanes_i |> reduce(inner_join)

dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTMEC4YR = WTMEC2YR/2,
           WTINT4YR = WTINT2YR/2)

Let's plot the pregnancy-by-age data we created previously.

# Recreate the pregnancy table from earlier
# Treat RIDEXPRG as a factor, and keep missing combinations in count()
tab <- dat |>
    filter(RIDEXPRG %in% 1:2) |>
    count(RIDAGEYR, RIDEXPRG = as.factor(RIDEXPRG),
        wt = WTMEC4YR, .drop = FALSE) |>
    mutate(p = n / sum(n), .by = RIDAGEYR)

# Reshape and pretty the table
tab <- tab |> filter(RIDEXPRG == "1") |>
    select(Age = RIDAGEYR, Frac_Preg = p)

# Create a column plot
ggplot(tab, aes(x = Age, y = Frac_Preg)) + geom_col()

We can add more information to this plot if we consider how many deliveries the respondents have had.

# RHQ131 - Ever been pregnant before.
# RHQ171 - How many deliveries live birth result
tab <- dat |>
    filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG)

# If a respondent answered "no" to RHQ131, then RHQ171 will be missing/NA
# Use `coalesce()` to replace these NAs with 0.
tab <- tab |> mutate(Births = coalesce(RHQ171, 0))

# Count the Age x Births x Status combinations.
# Use `complete()` to add missing combinations as 0.
tab <- tab |> count(Age, Births, Status, wt = WTMEC4YR) |>
    complete(Age, Births, Status, fill = list(n = 0))

# Create pregnancy table
tab <- tab |> mutate(Frac_Preg = n/sum(n), .by = Age) |>
    filter(Status == 1)

# Create a new plot with additional information
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = Births)) +
    geom_col()

# Treat births as a factor and use a discrete palette
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = as.character(Births))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1")

# Customize the scale so that 0 is at the bottom
ggplot(tab, aes(x = Age, y = Frac_Preg, fill = fct_rev(as.character(Births)))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)

We have too many categories in our data to be properly represented by our color scheme. Let's create one category for 4 or more births.

# Create bins '(0,1]', '(1,2]', '(2,3]', and '(3,4]'
# Everything else is NA
cut(tab$Births, breaks = 0:4)

# Create bins '[0,1]', '(1,2]', '(2,3]', and '(3,4]'
cut(tab$Births, breaks = 0:4, include.lowest = TRUE)

# Create bins '[0,1]', '(1,2]', '(2,3]', '(3,4]', '(4,Inf]'
cut(tab$Births, breaks = c(0:4,Inf), include.lowest = TRUE)

# Create bins '[0,1)', '[1,2)', '[2,3)', '[3,4)', '[4,Inf)'
cut(tab$Births, breaks = c(0:4,Inf), right = FALSE)

# Add reasonable labels
cut(tab2$Births, breaks = c(0:4,Inf), right = FALSE,
    labels = c(0:3, "4+"))

# Wrap everything a function
cut_births <- function(x) {
    cut(x, breaks = c(0:4,Inf),
        labels = c(0:3, "4+"),
        right = FALSE)
}

# Finish our plot
tab2 <- tab |> mutate(Births = cut_births(Births))
ggplot(tab2, aes(x = Age, y = Frac_Preg, fill = fct_rev(Births))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)

Patchwork

Patchwork makes it ridiculously simple to combine multiple ggplots into a single graph.

library(tidyverse)
library(patchwork)

# Download Data
# Specify what datsets we want to use.
tables <- c("DEMO", "UM") # Demographics and Urine Metals data

# Specify what years we want to pull data from.
# Use `enframe()` to create a tibble from a named vector.
cohorts <- c(H = "2013-2014", I = "2015-2016")
cohorts <- enframe(cohorts, name = "ext", value = "years")

cohorts

# Create all permutations of years and tables.
url_data <- expand_grid(surveys, tables)

url_data

# Build the urls we will be downloading.
urls <- str_glue_data(url_data, "https://wwwn.cdc.gov/Nchs/Nhanes/{years}/{tables}_{ext}.XPT")

# Download and parse datasets.
nhanes <- set_names(urls) |> map(haven::read_xpt, .progress = TRUE)

# Split data into groups based on year.
nhanes <- split(nhanes, url_data$ext)

# Reduce each year to a single data.frame using `inner_join()`.
dat <- map(nhanes, \(x) reduce(x, inner_join))

# Reduce the entire dataset using bind_rows.
dat <- reduce(dat, bind_rows)

# Add four-year weights
dat <- dat |> mutate(WTINT4YR = WTINT2YR/2,
                     WTMEC4YR = WTMEC2YR/2 )

# Construct data for graph
tab <- dat |>
    select(SEQN, RIDAGEYR, RIAGENDR, WTMEC4YR, starts_with("URX")) |>
    mutate(RIAGENDR = as.factor(RIAGENDR)) |>
    pivot_longer(starts_with("URX")) |>
    summarize(mean = weighted.mean(value, w = WTMEC4YR, na.rm = TRUE),
        .by = c(RIDAGEYR, RIAGENDR, name)) |>
    pivot_wider(values_from="mean")

tab

# Plot manganese levels by age.
mn <- ggplot(tab, aes(x = RIDAGEYR, y = URXUMN, color = RIAGENDR)) +  
    geom_line() + theme(legend.position = "top") + labs(color = "Sex")
mn

# Plot lead levels by age.
pb <- ggplot(tab, aes(x = RIDAGEYR, y = URXUPB, color = RIAGENDR)) + 
    geom_line() + theme(legend.position = "top") + labs(color = "Sex")
pb

# Plot uranium levels by age.
u <- ggplot(tab, aes(x = RIDAGEYR, y = URXUUR, color = RIAGENDR)) +
    geom_line() + theme(legend.position = "top") + labs(color = "Sex")
u

# Let's use patchwork to create figures from multiple subplots
# Patchwork has a lot of options and we will go over a few here

# Horizontal
mn + pb

# Vertical
mn / pb

# specify a column layout with heights
mn + pb + plot_layout(ncol = 1, heights = c(1,2))

# use one legend
mn + pb + plot_layout(ncol = 1, heights = c(1,2), guides = "collect")

# specify widths
mn + pb + plot_layout(ncol = 2, width = c(1, 2))

# add space
mn + plot_spacer() + pb

# complex layouts
(mn + pb) / u

(mn / pb) | u

# really complex layouts
mn + {
    pb + {
        pb + {
            u + plot_layout(ncol = 1)
        }
    }
} + plot_layout(ncol = 1, guides = "collect")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment