-
-
Save pdparker/083c56fe8bf9474340b7f76c8d583a8f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "R Notebook" | |
output: html_notebook | |
--- | |
# Overview | |
## | |
![Overview](fig/PISA-Workshop.png) | |
## Resources | |
- [Working with Archival Data: Studying Lives](https://www.amazon.com/Working-Archival-Data-Quantitative-Applications/dp/0803942621) | |
- [Complex Surveys: A Guide to Analysis Using R](https://www.amazon.com/Complex-Surveys-Guide-Analysis-Using/dp/0470284307/ref=sr_1_1?s=books&ie=UTF8&qid=1539927970&sr=1-1&keywords=Thomas+Lumley) | |
- [Analyze Survey Data for Free](http://asdfree.com/) | |
- [R for Data Scientists](http://r4ds.had.co.nz/) | |
- [A Tidyverse Approach to Survey Data](https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html) | |
# Archival Data - Advantages | |
- Cost effective | |
- Better than what you can afford to do on your own | |
- Representative | |
## The disadvantages of archival data | |
>archival data analysis often means initial questions are reformulated to fit the data and the data are reworked in coding and recoding to fit the question better | |
Elder, Clipp, and Colerick (1993, p. 2) | |
## The unique skills required for archival data | |
- Munging: What rows or columns need to be excluded, cleaned, merged | |
- Harmonizing:Can this survey data be merged with other survey data | |
- Integrating: Can this survey data be merged with data from other sources | |
- Indexing: Can I combine variables to get an index of what I want | |
- Aligning: Can I create a correspondence between one type of coding and another | |
# Intro to Complex Survey Designs | |
# Why Complex Survey Designs | |
- Simple random samples are simple and require relatively few units | |
- Simple random samples are not often used because stratified samples can provide the same precision at a lower cost | |
## Sampling units | |
- Primary Sampling Unit e.g., school | |
- Secondary Sampling Unit e.g., child | |
## Strata | |
- Break a population down into known groups | |
- Probability sample from within each group | |
- MUST know strata membership and characteristics ahead of time | |
- Stratification can reduce standard errors because more information about the population is known | |
## Oversampling | |
- Over sampling can be beneficial when important groups are small | |
- PISA oversamples Indigenous students | |
- Provided population statistics are know this can be corrected via weights | |
## Weights | |
Types of Weights: | |
- Precision (inverse-probability) weights | |
- Frequency weights | |
- Sampling weights | |
- Senate weights | |
- Sampling weights | |
- Repeated Replicate weights | |
- Attrition weights | |
## Replicate weights | |
- PISA is a two-stage design (with schools as the primary sampling unit) | |
- Standard errors will be wrong if this is not accounted for | |
- PISA provides repeated replication weights to correct for this | |
## Replicate Weights | |
> The replicate-weight approach to estimating standard errors computes the standard deviation of the estimated summary across many partially independent subsets of the one sample, and extrapolates from this to the standard deviation between completely independent samples. | |
Lumley, Thomas. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology) (Kindle Locations 797-800). Wiley. Kindle Edition. | |
## Types of Replicate Weights | |
- Balanced Repeated Replication: Estimates from half samples | |
- Fay: All observations are weighted $(2 - p)/\pi_i$ and $p/\pi_i$ (PISA prob = .5) | |
- Jackknife: JK1 (1 ob or 1 cluster weighted at zero) JKn (n ob or n cluster) | |
- Block Bootstrap | |
## Statistic for PISA Replicate Weights | |
- Point: $\sum_{R=1}^MA_r/M$ | |
- Standard Error: $\sqrt{ \frac{4}{M} \times \sum_{R=1}^{M} (A_r - \bar{A})^2 }$ | |
## But Multilevel Models? | |
> In PISA, the sampling variances estimated with multi-level models will always be greater than the sampling variances estimated with Fay replicate samples | |
[PISA 2003 Data Analysis Manual](https://goo.gl/dfLSMC) | |
# My Approach to R | |
## Base R script | |
Normal R code can get quite messy | |
mean(subset(iris, Species == 'virginica')$Sepal.Length) | |
``` | |
## Tidyverse | |
Tidycode provides pipes which clean things up alot by passing objects from the left to the right. | |
```{r} | |
suppressPackageStartupMessages(library(tidyverse)) | |
iris %>% | |
filter(Species == 'virginica') %>% | |
summarise(m = mean(Sepal.Length)) | |
``` | |
# Setting up PISA data | |
## The packages to use | |
```{r message=FALSE} | |
## Required --- | |
library(survey) # the core of everything we will do today | |
library(mitools) # needed for working with plausible values | |
## optional but makes life easier --- | |
library(tidyverse) # a suite of tools to make coding easier | |
library(readit) # Figures out what sort of file you are giving it and read in the data | |
library(srvyr) # tidyverse for survey data | |
library(svyPVpack) # convience functions for survey data and plausible values | |
## For integrating data from statistical agencies --- | |
library(OECD) # grabs OECD data from the website | |
##F or analysis in parallel --- | |
library(furrr) # needed to easily set up and run models in parallel | |
plan(multiprocess) # prepare to run things in parallel | |
##Pretty graphs --- | |
library(ggthemes) # emulate the style of professional data vis | |
library(wesanderson) # colour pallets that work well together | |
``` | |
## Data Munging | |
```{r cache=TRUE} | |
d2015 <- readit("~/Dropbox/Databases/WORLD/PUF_SAS_COMBINED_CMB_STU_QQQ/cy6_ms_cmb_stu_qqq.sas7bdat") %>% # load file | |
filter(CNT == 'AUS') %>% # Lets just use one country | |
rename(GENDER = ST004D01T) %>% #give us better names to work with | |
mutate(GENDER = ifelse(GENDER == 1, 'girl', 'boy')) # give us meaningful names | |
``` | |
## Creating a complex survey data.frame - The easy way | |
```{r} | |
#Tidyverse | |
d2015_survey <- d2015 %>% | |
as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"), | |
weights = W_FSTUWT, rho = .5) | |
#Standard R | |
# d2015_survey<-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+", | |
# type = "Fay", rho = .5, data = d2015) | |
``` | |
# Some basic models | |
## Sample Size and basic descriptives | |
```{r} | |
d2015_survey %>% | |
group_by(GENDER) %>% | |
summarise(un = unweighted(n()), n = survey_total(), | |
m_scie = survey_mean(PV1SCIE)) | |
``` | |
## Some Tidyverse Magic | |
```{r} | |
d2015_survey %>% | |
summarise_at(vars(matches("PV[1-9]+MATH")), survey_mean) %>% | |
t() | |
``` | |
## Balanced Repeated Replication weights by Hand | |
```{r} | |
ms <- c() | |
for (i in 1:80){ | |
tmp <- d2015 %>% select(W_FSTUWT, paste0('W_FSTURWT',i), PV2MATH) | |
wt = tmp[,paste0('W_FSTURWT',i)] | |
names(wt) <- "wt" | |
tmp <- bind_cols(tmp, wt) | |
ms[i] <- weighted.mean(x = tmp$PV2MATH, w = tmp$wt, na.rm=TRUE) | |
} | |
``` | |
## Results | |
```{r} | |
mean(ms); sqrt(sum(4/80*(ms - mean(ms))^2)) | |
tmp <- svymean(~PV2MATH, design = d2015_survey, return.replicates=TRUE) | |
tmp | |
all(ms == tmp$replicates) | |
``` | |
## Some other basic descriptives | |
```{r} | |
svytable(~GENDER, d2015_survey) | |
svytable(~GENDER, d2015_survey, Ntotal = 100) | |
``` | |
## Graph | |
```{r} | |
p1<- d2015_survey %>% | |
group_by(GENDER) %>% | |
summarise(m_scie = survey_mean(TEACHSUP, na.rm=TRUE)) %>% | |
ggplot(aes(x=GENDER, y = m_scie, fill = GENDER)) + | |
geom_bar(stat="identity") + | |
theme_wsj() + | |
scale_fill_manual(values=wes_palette(n=2, name="Royal1")) | |
``` | |
## Graph | |
```{r echo=FALSE} | |
p1 | |
``` | |
# Plausible Values | |
## What are Plausible Values | |
- Not all children in PISA receive all test items | |
- Thus we must estimate their scores with missing information | |
- When we estimate scores we do so with uncertainty | |
- Plausible values are taken from an IRT model | |
- They are estimates of the child's true proficiency | |
- Combined using Multiple Imputation logic of Rubin | |
## Dealing with plausible values | |
- The Simple Way (covers most basic cases but not very flexible) | |
- The Multiple Imputation Way (annoying but possible to do almost anything with it) | |
- Parallel Processing (mostly not worthwhile except for very big models) | |
## Replicating OECD report results | |
```{r cache=TRUE} | |
# Matches 2015 PISA Report | |
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "SCIE")) %>% | |
select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44 | |
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "READ")) %>% | |
select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44 | |
``` | |
## Replicating OECD Results | |
```{r cache=TRUE} | |
# Matches 2015 PISA Report | |
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "MATH"))%>% | |
select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44 | |
svyPVglm(PV..SCIE ~ ESCS, d2015_survey) # P46 | |
``` | |
## Replicating OECD Results using data lists | |
```{r} | |
pvs <- list() | |
mis <- list() | |
K = paste0("PV",1:10, "MATH") | |
for (i in 1:10){ | |
mis[[i]] <- d2015 %>% | |
select(matches(paste0("PV", i, "[A-Z]+") ), ESCS, | |
GENDER, W_FSTUWT, starts_with("ST"), | |
starts_with("W_FSTURWT"), CNT) | |
names(mis[[i]]) <- gsub("PV[0-9]+", "", names(mis[[i]])) | |
pvs[[i]] <- mis[[i]] %>% | |
as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"), weights = W_FSTUWT, rho = .5) | |
} | |
``` | |
## Replicating OECD Results using data lists | |
```{r} | |
names(pvs) <- paste0("PV", 1:10) | |
mis <- imputationList(mis) | |
mis <-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+", | |
type = "Fay", rho = .5, data = mis) | |
``` | |
## Data list | |
```{r cache=TRUE} | |
with(mis, svymean(~SCIE, na.rm=TRUE)) %>% | |
MIcombine() %>% | |
summary | |
pvs %>% | |
future_map(~ svymean(~SCIE, design = ., na.rm=TRUE)) %>% | |
MIcombine() %>% | |
summary | |
``` | |
# More Complex Models | |
## Complex GLM Model | |
```{r cache = TRUE} | |
sjlabelled::get_label(d2015$ST123Q01NA) | |
pvs %>% | |
future_map(~ svyolr(as.factor(ST123Q01NA)~scale(SCIE), design = .)) %>% | |
MIcombine() %>% | |
summary | |
``` | |
## SEM Models | |
``` | |
library(lavaan) | |
library(lavaan.survey) | |
M1 <- ' | |
EMOSUPS =~ ST123Q01NA + ST123Q02NA + ST123Q03NA + ST123Q04NA | |
INSTSCIE =~ ST113Q01TA + ST113Q02TA + ST113Q03TA + ST113Q04TA | |
' | |
simple.fit <- cfa(M1, data=d2015) | |
summary(simple, standardized = TRUE, fit.measures = TRUE) | |
survey.fit <- lavaan.survey(lavaan.fit=simple.fit, survey.design=d2015_survey) | |
summary(survey.fit, standardized = TRUE, fit.measures = TRUE) | |
``` | |
## Missing data | |
- Easiest to use multiple imputations via a package like Amelia II | |
- Assign one PV to each imputation | |
- Run analysis as per the models already shown | |
## Models not covered by the Survey Package | |
```{r cache = TRUE, message=FALSE} | |
library(quantreg) | |
withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE), | |
tau=0.2, weights=.weights)))) | |
withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE), | |
tau=0.8, weights=.weights)))) | |
``` | |
## Complications with PVs: Tables | |
```{r} | |
#At top proformance band | |
M1 <- with(mis, svytable(~I(MATH>707.93), Ntotal=100)) %>% unlist | |
mean(M1[c(T,F)]); mean(M1[c(F,T)]) | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment