Created
September 16, 2018 17:23
-
-
Save jknowles/a8b7b71d46f0296f9f88c868803ea5bd to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Additional tools for machine learning and predictive analytics in stata | |
/* | |
Author: Jared Knowles | |
Date: 09/12/2018 | |
Purpose: Survey of some additional code helpful in conducting and explaining | |
or demonstrating predictive analytics to stakeholders. | |
You do not need to run all of this code - this is a survey of commands that | |
tackle different techniques. Pick and choose what might be most useful to you. | |
*/ | |
// Discrimminant Analysis in stata | |
// You need to load this up with more variables for it to be informative | |
// Discrimminant analysis works like a regression tree to find cutpoints in | |
// key variables that best differentiate members of different groups, in this | |
// case, ontime graduates and non-graduates | |
discrim lda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad) | |
discrim qda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad) | |
/* | |
Variable importance can be calculated in Stata using standardized regression | |
coefficients in logistic regression. This allows for more direct comparison of | |
the magnitude and power of "effect sizes". | |
*/ | |
// ssc install center if you need it | |
preserve // we are going to modify the data in place | |
center scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, inplace standardize | |
// fit your model | |
logit ontime_grad scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, noconstant | |
// can restore to the original data now | |
restore | |
// coefplot the model coefficients | |
coefplot, xline(0) xtitle(Standardized Coefficients) | |
/* | |
Cluster analysis can be useful to explore how variables interact with one another. | |
Principal component analysis is a form of factor analysis that you can use to | |
explore and graph how variables correlate together. | |
*/ | |
pca scale_score_7_math male sch_g7_lep_per sch_g7_gifted_per | |
loadingplot | |
// look at individual observations scores | |
scoreplot | |
/* | |
Setting the probability cutoff for identifying students for additional attention | |
or "flagging" them in the EWS is a consequential decision. It can be helpful | |
to create a graph that shows the relationship between the probability threshold | |
and the percent of students who graduate. This is code you've already used to | |
explore relationships like this with predictors, adapted now to use the predicted | |
probabilities. | |
*/ | |
// Fit an example model, you can substitute your own | |
logit ontime_grad scale_score_7_math male i.frpl_7 sch_g7_lep_per sch_g7_gifted_per | |
// predict | |
predict yhat | |
// Cut the predicted probability into categorical bins | |
egen prob_cat = cut(yhat), at(0(0.1)1) | |
tab prob_cat | |
// Compute the probability of graduating within each bin | |
egen prob_ontime_grad = mean(ontime_grad), by(prob_cat) | |
// plot | |
scatter prob_ontime_grad prob_cat |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
This do file provides guidance and Stata syntax examples for the hands-on predictive analytics | |
session during the Fall 2018 Cohort 9 Strategic Data | |
Project Workshop in Denver. | |
During the workshop, we'll ask you work with stakeholders to develop a dropout | |
early warning system for 7th grade students for a simulated state education agency | |
(Montucky) using student data. This guide will provide you with a baseline model to get you | |
started and introduce you to the main concepts of predictive anlaytics. From | |
there, you will work together with stakeholders to incorporate their values | |
into the model, secure their support and adoption of the model, and practice | |
an inclusive and iterative design process to building analytics. | |
As a data analyst, your role in this process is to measure the accuracy of the | |
model and communicate the tradeoffs of different models in terms of accuracy | |
and usability. Together with stakeholders, you might find that the most accurate | |
model does not reflect the values of the end the users. Perhaps the most accurate | |
model requires variables not available for all students, or predicts all students | |
in certain categories to be at-risk. These are issues you will have to work | |
together with stakeholders to address. | |
Logistic regression is one tool you can use, and we'll demonstrate it here. | |
There are many other techniques of increasing complexity. (Many of the best | |
predictive analytics packages are written in the R programming language.) But | |
for a binary outcome variable, most data scientists start with logistic | |
regressions, and those are very straightforward to do in R. | |
Here are the steps: | |
1. explore the data, especially college enrollment predictors and outcomes | |
2. examine the relationship between predictors and outcomes | |
3. evaluate the predictive power of different variables and select predictors for your model | |
4. make predictions using logistic regression | |
5. convert the predicted probabilities into a 0/1 indicator | |
6. look at the effect of different probability cutoffs on prediction accuracy (develop a "confusion matrix") | |
This guide will take you through these steps using a baseline model for predicting | |
graduation using student grade 7 data. During the workshop you will iterate on | |
this baseline model using the process above and find an improved model that | |
meets stakeholder needs. | |
The commands in this script won't tell you everything you need to do to develop | |
your model, but they will give you command | |
syntax that you should be able to adjust and adapt to get the project done. | |
ou should also consider non-model approaches like the "checklist" approaches | |
described in the pre-reading created at the Chicago Consortium of School | |
Research (CCSR). With that "checklist" approach, you experiment with different | |
thresholds for your predictor variables, and combine them to directly predict | |
the outcome. This approach has the advantage of being easy to explain and | |
implement, but it might not yield the most accurate predictions. We won't | |
demonstrate that approach here, but if you want to try it you can draw on the | |
syntax examples here to develop it. | |
Before you get started, you need to think about variables, time, and datasets. | |
The sooner in a student's academic trajectory you can make a prediction, the | |
sooner you can intervene--but the less accurate your predictions, and hence your | |
intervention targeting, is likely to be. What data, and specifically which | |
variables, do you have available to make predictions? What outcome are you | |
trying to predict? | |
In the case of the workshop, we are limiting our focus to using data available | |
at the end of grade 7 to predict outcomes at the end of high school. A critical | |
step in developing a predictive model is to identify the time points that | |
different measures are collected during the process you are predicting -- you | |
can't use data fromthe future to make predictions. If you're planning to use | |
your model to make predictions for students at the end of 11th grade, for | |
instance, and if most students take AP classes as seniors, you can't use data | |
about AP coursetaking collected during senior year to predict the likelihood of | |
college enrollment, even if you have that data available for past groups of | |
students. | |
In terms of datasets, you can develop your model and then test its accuracy on | |
the dataset you used to develop the model, but that is bad practice--in the real | |
world, your model is only as good as its predictions on different, out of sample | |
datasets. It's good practice to split your data into three parts: one part for | |
developing your model, one for repeatedly testing different versions of your | |
model, and a third to use for a final out of sample test. | |
We're using multiple cohorts of middle-school students for the predictive analytics | |
task--students who were seventh graders between 2007 and in 2011. These are the | |
cohorts for which we have access to reliable information on their high school | |
graduation status (late graduate, on time graduate, dropout, transferout, | |
disappeared). For this workshop you are being given the entire set of data so | |
that you can explore different ways of organizing the data across cohorts. | |
An example is that you may choose to explore your models and data using the | |
earlier cohort, and evaluate their performance on more recent cohorts. | |
One last point -- even though the data is synthetic, we have simulated missing data | |
for you. In the real world, you'll need to make predictions for every | |
student, even if you're missing data for that student which your model needs in | |
order to run. Just making predictions using a logistic regression won't be | |
enough. You'll need to use decision rules based on good data exploration and | |
your best judgment to predict and fill in outcomes for students where you have | |
insufficient data. | |
// Getting Started | |
If you're using the do file version of these materials, start by saving a new | |
version of the do file with your initials in the title, so you can edit it | |
without worrying about overwriting the original. Then work through the do file | |
in Stata by highlighting one or a few command lines at a time, clicking the | |
"execute" icon in the toolbar above (or pressing control-D), and then looking at | |
the results in Stata's results window. Edit or add commands as you wish. If | |
you're using a paper or PDF version of these materials, just read on--the Stata | |
output appears below each section of commands. | |
This guide is built using the data you will be working with on the workshop. | |
This dataset includes simulated data for multiple cohorts of 7th graders along | |
with their corresponding high school outcomes. Each observation (row) is a student, | |
and includes associated information about the student's demographics, academic | |
performance in 7th grade, associated school and district (in grade 7), last | |
high school attended, and high school completion. | |
To work through the do file, you need to put the montucky.dta data file on your | |
computer desktop or in a working folder of your choice, and then edit the | |
username and basepath global commands below to tell Stata where to look for the | |
data. If you have trouble doing this, ask for help from other members of your | |
group. | |
*/ | |
// Set up | |
set more off | |
set type double | |
capture log close | |
// Open a log file. This stores a record of the commands and their output in a | |
// text file you can review later. | |
log using "montucky_ews.log", replace | |
// Load the data. | |
use data/montucky.dta, clear | |
// Validate the Data | |
// Verify that there is exactly one observation per student, and check the total | |
// number of observations. | |
isid sid | |
// Wait, what is the issue here? | |
tab sid | |
// Why might our IDs be repeated 45 times? Let's look at how many LEAs we have in | |
// our SEA dataset: | |
levels sch_g7_lea_id | |
// We see that our student IDs are not unique by LEA. That's an easy enough | |
// fix. | |
isid sid sch_g7_lea_id | |
count | |
// Explore the Data | |
/* | |
A key initial task in building an EWS is to identify the cohort membership of | |
students. When we build a predictive model need to identify two timepoints - | |
when we will be making the prediction, and when the outcome we are predicting | |
will be observed. In this case, we will be making the prediction upon receiving | |
the 7th grade data on students (so the completion of 7th grade), and we will | |
be predicting their completion of high school. | |
Let's focus first on identifying the 7th grade year for each student. We have | |
three year variables, what is their relationship: | |
*/ | |
tab year | |
tab cohort_year | |
tab cohort_grad_year | |
/* | |
From the data dictionary we know that the first year variable is the year | |
that corresponds with the student entering 7th grade (`year`). The `cohort_year` | |
variable defines the 9th grade cohort a student belongs to for measuring ontime | |
graduation. Finally, the `cohort_grad_year` is the year that the cohort a | |
student is a member of should graduate to be considered "on-time". | |
If a student graduates, their year of graduation is recorded as `year_of_graduation`. | |
The definition of graduation types is an important design decision for our | |
predictive analytic system. We have to set a time by which students are graduated | |
so we know whether to count them as graduating or not completing. The state of | |
Montucky uses a 4-year cohort graduation rate for most reporting, defining | |
ontime graduation as graduation within 4 years of entering high school. This | |
variable is defined as: | |
*/ | |
gen year_test = 0 | |
replace year_test = 1 if year_of_graduation == cohort_grad_year | |
tab year_test | |
tab ontime_grad | |
tab ontime_grad year_test | |
drop year_test | |
/* | |
This is an example of a business rule - it's a restriction on the definition | |
of the data we make so that we can consistently use the data. In this case, it | |
is necessary so we can definitively group students for the purposes of predicting | |
their outcomes. You could consider alternative graduation timelines, for example: | |
*/ | |
gen year_test = 0 | |
replace year_test = 1 if year_of_graduation <= cohort_grad_year + 1 | |
drop year_test | |
/* | |
What does this rule say? How is it different than the definition of on-time | |
above? | |
Now that we have a sense of the time structure of the data, let's look at | |
geography. How many high schools and how many districts are? What are those | |
regional education services coops? | |
We are going to be building a model for the an entire state, but stakeholders | |
may have questions about how the model works for particular schools, districts, | |
or regions. Let's practice exploring the data by these differerent geographies. | |
*/ | |
tab coop_name_g7, mi nolabel | |
tab first_hs_name | |
/* | |
For this exercise, districts in Montucky are organized into cooperative regions. | |
Cooperative regions are just groups of LEAs. It may be helpful to compare how | |
our model performs in a given school, LEA, or coop region to the rest of the | |
data in the state. As an example of this kind of analysis, select a specific | |
coop region and explore its data, drawing comparisons to the full dataset. Which | |
districts are part of this coop region and how many students do they have? | |
Substitute different abbreviation codes for different coops and then replace the | |
`my_coop` variable below. | |
*/ | |
tab sch_g7_lea_id if coop_name_g7 == `my_coop' | |
/* | |
What student subgroups are we interested in? Let's start by looking at student | |
subgroups. Here's whether a student is male. | |
*/ | |
tab male, mi | |
// Here's a shortcut command to look at one-way tabs of a lot of variables at once. | |
tab1 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7, mi | |
/* | |
Let's examine the distribution of student subgroups by geography. For this | |
command, we'll use Stata's looping syntax, which lets you avoid repetition by | |
applying commands to multiple variables at once. You can't use loops when you | |
are entering commands directly into the command window, but they are very | |
powerful in do files. You can type "help foreach" into the Stata command window | |
if you want to learn more about how to use loops in Stata. | |
*/ | |
foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 { | |
tab coop_name_g7 `var', row mi | |
} | |
/* | |
Now, let's look at outcomes. We won't examine them all, but you should. Here's | |
a high school graduation outcome variable: | |
*/ | |
tab ontime_grad, mi | |
/* | |
Wait! What if the data includes students who transferred out of state? That | |
might bias the graduation rate and make it too low, because those seventh | |
graders might show up as having dropped out. | |
*/ | |
tab transferout, mi | |
tab transferou ontime_grad, mi | |
/* | |
This is another case where we may want to consider a business rule. How should | |
students who transfer out be treated? We don't know whether they graduated | |
or not. Should they be excluded from the analysis? Coded as not completing? | |
The decision is yours, but it is important to consider all the possible high | |
school outcomes when building a model and how the model will treat them. | |
Let's look at the distribution of another outcome variable, `any_grad`, which | |
includes late graduation and ontime graduation by both geography and by | |
subgroup. | |
*/ | |
tab coop_name_g7 any_grad, row mi | |
foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 { | |
tab any_grad `var', row mi | |
} | |
// Let's look at the distribution of this outcome variable by geography and then by subgroup. | |
tab coop_name_g7 ontime_grad, mi row | |
foreach var of varlist male race_ethnicity frpl_7 iep_7 ell_7 gifted_7 { | |
tab `var' ontime_grad, row mi | |
} | |
// Review existing indicator | |
// Now that we are oriented to the data, let's turn our attention to the model | |
// predictions provided by the vendor. First, let's check the format: | |
codebook vendor_ews_score | |
/* | |
Instead of classifying students, each student receives a predicted probability | |
for graduating. We need a way to convert this into a classification measure. | |
One way to do this is to pick a threshold and declare all values greater than | |
that threshold as being of the positive (graduation) class, and all values | |
below as being members of the negative class. | |
*/ | |
gen vendor_grad_class = 0 | |
replace vendor_grad_class = 1 if vendor_ews_score > 0.5 | |
tab ontime_grad vendor_grad_class, cell mi | |
/* | |
This matrix tells us, for the threshold we have selected, how many graduates | |
we identified successfully, how many non-graduates we identified successfully, | |
and how many mistakes we made in each direction (this is known as a confusion | |
matrix and will be discussed at length at the workshop!). | |
How does this compare with the vendor's marketing material and stated accuracy? | |
Now, let's turn to identifying the predictors available for building an alternative | |
model. Let's examine the performance and behavioral variables that you can | |
use as predictors. These are mostly numerical variables, so you should use the | |
summary, histogram, and table commands to explore them. Here's some syntax for | |
examining 7th grade math scores. You can replicate and edit it to examine other | |
potential predictors and their distributions by different subgroups. | |
*/ | |
summ scale_score_7_math, detail | |
hist scale_score_7_math, width(1) | |
table coop_name_g7, c(mean scale_score_7_math) | |
table frpl_7, c(mean scale_score_7_math) | |
// Finally, here's some sample code you can use to look at missingness patterns | |
// in the data. The "gen" command is used to generate a new variable. | |
gen math7_miss = missing(scale_score_7_math) | |
tab math7_miss | |
foreach var of varlist coop_name_g7 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 { | |
tab `var' math7_miss, mi row | |
} | |
/* | |
Handling missing values is another case where business rules will come into play. | |
Did you see any outlier or impossible values while you were exploring the data? | |
If so, you might want to truncate them or change them to missing. Here's how you | |
can replace a numeric variable with a missing value if it is larger than a | |
certain number (in this case, 100 percent). | |
*/ | |
hist pct_days_absent_7 | |
replace pct_days_absent_7 = . if pct_days_absent_7 > 100 | |
hist pct_days_absent_7 | |
/* | |
Trimming the data in this way is another example of a business rule. You | |
may wish to trim the absences even further in this data. You may also wish to | |
assign a different value other than missing for unusual values - such as the | |
mean or median value. | |
Now that you've explored the data, you can start to examine the relationship | |
between predictor and outcome variables. Here we'll continue to look at the high | |
school graduation outcome, and we'll restrict the predictors to just two: 7th | |
grade math scores and percent of enrolled days absent through 7th grade. For | |
your model, you can of course use more and different predictor | |
variables. First, check the correlation between outcome and predictors. | |
*/ | |
corr ontime_grad scale_score_7_math pct_days_absent_7 | |
/* | |
These correlations do not look very promising! But remember, a correlation | |
may not tell the whole story if other factors are involved, and correlations | |
between binary variables and continuous variables are noisy indicators. | |
It would be nice to have a better idea of | |
the overall relationship between outcomes and predictors. | |
But you can't make a meaningful scatterplot when the independent, or y value, is | |
a binary outcome variable (try it!). Let's look at a technique to identify | |
the relationship between a continuous variable and a binary outcome. | |
The idea behind this code is to show the mean of the outcome variable for each | |
value of the predictor, or for categories of the predictor variable if it has | |
too many values. First, define categories (in this case, round to the nearest | |
percentage) of the percent absent variable, and then truncate the variable so that | |
low-frequency values are grouped together. | |
*/ | |
egen pct_absent_cat = cut(pct_days_absent_7), at(0(1)100) | |
tab pct_absent_cat | |
replace pct_absent_cat = 30 if pct_absent_cat >= 30 | |
/* | |
Next, define a variable which is the average ontime graduation rate for each | |
absence category, and then make a scatter plot of average graduation rates by | |
absence percent. | |
*/ | |
egen abs_ontime_grad = mean(ontime_grad), by(pct_absent_cat) | |
scatter abs_ontime_grad pct_absent_cat | |
/* | |
You can do the same thing for 7th grade test scores, without having to group | |
them with the egen cut command. | |
*/ | |
egen math_7_cut = cut(scale_score_7_math), group(100) | |
egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut) | |
scatter math_7_ontime_grad scale_score_7_math | |
/* | |
You can see there are some 7th grade math score outliers--if you haven't | |
already, you might want to set them to zero. | |
*/ | |
replace scale_score_7_math = . if scale_score_7_math < 0 | |
drop math_7_cut | |
drop math_7_ontime_grad | |
egen math_7_cut = cut(scale_score_7_math), group(100) | |
egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut) | |
scatter math_7_ontime_grad scale_score_7_math | |
/* | |
Looking at the plot, if you think the relationship between eigth grade math | |
scores and ontime graduation is more of a curve than a line, you can define | |
variables for the square and cube of the math scores so that Stata will be able | |
to fit a polynomial equation to the data instead of a straight line when you | |
build your model. | |
*/ | |
gen math_7_squared = scale_score_7_math^2 | |
gen math_7_cubed = scale_score_7_math^3 | |
/* | |
Now we're ready to call on the logit command to examine the relationship between | |
our binary outcome variable and our predictor variables. When you run a logistic | |
regression with the logit command, Stata calculates the parameters of an | |
equation that fits the relationship between the predictor variables and the | |
outcome. A regression model typically won't be able to explain all of the | |
variation in an outcome variable--any variation that is left over is treated as | |
unexplained noise in the data, or error, even if there are additional variables | |
not in the model which could explain more of the variation. Once you've run a | |
logit regression, you can have Stata generate a variable with new, predicted | |
outcomes for each observation in your data with the predict command. The | |
predictions are calculated using the model equation and ignore the unexplained | |
noise in the data. For logit regressions, the predicted outcomes take the form | |
of a probability number between 0 and 1. To start with, let's do a regression of | |
ontime graduation on seventh grade math scores. | |
*/ | |
logit ontime_grad scale_score_7_math | |
/* | |
Even before you use the predict command, you can use the logit output to learn | |
something about the relationship between the predictor and the outcome variable. | |
The Pseudo R2 (read R-squared) is a proxy for the share of variation in the | |
outcome variable that is explained by the predictor. Statisticians don't like it | |
when you take the pseudo R2 too seriously, but it can be useful in predictive | |
exercises to quickly get a sense of the explanatory power of variables in a | |
logit model. Does adding polynomial terms increase the pseudo R2? Not by very | |
much. Any time you add predictors to a model, the R2 will increase, even if the | |
variables are fairly meaningless, so it's best to focus on including predictors | |
that add meaningful explanatory power. | |
*/ | |
logit ontime_grad scale_score_7_math math_7_squared math_7_cubed | |
// Now take a look at the R2 for the absence variable. Absence rates have | |
// almost no explanatory power in 7th grade. | |
logit ontime_grad pct_days_absent_7 | |
// Let's combine our two predictors. This model has barely any more | |
// explanatory power than the test-score alone. | |
logit ontime_grad pct_days_absent_7 scale_score_7_math | |
// Now, let's use the predict command. Stata applies the predict command to | |
// the most recent regression model. | |
predict model1 | |
/* | |
This generates a new variable `model1`, which is the predicted probability of | |
ontime high school | |
graduation, according to the model. But if you look at the number of | |
observations with predictions, you'll see that it is smaller than the total | |
number of students. This is because Stata doesn't use observations that have | |
missing data for any of the variables in the model. | |
*/ | |
summ model1, detail | |
count | |
/* | |
Let's convert this probability to a 0/1 indicator for whether or not a student | |
is likely to graduate ontime. A good rule of thumb when starting out is to set | |
the probability cutoff at the mean of the outcome variable. In this example, | |
we can store this value as a variable: | |
*/ | |
summ ontime_grad // to get the threshold | |
gen grad_indicator = 0 if model1 < .655 & model1 ~= . | |
replace grad_indicator = 1 if model1 >= .655 & model1 ~= . | |
tab grad_indicator, mi | |
// You can also plot the relationship between the probability and the outcome. | |
// Ideally, you should see the proportion of graduates steadily increase for each | |
// percentile of the probabilities. What does this relationship tell you? | |
egen model_prob_rank = cut(model1), group(50) | |
egen model_prob_ontime_grad = mean(ontime_grad), by(model_prob_rank) | |
scatter model_prob_ontime_grad model1 | |
/* | |
Lets evaluate the accuracy of the model by comparing the predictions to the | |
actual graduation outcomes for the students for whom we have predictions. This | |
type of crosstab is called a "confusion matrix." The observations in the upper | |
right corner, where the indicator and the actual outcome are both 0, are true | |
negatives. The observations in the lower right corner, where the indicator and | |
the outcome are both 1, are true positives. The upper right corner contains | |
false positives, and the lower left corner contains false negatives. Overall, if | |
you add up the cell percentages for true positives and true negatives, the model | |
got 58 percent of the predictions right. | |
*/ | |
tab ontime_grad grad_indicator, cell | |
/* | |
However, almost all of the wrong predictions are false negatives--these are | |
students who have been flagged as dropout risks even though they did graduate | |
ontime. If you want your indicator system to be have fewer false negatives, you | |
can change the probability cutoff. This cutoff has a lower share of false | |
positives and a higher share of false negatives, with a somewhat lower share of | |
correct predictions. | |
*/ | |
replace grad_indicator = 0 if model1 < .59445 & model1 ~= . | |
replace grad_indicator = 1 if model1 >= .59445 & model1 ~= . | |
tab ontime_grad grad_indicator, cell | |
// Missing Data | |
/* | |
Another key business rule is how we will handle students with missing data. A | |
predictive analytics system is more useful if it makes an actionable prediction | |
for every student. It is good to check, if it is available, the graduation rates | |
for students with missing data: | |
*/ | |
tab ontime_grad if math7_miss == 1 | |
// There are a number of options. One is to run a model with fewer variables for | |
// only those students, and then use that model to fill in the missing | |
// indicators. | |
logit ontime_grad pct_days_absent_7 if math7_miss == 1 | |
predict model2 if math7_miss == 1 | |
summ model2, detail | |
replace grad_indicator = 0 if model2 < .59445 & model2 ~= . & model1 == . | |
replace grad_indicator = 1 if model2 >= .59445 & model2 ~= . & model1 == . | |
/* | |
We now have predictions for all but a very small share of students, and those | |
students are split between graduates and non-graduates. We have to apply a rule | |
or a model to make predictions for them--we can't use information from the | |
future, except to develop the prediction system. We'll arbitrarily decide to | |
flag them as potential non-graduates, since students with lots of missing data | |
might merit some extra attention. | |
*/ | |
tab grad_indicator, mi | |
replace grad_indicator = 0 if grad_indicator == . | |
// Evaluate Fit | |
// Now we have a complete set of predictions from our simple models. How well | |
// does the prediction system work? Can we do better? | |
tab ontime_grad grad_indicator, cell | |
/* | |
A confusion matrix is one way to evaluate the success of a model and evaluate | |
tradeoffs as you are developing prediction systems, but there are others. We | |
will cover these more in the workshop, but in cases where we have an uneven | |
proportion of cases in each class (e.g. we have many more graduates than | |
non-graduates), it can be helpful to look at a metric like the AUC, which stands | |
for "area under the curve." You'll learn more about ways to evaluate a | |
prediction system, including the AUC metric, during Day 2 of the workshop, but | |
here's a sneak peak. First, look at row percentages instead of cell percentages | |
in the confusion matrix. | |
*/ | |
tab ontime_grad grad_indicator, row | |
/* | |
Next, use the "roctab" command to plot the true positive rate (sensitivity in | |
the graph) against the false positive rate (1-specificity in the graph). You can | |
see these percentages match the row percentages in the last table. The AUC is | |
the "area under ROC curve" in this graph, and it is a useful single-number | |
summary of predictive accuracy. | |
*/ | |
roctab ontime_grad grad_indicator, graph | |
/* | |
Model comparison in Stata for different logistic regressions is straightforward | |
as well. Fit two models and store their predictions. Let's compare a model that | |
includes two student demographic categories (FRPL and sex) with a model that | |
only includes math scores and attendance. | |
*/ | |
logit ontime_grad scale_score_7_math male i.frpl_7 | |
predict yhat_1 | |
// plot the roc curve | |
lroc | |
logit ontime_grad scale_score_7_math pct_days_absent_7 | |
predict yhat_2 | |
lroc | |
// this takes awhile :-) for speed you can subset down your data set to a test | |
// set or a specific year/lea/group | |
roccomp ontime_grad yhat_1 yhat_2, graph summary | |
// more details here: | |
// https://stats.idre.ucla.edu/stata/faq/how-can-i-test-the-difference-in-area-under-roc-curve-for-two-logistic-regression-models/ | |
/* | |
A couple of last thoughts and notes. First, note that so far we haven't done any | |
out-of-sample testing. We know from the pre-reading that we should never trust | |
our model fit measures on data the model was fit to -- statistical models are | |
overly confident. To combat this, you should subdivide your dataset. There are | |
many strategies you can choose from depending on how much data you have and the | |
nature of your problem - for the EWS case, we can use the first two cohorts to | |
build our models and the latter two cohorts to evaluate that fit. | |
Here is some code you can use to do that: | |
*/ | |
local split = 15000 | |
local train = "1/`=`split'-1'" | |
local test = "`split'/`=_N'" | |
logit ontime_grad pct_days_absent_7 scale_score_7_read in `train' | |
predict grad3 | |
summ grad3 | |
gen grad_indicator3 = 0 | |
replace grad_indicator3 = 1 if grad3 >= 0.5994 | |
tab ontime_grad grad_indicator3 in `test', cell mi | |
// You can use the classtabi routine (ssc install classtabi) to get more details | |
// You type in the cell counts from left to right (top left, top right, bottom | |
// left ..) Due to dataset variation, your exact counts may vary, modify as | |
// needed | |
classtabi 4314 32699 5873 61351 | |
/* | |
You can also split your data by time cohorts, which is a good idea - fit your | |
model on older data and see how it performs on more recent data: | |
*/ | |
logit ontime_grad pct_days_absent_7 scale_score_7_read if year < 2006 | |
predict grad4 | |
summ grad4 | |
gen grad_indicator4 = 0 | |
replace grad_indicator4 = 1 if grad4 >= 0.5994 | |
tab ontime_grad grad_indicator4 if year > 2005, cell mi | |
/* | |
Second, should we use subgroup membership variables (such as demographics or | |
school of enrollment?) to make predictions, if they improve the accuracy of | |
predictions? This is more a policy question than a technical question, and you | |
should consider it when you are developing your models. You'll also want to | |
check to see how accurate your model is for different subgroups. | |
*/ | |
// Bonus | |
// Warning: Here be dragons | |
// Some less tested code to try some more exotic modeling strategies using Stata | |
// Set up Stata to be memory efficient | |
set matsize 11000 | |
set emptycells drop | |
// Install svmachines | |
// Uncomment the next two lines to install support vector machines in Stata | |
// net sj 16-4 | |
// net install st0461 | |
// Data preparation | |
// svmachines is more finnicky about data than logit, it needs data types clearly | |
// declared in advance and cannot handle any missing values at all | |
drop if pct_days_absent_7 == . | |
encode male, gen(male_fac) | |
encode race_ethnicity, gen(race_fac) | |
encode iep_7, gen(iep_fac) | |
encode gifted_7, gen(gifted_fac) | |
drop if male_fac == . | |
drop if race_fac == . | |
set seed 9876 | |
// shuffle the data | |
generate u = runiform() | |
sort u | |
// before the actual train and test split: | |
local split = 15000 // restrict the sample to something Stata can compute quickly | |
local train = "1/`=`split'-1'" | |
local test = "`split'/`=_N'" | |
// recode your outcome variable to be the right type for svmachines | |
tempvar B | |
generate byte `B' = ontime_grad | |
drop ontime_grad | |
rename `B' ontime_grad | |
// Get the svmachine model fit using multiple predictors | |
svmachines ontime_grad scale_score_7_math scale_score_7_read i.frpl_7 male_fac race_fac iep_fac gifted_fac sch_g7_frpl_per in `train', verbose probability | |
// generate predictions on the test, not training, data | |
predict P in `test' | |
// Calculate the prediction error | |
generate err = ontime_grad != P in `test' | |
// summarize the prediction error | |
summarize err in `test' | |
tab P ontime_grad in `test' | |
// Generate predicted probabilities | |
predict P_prob in `test', prob | |
// calculate an ROC graph for the predicted probabilities | |
roctab ontime_grad P_prob in `test', graph | |
// Calculate all accuracy metrics from the confusion matrix using | |
// classtabi | |
// Note due to random sampling your numbers may need to be modified | |
tab P ontime_grad in `test' | |
classtabi 11790 7654 25223 59570 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment