jknowles · September 16, 2018 17:23
diff --git a/advanced_stata_syntax.do b/advanced_stata_syntax.do
 // Additional tools for machine learning and predictive analytics in stata
 /*
 Author: Jared Knowles
 Date: 09/12/2018
 Purpose: Survey of some additional code helpful in conducting and explaining
 or demonstrating predictive analytics to stakeholders.

 You do not need to run all of this code - this is a survey of commands that
 tackle different techniques. Pick and choose what might be most useful to you.
 */

 // Discrimminant Analysis in stata
 // You need to load this up with more variables for it to be informative
 // Discrimminant analysis works like a regression tree to find cutpoints in
 // key variables that best differentiate members of different groups, in this
 // case, ontime graduates and non-graduates

 discrim lda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)
 discrim qda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)

 /*
 Variable importance can be calculated in Stata using standardized regression
 coefficients in logistic regression. This allows for more direct comparison of
 the magnitude and power of "effect sizes".
 */
 // ssc install center if you need it

 preserve // we are going to modify the data in place
 center scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, inplace standardize
 // fit your model
 logit ontime_grad scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, noconstant
 // can restore to the original data now
 restore
 // coefplot the model coefficients
 coefplot, xline(0) xtitle(Standardized Coefficients)

 /*
 Cluster analysis can be useful to explore how variables interact with one another.
 Principal component analysis is a form of factor analysis that you can use to
 explore and graph how variables correlate together.
 */
 pca scale_score_7_math male sch_g7_lep_per sch_g7_gifted_per
 loadingplot
 // look at individual observations scores
 scoreplot

 /*
 Setting the probability cutoff for identifying students for additional attention
 or "flagging" them in the EWS is a consequential decision. It can be helpful
 to create a graph that shows the relationship between the probability threshold
 and the percent of students who graduate. This is code you've already used to
 explore relationships like this with predictors, adapted now to use the predicted
 probabilities.
 */
 // Fit an example model, you can substitute your own
 logit ontime_grad scale_score_7_math male i.frpl_7 sch_g7_lep_per sch_g7_gifted_per
 // predict
 predict yhat
 // Cut the predicted probability into categorical bins
 egen prob_cat = cut(yhat), at(0(0.1)1)
 tab prob_cat
 // Compute the probability of graduating within each bin
 egen prob_ontime_grad = mean(ontime_grad), by(prob_cat)
 // plot
 scatter prob_ontime_grad prob_cat
diff --git a/pa_syntax_guide_stata.do b/pa_syntax_guide_stata.do
 /*
 This do file provides guidance and Stata syntax examples for the hands-on predictive analytics
 session during the Fall 2018 Cohort 9 Strategic Data
 Project Workshop in Denver.

 During the workshop, we'll ask you work with stakeholders to develop a dropout
 early warning system for 7th grade students for a simulated state education agency
 (Montucky) using student data. This guide will provide you with a baseline model to get you
 started and introduce you to the main concepts of predictive anlaytics. From
 there, you will work together with stakeholders to incorporate their values
 into the model, secure their support and adoption of the model, and practice
 an inclusive and iterative design process to building analytics.

 As a data analyst, your role in this process is to measure the accuracy of the
 model and communicate the tradeoffs of different models in terms of accuracy
 and usability. Together with stakeholders, you might find that the most accurate
 model does not reflect the values of the end the users. Perhaps the most accurate
 model requires variables not available for all students, or predicts all students
 in certain categories to be at-risk. These are issues you will have to work
 together with stakeholders to address.

 Logistic regression is one tool you can use, and we'll demonstrate it here.
 There are many other techniques of increasing complexity. (Many of the best
 predictive analytics packages are written in the R programming language.) But
 for a binary outcome variable, most data scientists start with logistic
 regressions, and those are very straightforward to do in R.

 Here are the steps:

 1. explore the data, especially college enrollment predictors and outcomes
 2. examine the relationship between predictors and outcomes
 3. evaluate the predictive power of different variables and select predictors for your model
 4. make predictions using logistic regression
 5. convert the predicted probabilities into a 0/1 indicator
 6. look at the effect of different probability cutoffs on prediction accuracy (develop a "confusion matrix")

 This guide will take you through these steps using a baseline model for predicting
 graduation using student grade 7 data. During the workshop you will iterate on
 this baseline model using the process above and find an improved model that
 meets stakeholder needs.

 The commands in this script won't tell you everything you need to do to develop
 your model, but they will give you command
 syntax that you should be able to adjust and adapt to get the project done.

 ou should also consider non-model approaches like the "checklist" approaches
 described in the pre-reading created at the Chicago Consortium of School
 Research (CCSR). With that "checklist" approach, you experiment with different
 thresholds for your predictor variables, and combine them to directly predict
 the outcome. This approach has the advantage of being easy to explain and
 implement, but it might not yield the most accurate predictions. We won't
 demonstrate that approach here, but if you want to try it you can draw on the
 syntax examples here to develop it.

 Before you get started, you need to think about variables, time, and datasets.
 The sooner in a student's academic trajectory you can make a prediction, the
 sooner you can intervene--but the less accurate your predictions, and hence your
 intervention targeting, is likely to be. What data, and specifically which
 variables, do you have available to make predictions? What outcome are you
 trying to predict?

 In the case of the workshop, we are limiting our focus to using data available
 at the end of grade 7 to predict outcomes at the end of high school. A critical
 step in developing a predictive model is to identify the time points that
 different measures are collected during the process you are predicting -- you
 can't use data fromthe future to make predictions. If you're planning to use
 your model to make predictions for students at the end of 11th grade, for
 instance, and if most students take AP classes as seniors, you can't use data
 about AP coursetaking collected during senior year to predict the likelihood of
 college enrollment, even if you have that data available for past groups of
 students.

 In terms of datasets, you can develop your model and then test its accuracy on
 the dataset you used to develop the model, but that is bad practice--in the real
 world, your model is only as good as its predictions on different, out of sample
 datasets. It's good practice to split your data into three parts: one part for
 developing your model, one for repeatedly testing different versions of your
 model, and a third to use for a final out of sample test.

 We're using multiple cohorts of middle-school students for the predictive analytics
 task--students who were seventh graders between 2007 and in 2011. These are the
 cohorts for which we have access to reliable information on their high school
 graduation status (late graduate, on time graduate, dropout, transferout,
 disappeared). For this workshop you are being given the entire set of data so
 that you can explore different ways of organizing the data across cohorts.
 An example is that you may choose to explore your models and data using the
 earlier cohort, and evaluate their performance on more recent cohorts.

 One last point -- even though the data is synthetic, we have simulated missing data
 for you. In the real world, you'll need to make predictions for every
 student, even if you're missing data for that student which your model needs in
 order to run. Just making predictions using a logistic regression won't be
 enough. You'll need to use decision rules based on good data exploration and
 your best judgment to predict and fill in outcomes for students where you have
 insufficient data.

 // Getting Started

 If you're using the do file version of these materials, start by saving a new
 version of the do file with  your initials in the title, so you can edit it
 without worrying about overwriting the original. Then  work through the do file
 in Stata by highlighting one or a few command lines at a time, clicking the
 "execute" icon in the toolbar above (or pressing control-D), and then looking at
 the results in Stata's  results window. Edit or add commands as you wish. If
 you're using a paper or PDF version of these  materials, just read on--the Stata
 output appears below each section of commands.

 This guide is built using the data you will be working with on the workshop.
 This dataset includes simulated data for multiple cohorts of 7th graders along
 with their corresponding high school outcomes. Each observation (row) is a student,
 and includes associated information about the student's demographics, academic
 performance in 7th grade, associated school and district (in grade 7), last
 high school attended, and high school completion.

 To work through the do file, you need to put the montucky.dta data file on your
 computer desktop or in a working folder of your choice, and then edit the
 username and basepath global commands below to tell Stata where to look for the
 data. If you have trouble doing this, ask for help from other members of your
 group.
 */

 // Set up

 	set more off
 	set type double
 	capture log close

 // Open a log file. This stores a record of the commands and their output in a
 // text file you can review later.

 	log using "montucky_ews.log", replace

 // Load the data.

 	use data/montucky.dta, clear

 // Validate the Data
 // Verify that there is exactly one observation per student, and check the total
 // number of observations.

 	isid sid

 // Wait, what is the issue here?

 	tab sid

 // Why might our IDs be repeated 45 times? Let's look at how many LEAs we have in
 // our SEA dataset:

 	levels sch_g7_lea_id

 // We see that our student IDs are not unique by LEA. That's an easy enough
 // fix.

 	isid sid sch_g7_lea_id
 	count

 // Explore the Data

 /*
 A key initial task in building an EWS is to identify the cohort membership of
 students. When we build a predictive model need to identify two timepoints -
 when we will be making the prediction, and when the outcome we are predicting
 will be observed. In this case, we will be making the prediction upon receiving
 the 7th grade data on students (so the completion of 7th grade), and we will
 be predicting their completion of high school.

 Let's focus first on identifying the 7th grade year for each student. We have
 three year variables, what is their relationship:
 */

 	tab year
 	tab cohort_year
 	tab cohort_grad_year

 /*
 From the data dictionary we know that the first year variable is the year
 that corresponds with the student entering 7th grade (`year`). The `cohort_year`
 variable defines the 9th grade cohort a student belongs to for measuring ontime
 graduation. Finally, the `cohort_grad_year` is the year that the cohort a
 student is a member of should graduate to be considered "on-time".

 If a student graduates, their year of graduation is recorded as `year_of_graduation`.

 The definition of graduation types is an important design decision for our
 predictive analytic system. We have to set a time by which students are graduated
 so we know whether to count them as graduating or not completing. The state of
 Montucky uses a 4-year cohort graduation rate for most reporting, defining
 ontime graduation as graduation within 4 years of entering high school. This
 variable is defined as:
 */

 	gen year_test = 0
 	replace year_test = 1 if year_of_graduation == cohort_grad_year
 	tab year_test
 	tab ontime_grad
 	tab ontime_grad year_test
 	drop year_test

 /*
 This is an example of a business rule - it's a restriction on the definition
 of the data we make so that we can consistently use the data. In this case, it
 is necessary so we can definitively group students for the purposes of predicting
 their outcomes. You could consider alternative graduation timelines, for example:
 */

 	gen year_test = 0
 	replace year_test = 1 if year_of_graduation <= cohort_grad_year + 1
 	drop year_test

 /*
 What does this rule say? How is it different than the definition of on-time
 above?

 Now that we have a sense of the time structure of the data, let's look at
 geography. How many high schools and how many districts are? What are those
 regional education services coops?

 We are going to be building a model for the an entire state, but stakeholders
 may have questions about how the model works for particular schools, districts,
 or regions. Let's practice exploring the data by these differerent geographies.
 */

 	tab coop_name_g7, mi nolabel
 	tab first_hs_name

 /*
 For this exercise, districts in Montucky are organized into cooperative regions.
 Cooperative regions are just groups of LEAs. It may be helpful to compare how
 our model performs in a given school, LEA, or coop region to the rest of the
 data in the state. As an example of this kind of analysis, select a specific
 coop region and explore its data, drawing comparisons to the full dataset. Which
 districts are part of this coop region and how many students do they have?
 Substitute different abbreviation codes for different coops and then replace the
 `my_coop` variable below.
 */

 	tab sch_g7_lea_id if coop_name_g7 == `my_coop'

 /*
 What student subgroups are we interested in? Let's start by looking at student
 subgroups. Here's whether a student is male.
 */

 	tab male, mi

 // Here's a shortcut command to look at one-way tabs of a lot of variables at once.
 	tab1 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7, mi

 /*
 Let's examine the distribution of student subgroups by geography. For this
 command, we'll use Stata's looping syntax, which lets you avoid repetition by
 applying commands to multiple variables at once. You can't use loops when you
 are entering commands directly into the command window, but they are very
 powerful in do files. You can type "help foreach" into the Stata command window
 if you want to learn more about how to use loops in Stata.
 */

 	foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
 		tab coop_name_g7 `var', row mi
 	}
 /*
 Now, let's look at outcomes. We won't examine them all, but you should. Here's
 a high school graduation outcome variable:
 */

 	tab ontime_grad, mi

 /*
 Wait! What if the data includes students who transferred out of state? That
 might bias the graduation rate and make it too low, because those seventh
 graders might show up as having dropped out.
 */

 	tab transferout, mi
 	tab transferou ontime_grad, mi

 /*
 This is another case where we may want to consider a business rule. How should
 students who transfer out be treated? We don't know whether they graduated
 or not. Should they be excluded from the analysis? Coded as not completing?
 The decision is yours, but it is important to consider all the possible high
 school outcomes when building a model and how the model will treat them.

 Let's look at the distribution of another outcome variable, `any_grad`, which
 includes late graduation and ontime graduation by both geography and by
 subgroup.
 */

 	tab coop_name_g7 any_grad, row mi

 	foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
 			tab any_grad `var', row mi
 	}


 // Let's look at the distribution of this outcome variable by geography and then by subgroup.
 	tab coop_name_g7 ontime_grad, mi row

 	foreach var of varlist male race_ethnicity frpl_7 iep_7 ell_7 gifted_7 {
 		tab `var' ontime_grad, row mi
 	}

 // Review existing indicator

 //	Now that we are oriented to the data, let's turn our attention to the model
 // predictions provided by the vendor. First, let's check the format:

 	codebook vendor_ews_score

 /*
 Instead of classifying students, each student receives a predicted probability
 for graduating. We need a way to convert this into a classification measure.
 One way to do this is to pick a threshold and declare all values greater than
 that threshold as being of the positive (graduation) class, and all values
 below as being members of the negative class.
 */

 	gen vendor_grad_class = 0
 	replace vendor_grad_class = 1 if vendor_ews_score > 0.5
 	tab ontime_grad vendor_grad_class, cell mi

 /*
 This matrix tells us, for the threshold we have selected, how many graduates
 we identified successfully, how many non-graduates we identified successfully,
 and how many mistakes we made in each direction (this is known as a confusion
 matrix and will be discussed at length at the workshop!).

 How does this compare with the vendor's marketing material and stated accuracy?

 Now, let's turn to identifying the predictors available for building an alternative
 model. Let's examine the performance and behavioral variables that you can
 use as predictors. These are mostly numerical variables, so you should use the
 summary, histogram, and table commands to explore them. Here's some syntax for
 examining 7th grade math scores. You can replicate and edit it to examine other
 potential predictors and their distributions by different subgroups.
 */

 	summ scale_score_7_math, detail
 	hist scale_score_7_math, width(1)
 	table coop_name_g7, c(mean scale_score_7_math)
 	table frpl_7, c(mean scale_score_7_math)

 // Finally, here's some sample code you can use to look at missingness patterns
 // in the data. The "gen" command is used to generate a new variable.

 	gen math7_miss = missing(scale_score_7_math)
 	tab math7_miss
 	foreach var of varlist coop_name_g7 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
 		tab `var' math7_miss, mi row
 	}

 /*
 Handling missing values is another case where business rules will come into play.

 Did you see any outlier or impossible values while you were exploring the data?
 If so, you might want to truncate them or change them to missing. Here's how you
 can replace a numeric variable with a missing value if it is larger than a
 certain number (in this case, 100 percent).
 */

 	hist pct_days_absent_7
 	replace pct_days_absent_7 = . if pct_days_absent_7 > 100
 	hist pct_days_absent_7

 /*
 Trimming the data in this way is another example of a business rule. You
 may wish to trim the absences even further in this data. You may also wish to
 assign a different value other than missing for unusual values - such as the
 mean or median value.

 Now that you've explored the data, you can start to examine the relationship
 between predictor and outcome variables. Here we'll continue to look at the high
 school graduation outcome, and we'll restrict the predictors to just two: 7th
 grade math scores and percent of enrolled days absent through 7th grade. For
 your model, you can of course use more and different predictor
 variables. First, check the correlation between outcome and predictors.
 */

 corr ontime_grad scale_score_7_math pct_days_absent_7

 /*
 These correlations do not look very promising! But remember, a correlation
 may not tell the whole story if other factors are involved, and correlations
 between binary variables and continuous variables are noisy indicators.

 It would be nice to have a better idea of
 the overall relationship between outcomes and predictors.

 But you can't make a meaningful scatterplot when the independent, or y value, is
 a binary outcome variable (try it!). Let's look at a technique to identify
 the relationship between a continuous variable and a binary outcome.

 The idea behind this code is to show the mean of the outcome variable for each
 value of the predictor, or for categories of the predictor variable if it has
 too many values. First, define categories (in this case, round to the nearest
 percentage) of the percent absent variable, and then truncate the variable so that
 low-frequency values are grouped together.
 */

 	egen pct_absent_cat = cut(pct_days_absent_7), at(0(1)100)
 	tab pct_absent_cat
 	replace pct_absent_cat = 30 if pct_absent_cat >= 30

 /*
 Next, define a variable which is the average ontime graduation rate for each
 absence category, and then make a scatter plot of average graduation rates by
 absence percent.
 */

 	egen abs_ontime_grad = mean(ontime_grad), by(pct_absent_cat)
 	scatter abs_ontime_grad pct_absent_cat

 /*
 You can do the same thing for 7th grade test scores, without having to group
 them with the egen cut command.
 */

 	egen math_7_cut = cut(scale_score_7_math), group(100)
 	egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut)
 	scatter math_7_ontime_grad scale_score_7_math

 /*
 You can see there are some 7th grade math score outliers--if you haven't
 already, you might want to set them to zero.
 */

 	replace scale_score_7_math = . if scale_score_7_math < 0
 	drop math_7_cut
 	drop math_7_ontime_grad
 	egen math_7_cut = cut(scale_score_7_math), group(100)
 	egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut)
 	scatter math_7_ontime_grad scale_score_7_math

 /*
 Looking at the plot, if you think the relationship between eigth grade math
 scores and ontime graduation is more of a curve than a line, you can define
 variables for the square and cube of the math scores so that Stata will be able
 to fit a polynomial equation to the data instead of a straight line when you
 build your model.
 */

 	gen math_7_squared = scale_score_7_math^2
 	gen math_7_cubed = scale_score_7_math^3

 /*
 Now we're ready to call on the logit command to examine the relationship between
 our binary outcome variable and our predictor variables. When you run a logistic
 regression with the logit command, Stata calculates the parameters of an
 equation that fits the relationship between the predictor variables and the
 outcome. A regression model typically won't be able to explain all of the
 variation in an outcome variable--any variation that is left over is treated as
 unexplained noise in the data, or error, even if there are additional variables
 not in the model which could explain more of the variation. Once you've run a
 logit regression, you can have Stata generate a variable with new, predicted
 outcomes for each observation in your data with the predict command. The
 predictions are calculated using the model equation and ignore the unexplained
 noise in the data. For logit regressions, the predicted outcomes take the form
 of a probability number between 0 and 1. To start with, let's do a regression of
 ontime graduation on seventh grade math scores.
 */

 	logit ontime_grad scale_score_7_math

 /*
 Even before you use the predict command, you can use the logit output to learn
 something about the relationship between the predictor and the outcome variable.
 The Pseudo R2 (read R-squared) is a proxy for the share of variation in the
 outcome variable that is explained by the predictor. Statisticians don't like it
 when you take the pseudo R2 too seriously, but it can be useful in predictive
 exercises to quickly get a sense of the explanatory power of variables in a
 logit model. Does adding polynomial terms increase the pseudo R2? Not by very
 much. Any time you add predictors to a model, the R2 will increase, even if the
 variables are fairly meaningless, so it's best to focus on including predictors
 that add meaningful explanatory power.
 */

 	logit ontime_grad scale_score_7_math math_7_squared math_7_cubed

 // Now take a look at the R2 for the absence variable. Absence rates have
 // almost no explanatory power in 7th grade.

 	logit ontime_grad pct_days_absent_7

 // Let's combine our two predictors. This model has barely any more
 // explanatory power than the test-score alone.

 	logit ontime_grad pct_days_absent_7 scale_score_7_math

 // Now, let's use the predict command. Stata applies the predict command to
 // the most recent regression model.

 	predict model1

 /*
 This generates a new variable `model1`, which is the predicted probability of
 ontime high school
 graduation, according to the model. But if you look at the number of
 observations with predictions, you'll see that it is smaller than the total
 number of students. This is because Stata doesn't use observations that have
 missing data for any of the variables in the model.
 */

 	summ model1, detail
 	count

 /*
 Let's convert this probability to a 0/1 indicator for whether or not a student
 is likely to graduate ontime. A good rule of thumb when starting out is to set
 the probability cutoff at the mean of the outcome variable. In this example,
 we can store this value as a variable:
 */

 	summ ontime_grad // to get the threshold
 	gen grad_indicator = 0 if model1 < .655 & model1 ~= .
 	replace grad_indicator = 1 if model1 >= .655 & model1 ~= .
 	tab grad_indicator, mi

 // You can also plot the relationship between the probability and the outcome.
 // Ideally, you should see the proportion of graduates steadily increase for each
 // percentile of the probabilities. What does this relationship tell you?

 	egen model_prob_rank = cut(model1), group(50)
 	egen model_prob_ontime_grad = mean(ontime_grad), by(model_prob_rank)
 	scatter model_prob_ontime_grad model1

 /*
 Lets evaluate the accuracy of the model by comparing the predictions to the
 actual graduation outcomes for the students for whom we have predictions. This
 type of crosstab is called a "confusion matrix." The observations in the upper
 right corner, where the indicator and the actual outcome are both 0, are true
 negatives. The observations in the lower right corner, where the indicator and
 the outcome are both 1, are true positives. The upper right corner contains
 false positives, and the lower left corner contains false negatives. Overall, if
 you add up the cell percentages for true positives and true negatives, the model
 got 58 percent of the predictions right.
 */

 	tab ontime_grad grad_indicator, cell

 /*
 However, almost all of the wrong predictions are false negatives--these are
 students who have been flagged as dropout risks even though they did graduate
 ontime. If you want your indicator system to be have fewer false negatives, you
 can change the probability cutoff. This cutoff has a lower share of false
 positives and a higher share of false negatives, with a somewhat lower share of
 correct predictions.
 */

 	replace grad_indicator = 0 if model1 < .59445 & model1 ~= .
 	replace grad_indicator = 1 if model1 >= .59445 & model1 ~= .
 	tab ontime_grad grad_indicator, cell

 // Missing Data

 /*
 Another key business rule is how we will handle students with missing data. A
 predictive analytics system is more useful if it makes an actionable prediction
 for every student. It is good to check, if it is available, the graduation rates
 for students with missing data:
 */

 	tab ontime_grad if math7_miss == 1

 // There are a number of options. One is to run a model with fewer variables for
 // only those students, and then use that model to fill in the missing
 // indicators.

 	logit ontime_grad pct_days_absent_7 if math7_miss == 1
 	predict model2 if math7_miss == 1
 	summ model2, detail
 	replace grad_indicator = 0 if model2 < .59445 & model2 ~= . & model1 == .
 	replace grad_indicator = 1 if model2 >= .59445 & model2 ~= . & model1 == .

 /*
 We now have predictions for all but a very small share of students, and those
 students are split between graduates and non-graduates. We have to apply a rule
 or a model to make predictions for them--we can't use information from the
 future, except to develop the prediction system. We'll arbitrarily decide to
 flag them as potential non-graduates, since students with lots of missing data
 might merit some extra attention.
 */

 	tab grad_indicator, mi
 	replace grad_indicator = 0 if grad_indicator == .

 // Evaluate Fit

 // Now we have a complete set of predictions from our simple models. How well
 // does the prediction system work? Can we do better?

 	tab ontime_grad grad_indicator, cell

 /*
 A confusion matrix is one way to evaluate the success of a model and evaluate
 tradeoffs as you are developing prediction systems, but there are others. We
 will cover these more in the workshop, but in cases where we have an uneven
 proportion of cases in each class (e.g. we have many more graduates than
 non-graduates), it can be helpful to look at a metric like the AUC, which stands
 for "area under the curve." You'll learn more about ways to evaluate a
 prediction system, including the AUC metric, during Day 2 of the workshop, but
 here's a sneak peak. First, look at row percentages instead of cell percentages
 in the confusion matrix.
 */

 	tab ontime_grad grad_indicator, row

 /*
 Next, use the "roctab" command to plot the true positive rate (sensitivity in
 the graph) against the false positive rate (1-specificity in the graph). You can
 see these percentages match the row percentages in the last table. The AUC is
 the "area under ROC curve" in this graph, and it is a useful single-number
 summary of predictive accuracy.
 */

 	roctab ontime_grad grad_indicator, graph

 /*
 Model comparison in Stata for different logistic regressions is straightforward
 as well. Fit two models and store their predictions. Let's compare a model that
 includes two student demographic categories (FRPL and sex) with a model that
 only includes math scores and attendance.
 */

 	logit ontime_grad scale_score_7_math male i.frpl_7
 	predict yhat_1
 	// plot the roc curve
 	lroc
 	logit ontime_grad scale_score_7_math pct_days_absent_7
 	predict yhat_2
 	lroc

 // this takes awhile :-) for speed you can subset down your data set to a test
 // set or a specific year/lea/group

 	roccomp ontime_grad yhat_1 yhat_2, graph summary

 // more details here:
 // https://stats.idre.ucla.edu/stata/faq/how-can-i-test-the-difference-in-area-under-roc-curve-for-two-logistic-regression-models/

 /*
 A couple of last thoughts and notes. First, note that so far we haven't done any
 out-of-sample testing. We know from the pre-reading that we should never trust
 our model fit measures on data the model was fit to -- statistical models are
 overly confident. To combat this, you should subdivide your dataset. There are
 many strategies you can choose from depending on how much data you have and the
 nature of your problem - for the EWS case, we can use the first two cohorts to
 build our models and the latter two cohorts to evaluate that fit.

 Here is some code you can use to do that:
 */

 	local split = 15000
 	local train = "1/`=`split'-1'"
 	local test = "`split'/`=_N'"
 	logit ontime_grad pct_days_absent_7 scale_score_7_read in `train'
 	predict grad3
 	summ grad3
 	gen grad_indicator3 = 0
 	replace grad_indicator3 = 1 if grad3 >= 0.5994
 	tab ontime_grad grad_indicator3 in `test', cell mi

 // You can use the classtabi routine (ssc install classtabi) to get more details
 // You type in the cell counts from left to right (top left, top right, bottom
 // left ..) Due to dataset variation, your exact counts may vary, modify as
 // needed

 	classtabi 4314 32699 5873 61351

 /*
 You can also split your data by time cohorts, which is a good idea - fit your
 model on older data and see how it performs on more recent data:
 */

 	logit ontime_grad pct_days_absent_7 scale_score_7_read if year < 2006
 	predict grad4
 	summ grad4
 	gen grad_indicator4 = 0
 	replace grad_indicator4 = 1 if grad4 >= 0.5994
 	tab ontime_grad grad_indicator4 if year > 2005, cell mi

 /*
 Second, should we use subgroup membership variables (such as demographics or
 school of enrollment?) to make predictions, if they improve the accuracy of
 predictions? This is more a policy question than a technical question, and you
 should consider it when you are developing your models. You'll also want to
 check to see how accurate your model is for different subgroups.
 */

 // Bonus
 // Warning: Here be dragons

 // Some less tested code to try some more exotic modeling strategies using Stata
 // Set up Stata to be memory efficient

 	set matsize 11000
 	set emptycells drop

 // Install svmachines
 // Uncomment the next two lines to install support vector machines in Stata
 // net sj 16-4
 // net install st0461

 // Data preparation
 // svmachines is more finnicky about data than logit, it needs data types clearly
 // declared in advance and cannot handle any missing values at all

 	drop if pct_days_absent_7 == .
 	encode male, gen(male_fac)
 	encode race_ethnicity, gen(race_fac)
 	encode iep_7, gen(iep_fac)
 	encode gifted_7, gen(gifted_fac)
 	drop if male_fac == .
 	drop if race_fac == .

 	set seed 9876
 // shuffle the data

  generate u = runiform()
  sort u

 //  before the actual train and test split:
  local split = 15000 // restrict the sample to something Stata can compute quickly
  local train = "1/`=`split'-1'"
  local test = "`split'/`=_N'"

 // recode your outcome variable to be the right type for svmachines

 	tempvar B
  generate byte `B' = ontime_grad
  drop ontime_grad
  rename `B' ontime_grad

 // Get the svmachine model fit using multiple predictors
 	svmachines ontime_grad scale_score_7_math scale_score_7_read i.frpl_7 male_fac race_fac iep_fac gifted_fac sch_g7_frpl_per in `train', verbose probability

 // generate predictions on the test, not training, data
 	predict P in `test'
 // Calculate the prediction error
 	generate err = ontime_grad != P in `test'
 // summarize the prediction error
 	summarize err in `test'
 	tab P ontime_grad in `test'

 // Generate predicted probabilities
 	predict P_prob in `test', prob

 // calculate an ROC graph for the predicted probabilities
 	roctab ontime_grad P_prob in `test', graph

 // Calculate all accuracy metrics from the confusion matrix using
 // classtabi
 // Note due to random sampling your numbers may need to be modified

 	 tab P ontime_grad in `test'
 	 classtabi 11790 7654 25223 59570
	// Additional tools for machine learning and predictive analytics in stata
	/*
	Author: Jared Knowles
	Date: 09/12/2018
	Purpose: Survey of some additional code helpful in conducting and explaining
	or demonstrating predictive analytics to stakeholders.

	You do not need to run all of this code - this is a survey of commands that
	tackle different techniques. Pick and choose what might be most useful to you.
	*/

	// Discrimminant Analysis in stata
	// You need to load this up with more variables for it to be informative
	// Discrimminant analysis works like a regression tree to find cutpoints in
	// key variables that best differentiate members of different groups, in this
	// case, ontime graduates and non-graduates

	discrim lda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)
	discrim qda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)

	/*
	Variable importance can be calculated in Stata using standardized regression
	coefficients in logistic regression. This allows for more direct comparison of
	the magnitude and power of "effect sizes".
	*/
	// ssc install center if you need it

	preserve // we are going to modify the data in place
	center scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, inplace standardize
	// fit your model
	logit ontime_grad scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, noconstant
	// can restore to the original data now
	restore
	// coefplot the model coefficients
	coefplot, xline(0) xtitle(Standardized Coefficients)

	/*
	Cluster analysis can be useful to explore how variables interact with one another.
	Principal component analysis is a form of factor analysis that you can use to
	explore and graph how variables correlate together.
	*/
	pca scale_score_7_math male sch_g7_lep_per sch_g7_gifted_per
	loadingplot
	// look at individual observations scores
	scoreplot

	/*
	Setting the probability cutoff for identifying students for additional attention
	or "flagging" them in the EWS is a consequential decision. It can be helpful
	to create a graph that shows the relationship between the probability threshold
	and the percent of students who graduate. This is code you've already used to
	explore relationships like this with predictors, adapted now to use the predicted
	probabilities.
	*/
	// Fit an example model, you can substitute your own
	logit ontime_grad scale_score_7_math male i.frpl_7 sch_g7_lep_per sch_g7_gifted_per
	// predict
	predict yhat
	// Cut the predicted probability into categorical bins
	egen prob_cat = cut(yhat), at(0(0.1)1)
	tab prob_cat
	// Compute the probability of graduating within each bin
	egen prob_ontime_grad = mean(ontime_grad), by(prob_cat)
	// plot
	scatter prob_ontime_grad prob_cat