Created
December 30, 2019 14:54
-
-
Save viniciusmss/f3fa0d29d51a045b43e43821afabd533 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Quiz 2 -- More R Coding... | |
## This quiz covers competencies you should know by the first day of class. | |
## The first ten questions are intended to be basic. | |
## The final five questions are more advanced. | |
## An answer key will be provided by Monday, Aug 12. | |
## NOTE: Most students find it helpful to code in R using R-Studio. | |
## This involves installing R, and installing R-Studio, and then | |
## running R-Studio. Google it if you run into problems, and | |
## if Google can't answer your questions, feel free to email me. | |
# Load the Matching library, as described at the beginning of the prior quiz | |
# (Quiz 1). If you've never installed this Matching library before, | |
# you'll have to begin by doing so -- one way is to use use "install.packages()", | |
# but there are also 'point-and-click' ways to do it. | |
# So---the first step is to install the library, and the next step is to load it. | |
# The "Matching" library comes with a data set called | |
# GerberGreenImai which relates to a series of published papers. | |
# You can see some of those papers if (after loading the library) | |
# you type: | |
?GerberGreenImai | |
# This command opens the help page for the data set. At the bottom of the help page, | |
# you'll see a list of 4 relevant published papers. | |
# To load the data set into working memory, type: | |
data(GerberGreenImai) | |
# To take a quick look at the data set, type: | |
head(GerberGreenImai) | |
# Question 1 (using "dim") | |
# What are the dimensions of the data set (how many rows and columns)? | |
dim(GerberGreenImai) # rows: 10829 columns: 26 | |
# Question 2 (using "names") | |
# What are the variable names? | |
names(GerberGreenImai) | |
# [1] "PERSONS" "WARD" "QUESTION" "MAILGRP" "PHONEGRP" "PERSNGRP" "APPEAL" "CONTACT" "MAILINGS" "AGE" | |
# [11] "MAJORPTY" "VOTE96.0" "VOTE96.1" "MAILCALL" "VOTED98" "PHNSCRPT" "DIS.MC" "DIS.PHN" "PHN.C" "PHNTRT1" | |
# [21] "PHNTRT2" "PHN.C1" "PHN.C2" "NEW" "phone" "AGE2" | |
# Question 3 (using "is.na") | |
# Is there any missing data? If so, how many elements are missing | |
# and which rows contain the missing data? | |
any(is.na(GerberGreenImai)) # False, which means no missing data. | |
sum(is.na(GerberGreenImai)) # how many elements are missing? Zero. | |
# Question 4 (using the "mode" function, or "is.factor") | |
# In this data set, each column is a variable (e.g., "AGE", "WARD", etc.). | |
# Which of these variables (if any) is a factor? | |
lapply(GerberGreenImai, is.factor) # WARD is a factor | |
# Question 5 (using the "class" function) | |
# Code a for loop that outputs column name and variable type, per the below format: | |
# column name #1 variable type #1 | |
# column name #2 variable type #2 | |
# column name #3 variable type #3 | |
# etc., etc., etc., type of each | |
for (colname in names(GerberGreenImai)) { | |
cat(colname, ":", class(GerberGreenImai[[colname]]), "\n") | |
} | |
# Question 6 | |
# What do you think the variable AGE2 represents? | |
# Looks like age squared divided by 100 | |
# Question 7 | |
# Which individuals (and how many) voted in 1998? | |
which(GerberGreenImai$VOTED98 == 1) # Individual indices | |
sum(GerberGreenImai$VOTED98) # Number of people | |
# Question 8 | |
# Which individuals (and how many) have age > 35 and did NOT vote in 1998? | |
which(GerberGreenImai$AGE > 35 & GerberGreenImai$VOTED98 == 0) | |
sum(GerberGreenImai$AGE > 35 & GerberGreenImai$VOTED98 == 0) | |
# Question 9 | |
# What line of code eliminates all rows that satisfy the conditions of Question 8? | |
GerberGreenImai[!(GerberGreenImai$AGE > 35 & GerberGreenImai$VOTED98 == 0), ] | |
# or | |
# subset(GerberGreenImai, !(GerberGreenImai$AGE > 35 & GerberGreenImai$VOTED98 == 0)) | |
# Question 10 | |
# Append a column that records a "1" if Question 8's conditions satisfy, 0 if not. | |
# Congrats, you just created a dummy variable! Google it if you have questions. | |
GerberGreenImai$myfirstdummy <- as.numeric(GerberGreenImai$AGE > 35 & GerberGreenImai$VOTED98 == 0) | |
### More advanced questions | |
# Question 11 | |
# Write a function that performs the algorithm you identified in Question 6. | |
algo1 <- function(age) {return((age^2)/100)} | |
# and then try algo1(9) | |
# or try algo1(20) or whatever... | |
# Question 12 | |
# Run a regression w/ AGE2 and AGE as dependent and independent vars (respectively). | |
# Do you get statistically significant results? Is there something causal going on? | |
# Does the model make predictions that are not possible in the real world? Explain. | |
lm.age <- lm(AGE2 ~ AGE, data=GerberGreenImai) | |
summary(lm.age) | |
# We get statistically significant results. There's no causation proper here since AGE2 | |
# is calculated directly from AGE. The model does make impossible predictions since the intercept is negative. | |
# If AGE is zero, AGE2 will be -24.9. | |
# Question 13 (hint: see the help page for "formula") | |
# Consider the following code: | |
reg1 <- lm(PHN.C1 ~ PERSONS + VOTE96.1 + NEW + MAJORPTY + AGE + | |
WARD + I(PERSONS*VOTE96.1), data=GerberGreenImai[c(1:1000),]) | |
# you can then type something like "summary(reg1)"... | |
# This regression predicts whether or not a subject received a get-out-the-vote | |
# phone call based on the first 1000 observations. | |
# What is meant by I(PERSONS*VOTE96.1)? Use the resulting regression | |
# model to make a prediction for individuals in rows 1001 and 1002. Do these | |
# predictions make sense? Why or why not? | |
# It is an interaction term between the number of person in the household and whether | |
# they voted in 1996. Namely, the effect of both factors together in influencing the outcome | |
# prediction that is not explained by the factors in isolation. | |
testdata <- GerberGreenImai[c(1001:1002),] | |
predict(reg1, testdata) | |
# These predictions make sense if they are interpreted as | |
# probabilities of having received a phone call. | |
# Question 14 | |
# Consider the "WARD" variable. Observe the first 20 elements. | |
# What happens when you perform arithmetic or statistical operations on these | |
# elements? Why? Provide intuition for the result you obtain. | |
GerberGreenImai$WARD[1:20] | |
mean(GerberGreenImai$WARD) # argument is not numeric or logical: returning NA | |
# R does not interpret the WARD variable as a number since it is coded as a factor. | |
# hence, it returns errors if we try to perform some operations over this variable. | |
# Question 15 | |
# Write code that takes the "WARD" variable and creates dummy variables for | |
# each individual ward. We can call this "dummy-fying" a categorical variable. | |
for (lvl in levels(GerberGreenImai$WARD)) { | |
GerberGreenImai[[lvl]] <- GerberGreenImai$WARD == lvl | |
} | |
# Given your answer to Question 14, do you see why you would want to "dummy-fy" | |
# this variable prior to performing statistical analysis on it (and | |
# variables likes this one)? | |
# We dummify since there is no sensible way to translate difference categories | |
# of a factor variable into numbers that can be statistically analyzed. When we | |
# dummify, we are able to isolate the effect of being part of a certain category or not. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment