Skip to content

Instantly share code, notes, and snippets.

@primaryobjects
Last active April 25, 2016 18:02
Show Gist options
  • Select an option

  • Save primaryobjects/56c94ee248be1db8edb8376065a6e1dc to your computer and use it in GitHub Desktop.

Select an option

Save primaryobjects/56c94ee248be1db8edb8376065a6e1dc to your computer and use it in GitHub Desktop.
PISA dataset analysis with linear regression.
train <- read.csv('pisa2009train.csv')
test <- read.csv('pisa2009test.csv')
tapply(train$readingScore, train$male, mean)
# Which columns have an NA value?
which(unlist(lapply(train, function(x) any(is.na(x)))))
# Remove missing values.
train <- na.omit(train)
test <- na.omit(test)
train$raceeth <- relevel(train$raceeth, 'White')
test$raceeth <- relevel(test$raceeth, 'White')
lmScore <- lm(readingScore ~ ., data=train)
SSE <- sum(lmScore$residuals^2)
RMSE <- sqrt(SSE/nrow(train))
a <- 11 * lmScore$coefficients[2]
b <- 9 * lmScore$coefficients[2]
a - b
pred <- predict(lmScore, newdata=test)
# Range between max and min predicted reading scores.
summary(pred)[6] - summary(pred)[1]
SSEtest <- sum((pred - test$readingScore)^2)
SSTtest <- sum((mean(train$readingScore) - test$readingScore)^2)
RMSEtest <- sqrt(SSEtest / nrow(test))
R2 <- 1 - SSEtest/SSTtest
baseLineScore <- mean(train$readingScore)
# Can also use 'Mean' from this:
# pred2 <- predict(lmScore, newdata=train)
SSTbase <- sum((baseLineScore - test$readingScore)^2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment