Last active
September 25, 2018 15:29
-
-
Save davebraze/5d30c4cd20700b7074d52193ae906354 to your computer and use it in GitHub Desktop.
Basics of factor level ordering
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##### Basic factor level ordering and (treatment) contrasts | |
## set up data.frame with 1 continuous variable and 1 factor with 8 levels. | |
set.seed(1234) | |
x <- rnorm(80) | |
fac <- factor(rep(LETTERS[8:1], 10)) | |
df <- data.frame(x, fac) | |
df$x[as.integer(df$fac) %% 5 == 0] <- rnorm(10, 1) | |
head(df, 16) ## Note the order of factor levels in this data is reverse | |
## alphanumeric | |
str(df) | |
## By default, R sets the order of factor levels to be alphanumeric | |
## ascending, regardless of their order in the data set. | |
levels(df$fac) | |
## This is important because the order of levels impacts the specific | |
## contrasts entailed by each type of contrast coding. | |
## | |
## R's default is to use "treatment" contrast coding for unordered factors. | |
options("contrasts") | |
contrasts(df$fac) | |
## For treatment contrasts (sometimes called dummy coding), the first level | |
## of a factor is compared pairwise to each subsequent level, and the | |
## intercept is set to the mean of the first level. | |
summary(lm(x~fac, data=df)) | |
by(df$x, df$fac, mean) ## cell means | |
## Note that the intercept Estimate corresponds to the cell mean for level | |
## A. All other Estimates correspond to the difference between the given | |
## cell mean and cell A. | |
## It's usually an extraordinary coincidence if the baseline level that you | |
## want happens to be alphanumerically first in order. So what do you do if | |
## you want something different? | |
## You can use the relevel() function to specify which level you want to be | |
## the baseline. All other levels are simply pushed down 1 place. | |
relevel(df$fac, "H") | |
df$fac <- relevel(df$fac, "H") | |
## note that multiple r-squared for the model does not change, but the | |
## specific contrasts have (because the baseline has changed). | |
summary(lm(x~fac, data=df)) | |
by(df$x, df$fac, mean) | |
## Now the intercept corresponds to the cell mean for H, and all other | |
## estimates are the difference between the given level and H. | |
## You can use factor() to specify a specific order for all levels. | |
## This is sometimes handy for reasons we don't need to get in to here. | |
factor(df$fac, levels = c("E", "F", "G", "H", "A", "B", "C", "D")) | |
df$fac <- factor(df$fac, levels = c("E", "F", "G", "H", "A", "B", "C", "D")) | |
summary(lm(x~fac, data=df)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment