Last active
November 16, 2023 19:34
-
-
Save jmclawson/50e0fbf83d24ba4c9dde0711f362dc45 to your computer and use it in GitHub Desktop.
Understanding and using linear models in R
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Understanding and using linear models in R" | |
subtitle: "A 10-slide guide to the dark side" | |
format: | |
revealjs: | |
slide-number: true | |
embed-resources: true | |
warning: false | |
message: false | |
theme: dark | |
--- | |
## Case study: finding mass from height | |
The `starwars` data set is always useful. We're filtering out Jabba Desilijic Tiure who is *surprisingly* massive for their height. We're also filtering out anyone whose mass is missing. | |
```{r} | |
#| label: setup | |
#| echo: true | |
library(tidyverse) | |
starwars |> | |
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> | |
select(name, height, mass) |> | |
head() | |
``` | |
## | |
```{r} | |
# black ggplot theme: | |
source("https://gist.githubusercontent.com/jmclawson/47c84f8f3d4e6e09edfd42ed112f4e03/raw/63fecc34d65a7bbd8c61e9b7aa0c1591ac25e40c/theme_black.R") | |
base_plot <- starwars |> | |
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> | |
lm(mass ~ height, data = _) |> | |
ggplot(aes(height, mass)) + | |
geom_point(color = "lightgray", | |
shape = 21, | |
fill = "black") + | |
labs(title = "Height and mass look related. That is, taller characters are also more massive.", | |
subtitle = "The <i>response</i> variable is typically shown on the Y-axis. We’re predicting <b>mass</b> from <b>height</b>.", | |
y = "mass (kg)", | |
x = "height (cm)") + | |
theme_black(base_size = 14, black = "#191919") + | |
theme(plot.title = ggtext::element_textbox(), | |
plot.caption = ggtext::element_textbox(), | |
plot.subtitle = ggtext::element_textbox(), | |
plot.title.position = "plot", | |
panel.grid.major = element_blank(), | |
panel.grid.minor = element_blank()) | |
base_plot | |
``` | |
## | |
```{r} | |
color_res <- "#009E73" | |
color_pred <- "#CC79A7" | |
color_trend <- "#0072B2" | |
color_actual <- "#EEEEEE" | |
second_base <- base_plot + | |
geom_smooth(method = "lm", | |
se = FALSE, | |
color = color_trend) + | |
labs(title = paste0('<b>geom_smooth(method=\\"lm\\")</b> fits a linear <b style="color:',color_trend,';">trend line</b> confirming the relationship.'), | |
subtitle = "If we want to learn more details, we'll need to model the relationship of <b>height</b> to <b>mass</b>.") | |
second_base | |
``` | |
## | |
```{r} | |
third_base <- second_base + | |
geom_segment(aes(xend = height, | |
yend = .fitted), | |
color = color_res) + | |
geom_point(color = "gray", | |
shape = 21, | |
fill = "black") + | |
geom_smooth(method = "lm", | |
formula = y ~ x, | |
se = FALSE, | |
color = color_trend) + | |
geom_point(aes(y = .fitted), | |
color = "white", | |
shape = 21, | |
fill = color_pred) + | |
labs(title = paste0('<b>lm(mass ~ height, data = starwars)</b> models relationship of <b>height</b> to <b>mass</b>.'), | |
subtitle = paste0('This model can show us values for the <b style="color:',color_pred,';">predicted fit</b> and the <b style="color:',color_res,';">residual distances</b>.')) + | |
scale_color_identity() | |
third_base | |
``` | |
## | |
```{r} | |
annotations <- tibble( | |
label = c( | |
"The model's **.fitted** column<br>shows the predicted mass<br>for a given height", | |
"The **mass** column in **starwars**<br>shows the known mass for<br>a given height", | |
"The model's **.resid**<br>column measures<br>inaccuracy"), | |
color = c(color_pred, color_actual, color_res), | |
x = c(130, 160, 220), | |
y = c(95, 145,40)) | |
home_base <- third_base + | |
annotate(geom = "segment", | |
color = color_res, | |
x = 220, xend = 195, | |
y = 40, yend = 60, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
annotate(geom = "segment", | |
color = color_res, | |
x = 220, xend = 198, | |
y = 40, yend = 74, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
annotate(geom = "segment", | |
color = color_actual, | |
x = 173, xend = 177, | |
y = 145, yend = 122.5, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
annotate(geom = "segment", | |
color = color_actual, | |
x = 173, xend = 188, | |
y = 145, yend = 116, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
annotate(geom = "segment", | |
color = color_pred, | |
x = 130, xend = 148, | |
y = 95, yend = 63, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
annotate(geom = "segment", | |
color = color_pred, | |
x = 130, xend = 157.2, | |
y = 95, yend = 70.4, | |
arrow = arrow(length = unit(0.1,"cm"))) + | |
ggtext::geom_richtext( | |
data = annotations, | |
aes(x = x, y = y, color = color, label = label), | |
fill = "#191919") + | |
labs(caption = paste0('A model’s effectiveness is expressed by its Root Mean Squared Error, <b>sqrt(var(<span style="color:',color_res,';">model$.resid</span>))</b> = <b>',round(sqrt(var(lm(mass ~ height, data = filter(starwars, name != "Jabba Desilijic Tiure", !is.na(mass)))$residuals)),1),'</b>')) | |
home_base + | |
labs(caption = NULL) | |
``` | |
## `lm()` fits a linear model | |
Inside the function, a formula `y ~ x` will show that we want to deduce `y` from `x`: | |
```{r} | |
#| echo: true | |
#| output-location: fragment | |
the_model <- starwars |> | |
filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> | |
lm(mass ~ height, data = _) | |
the_model | |
``` | |
## Use **broom** to `tidy()` | |
```{r} | |
#| echo: true | |
library(broom) | |
tidy(the_model) | |
``` | |
::: {.fragment} | |
Coefficients reveal the trend line's equation, *y = mx + a* | |
::: columns | |
::: {.column width="50%"} | |
* *m* = height = `r round(the_model$coefficients[[2]], 3)`; each cm adds this many kg | |
* At 100cm, a character's mass should be `r round(100 * the_model$coefficients[[2]] + the_model$coefficients[[1]], 1)`kg. | |
::: | |
::: {.column width="50%"} | |
* *a* = Intercept = `r round(the_model$coefficients[[1]], 1)`; or height if mass is 0kg | |
* At 200cm, a character's mass should be `r round(200 * the_model$coefficients[[2]] + the_model$coefficients[[1]], 1)`kg. | |
::: | |
::: | |
::: | |
## `augment()` adds two columns | |
```{r} | |
#| echo: true | |
starwars |> | |
select(name, height, mass) |> | |
head() |> | |
augment(x = the_model, | |
newdata = _) | |
``` | |
* `.fitted` is the predicted mass | |
* `.resid` calculates the actual `mass` minus this prediction | |
## RMSE measures a model's accuracy | |
* RMSE stands for Root Mean Squared Error | |
* lower RMSE is better than higher | |
* calculate with `sqrt(var(the_model$residuals))` | |
```{r} | |
#| fig-align: center | |
third_base + | |
labs( | |
title = paste0('This model’s RMSE is <b>',round(sqrt(var(the_model$residuals)),2),'kg</b>.'), | |
caption = NULL, | |
subtitle = NULL) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Slides can be found here: https://jmclawson.net/misc/linear-models-slides.html