Created
August 3, 2021 17:43
-
-
Save FlukeAndFeather/866eed1e0444f35ac4c6980de15acea8 to your computer and use it in GitHub Desktop.
The datasaurus dozen is a collection of datasets that demonstrate how summary statistics can be misleading. This gist shows how to access the data and visualize it.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# If you don't have devtools, install that first. If you have Windows and R version 4.0 (or greater), see: https://cran.r-project.org/bin/windows/Rtools/ | |
devtools::install_github("lockedata/datasauRus") | |
# Load the datasaurus dozen and our favorite suite of data processing tools | |
library(datasauRus) | |
library(tidyverse) | |
# Quick look at the contents of the datasaurus_dozen data frame | |
summary(datasaurus_dozen) | |
unique(datasaurus_dozen$dataset) | |
# Scatter plots of the "slant up" and high lines" datasets | |
filter(datasaurus_dozen, dataset == "slant_up") %>% | |
ggplot(aes(x, y)) + | |
geom_point() + | |
theme_classic() | |
filter(datasaurus_dozen, dataset == "high_lines") %>% | |
ggplot(aes(x, y)) + | |
geom_point() + | |
theme_classic() | |
# Scatter plot of the legend, the myth, the datasaurus | |
filter(datasaurus_dozen, dataset == "dino") %>% | |
ggplot(aes(x, y)) + | |
geom_point() + | |
theme_classic() | |
# Scatter plots of the full collection | |
ggplot(datasaurus_dozen, aes(x, y, color = dataset)) + | |
geom_point() + | |
facet_wrap(~ dataset) + | |
theme_classic() + | |
theme(legend.position = "none") | |
# This is what makes the datasaurus dozen so special: despite *looking* so distinctive to human eyes, they all | |
# have the same summary statistics (mean, standard deviation, and correlation coefficient). A reminder that we | |
# need to be discerning when we look at summary statistics, whether in a research article or in training a | |
# machine learning algorithm. Looking at the raw data, or using algorithms that look at the raw data, can be the | |
# difference between useful inference and misleading nonsense. | |
datasaurus_dozen %>% | |
group_by(dataset) %>% | |
summarize(mean(x), | |
mean(y), | |
sd(x), | |
sd(y), | |
cor(x, y)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment