Skip to content

Instantly share code, notes, and snippets.

@brshallo
brshallo / count-workers.R
Created November 14, 2018 23:55
Given question: provided a dataset with employee id, clock in, and clock out data... how would you get hourly counts of employees on the clock?
# Libraries
library(tidyverse, quietly = TRUE)
library(lubridate, warn.conflicts = FALSE, quietly = TRUE)
# Create simulated data set
set.seed(123)
possible_starts <- seq.POSIXt(ymd_hms("2018-10-30 09:00:00"), ymd_hms("2018-10-30 14:00:00"), by = "min")
possible_ends <- seq.POSIXt(ymd_hms("2018-10-30 14:00:00"), ymd_hms("2018-10-30 20:00:00"), by = "min")
(work_times <- tibble(worker_id = 1:100,
@brshallo
brshallo / filter-for-network.R
Created November 29, 2018 22:13
Remove all rows that have a value in `col_b` that does not show-up in `col_a` (question sent)
library(tidyverse, quietly = TRUE)
(df <- tibble(col_a = c("a", "b", "c", "d", "d"),
col_b = c("b", "a", "d", "a", "e")))
#> # A tibble: 5 x 2
#> col_a col_b
#> <chr> <chr>
#> 1 a b
#> 2 b a
#> 3 c d
@brshallo
brshallo / list-to-df-indexed.R
Last active December 3, 2018 06:43
List-to-df-with-index-maintained-question
# unnest list, mantain index example
library(tidyverse, quietly = TRUE)
(list_test <- list(1, 1, 1, 1, c(1, 2), 3, 3, c(2, 4)))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1
#>
@brshallo
brshallo / num_mods_created.Rmd
Last active February 6, 2019 04:57
Best subset vs stepwise, number of models created
``` r
library(tidyverse)
# as k increases from 1:p, see number of models created
p <- 10
map_dbl(1:p, ~ choose(.x, c(1:p)) %>% sum()) %>%
enframe(name = NULL) %>%
rename(best_subset = value) %>%
mutate(p = 1:p,
stepwise = 1 + p * (p + 1) / 2) %>%
@brshallo
brshallo / rbinom_with_sample.R
Created February 6, 2019 05:43
Overly fancy sample technique -- could just use `rbinom` rather than `sample` here
``` r
library(tidyverse)
tibble(x = runif(100), x_inv = 1 - x) %>%
mutate(event = map2_lgl(.x = x,
.y = x_inv,
~sample(c(TRUE, FALSE),
size = 1,
replace = TRUE,
prob = c(.x, .y))
)
@brshallo
brshallo / binary_continuous_plot.R
Last active February 25, 2019 07:06
Plot for viewing continuous vs binomial variable relationship
library(dplyr)
library(purrr)
library(ggplot2)
library(rlang)
# make "remainder" / throw away components more symmetric (expects balanced)
ggplot_continuous_binary <- function(df, covariate, response, rug = TRUE, snip_scales = FALSE, input_bin_size = NULL){
covariate_var <- enquo(covariate)
response_var <- enquo(response) # needs to be either a TRUE/FALSE or a 1/0
@brshallo
brshallo / gini_entropy_similarity.R
Created February 12, 2019 17:44
Plot showing relationship between entropy and gini in relation to proportion of event (and that gini and entropy follow same pattern).
library(tidyverse)
df_metrics <- tibble(
prob = seq.int(0.001, 0.999, length.out = 999),
entropy = -2 * (prob * log(prob) + (1-prob)*log(1-prob)),
gini_index = 4 * prob * (1 - prob)
) %>%
gather(entropy, gini_index, key = "purity_metric", value = "value")
ggplot(df_metrics, aes(x = prob, y = value, colour = purity_metric))+
@brshallo
brshallo / add_multiple.R
Created February 19, 2019 15:26
Function to add multiple inputs
library(purrr)
add_vecs <- function(...){
if(!is.list(...)) stop("must be list input")
reduce(..., `+`)
}
add_vecs(list(1:5, 1:5, 1:5))
#> [1] 3 6 9 12 15
@brshallo
brshallo / deviation_coding.R
Created February 20, 2019 02:43
function to convert contrasts in factors to deviation coding
library(tidyverse)
contrast_deviation <- function(vec_fct){
new_contrast <- unique(vec_fct) %>% length() %>% contr.sum()
contrasts(vec_fct) <- new_contrast
vec_fct
}
# convert all char columns to factors that use deviation coding
mpg %>%
@brshallo
brshallo / box_box_box_and_whiskers_plot.md
Last active February 24, 2019 23:21
Visualize distribution of mean of target by two variables within a hierarchy, segmented by a third variable.
library(tidyverse)
library(glue)
library(ggbeeswarm)

flights <- nycflights13::flights %>%
  filter(arr_delay > 0) %>%
  mutate(arr_delay_log = log(arr_delay),
         quarter = as.factor(1 + month %/% 4))