Skip to content

Instantly share code, notes, and snippets.

@mikmart
Last active February 5, 2019 13:37
Show Gist options
  • Save mikmart/e9401864675b8e9963a5ba15ddbde6d3 to your computer and use it in GitHub Desktop.
Save mikmart/e9401864675b8e9963a5ba15ddbde6d3 to your computer and use it in GitHub Desktop.
A grouped summarize gets slow with many groups?
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

just_group <- function(df) {
  df %>%
    group_by(id, cyl) %>%
    summarise(
      mpg = mean(mpg),
      hp  = mean(hp)
    ) %>%
    ungroup()
}

nest_map <- function(df) {
  df %>%
    nest(-id) %>%
    mutate(data = map(data, ~ {
      .x %>%
        group_by(cyl) %>%
        summarise(
          mpg = mean(mpg),
          hp  = mean(hp)
        ) %>%
        ungroup()
    })) %>%
    unnest()
}

bench::press(
  n = 10^(3:5),
  {
    df <- mtcars %>%
      crossing(id = seq_len(n))

    bench::mark(
      just_group(df),
      nest_map(df)
    )
  }
) -> bm
#> Running with:
#>        n
#> 1   1000
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> 2  10000
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> 3 100000
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

bm
#> # A tibble: 6 x 11
#>   expression      n      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>       <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 just_grou~   1000   5.47ms   6.87ms   6.37ms   13.3ms 146.       496.13KB
#> 2 nest_map(~   1000     2.2s     2.2s     2.2s     2.2s   0.454     12.03MB
#> 3 just_grou~  10000  51.67ms  56.62ms  54.66ms  66.28ms  17.7        1.61MB
#> 4 nest_map(~  10000    24.2s    24.2s    24.2s    24.2s   0.0413   115.38MB
#> 5 just_grou~ 100000 897.01ms 897.01ms 897.01ms 897.01ms   1.11      16.03MB
#> 6 nest_map(~ 100000    3.44m    3.44m    3.44m    3.44m   0.00485     1.1GB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

ggplot2::autoplot(bm)
#> Error in loadNamespace(name): there is no package called 'ggbeeswarm'

Created on 2019-02-05 by the reprex package (v0.2.1.9000)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United Kingdom.1252 
#>  ctype    English_United Kingdom.1252 
#>  tz       Europe/Helsinki             
#>  date     2019-02-05                  
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version     date       lib
#>  assertthat    0.2.0       2017-04-11 [1]
#>  backports     1.1.3       2018-12-14 [1]
#>  bench         1.0.1       2018-06-06 [1]
#>  bindr         0.1.1       2018-03-13 [1]
#>  bindrcpp    * 0.2.2       2018-03-29 [1]
#>  callr         3.1.1       2018-12-21 [1]
#>  cli           1.0.1       2018-09-25 [1]
#>  colorspace    1.4-0       2019-01-13 [1]
#>  crayon        1.3.4       2017-09-16 [1]
#>  desc          1.2.0       2018-05-01 [1]
#>  devtools      2.0.1       2018-10-26 [1]
#>  digest        0.6.18      2018-10-10 [1]
#>  dplyr       * 0.7.8       2018-11-10 [1]
#>  evaluate      0.12        2018-10-09 [1]
#>  fansi         0.4.0       2018-12-04 [1]
#>  fs            1.2.6       2018-08-23 [1]
#>  ggplot2       3.1.0       2018-10-25 [1]
#>  glue          1.3.0       2018-07-17 [1]
#>  gtable        0.2.0       2016-02-26 [1]
#>  highr         0.7         2018-06-09 [1]
#>  htmltools     0.3.6       2017-04-28 [1]
#>  knitr         1.21        2018-12-10 [1]
#>  lazyeval      0.2.1       2017-10-29 [1]
#>  magrittr      1.5         2014-11-22 [1]
#>  memoise       1.1.0       2017-04-21 [1]
#>  munsell       0.5.0       2018-06-12 [1]
#>  pillar        1.3.1.9000  2019-01-22 [1]
#>  pkgbuild      1.0.2       2018-10-16 [1]
#>  pkgconfig     2.0.2       2018-08-16 [1]
#>  pkgload       1.0.2       2018-10-29 [1]
#>  plyr          1.8.4       2016-06-08 [1]
#>  prettyunits   1.0.2       2015-07-13 [1]
#>  processx      3.2.1       2018-12-05 [1]
#>  profmem       0.5.0       2018-01-30 [1]
#>  ps            1.3.0       2018-12-21 [1]
#>  purrr       * 0.3.0.9000  2019-02-05 [1]
#>  R6            2.3.0       2018-10-04 [1]
#>  Rcpp          1.0.0       2018-11-07 [1]
#>  remotes       2.0.2       2018-10-30 [1]
#>  rlang         0.3.1       2019-01-08 [1]
#>  rmarkdown     1.11        2018-12-08 [1]
#>  rprojroot     1.3-2       2018-01-03 [1]
#>  scales        1.0.0       2018-08-09 [1]
#>  sessioninfo   1.1.1       2018-11-05 [1]
#>  stringi       1.2.4       2018-07-20 [1]
#>  stringr       1.3.1       2018-05-10 [1]
#>  testthat      2.0.0.9000  2018-05-29 [1]
#>  tibble        2.0.99.9000 2019-01-22 [1]
#>  tidyr       * 0.8.2       2018-10-28 [1]
#>  tidyselect    0.2.5       2018-10-11 [1]
#>  usethis       1.4.0       2018-08-14 [1]
#>  utf8          1.1.4       2018-05-24 [1]
#>  withr         2.1.2       2018-03-15 [1]
#>  xfun          0.4         2018-10-23 [1]
#>  yaml          2.2.0       2018-07-25 [1]
#>  source                           
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  Github (brodieG/fansi@ab11e9c)   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  Github (r-lib/pillar@3a54b8d)    
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  Github (tidyverse/purrr@28a6811) 
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  Github (r-lib/testthat@ca2280b)  
#>  Github (tidyverse/tibble@5a6e727)
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.1)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#>  CRAN (R 3.5.2)                   
#> 
#> [1] C:/Users/mikkmart/OneDrive - Orion Corporation/R/win-library/3.5
#> [2] C:/Users/mikkmart/R/R-3.5.2/library
@mikmart
Copy link
Author

mikmart commented Feb 5, 2019

further inspection showed that this was just because of an outdated dplyr version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment