- Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
- Models and Issues in Data Stream Systems
- Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
- Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
- [Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep=rep1&t
library(V8) | |
stopifnot(packageVersion("V8") >= "0.5") | |
# Create V8 context and load viz.js | |
ct <- new_context("window") | |
invisible(ct$source('http://mdaines.github.io/viz.js/viz.js')) | |
# This runs: Viz("digraph { a -> b; }", "svg") | |
svg <- ct$call("Viz", "digraph { a -> b; }", "svg") | |
cat(svg) |
library(ape) | |
library(DiagrammeR) | |
library(igraph) | |
library(htmltools) | |
library(pipeR) | |
# use this since write.igraph needs a file | |
tmp <- tempfile() | |
data(bird.orders) |
I think the two most important messages that people can get from a short course are:
a) the material is important and worthwhile to learn (even if it's challenging), and b) it's possible to learn it!
For those reasons, I usually start by diving as quickly as possible into visualisation. I think it's a bad idea to start by explicitly teaching programming concepts (like data structures), because the pay off isn't obvious. If you start with visualisation, the pay off is really obvious and people are more motivated to push past any initial teething problems. In stat405, I used to start with some very basic templates that got people up and running with scatterplots and histograms - they wouldn't necessary understand the code, but they'd know which bits could be varied for different effects.
Apart from visualisation, I think the two most important topics to cover are tidy data (i.e. http://www.jstatsoft.org/v59/i10/ + tidyr) and data manipulation (dplyr). These are both important for when people go off and apply
library(DiagrammeR) | |
#import onetrain data | |
head(onetrain) | |
edges <- onetrain | |
edges #edges.minlen doesn't seem to do much in final viz | |
# Create a 'nodes' data frame |
### Markov Chain Diagrams Using DiagrammeR | |
### Introduction and Sample Data | |
#' Imagine a sequence of behaviors like below, where each letter (A,I,O,R,S,X,Y) refers to | |
#' a distinct behavior. | |
#' AOXXYXXXXXXYXXYXXXXXYXXXXXYXSXXXXAXAOOOXAAAOYXXXXXXSXXXXSXXYXYXXYXXYXXXXXXXXXYXXAAAAAAOAA | |
#' AOAAAOAAAAAOAAAAAAAAAAAOAAAOAAAOOAAAOAAAAAOOIAOAOAOIAOOOAAARSAAOOOAAAAOAAAOOAOOOAOAAAISAA |
#' Patch data on the fly. | |
#' | |
#' @param object to be patched | |
#' @param cond logical condition(s) to be evaluated within scope of object | |
#' @param \dots name-value pairs | |
#' @param quiet suppress messages | |
#' | |
#' @examples | |
#' patch(mtcars, where(vs == 0, am == 1), gear = Inf, carb = carb + 10) | |
#' |
# ttable: a grammar of tables | |
# https://gist.github.com/leeper/f9cfbe6bd185763762e126a4d8d7c286 | |
# aggregate/summarize | |
# arrange | |
# annotation (metadata features) | |
# theme |
# fun with the gt package. | |
# replicating the "Parts of a gt Table" at | |
# https://blog.rstudio.com/2020/04/08/great-looking-tables-gt-0-2/ | |
library(dplyr) | |
library(gt) | |
data.frame( | |
row_label = c("ROW LABEL 1", "ROW LABEL 2"), |