nicokosi · January 31, 2014 10:05
diff --git a/Computing for Data Analysis - R lang - notes.tasks b/Computing for Data Analysis - R lang - notes.tasks
 Nicolas Kosinski's notes from Coursera course "Computing for Data Analysis" that teaches R basics for statistics: https://www.coursera.org/course/compdata (session #4: https://class.coursera.org/compdata-004)

 ✔ R installation @done (14-01-21 10:59)
 ✔ Week 1 @done (14-01-22 07:55)
  ✔ What Makes R Different? (4:20) @done (14-01-21 12:08)
    R mixes:
      interactive (command-oriented) tool
      programming lang 
  ✔ How to Get Help (13:53) @done (14-01-21 12:37)
  ✔ Background and Overview (16:38) @done (14-01-21 15:39)
    background:       
      dialect of S (interactive language)
      S created in 1976, several versions
      R created in 1991, current version: 3, S-like syntax
      pros: 
         free (GPL)
         quite lean and modular
         excellent graphical calpabilities
         active community
      cons:
         old technology (40 year-old!)
         little support on 3D graphics
         limitation due to memory storage
      overview:
       "base" packages (utils, stats, datasets etc...)
       "recommended" packages (boot, class, etc...)
       source: CRAN (official) / Bioconductor / personal websites...
  ✔ Data Types (31:06) @done (14-01-21 17:27)
    5 base types: character, numeric, integer, logical and complex
    vector: contains objects of same type
    list: contains objects of any types
    objects have attributes
    assignment: <-
    comment: #
    range: 1-10
    special numbers: NA (object not defined), NaN (number not defined, a NaN is a NA), Inf (infinity)
    explicit coercion: "as.logical(0)"
    vector:
      created by
        vector(type, length)
          or
        c(...) # concatenation, objects are converted to compatible type (i.e. c(1.2, "a") creates string vector)    
    matrix:
      special kind of vector with dimension
      matrix(1:6, nrow=2, ncol=3) # values send iterating by column
      dimension can set after value assignment
        m <- c(1:10) 
        dim(2, 5)
      create via column or row binding
        cbind(1:3, 4:6) # by column-first: 1, 4, 2, 5, 3, 6
        rbind(1:3, 4:6) # by column-first: 1, 2, 3, 4, 5, 6
    list:
      are indexed
    factor:
      special kind of list
      used for categories (enums) => index + label + count
      treated with modeling functions lm() and glm()
      counts occurrences (kind of map):
        f <- factor(c("yes, "no", "yes"))
        table(f) # "no" "yes"
                 # 1     2
        attr(,"levels") # "no" "yes"
      treated as integers:
        unclass(f) # 2, 1
        order can be chosen via construtor: factor(c("yes, "no", "yes"), levels=c("yes", "no"))
    missing values:
        is.na(f)
        is.nan(f)
    data frames:
        for tabular data: list containg heterogenous objects + a row name
        can be converted to matrix (=> coercion)
    naming:
      Object can have a name (meta-data)
        # for objects
        x <- 1:3
        names(x) <- c("one", "two", "three")
        # for matrices
        m <- matrix(1:4, nrow=2, ncol=2)
        dimnames(m) <- list(c("col1", "col2"), c("row1", "row2"))
  ✔ Subsetting (17:20) @done (14-01-21 18:18)
    [*index or condition*] returns a subset with same type as original
    [[]] returns a subset for multi-type containers (lists, data frames)
    $ extracts elements by name (same semantics as [[]])

    subsetting vectors:
      v <- c("a", "b", "c", "c")
      # by index:
      v[1]     # "a"
      # filtering:
      v[v > "a"]    # "b", "c", "c"
    subsetting matrices:
      m <- matrix(1:6, 2, 3)
      # by indices, into single-element vector:
      m[1, 2]   # 3
      # by indices with missing values:
      m[1,]   # 1 3 5
      # one element into an 1x1 matrix:
      m[1, 2, drop = FALSE]      # [, 1]
    subsetting lists:
      l <- list(ints = c(1, 2, 3), doubles = c(1.1, 2.2, 3.3))
      # by name:
      l$ints         # 1, 2, 3
      l[[ints]]      # 1, 2, 3
      # by name via variable:
      var <- "ints"
      l[[var]]       # 1, 2, 3
      # by name via partial matching:
      l$in           # 1, 2, 3
      l[[in, exact = FALSE]]  # 1, 2, 3
      l[[in]]        # NULL
      # by index:
      l[[c(1, 3)]]   # 3
    filtering NA values:
      via [cond] or complete.cases() (vectors and data frames)
  ✔ Vectorized Operations (3:46) @done (14-01-21 18:46)
    many functions can be applied in parallel on objects (vectors, etc...) without looping
      v1 <- 1:4
      v2 <- 6:9
      v3 <- v1 + v2    # 7 9 11 13
      v3 < 10    # T, T, F, F
    matrix product:
      m1 <- matrix(1:4, 2, 2)
      m2 <- matrix(rep(10, 4), 2, 2)
      m1 %*% m2       # | 40 40 |   | 1 3 |   | 10 10 |
                      # | 60 60 | = | 2 4 | . | 10 10 |
      # * would mean multiply element-by-element
  ✔ Reading/Writing Data: Part 1 @done (14-01-21 20:12)
    read/write functions for CSV-like text files:
        read.table()/write.table() # default separator is "#", can have a header
        read.csv() # similar to read.table but default separator is "," and header is mandatory
        readLines()/writeLines()
        dget()/dput() # read/write single object
        source()/dump() # read/write multiple objects
        load()/save() # read/write in workspace
        serialize()/unserialize() # read/write binary inputs
    Reading large dataset:
       refer to help page (hints)
       consider skipping comments (comment.char="")
       help type detection via "colClasses" argument:
         # hard-code type detection:
         all <- read.table(colClasses=numerice)
         # detect type for the first-n rows:
         start <- read.table(nrows=10)
         classes <- sapply(start, class)
         all <- read.table(colClasses=classes)
       set row number (exact or estimate):
         all <- read.table(nrows)
       estimate required memory:
         number of cells * 8 (bytes/numeric) = estimated memory bytes (optimistic)
  ✔ Reading/Writing Data: Part 2 (9:25) @done (14-01-21 20:18)
    Text file with metadata:
      pros: human-readable, Unix philosophy, versioning...
      cons: greedy (not storage efficient)

    basic functions:
      d <- data.frame(a=1, b="a")
        d2 <- data.frame(a=2, b="c")
        # serialize object:
        dput(d)
        # write object to a file:
        dput(d, file = "d.R")
        # read file to object:
        d2 <- dget("foo.R")
        # write objects by name to a file:
        dump(c("d", "d2"), file="d_and_d2.R")
        # read objects from a file:
        rm(d, d2) # we can remove objects to be sure...

        source("d_and_d2.R")
      file reading via connection interfaces:
        file
        url
        gzfile # open GZIP file
        bzfile # open BZIP2 file
      some functions such as read.cvs(filename) hides the connection interface (i.e. no need to open and close file)
      connection interface may be used for partial reading (example: readLines(cnx, 10))
  ✔ Setting Your Working Directory and Editing R Code (Windows) [7:20] @done (14-01-22 07:44)
    Basic commands related to working directory:
      # work dir can be changed via menu "File | Change dir"
      getwd() # show working directory, used as a base path for loading files
      dir()   # list files from work dir
      source("foo.R")  # load code from script file
      ls()    # lists functions and variables
    dev loop : edit script, save it and load it in R Console via source("filename")
  ✔ The str function (6:05) @done (14-01-22 07:55)
    str() displays the internal structure of an object
    "what's in this object?"
    many usages such as:
      summarizing nested objects (vectors, lists, etc...)
      displaying function signature
 ✔ Week 2 @done (14-01-23 08:55)
  ✔ Control Structures (15:23) @done (14-01-22 08:42)
    loop keywords (for scripts):
    if/else
    for
    while
    repeat # infinite loop (until break is called)
    break  # break loop
    next   # go to next loop iteration
    return
    NB: for interactive usage, *apply functions are more useful
    conditional assignment:
      y <- if (x > 2) { 1 } else { 0 }
    for (details):
      # on range:
      for (i in 1:4) {} # beware no to overwrite existing variable!
      # on vector indexes:
      for (i in seq_along(v)) { v[i] }
      # on vector values:
      for (elem in v) { elem }
      # on matrix:
      m <- matrix(...)      
      for (i in seq_len(nrow(m))) { for (j in seq_len(ncol(m))) { m[i, j] } }} # nested loops
    combining expressions:
      evaluated left to right
      via logical operators (&&, ...)
  ✔ Functions (16.32) @done (14-01-22 09:50)
    first-class objects:
      arguments can be functions
      functions can be nested
    definition:
      # creates an object of class "function"
      f <- function(foo, bar, bazbaz = "default" { #... }

      returned value is last expression
    arguments:
      are named
      can have default value
        function f(nums, best.effort = TRUE)
      can be explicit defined ("formal arguments")
        formals(f) returns formal arguments
      matching (on caller side):
        # 1. by name:
        f(foo=1, bar=2, bazbaz=3)
        # 2. by name with partial matching (for interactive usage, preferably):
        f(fo=1, bar=2, baz=3)
        # 3. by position:
        f(1, 2, 3)
        # mixing position and name:
        # legal but beware (names args are set first and do not count for position)
        f(bazbaz=3, 1, 2)
      are evaluated lazily:
        => missing arg error occurs when arg is evaluated
      variables arguments:
        usages
          extending an existing function
            function (x, ...) { otherFunction(...) }
          generic functions with extra arguments => function dispatch
          unknown number of args
            should be used for first arg
            other args must be matched by exact name (partial matching is ignored))
  ✔ Scoping Rules (19:03) @done (14-01-22 11:26)
    Bind symbol to value:
     Via several environments:
        1. global env # always first
        2. package X  # libraries loaded by user via library() are inserted here, by default
        3. package Y
        ...
        last is Base package # always last
      Environments are ordered: search()
     Function names != object names:
       object "foo" can exist if function "foo" exists
    Lexical (=static) scoping:
      differs from dynamic scoping

      environment:
        symbol dictionary (map names to symbols)
        parent
        children

          foo <- function() {}
          environment(f)     # outputs global env, for instance          

      closure:
        closure = function + environment

      free variable:
        used in a function, must be defined in the same environment
        function foo(a) {
          # b is a free variable
          a + b 
        }

        free vars are searched in function's environment (let call it e), then in e's parent, etc... until last parent (usually: global env)
        if not found => error

      nested functions:
        functions can be returned by other functions
        they have a dedicated environment
        example:
          # declare and call nested function:
          make.power <- function(n) { 
            pow <- function(x) {
              x ^ n
            }
            pow
          }
          cube <- make.power(3)
          cube(4)

          # display content of cube's env:
          ls(environment(cube))
          [1] "n"   "pow"

          # display symbol bound to "n", for cube():
          get("n", environment(cube))
          [1] 3
    Lexical scoping:
      Free variables are searched in function definition's env
      Example:
        y <- 10
        f <- function(x) {
          y <- 2
          y^2 + g(x)
        }
        g <- function(x) {
          # y will is bound to 10 (scope of function definition)
          # and not to 2 (scope of function call)
          x * y
        }
        f(3)
        [1] 34

      Lexical scoping is also used in languages such as Python, Perl, Scheme, Common Lisp
      Consequence: memory cost (to store all environments)
  ✔ Optimization Application (9:21) @done (14-01-22 12:01)
    Lexical scope suits well to optimization problems
    Optimization routines: optim(), nlm(), optimize(), etc...

    Objective functions are implemented as a "constructor" function that has nested function
    By default, objective function are for minimization (not maximization)

    # Set sigma (standard deviation) and mu (mean)
    optim(c(mu = 0, sigma = 1), nLL)$par

    # Set sigma to 2
    nLL <- make.NegLogLike(normals, c(FALSE, 2))
    optimize(nLL, c(-1, 3))$minimum

    # Set mu to 1
    nLL <- make.NegLogLike(normals, c(1, FALSE))
    optimize(nLL, c(-1, 3))$minimum
  ✔ (Loop function:) lapply (9:23) @done (14-01-22 13:18)
    lapply:
    lapply() applies a function to all elements of a list, returning a list (same size as input)
      lapply(object, function, ...)
    If object is not a list, it will be coercized to a list (may fail)

    Examples:
      # sum integers:
      x <- list(a = 1:5, b = 5:10)
      lapply(x, mean)
      $a
      [1] 3

      $b
      [1] 7.5

      # generate random numbers from 1 to 6 demonstrating extra args (...):
      n <- 1:4
      lapply(l, runif, min=1, max=6)
      [[1]]
      [1] 3.537174

      [[2]]
      [1] 1.221711 1.470035

      [[3]]
      [1] 2.304671 5.510663 4.682575

      # using anonymous function:
      l <- 1:2
      lapply(l, function(x) x+1)
      [[1]]
      [1] 2

      [[2]]
      [1] 3

    sapply is similar but returned object is simplified:
      returns a vector instead of a list a one elem
      returns a matrix if all elements is a list with the same size

    Examples:
      # vector returned:
      x <- list(a = 1:5, b = 5:10)
      sapply(x, mean)
        a   b 
      3.0 7.5 
  ✔ (Loop function:) apply (7:16) @done (14-01-22 14:14)
    apply(array, dimensionToRetain, function, ...)
    Example:
    # Sum columns (=> retain dimension 2) of a matrix with 10 rows and 4 columns
    m <- matrix(1:15, 5, 3)
    m
         [,1] [,2] [,3]
    [1,]    1    6   11
    [2,]    2    7   12
    [3,]    3    8   13
    [4,]    4    9   14
    [5,]    5   10   15
    apply(m, 2, sum)
    [1] 15 40 65

    shortcuts:
      rowSums(x) = apply(x, 1, sum)
      rowMeans(x) = apply(x, 1, mean)
      colSums(x) = apply(x, 2, sum)
      colMeans(x) = apply(x, 2, mean)
     Optimized for large matrices

     on array:
        # 3-dim array:
        a <- array(1:2*2*5, c(2, 2, 5))
        # sum value for dimension 3:
        apply(a, c(1, 2), sum)
             [,1] [,2]
        [1,]   50   50
        [2,]  100  100
        # equivalent to:
        rowMeans(a, dims=2)
  ✔ (loop function:) tapply and split (12:22) @done (14-01-22 15:02)
    tapply:
      apply a function to a subset of a vector (kind of "group by")
      tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
        X: vector
        INDEX: factor or list of factors
        FUN: function to apply
        simplify: should the result be simplified? (if simplify, return a vector, otherwise a list)

     Example:
         # vector with
         #  10 random values with normal distribution with mean = 0
         #  10 random values wih uniform distribution
         #  10 random values with normal distribution with mean = 1
         v <- c(rnorm(10), runif(10), rnorm(10, 1))
         # generate 3 factors with 10 repetitions
         f <- gl(3, 10)
         tapply(v, f, mean)
    split:
      Splits a vector into several vectors using factors (groups).
      Similar to tapply, but does not apply a function.
      Is generaly used in conjunction with lapply...

      Examples:
         # split on 1 level:
         # vector with
         #  10 random values with normal distribution with mean = 0
         #  10 random values wih uniform distribution
         #  10 random values with normal distribution with mean = 1
         v <- c(rnorm(10), runif(10), rnorm(10, 1))
         # generate 3 factors with 10 repetitions
         lapply(split(v, f), sum)

         # split on multiple levels (factors):
         x <- rnorm(10)
         f1 <- gl(2, 5)
         f2 <- gl(5, 2)
         # for information, the combination of factors is
         interaction(f1, f2)
         # 
         str(split(x, list(f1, f2), drop=TRUE))
         List of 6
         $ 1.1: num [1:2] -0.37 -0.029
         $ 1.2: num [1:2] 0.00941 -0.85438
         $ 1.3: num -0.331
         $ 2.3: num -1.07
         $ 2.4: num [1:2] -0.165 -0.825
         $ 2.5: num [1:2] -0.224 -1.61
  ✔ (loop function:) mapply (4:41) @done (14-01-22 15:13)
    mapply() applies a function in parallel over arguments
      function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
    Useful to "vectorize" a function which does not propose vector return.

      Example:
        # repeat 1 three times
        > rep(1, 3)
        [1] 1 1 1
        # repeat 1 four times, 2 three times, etc...
        > mapply(rep, 1:4, 4:1)
        [[1]]
        [1] 1 1 1 1

        [[2]]
        [1] 2 2 2

        [[3]]
        [1] 3 3

        [[4]]
        [1] 4
  ✔ Debugging Tools: Part 1 (8:50) @done (14-01-22 15:34)
    something goes wrong!
    Kind of messages:
    1. message: diagnostic/notification, function execution continue
    2. warning: something unexpected happened, function execution continue
      example: log(-1) => NaNs produced
    3. error: function execution stops
      example: if (x > 0) throws an error if x is NA
    + condition: generic type of issue (custom)

    invisible(v) can be used in function to indicate that object is returned but should not be auto-printed to the console
  ✔ Debugging Tools: Part 2 (10:07) @done (14-01-22 15:52)
    debugging functions:
      traceback(): prints out call stack of last error
      debug(f): flags a function for debug mode (breakpoint is created on first line)
      brower:  suspend a function for debug mode (create breakpoint anywhere in the function?)
      trace: inserts debugging code into a function (can be internal packages, not written by us)
      recover: modify error behaviour in order to browse call stack
  ✔ Debugging Tools: Part 3 (8:23) @done (14-01-22 16:08)
    traceback example (dumb one!):
      > mean(z)
      Error in mean(z) : object 'z' not found
      > traceback()
      1: mean(z)
    debug example:
      debug(myFunction)
      myFuntion("foo")
      # then type 'n' + return for going to next line
    recover example:
      options(error = recover)
      read.csv("doesNotExist")
      Error in file(file, "rt") : cannot open the connection
      In addition: Warning message
      In file(file, "rt") 
        cannot open file 'doesNotExist': No such file or directory

      Enter a frame number, or 0 to exit   

      1: read.csv("doesNotExist")
      2: read.table(file = file, header = header, sep = sep, quote = quote, dec = de
      3: file(file, "rt")
 ✔ Week 3 @done (14-01-28 16:06)
  ✔ Week 3 Introduction (1:58) @done (14-01-23 14:11)
  ✔ Simulation (14:51) @done (14-01-23 15:47)
    
    generating numbers:

      function prefixes:
        d: evaluation of density
        r: generation of random numbers
        p: evaluation of cumulative distribution
        q: evalution of the quantile

      main functions:
        rnorm() generates random number for Normal distribution (input: number of points, mean (def=0) and standard deviation (def=1))
        dnorm() evaluates Normal probability density at point (or some points)
        pnorm() evaluates the cumulative distribution for a Normal distribution
        qnorm() evaluates the quantile for a Normal distribution   qnorm(p) = pnorm^-1(p)

        N.B.:  lower tail: left part vs upper tail: right part

        rpois() generates random numbers for Poisson distribution (input: number of points and rate, output: integers)
        ppois() evaluates the cumulative distribution for a Poisson distribution (input: value-inferior-or-equal-to and rate, output: probabiltiy)

        rbinom() generates binary numbers 

      Set the seed:
        set.seed(x)

        Always do it (before first call) in order to ensure reproducibilty.

    generating numbers from linear model:

      Example:
        # Suppose y = a + b * x + e
        #  a: 0.5
        #  b: 2
        #  e ~ N(0, 2^2)
        #  x ~ N(0, 1^2)
        set.seed(20)
        x <- rnorm(100)
        e <- rnorm(100, 0, 2)
        y <- 0.5 + 2 * x + e
        summary(y)
           Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        -6.4080 -1.5400  0.6789  0.6893  2.9300  6.5050 
        # graph plots, we can see the line
        plot(x, y)

    generating numbers from generalized linear model:
      generalized linear model means distribution is not Normal (for instance: Poisson).

      Example:
        # Y ~ Poission(mu)
        # log mu = a + bx
        #  a = 0.5
        #  b = 0.3
        set.seed(1)
        x <- rnorm(100)
        log.mu <- 0.5 + 0.3 * x
        y <- rpois(100, exp(log.mu))
        summary(y)
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
         0.00    1.00    1.00    1.55    2.00    6.00 
        plot(x, y)

    generating samples:
      sample() retrieves rand numbers from a scalar object
      no repetition by default (permutation)
      repetition can be set via arg replace=TRUE
      example:
        > set.seed(3)
        > sample(1:10, 10)
         [1]  2  8  4  3  9  6  1  5 10  7
        > sample(1:10, 10, replace=TRUE)
         [1] 6 6 6 6 9 9 2 8 9 3
  ✔ Plotting with Base Graphics (23:22) @done (14-01-23 17:50)
    plotting & graphic packages:
      graphics: base functions such as plot(), hist(), boxplot(), etc...
      lattice: trellis functions such xyplot(), bwplot(), etc (independant of graphics)
      grid: low-level graphic routine (seldom used)
      grDevices: graphic devices (screen output / file generation) such as X11, PDF, PNG, etc...

    process:
      output: screen? file?
      usage: for temp screen display, a presentation, a paper?
      data volume: a few points vs huge data set?
      resizable?: (bitmap like JPEG or PNG vs vectorial format like PDF or PostScript)
      which package?: base (simpler, built piecemeal) or grid/lattice ("sigle function call") ; cannot be mixed

    graphics package:
      example:
        > x <- rnorm(100)
        # draw histogram
        > hist(x)
        # close window
        > dev.off()
        null device 
                  1
        # draw histogram (opens new window)
        > hist(x)
        # draw another histogram (reuses window)
        > hist(2 * x)

      graphic parameters:
        see par() function for setting graphic parameters for current session only
        some params can be overriden in specific plotting functions

        important params:
          pch: plotting symbol (default: open circle)
          lty: line type (default: solid line)
          lwd: line width
          col: plotting color (default: black)
          las: orintation of axis labels
          bg: background color
          mar: margin sizes (from bottom to right, clockwise)
          oma: outer margin sizes
          mfrow: number of plots per row, column (plots filled row-wise)
          mfcol: idem (plots filled column-wise)

      functions:
        plot: draws a plot (scatterplot usually, other types of plots depending on input object)          
        hist: draws a histogram

        lines: adds lines to a plot
        points: adds points to a plot
        text: adds text
        title: adds titles (axis, title, sub-title, margin
        mtext: adds text to margins
        axis: add axis tick marks or labels
        legend: adds legend

      devices:
        ?Devices lists the list of devices        

          vectorial (resizable):
          pdf (resizes well, portable)
          postscript (older, less used)

          bitmap (+: well-suited for plots with many points, -: not well resizables):
          png :good for solid color, lossless compression
          jpeg: good for pictures, lossy compression
          bitmap: less used
          bmp: native Windows format, less used

          example:
            # generate plot in PDF file:


            > pdf(file = "testRplot.pdf")
            > x <- rnorm(100)
            > hist(x)
            > dev.off()

      copying device:
        to export to file, either:   1. open device, make plot, close device
                                     2. make plot on default device (screen), then copy it to other device

        copying device is not an exact operation!
        functions: dev.copy(), dev.copy2pdf, dev.list(), dev.set(), dev.off
  ✔ Base Graphics Plotting Demo (16:56) @done (14-01-23 18:18)
    # launch plot demo, useful to show params (plotting symbols, etc):
    example(points)

    # trick use different plotting symbols on the same plot:
      plot(x, y, type = "n") # will draw axis, legends, etc... but not the points
      points(x[group == 'Male'], y[group == 'Male'], col = "blue") # draw "Male" blue points via factor "group"
      points(x[group == 'Female'], y[group == 'Female'], col = "pink") # draw "Male" pink points via factor "group"
  ✔ Plotting with Lattice Graphics (7:18) @done (14-01-23 18:57)
    Main Lattice functions:
      xyplot() for scatterplots
      bwplot() for boxplots
      histogram() for histograms
      strpplot() for boxplots with points
      dotplot()
      splom() for scatterplots matrix (direcly on a dataset)
      levelplot()/contourplot() for image data

      example:
        library(lattice)
        library(nlme)
        xyplot(distance ~ age | Subject, data = Orthodont)


      functions do not directly print:
        returned object of type "treillis" (that can be stored but it's better to save the code)
        print() method call needed to draw on device (auto-printed in the console)      

      arguments:
        first arg: formula like "y ~ x | f * g" (x, y: inputs ; f, g: optional factors)
        "data" arg: data
        "panel" (optional): for extra items per group (ex: draw average line via panel.abline() or regression line via panel.lmline())
  ✔ Lattice Graphics Plotting Demo (21:23) @done (14-01-23 19:35)
    # display documentation
    package ? lattice
    # 
    library(help - lattice)

    # split scalar in 4 ranges that overlap slightly
    # useful to see influence between two variables and a third one
    temp.cut <- equal.count(environmental$temperature, 4)
    wind.cut <- equal.count(environmental$wind, 4)

    xyplot(
      # 2 4-level factors combined (wind and temp) => 16 panels
      ozone ~ radiation | temp.cut * wind.cut,
      data = environmental, as.table,

      # add regression line
      panel = function (x, y, ...) {
        # draw plots
        panel.xyplot(x, y, ...)
        # add regression line
        fit <- lm(y ~ x)
        panel.abline(fit)
        # add smoother (?)
        panel.loess(x, y)
      })
  ✔ Plotting with ggplot2: Part 1 (24:18) @done (14-01-27 09:34)
    ggplot2 = CRAN package that implements "Grammar of Graphics" (book)
            = "3rd graphics system" (after base and lattice)
    concept: verb, noun & adjvective
    workflow:
      1. start with base function (ie plot)
      2. use annotation functions (text, lines, points, axis) to add/modify

      many automatic stuff (but customisation is possible)

    basic function, qplot():
      similar with base's plot()
      source: data frame (always) from input (prefered) or workspace
      output: aesthestics (size, shape, color) + geoms (points, lines)
      input factors should be labeled

      qplot(): simple function (hides complexity)
      ggplot(): advanced function (more powerful and flexible)

      examples:
        install.packages()
        library(ggplot2)

        # draw plots and show sub-groups (based on 'drv' factor variable):
        qplot(displ           # xcoord
              ,hwy            # y coord
              ,data = mpg     # data frame
              ,color = drv    # change aesthetics: color points via 'drv' factor
                              # legend is automatically added
        )

        # draw plots and smooth line ("loess"):
        qplot(displ           # xcoord
              ,hwy            # y coord
              ,data = mpg     # data frame
              ,geom = c("point", "smooth")    # add geom: smooht line + 99% interval
        )

        # draw histogram:
        qplot(hwy        # only 1 variable
          , data = mpg
          , fill = drv)

        # draw facets (i.e. groups, like panels for lattice):
        qplot(displ, hwy, data = mpg
          , facets = . ~ drv       # pattern is "row var" ~ "col var", "." means empty
                                   # here, drv has 3 levels => 3 facets
        )        
  ✔ Plotting with ggplot2: Part 2 (28:35) @done (14-01-27 17:26)
    advanced function, ggplot:

      example:
      # prepate plots on 2 variables:
      > g <- ggplot(mpg, aes(displ, hwy))      
      # check:
      summary(g)
      data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
        class [234x11]
      mapping:  x = displ, y = hwy
      faceting: facet_null()       
      # auto save and print:
      p <- g + geom_point()       # geom_point is a layer to draw points
      > print(p)
      # print only:
      g + geom_point()
      # print (plots) with additional smooth line:
      g + geom_point() + geom_smooth()
      # print (plots) with additional smooth line w/ linear model (removes noise):
      g + geom_point() + geom_smooth( method = "lm" ) 
      # print (plots) with facets:
      g + geom_point() + facet_grid(. ~ cyl) + geom_smooth( method = "lm" )

    annotation:
      labels: xlab(), ylab(), labs(), ggtitle()
      addit. drawings: geom_*() like geom_smooth, etc...
      global vars: theme()  (example: theme(legend.position = "none"))
      appearance: theme_gray() for gray background (default), theme_bw() is a black and white UI theme

    modify aesthetics:
      # set point size + constant point color + alpha transparency:
      g + geom_point( color = "steelblue", size = 4, alpha = 1/2)
      # set point size + dynamic point color:
      g + geom_point( aes(color = drv) , size = 4 )
      # print (plots) with custom labels:
      g + geom_point() + labs(title="my title") + labs(x = "custom x axis") + labs( y = expression ("custom y axis for " * hwy) )
      # axis limit, filter outlier values:
      g + geom_point() + ylim(-3, 3)
      # axis limits, set outlier values out of scale (by default, all values are displayed):
      g + geom_point() + coord_cartesian( ylim(-3, 3) )

    Make categories over continue values:
      If categorical variable is continue (many values, not just 3 or 4)
      example:
        quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE)
               0% 33.33333% 66.66667%      100% 
                9        15        18        35 
        cutpoints <- quantile(mpg$cty, seq(0, 1, length=4), na.rm = TRUE)
        mpg$cty_range <- cut(mpg$cty, cutpoints)
        levels(mpg$cty_range)
        [1] "(9,15]"  "(15,18]" "(18,35]"
  ✔ Plotting with Mathematical Annotation (6:03) @done (14-01-28 16:06)
    Math symbols can be set in labels, plots, etc... (LateX-like symbols)
    Use expression() function to write math symbols
    cf ?plotmath

    examples:
      # create base plot with title 'theta = 1':
      plot(1, 2, main = expression(theta == 1))

      # create base slot with title = sum of xi * yi (epsilon symbol is used):
      plot(1, 2, main = expression(sum(x[i] * y[i], i==1, n))

      # expression with dynamic value (i.e. from variable):
      v <- -2
      # Will set title to "abs (x) = -2" (abs will be bar symbol)
      plot(1, 2, main = substitute(
        bar(x) == val, # the expression
        list(val=v) )) # list of variables to be substituted
 ✔ Week 4 @done (14-01-29 10:49)
  ✔ Plotting and Color in R (22:06) @done (14-01-28 17:08)
    default colors (1: black and white, 2: red, 3: green, etc...) are not very pretty!

    grDevices package has 2 functions:
    colorRamp(colors) takes colors as input and returns a function that inputs number from 0 to 1 and that returns RVG numbers
    colorRampPalette() takes colors as input and returns a function that return colors (as a vector of characters for hexa RGB)
    They blend colors (interpolation)

    colorRamp() examples:
      pal <- colorRamp(c("red", "blue"))
      # RVG color for red:
      pal(0)
           [,1] [,2] [,3]
      [1,]  255    0    0
      # RVG color for blue:
      pal(1)
           [,1] [,2] [,3]
      [1,]    0    0  255
      # RVG color in between:
      pal(0.5)
            [,1] [,2]  [,3]
      [1,] 127.5    0 127.5

    colorRampPalette examples:
      pal <- colorRampPalette(c("red", "blue"))
      # return 2 colors (red and blue):
      > pal(2)
      [1] "#FF0000" "#0000FF"
      # return 10 colors from red to blue:
      > pal(10)
       [1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA"
       [8] "#3800C6" "#1C00E2" "#0000FF"

    colors() lists color names (instead of RGB values)

    gray() is equivalent to colorAmp() for b/w


    usage:
      pal <- colorRampPalette(c("red", "yellow", "blue"))
      x <- rnorm(100)
      # plot 100 points, first will be red, 50th will be yellow, 100th will be blue, and other will be in between:
      plot(x, col = pal(100))


    RColorBrew package:
      3 types of palettes
        sequential, for continuous order data (ex: "Blues" that goes from light blue to dark blue)
        diverging (example: positive vs negative values) => from dark color 1 to light (in the middle) to dark color 2 (ex: "Spectral")
        qualitive, for data that are not ordered (categorical data) => each color is very different from previous one (ex: "Set1")

     Main function: brew.pal(numberOfPoints, paletteName)
     Can be used with ColorRamp() and ColorRampPalette()

     Example:
      # draw 100 plots using plot colors from light blue to dark blue via "Blues" palette:
      library(RColorBrewer)
      # use 3 primary colors (we don't need more) from "Blues" theme
      cols <- brewer.pal(3, "Blues")
      pal <- colorRampPalette(cols) 
      # generate 100 colors
      plot(x, col = pal(100))

      NB: smoothScatter(), for plotting huge number of points, uses RColorBrew package

    Additionnal notes:
      rgb() return RGB colors and handles transparency ('alpha' param)
      Can be useful for overlapping plots:
        x <- rnorm(10000)
        plot(x, col = rgb(0, 0, 0, 0.1))
  ✔ Dates and Times (10:29) @done (14-01-28 17:53)
    date:
      a day in year (no time)
      via Date class (number of days since 1970/01/01)

    time:
      date + time + timezone
      POSIXct and POSIXlt classes (number of seconds since 1970/01/01)
      POSIXct uses a big integer
      POSIXlt is a list with additional info such as day of the week, day of the year etc... (year, month, yday, hour, min, sec, etc...)


    # date from string:
    d <- as.Date("2013-12-25")

    # number of days since 1970:
    daysSinceNow <- unclass(d)
    [1] 16064

    # generic functions for dates and times:
    weekdays(d)
    [1] "mercredi"
    months(d)
    [1] "décembre"
    quarters(d)
    [1] "Q4"

    # system time:
      t <- Sys.time()
      dt <- as.POSIXlt(t)
      d$sec
      Error in d$sec : $ operator is invalid for atomic vectors
      dt$sec
      [1] 51.66967

    data/time format:
      strptime() converts one or several strings to time objects (see help for format)

    date/time operations:
      via standard functions: +, -, ==, >, etc...
      + conversion functions: as.Date, as.POSIXct, as.POSIXlt
  ✔ Regular Expressions (27:21) @done (14-01-28 18:39)
    regexp = combination of literals (text) and meta-characters (ex: starts with, alternative, word boundary)
    In R, a way to extract data from "unfriendly" sources (web sites, messy text files, etc...).

    meta-characters:
      ^: line that starts with text (ex: "^foo")
      $: line that ends with text (ex: "foo$")
      []: a set of characters (ex: [Nn] [Ii] [Cc] [Oo] to find 'nico', ignoring case)
          can be used with ranges (ex: [0-9] or [a-z])
      [^]: negative set (ex: "[^?.]$" returns line that do NOT end with '?' or '.')
      .: any character, or none
      |: alternatives (ex: "dev|coder" will return lines containing either "dev" or "coder")
      (: to indicates scope ("^foo|bar" which returns lines that start with "foo" or contain "bar" 
                              vs "^(foo|bar) that return lines that start with "foo" or "bar")

         to store matched text ("grouping") in \1, \2, etc...
      ?: for optional expressions (ex: "George( [Ww]\. )?" match "George W. Bush" and "George Bush")
      \: escape meta-character (ex: "\." means period, not "any character" meta-char)
      +: at least one of...
      *: any of... (including none)
         "greedy" => matches the longest possible string
         greediness can be stopped via ?
      {}: custom repetition: min-max interval or min or max
        (ex: "[Bb]ush( +[^ ]+ +){1, 5}) debate" will match lines having 1 to 5 words between "Bush" and "debate")
  ✔ Introduction to Baltimore City Homicide Data (4:20) @done (14-01-28 18:44)
  ✔ Regular Expressions in R (30:08) @done (14-01-28 20:26)
    grep()/grepl() search in a character vector, return index numbers that match or booleans
    regexpr()/grepregexpr() similar but return index of the string where match begin + length of match
      regexp is for first match, gregexp is for all matches
      used in conjonction with regmatches()
    sub()/gsub() search and replace (sub for first match, gsub for all matches)
    regexec() gives indices for sub-expressions (with parentheses)

    demo:
      # Count lines with "shooting":
      length(grep("[Ss]hooting", homicides))
      [1] 1005
      # Count lines with "cause: shooting":
      length(grep("[Cc]ause: [Ss]hooting", homicides))
      [1] 1003
      # Troubleshootig differences:
      s1 <- length(grep("[Ss]hooting", homicides))
      s2 <- length(grep("[Cc]ause: [Ss]hooting", homicides))
      setdiff(s1, s2)
      [1] 1005
      s1 <- grep("[Ss]hooting", homicides)
      s2 <- grep("[Cc]ause: [Ss]hooting", homicides)
      setdiff(s1, s2)
      [1] 318 859
      setdiff(s2, s1)
      integer(0)
      # state that start with "New":
      grep("^New", state.name) # line numbers
      [1] 29 30 31 32
      grep("^New", state.name, value=TRUE) # values
      [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
      grepl("^New", state.name) # boolean vector
       [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
      [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
      [25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
      [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
      [49] FALSE FALSE
      # Extract dates (initial try):
      > r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5])
      > regmatches(homicides[1:5], r)
      [1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
      [3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
      [5] "<dd>Found on January 5, 2007</dd>"
      # Extract dates (other try):
      > r <- regexpr("<dd>Found(.*?)</dd>", homicides[1:5])
      > m <- regmatches(homicides[1:5], r)
      > d <- gsub("<dd>Found on |</dd>", "", m)
      > as.Date(d[1], "%B %d, %Y")
      [1] NA
      # NB: should work with US settings (instead of French)
      > as.Date(d, "%B %d, %Y")
      [1] NA NA NA NA NA
      # Using sub-esxpressions via regexec:
      r <- regexec("<dd>Found on (.*?)</dd>", homicides)
      m <- regmatches(homicides, r)
      dates <- sapply(m, function(x) x[2])
      dates[1]
      [1] "January 1, 2007"
      dates <- as.Date(dates, "%B %d, %Y")
      # hist() can automatically handle dates
      hist(dates, "month", freq=TRUE)
  ✔ Classes and Methods in R (34:51) @done (14-01-29 10:49)
    S (and R!) supports OOP with a few specificities

    2 systems:
     - S3 classes and methods: informal, "old-style", easier, quick-and-dirty
     - S4 classes and methods: more formal and rigorous, "new-style"
    Separate systems, but can partially mix.

    S4 style is in "methods" package (usually laoded by default).

    Class definition: setClass()
    Object creation: new()
    Method: function that only operates on a certain class of objects
    Generic function: function that only dispatches methods (ex: plot() dispatch to different functions depending on input data type)

    help: ?Classes, ?Methods, ?setClass, ?setMethod, ?setGeneric

    Generic functions:

      S3:
        # display "mean" function signature:
        mean
        function (x, ...) 
        UseMethod("mean")
        <bytecode: 0x065e11dc>
        <environment: namespace:base>
        
        # dispatched methods:
        methods("mean")
        [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt 

        # display code for default method:
        getS3method("mean", "default")
          function (x, trim = 0, na.rm = FALSE, ...) 
          {
              if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
              # ...


      S4:
        # display "show" function signature:
        > show
        standardGeneric for "show" defined from package "methods"        
        function (object) 
        standardGeneric("show")
        <bytecode: 0x052674c4>
        <environment: 0x04f907a0>
        Methods may be defined for arguments: object
        Use  showMethods("show")  for currently available ones.
        (This generic function excludes non-simple inheritance; see ?setIs)

        # display "show" dispatch methods:
        > head(showMethods("show"))
        Function: show (package methods)
        object="ANY"
        object="classGeneratorFunction"
        object="classRepresentation"

      A generic function has at least one param (the object).
      If dispatch is found, method is called. Else, default method is called if exists, otherwise an error is thrown.
      S3 dispatch methods should not be called directly!
      S4 dispatch methods cannot be called directly.

      Write your own type:
        Why? To represent custom model (e.g. gene expression) that does not have built-in type
        Probably need to write methods for print()/show(), summary(), plot()

        For new S4 type:
        1. use setClass() to define:
          - name of class
          - data elements, "slots"
        2. use setMethod() to define:
          - methods
        3. check class info:
          via showClass()

        Example:
          # define polygon:
          setClass("polygon", representation(x = "numeric", y = "numeric"))

          # implement plot() for polygon, and register it (side-effect) for current session:
          setMethod(
            "plot",        # generic function
            "polygon",      # class name
            function (x, y, ...) {
                 # draw points via default method
                 plot(x@x, x@y, type="n", ...)

                 # draw lines
                 xp <- c(x@x, x@x[1])
                 yp <- c(x@y, x@y[1])
                 lines(xp, yp)
             })

          # check that is has been registered:
          showMethods("plot")
          Function: plot (package graphics)
          x="ANY"
          x="polygon"

          # create polygon object and call plot():
          p <- new("polygon", x = c(1,2,3,4), y = c(1,2,3,1))
          plot(p)