Skip to content

Instantly share code, notes, and snippets.

@NobodyXu
Last active May 20, 2019 06:13
Show Gist options
  • Save NobodyXu/39479d5a226d55417f11515df16ec419 to your computer and use it in GitHub Desktop.
Save NobodyXu/39479d5a226d55417f11515df16ec419 to your computer and use it in GitHub Desktop.
About R
  • rbindlist from data.table is very efficient in binding multiple rows of a data frame.

  • According to here, the most efficient way to remove a column is to

    library("data.table")
    
    # set from data.table
    set(my_df, j = "A", value = NULL)
  • Semantics:

    • assignment:

      • When assigning a variable to another name, eg, a = b, a new object is created. However, no data is copied due to the copy-on-modify
    • In order to xor booleans, use xor(a, b).

    • reminder and quotient

      • %% for reminder and %/% for quotient.
    • For accessing list inside list, [[index]] must be used.

    • For returning a vector from a data.frame or data.table, df[[one_list_index]] must be used.

    • slicing:

      • Slicing happens when you [] a container (vector, list, etc) using more than one index, generated by seq or : or c(). The index used can be integers or charaters.
      • When slicing a list, a shallow copy of the subset of the original container will be created. That is, a new list will be created, but the elements in it will be just reference to the original with the copy-on-modify semantics.See here for more.
      • Positive integer slicing
        • When slicing using positive integer(s), only the elements specified by the integers will be in the new subset.
      • Negative integer slicing
        • This works the opposite way of positve integer slicing. Only the elements specified by the integers will not be present in the subset. See here for more.
    • subset(x, sekect) function

      • subset function can be used to remove column easily:
        subset(df, select = -column_name_to_remove) # "column_name_to_remove" is not a character, it is just the name
        
    • Compare an array/data frame with a singel value and generate an array/data frame of same dim

      • Compare each element of it with the value and the result can be indexed in the same way the array/data frame can be indexed. E.g. v == value or dataframe$column_name == value.
    • Count TRUEs

      • which(x), where x is a logical vector/array, it returns an integer vector with length equal to sum(x), ie. the number of TRUEs.
      • sum(x) can also do a similar job, just like which.
      • It seems that sum(bools) is faster than length(which(bools)) when the bools is considerably long.
    • Def function:

      name_of_function = function(arg1, arg2 = 1) {# There can be default values to arguments
           # expr
           # The return statement is not always necessary. When there is only one expr in the function, the result of it will be
           # returned atomatically by R.
           return (expr) # If expr is omitted, NULL will be returned. expr can even be a funciton
      
      • To be precise, I will call it the definition of lambda instead of normal function.

      • Here, function is stored variable. function can also be used inside of the definition of another function body.

      • It is also worth noting that a function can access the variable that is defined in the env where the function is defined.

    • stop:

      • stop is a class that can be constructed with a message and passed as function arguments. It stops the execution of the current expression and executes and error action.
    • for loop:

          for (each in collections) {# collections can be vector, list, data frame, matrix, etc)
              expr
          }
      
      • Speeding up your R code - vectorisation tricks for beginners shows that loops are exensive on large data compared to apply function family writen in R and the external call to C functions are even quicker.

      • However, this is not always true. So it is better to do benchmark and understand what is under the hood to use them correctly.

    • while, if, else works just like in C

    • switch:

      • switch in R is like a function. switch(VALUE, COND1_ret_value, ...).
  • Builtin data structures:

    • vector and list
      • vector
        • vector is a homologous container. Since there is only one type of elements, the elements is stored continously. vector also has lower memory consumption compared to list if length is not too large.
        • vector(mode = "logical", length = 0) is used to construct an length-long vector storing elements of type mode. For how elements are allocated, see help(vector).
        • c(...) can be used to initialize a vector. It can also be used to combine vectors, new elements of the same type to become one vector(not vector of vector).
      • list
        • list is a heterogenous container, so it stores each elements by storing a pointer to it. It is very usefull since you get make a list of list using list(...).
        • c(...) can be used to combine list and any other type of new elements together into one list (not list of list).
        • To make list of list, you need to use list(...) to combine lists.
      • To append to a list or vector, you need to use list.append(.data, ...) from pacakge rlist, where .data is the container and ... is the elements.
      • Insert: using list.insert(.data, index, ...) from rlist.
      • push_front: using list.prepend(.data, ...) from rlist.
    • vector of logical
      • To perform &&, || or ! action on vector of logical: use &, | or !.
  • Builtin funcitons:

    • help(x)?x
    • ??x
      • Provid manual page about x.
    • object.size(x)
      • Get the size of an aobject.
    • rm(x)
      • Delete the name x and release its release if no other names use it (due to copy-on-modify semantics).
    • gc()
      • Do garbage collection immediately. It can be usefull to call after a large object have been removed and return memory to the
      • operating system. GC happens automatically without any user intervention, so normally a call to gc() isn't necessary and
      • can hurt the performance if call it after the removal every object. For more, see help(gc) and help(gctorture)`.
    • help(Memory):
      • Documents how objects are allocated in R.
  • Making packages

    1. write DESCRIPTION file at the root of the project:
    Package: Helloworld
    Title: What The Package Does (one line, title case required)
    Version: 0.1
    Author: person("First", "Last", email = "[email protected]",
    Maintainer:
    Description: What the package does (one paragraph)
    Depends: R (>= 3.1.0)
    License: What license is it under?
    LazyData: true
    ByteCompile: true
    RoxygenNote: 6.1.1
    
    1. Put code into root_of_pack/R/*.R.

    2. Then run roxygenise() from package roxygen2 with current working dir at the root of the project or roxygenise(root_of_project).

    The info above is from Creating R packages, the byte compiler and from running vignette("roxygen2", package = "roxygen2") (it does not need library("roxygen2") to work).

    1. Then run R CMD check --check-subdirs=yes root_of_pack and fix any error.

    2. Then run R CMD build root_of_pack to generate a *.tar.gz.

    3. Run R CMD check --check-subdirs=yes *.tar.gz where *.tar.gz is generated by the previous step.

    4. RUn R CMD INSTALL *.tar.gz to install the package.

    For more info on packages, check here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment