Created
March 8, 2019 13:09
-
-
Save cbergman/b4fe6d50ba2e7bd16af255af62297dcd to your computer and use it in GitHub Desktop.
R_workshop_3_8_19
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# download R: https://cran.rstudio.com/ | |
# download Rstudio: https://www.rstudio.com/products/rstudio/download/ | |
### Brief intro to R language ### | |
# R is a FOSS statistical programming language | |
# R is an implementation of S (Bell labs, 1976) by Ross Ihaka and Robert Gentleman (1995->2000) | |
# R name is play on S language and author names | |
# R is an interpreted language, R command line interpreter (console) is written in R, C and Fortran | |
# R interpreter takes plain text input, interprets input, and generates numerical, text or graphical output | |
# R is a scripting language, human readable code, no need to compile programs (more like BASH or Mathematica than C or JAVA) | |
# R code can be run interactively or at command line | |
# Core R language and packages maintained by R Development Core Team | |
# Core R extended by user community to develop domain-specific packages (CRAN, bioconductor, github) | |
# Core R provides basic R.app CLI and GUI | |
# Rstudio (also FOSS) is more advance GUI developed by Rstudo team (2010), who also develop many useful data science packages (ggplot2, tidyverse) | |
### Quick intro to Rstudio ### | |
# Rstudio is an Integrated Development Environment (IDE) that makes writing and running R code faster and easier | |
# 4 Panes: Source, Console/Terminal, Environment/History/Git, Files/Plots/Packages/Help: https://bookdown.org/ndphillips/YaRrr/the-four-rstudio-windows.html | |
# Only 3 visible until you open/create a source file | |
# Can resize and minimize/maximize panes | |
# Source: create, edit & save R code in plain text files. Can also select lines to be run in Console (Command + Return) | |
# Console/Terminal: CLI R interpreter (with arrowkey + mouse navigation). Can type directly at prompt or run commands selected in Source pane. Terminal allows access to system through shell | |
# Environment/History/Git: shows which packages and variables exist in your current environment; history of all commands run, and provides access to Git version control system | |
# Files/Plots/Packages/Help: shows what files are in project folder, interactive plot viewer, package manager, and help window/search | |
### Basics of R input and output ### | |
# boot up Rapp & Rstudio | |
# > is prompt, blinking cursor | |
# any text after # symbol is treated as comment and ignored by interpreter | |
# use R for basic numerical caculations | |
``` | |
# this is a comment | |
1 + 100 # this is a comment | |
``` | |
# calculation is interpreted as the sum of two single-element vectors, resulting in a single-element vector indexed by [1] | |
# index is not a part of the output, just a helpful guidepost for when output has many elements | |
``` | |
rep(1:100) | |
``` | |
# input can be split over multiple lines | |
# if input does not yield a completely interpretable statment, prompt will change from > to + | |
# + symbol does not indicate addition in this context | |
``` | |
1 + | |
100 | |
``` | |
# if you get + prompt that you don't want, hit Esc to kill in R.app/Rstudio (use crtl+c in R CLI) | |
# R uses standard mathematical symbols and order of operations | |
``` | |
3 + 5 * 2 | |
(3 + 5) * 2 | |
``` | |
# R uses scientific E-notation format for very large and small numbers | |
``` | |
2/10000 | |
``` | |
# 2e-04 = 2E-04 = 2e-4 = 2*10^(-4) | |
# 2e-04 != 2*exp(-04) | |
# often see P values of 2e-16, give-away someone is using R in their analysis | |
# non-interger numbers are represented as double floating point numbers with 53 binary digits of accuracy (with corresponds to ~16 decimal digits) | |
# important because non-integer numbers are not represented exactly & you can exprience rounding errors that can compound | |
# R has in-built mathematical functions | |
``` | |
log(1) # what is the default base of log function in R? | |
log(10) # default base isn't 10 | |
log(exp(1)) # default is natural log (base e) | |
log10(10) # log with base 10 | |
log2(2) # log with base 2 | |
log(3, base=3) # log with aribtrary base | |
``` | |
# R has in-built help functions that explain functions and provide examples | |
# help menu pane and tab autocompletion/help suggestions are better in Rstudio than R.app | |
``` | |
?log | |
``` | |
# R has in-built comparison and logical operators | |
``` | |
1 == 1 # equality (note two equals signs, read as "is equal to", only use for integers & strings) | |
1 != 2 # inequality (read as "is not equal to") | |
1 < 2 # less than | |
1 <= 1 # less than or equal to | |
1 > 0 # greater than | |
1 >= -9 # greater than or equal to | |
1 == 1 & 1 == 2 # AND | |
1 == 1 | 1 == 2 # OR | |
!(1 == 1) # NOT | |
``` | |
### Variable naming, assignment & management ### | |
# Variables are named containers that store information | |
# Variables can be manipulated & referenced by the variable name | |
# Do not need to declare variables in R or assign a type (integer, string, dataframe) to them prior to use (dynamically typed) | |
# Variable names cannot start with a number or underscore, or contain spaces | |
# R variable naming conventions are: | |
``` | |
periods.between.words | |
camelCaseToSeparateWords | |
underscores_between_words | |
``` | |
# Google style guide for R code says period >> camelCase >> underscores: https://google.github.io/styleguide/Rguide.xml#identifiers (I disagree with this, and consider this merely a matter of style) | |
# Key is to be consistent in your variable naming style | |
# Assignment of values to variables is typically done using the leftward "<-" composite operator | |
# "<-" does not mean "less than negative"" | |
``` | |
x <- 1/40 # assign value to x (x added to environment) | |
x # print value currently assigned to x | |
x <- 1/30 # assign new value to x (new x *not* added to environment) | |
log(x) # can use x in place of number in any calculation (value of log(x) reported to interpreter but not stored as variable in environment) | |
``` | |
# the right hand side of the expression is evaluated before being assigned to the variable on the left hand side | |
# R also allows rightward assignment "1/30 -> x" | |
# R also allows "=" to be used for assignment, but this is not recommended by google style guide:https://google.github.io/styleguide/Rguide.xml#assignment (I agree with this since "=" is used to set paramenters in many functions) | |
# Managing variables in your environment | |
# Variables that exist in current environment can be listed using: | |
``` | |
ls() | |
``` | |
# Variables (and their values) that exist in current environment can also be inspected in the Environment pane | |
# Variables starting with a "." are hidden from ls() and Environment pane | |
``` | |
.x <- 0 | |
ls() | |
``` | |
# You can view all viarables in your environment as follow: | |
``` | |
ls(all.names=TRUE) instead | |
``` | |
# Variables can be overwritten/modified by assignment | |
# Variables can be deleted as follows: | |
``` | |
rm(.x) | |
``` | |
# To remove all variables from your environment, | |
``` | |
rm(list = ls()) | |
``` | |
################# | |
### Exercises ### | |
################# | |
# 1) Will the following expression evaluate to TRUE or FALSE? | |
(1.25 * (1 * 0.8) - 1) == (1.25 * (3 * 0.8) - 3) | |
# 2) what are the values of the following expressions | |
1 * 0.8 | |
1.25 * (1 * 0.8) | |
1.25 * (1 * 0.8) - 1 | |
# 3) what are the values of the following expressions | |
3 * 0.8 | |
1.25 * (3 * 0.8) | |
1.25 * (3 * 0.8) - 3 | |
# 4) Will the following expression evaluate to TRUE or FALSE? | |
all.equal((1.25 * (1 * 0.8) - 1),(1.25 * (3 * 0.8) - 3)) | |
######################## | |
# Managing Projects in R | |
######################## | |
### Creating a project in Rstudio ### | |
# Click the “File” menu button, then “New Project”. | |
# Click “New Directory”. | |
# Click “New Project”. | |
# Type in the name of the directory to store your project, e.g. “my_project”. | |
# If available, select the checkbox for “Create a git repository.” | |
# Click the “Create Project” button. | |
### Organizing files in a project ### | |
# Put each project in its own directory, which is named after the project | |
# Put text documents associated with the project in the /doc directory | |
# Put (small) raw data and metadata in the /data directory | |
# Put scripts in the /src directory | |
# Directories can be created in Files Pane | |
### Formatting your input data ### | |
# All variables should have a separate column (don’t mix meaning in a column, add a new column if necessary) | |
# All data from the same variable go in same column | |
# Label your columns with terms that are meaningful to other people | |
# Do not leave white spaces in your variable names (use underscore or period, e.g. Air.Flow or air_flow) or data cells (use “n.a.”) | |
# Make sure headers are the first row in file | |
# Don’t leave any blank row or columns in a file | |
# Don’t color code cells (add a new column with an indicator variable, e.g. “gfp_tagged” with values “y/n”) | |
# Save your data as plain text files in tab delimited or comma separated value (CSV) format | |
# After input data is cleaned, make read only (and/or keep under version control) | |
# import CSV data into R session as follows: | |
``` | |
gapminder_data <- read.csv("gapminder_data.csv", header=T) | |
head(gapminder_data) | |
``` | |
### Managing results files ### | |
# Make separate results directories for each analysis (use date in dir_name) | |
# Treat generated output as disposable | |
# If large results files keep outside of git repository (or use git lfs) | |
### installing/loading packages ### | |
# You can see what packages are installed by typing | |
``` | |
installed.packages() | |
``` | |
# You can install packages (e.g. phangorn package) by typing | |
``` | |
install.packages("phangorn") | |
``` | |
# You can update installed packages by typing | |
``` | |
update.packages("packagename") | |
``` | |
# You can remove a package with | |
``` | |
remove.packages("packagename") | |
``` | |
# You need to make a package available before using in your R session (needs to be installed first) | |
``` | |
library(phangorn) | |
``` | |
# Package management is easiest with Packages pane (reports commands, avoids syntax errors) | |
# Packages often have complex dependencies which requires installation of other packages | |
# Packages can be installed from source or binaries: http://r-pkgs.had.co.nz/package.html | |
# Some packages have many versions, important to record which version you are using | |
# As project nears completion it is a good idea to archive your R packages, since it may not be possible to reconstruct full environment in the future (alternatively use conda/bioconda) | |
### version control with Git ### | |
################# | |
### Exercises ### | |
################# | |
1) Create R project in Rstudio & create data, src, and doc directories. | |
2) download gapminder data: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv | |
3) Move gapminder_data.csv into data folder. | |
4) Import gapminder_data.csv into R session, assign dataframe to variable, and inpect that your dataframe has been imported properly using head(). Note: you may need to modify the path to the input datafile. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment