Skip to content

Instantly share code, notes, and snippets.

View erictleung's full-sized avatar
👨‍💻
Data sciencing

Eric Leung erictleung

👨‍💻
Data sciencing
View GitHub Profile
@erictleung
erictleung / stackoverflow_remote_salaries.R
Created August 19, 2016 01:12
Compare remote to non-remote developers' salaries from StackOverflow's Developer Survey 2016
library(ggplot2)
library(dplyr)
so <- read.csv(file = "2016 Stack Overflow Survey Responses.csv") %>% tbl_df()
# What is the difference in salaries betewen people who report workring remotely
# and people who don't?
# (Full-time remote + Part-time remote) vs (Rarely work remotely and Never)
so_salary_remote <- so %>% select(salary_midpoint, remote)
@erictleung
erictleung / parallel_example.R
Created October 30, 2016 21:34
Example code to parallelize R with plyr and doMC
# Parallelize R
# Source: http://blog.yhat.com/posts/running-r-in-parallel.html
library(plyr)
library(doMC)
# set number of cores (mine has two cores)
doMC::registerDoMC(cores = 2) # number of cores you have access to
# Process data
@erictleung
erictleung / write-to-excel.R
Created June 13, 2017 21:49
Code example to write data from within R to Excel sheets using openxlsx
# Load package
library(openxlsx) # Write data to Excel sheets
# Remind what is in data
head(iris)
head(mtcars)
# Create workbook (i.e. file) to put data in
fileName <- "iris-mtcars.xlsx"
excel <- createWorkbook(fileName)
@erictleung
erictleung / rel_abundance.R
Created August 1, 2017 04:03
Show dangers of relative abundance
# Title:
# Show dangers of relative abundance
# Import graphics library
library(ggplot2)
# Create dummy data
a <- c(2, 3, 4, 15, 3)
b <- c(2, 3, 4, 30, 3)
sample <- 1:5
@erictleung
erictleung / correlations.R
Last active October 12, 2017 02:17
Copy of Matthews Correlation Coefficient https://github.com/cran/mccr/blob/master/R/mccr.R
# Question: https://stats.stackexchange.com/q/59343/132399
# Source: https://stats.stackexchange.com/a/59348/132399
matrix(c(5, 3, 2, 17), nrow = 2, byrow = TRUE)
# [,1] [,2]
# [1,] 5 3
# [2,] 2 17
# Matthews correlation coefficient directly from Wikipedia formula
(5*17-3*2) / sqrt((5+3)*(5+2)*(17+3)*(17+2))
@erictleung
erictleung / entropy.R
Created October 26, 2017 00:41
Show unintuitive nature of entropy to describe diversity
# Demonstrate unintuitiveness of entropy to describe diversity
# Resources used:
# - Code influenced by https://stackoverflow.com/a/27255815
# - Problem from http://www.biodiverseperspectives.com/2013/03/25/an-intuitive-approach-to-describing-diversity/
# Load libraries
library(entropy)
# Make two communities, each with equally abundant species
eightspecies <- rep(letters[1:8], 5)
@erictleung
erictleung / gini_contrast.R
Created October 26, 2017 17:44
Quick script to calculate Gini index and show it is not intuitive for diversity
# Quick script to calculate Gini index and show that it is counterintuitive to
# how we should think about diversity
# Calculate Gini-Simpson quadratic index
# gini = 1 - \sum_{i = 1}^R p_i^2
gini <- function(x) {
temp <- x / sum(x)
1 - sum(temp * temp)
}
# ---
# title: Export QIIME2 OTU table to compatible file for phyloseq
# description: |
# Three main steps to get to compatible file to import to phyloseq
#
# Outline:
# 1. Export OTU table
# 2. Export taxonomy table
# 3. Export phylogenetic tree
# ---
@erictleung
erictleung / find_largest_files.sh
Last active October 1, 2018 23:54
Bash script to run and find largest files on computer. Tested for macOS bash.
#!/usr/bin/env bash
# Purpose: Find largest directories and files in your home directory
# Note: Tested on macOS
# To Run:
# $ bash find_largest_files.sh
# Output:
# - top_100_largest.txt
# - Output to the terminal total space used in home directory
# To explore results:
@erictleung
erictleung / mathematics_topics.md
Last active October 5, 2018 23:57
Mathematical topics of interest to bioinformaticians and computational biologists

Mathematics Topics

List of running topics that may be of interest.

Statistics

  • Maximum likelihood estimation (MLE)
  • Experimental design
  • Bayesian statistics with priors and posterior distributions
  • Cross-correlations: correlations between entries of two random vectors X and Y, while the correlations of random vector X are considered to be the correlations between the entries of X itself, those forming the correlation matrix of X