Skip to content

Instantly share code, notes, and snippets.

View lwaldron's full-sized avatar

Levi Waldron lwaldron

View GitHub Profile
@lwaldron
lwaldron / HMP_controlled.R
Last active April 10, 2018 14:30
Accessing HMP controlled metadata
# Before starting, install the NCBI SRA Tookit:
# Debian/Ubuntu:
# apt install sra-toolkit
# macOS:
# brew install sratoolkit
# Windows:
# see https://tinyurl.com/y845ppaa
# Get your dbGaP repository from https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=list_wishlists
# Click "get dbGaP repository key"
library(HMP16SData)
@lwaldron
lwaldron / OVC_SingleCell.Rmd
Created May 29, 2018 15:08
Import OVC single-cell data
---
title: "OVC single-cell analysis"
author: "Levi Waldron"
date: "5/29/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
@lwaldron
lwaldron / CI_confusion.R
Last active June 2, 2018 12:51
What is the fraction of sample means expected to fall within the 95% confidence intervals? This is an incorrect (and not useful) interpretation of the 95% CI, but out of curiosity, the answer is less than 95%.
fracwithin <- function(n){
sam1 <- rnorm(n)
ci <- t.test(sam1)$conf.int
means <- replicate(100, mean(rnorm(n)))
mean(means > ci[1] & means < ci[2])
}
set.seed(2)
n <- seq(5, 10000, by=2)
fraction <- sapply(n, function(i) fracwithin(i))
@lwaldron
lwaldron / metaphlanToPhyloseq.R
Last active October 15, 2024 21:04
Import a table of MetaPhlAn taxonomic abundances into phyloseq
metaphlanToPhyloseq <- function(
metaphlandir,
metadat=NULL,
simplify=TRUE){
## tax is a matrix or data.frame with the table of taxonomic abundances, rows are taxa, columns are samples
## metadat is an optional data.frame of specimen metadata, rows are samples, columns are variables
## if simplify=TRUE, use only the most detailed level of taxa names in the final object
## metaphlanToPhyloseq("~/Downloads/metaphlan_bugs_list")
.getMetaphlanTree <- function(removeGCF=TRUE, simplify=TRUE){
if (!requireNamespace("ape")) {
@lwaldron
lwaldron / simplifyTCGAData.R
Last active June 21, 2018 23:45
Coerce (and optionally remove) all RaggedExperiments in a curatedTCGAData MAE.
simplifyTCGAData <- function(obj, removeRaggedExperiments=TRUE){
##This function will convert mutations to a genes x samples RangedSummarizedExperiment of 1 for non-silent mutations, 1 for silent or no mutation
##It will convert segmented copy number to copy number per gene, using a weighted average if there are non-disjunct ranges
suppressPackageStartupMessages({
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(GenomeInfoDb)
})
gn <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
gn <- keepStandardChromosomes(granges(gn), pruning.mode="coarse")
@lwaldron
lwaldron / symbolsToRanges.R
Last active June 22, 2018 13:31
Add ranges to RNA-seq and microarray SummarizedExperiments where rownames are gene symbols in a MultiAssayExperiment
.cMAE <- function(mae, x, name="newelement"){
el <- ExperimentList(tmp=x)
names(el)[1] <- name
c(mae, el)
}
.hasSymbols <- function(x){
mean(c(FALSE, grepl("^[A-Z0-9]{1,6}|^C[0-9]orf[0-9]{1,4}", rownames(x))), na.rm=TRUE) > 0.9
}
.isSummarizedExperiment <- function(x){
library(curatedTCGAData)
curatedTCGAData("OV")
mae <- curatedTCGAData("OV", assays=c("mRNAArray", "RNASeq2GeneNorm", "RNASeqGene"), dry.run = FALSE)
library(TCGAutils)
keep <- TCGAsampleSelect(colnames(mae), 11)
mae <- mae[, keep, ]
mae <- intersectColumns(mae)
mae <- intersectRows(mae)
library(org.Hs.eg.db)
@lwaldron
lwaldron / REexample.R
Created July 11, 2018 11:37
Example of RaggedExperiment::qreduceAssay
## ------------------------------------------------------------------------
library(GenomicRanges)
library(RaggedExperiment)
sample1 <- GRanges(
c(A = "chr1:1-10:-", B = "chr1:8-14:+", C = "chr2:15-18:+"),
score = 3:5)
sample2 <- GRanges(
c(D = "chr1:1-10:-", E = "chr2:11-18:+"),
score = 1:2)
colDat <- DataFrame(id = 1:2)
@lwaldron
lwaldron / GMQLusecase
Created December 19, 2018 20:14
GMQL use case (from Masseroli et al 2018, Bioinformatics bty688)
# Masseroli et al 2018, https://doi.org/10.1093/bioinformatics/bty688
# "In TCGA data of BRCA patients, find the DNA somatic mutations
# within the first 2000 bp outside of the genes that are both
# expressed with FPKM > 3 and have at least a methylation in the same patient
# biospecimen, and extract these mutations of the top 5% patients
# with the highest number of such mutations."
library(curatedTCGAData)
system.time(mae <- curatedTCGAData("ACC", c("Mutation", "RNASeq2GeneNorm", "Methylation"), dry.run = FALSE))
@lwaldron
lwaldron / methbenchmark.R
Last active January 14, 2019 10:39
simple DelayedMatrix benchmark showing access time of n rows growing as O(n^3)
if( Biobase::package.version("curatedTCGAData") < "1.5.6" ){
BiocManager::install("waldronlab/curatedTCGAData")
}
stopifnot(BiocManager::version() >= "3.9")
library(curatedTCGAData) #requires >=1.5.6 and bioc-devel
mae <- curatedTCGAData("UCEC", "Methylation_methyl27", dry.run = FALSE) #~2 seconds from cache
dm <- assay(mae, 1)
# benchmarking showing cubic increase with # rows