Skip to content

Instantly share code, notes, and snippets.

@mikelove
Last active August 29, 2015 14:20
Show Gist options
  • Save mikelove/83e47aa27dd4a71ea73f to your computer and use it in GitHub Desktop.
Save mikelove/83e47aa27dd4a71ea73f to your computer and use it in GitHub Desktop.
This is the Rpres file, the next gist, biocII.html can be downloaded and viewed in a browser.
How It's Done: Bioc II
===
author: Michael Love
date: 30 April 2015
autosize: false
width: 1600
Survey results (n=28)
===
<center>
```{r echo=FALSE, fig.width=14}
library(rafalib)
mypar()
res <- c(gene.models=13,
read.seq=12,
read.counts=8,
Anno.DB=10,
make.pkg=10,
BEDtools=9,
nice.HTML=8,
parallel=8)
barplot(res, col=rainbow(7))
```
</center>
<https://gist.github.com/mikelove/83e47aa27dd4a71ea73f>
Quick note on versioning
===
```{r cache=TRUE}
biocVersion() # your machine
packageVersion("limma")
biocValid("limma")
```
http://bioconductor.org/bioc-version
Second quick note: help and vignettes
===
What does this function / package do? Always go for help
```{r eval=FALSE}
?function
help(package="foo", help_type="html")
browseVignettes("foo")
```
Also: http://support.bioconductor.org and post your `sessionInfo()`.
Gene models
===
This functionality comes from [GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures) package.
```{r}
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
```
Gene models
===
```{r cache=TRUE}
g <- genes(txdb)
g["128229"]
```
Also `exons()` and `transcripts()`
Gene models: exons by gene
===
```{r cache=TRUE}
ebg <- exonsBy(txdb, by="gene")
ebg[["128229"]]
```
Gene models: transcripts by gene
===
```{r cache=TRUE}
tbg <- transcriptsBy(txdb, by="gene")
tbg[["128229"]]
```
Gene models: exons by transcript
===
```{r cache=TRUE}
ebt <- exonsBy(txdb, by="tx")
ids <- tbg[["128229"]]$tx_id
ebt.sub <- ebt[ids]
lapply(ebt.sub, function(gr) gr$exon_id)
```
Read in sequencing data
===
Basic functionality in the [Rsamtools](http://bioconductor.org/packages/Rsamtools) library.
```{r echo=FALSE}
filename <- system.file("extdata/SRR1039508_subset.bam",package="airway")
```
```{r}
library(Rsamtools)
bf <- BamFile(filename)
seqinfo(bf)
```
Read in sequencing data
===
```{r echo=FALSE}
index <- indexBam(bf)
bf <- BamFile(filename)
```
```{r cache=TRUE}
countBam(bf, param=ScanBamParam(which=GRanges("1",IRanges(11e6,11.4e6))))
```
Read in sequencing data
===
```{r cache=TRUE}
reads <- scanBam(bf, param = ScanBamParam(which = GRanges("1",IRanges(11e6,11.4e6)), what = c("strand","pos")))
class(reads)
head(reads[[1]]$strand) # '1' for the first range in 'which'
head(reads[[1]]$pos)
```
Options for **what**: qname, flag, strand, pos, cigar, etc.
Read in sequencing data: nice objects
===
Nicer access to sequencing data in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package.
```{r cache=TRUE}
library(GenomicAlignments)
ga <- readGAlignmentPairs(bf)
ga[1]
```
Read in sequencing data: nice objects
===
```{r cache=TRUE}
first(ga)[1]
```
Also `last()`, `left()` and `right()`.
Read count table
===
Often people want to count NGS reads which overlap features. Two good functions for this:
* `summarizeOverlaps` in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package
* `featureCounts` in the [Rsubread](http://bioconductor.org/packages/Rsubread) package
Read count table: summarizeOverlaps
===
```{r cache=TRUE}
gr <- GRanges("1",IRanges(110:114 * 1e5, width=1e5))
so <- summarizeOverlaps(gr, bf)
assay(so)
```
Lots of options in `?summarizeOverlaps`. To count in the exons of genes or transcripts you would use `exonsBy`.
Instead of a BamFile, we could have had a BamFileList.
Read count table: featureCounts
===
```{r echo=FALSE}
gtf.file <- system.file("extdata/Homo_sapiens.GRCh37.75_subset.gtf",package="airway")
```
```{r results="hide", cache=TRUE}
library(Rsubread)
fc <- featureCounts(filename, annot.ext=gtf.file, isGTFAnnotationFile=TRUE)
```
```{r cache=TRUE}
fc$counts[,1]
```
Likewise, here we could have used multiple files.
Annotation interface
===
The annotation packages use functionality from [AnnotationDbi](http://bioconductor.org/packages/AnnotationDbi)
```{r}
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
# columns(org.Hs.eg.db)
```
Annotation interface
===
```{r cache=TRUE}
head(keys(org.Hs.eg.db))
```
Annotation interface
===
```{r cache=TRUE}
select(org.Hs.eg.db, key=c("TP53","PTEN"),
keytype="SYMBOL", columns=c("ENSEMBL","ENTREZID"))
```
Another option is the [biomaRt](http://bioconductor.org/packages/biomaRt) package, which pings Biomart servers, but I like that the *org* packages are downloadable, and therefore versionable and reproducible.
Annotation interface
===
New in Bioc 3.1, `mapIds()` returns as a named character vector
```{r cache=TRUE}
mapIds(org.Hs.eg.db, key=c("TP53","PTEN"), keytype="SYMBOL", column="ENSEMBL")
```
Annotation interface: txdb's also are databases
===
```{r cache=TRUE}
keytypes(txdb)
select(txdb, keys=c("1","10"), keytype="GENEID", columns="TXID")
```
Make a package
===
Here I show the bare minimum. For more, see Hadley Wickham's [R packages](http://r-pkgs.had.co.nz/)
```{r eval=FALSE}
source("scripts.R")
package.skeleton(name = "foo") # packages up all functions in envir.
library(devtools)
document()
```
Also good is the *testthat* packaage, for unit testing.
Make a package
===
This is documentation using the *roxygen2* package.
```{r eval=FALSE}
#' title
#'
#' descriptive paragraph
#'
#' @param x a numeric value
#'
#' @return a numeric value, x^2
foo <- function(x) {
x^2
}
```
BEDtools-like operations
===
Read about these in the [GenomicRanges](http://bioconductor.org/packages/GenomicRanges) manual.
```{r}
gr1 <- GRanges("1", IRanges(c(1,11,21),c(5,15,25)))
gr2 <- GRanges("1", IRanges(c(1,9),c(13,12)))
gr1 %over% gr2
```
BEDtools-like operations
===
```{r}
fo <- findOverlaps(gr1, gr2)
fo
```
BEDtools-like operations
===
```{r}
intersect(gr1, gr2)
```
BEDtools-like operations
===
```{r}
union(gr1, gr2)
```
BEDtools-like operations
===
```{r}
gr1 <- GRanges("1", IRanges(5001000,width=100))
gr2 <- GRanges("1", IRanges((1:10) * 1e6, width=300))
distanceToNearest(gr1, gr2)
```
Nice HTML reports
===
From the [ReportingTools](http://bioconductor.org/packages/ReportingTools) vignette.
```{r}
my.df <- data.frame(EGID = c("103", "104", "105", "106", "107"),
RPKM = c(4, 5, 3, 100, 75),
DE = c("Yes", "Yes", "No", "No", "No"))
my.df
```
Nice HTML reports
===
```{r eval=FALSE}
library(ReportingTools)
htmlRep <- HTMLReport(shortName = "my_html_file",
reportDirectory = "report",
basePath="~/Desktop")
publish(my.df, htmlRep)
finish(htmlRep)
```
Note, this is already set up to work with limma, edgeR, DESeq2, etc. Can show boxplots and link out to external db's. Check `browseVignettes("ReportingTools")`.
BiocParallel
===
[BiocParallel](http://bioconductor.org/packages/BiocParallel) implements `bplapply` which is very similar to `mclapply`.
"*BiocParallel aims to provide a unified interface to existing parallel infrastructure where code can be easily executed in different environments*"
```{r eval=FALSE}
library(BiocParallel)
register(MulticoreParam(5)))
# register(SerialParam())
result.list <- bplapply(1:10, function(i) {
heavyWork(filename[i])
})
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment