mikelove · August 29, 2015 14:20
diff --git a/biocII.Rpres b/biocII.Rpres
 How It's Done: Bioc II
 ===
 author: Michael Love
 date: 30 April 2015
 autosize: false
 width: 1600

 Survey results (n=28)
 ===

 <center>
 ```{r echo=FALSE, fig.width=14}
 library(rafalib)
 mypar()
 res <- c(gene.models=13,
         read.seq=12,
         read.counts=8,
         Anno.DB=10,
         make.pkg=10,
         BEDtools=9,
         nice.HTML=8,
         parallel=8)
 barplot(res, col=rainbow(7))
 ```
 </center>

 <https://gist.github.com/mikelove/83e47aa27dd4a71ea73f>

 Quick note on versioning
 ===

 ```{r cache=TRUE}
 biocVersion() # your machine
 packageVersion("limma") 
 biocValid("limma")
 ```

 http://bioconductor.org/bioc-version

 Second quick note: help and vignettes
 ===

 What does this function / package do? Always go for help

 ```{r eval=FALSE}
 ?function
 help(package="foo", help_type="html")
 browseVignettes("foo")
 ```

 Also: http://support.bioconductor.org and post your `sessionInfo()`.

 Gene models
 ===

 This functionality comes from [GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures) package.

 ```{r}
 library("TxDb.Hsapiens.UCSC.hg19.knownGene")
 txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
 ```

 Gene models
 ===

 ```{r cache=TRUE}
 g <- genes(txdb)
 g["128229"]
 ```

 Also `exons()` and `transcripts()`

 Gene models: exons by gene
 ===

 ```{r cache=TRUE}
 ebg <- exonsBy(txdb, by="gene")
 ebg[["128229"]]
 ```

 Gene models: transcripts by gene
 ===

 ```{r cache=TRUE}
 tbg <- transcriptsBy(txdb, by="gene")
 tbg[["128229"]]
 ```

 Gene models: exons by transcript
 ===

 ```{r cache=TRUE}
 ebt <- exonsBy(txdb, by="tx")
 ids <- tbg[["128229"]]$tx_id
 ebt.sub <- ebt[ids]
 lapply(ebt.sub, function(gr) gr$exon_id)
 ```

 Read in sequencing data
 ===

 Basic functionality in the [Rsamtools](http://bioconductor.org/packages/Rsamtools) library.

 ```{r echo=FALSE}
 filename <- system.file("extdata/SRR1039508_subset.bam",package="airway")
 ```

 ```{r}
 library(Rsamtools)
 bf <- BamFile(filename)
 seqinfo(bf)
 ```

 Read in sequencing data
 ===

 ```{r echo=FALSE}
 index <- indexBam(bf)
 bf <- BamFile(filename)
 ```

 ```{r cache=TRUE}
 countBam(bf, param=ScanBamParam(which=GRanges("1",IRanges(11e6,11.4e6))))
 ```

 Read in sequencing data
 ===

 ```{r cache=TRUE}
 reads <- scanBam(bf, param = ScanBamParam(which = GRanges("1",IRanges(11e6,11.4e6)), what = c("strand","pos")))
 class(reads)
 head(reads[[1]]$strand) # '1' for the first range in 'which'
 head(reads[[1]]$pos)
 ```

 Options for **what**: qname, flag, strand, pos, cigar, etc.

 Read in sequencing data: nice objects
 ===

 Nicer access to sequencing data in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package.

 ```{r cache=TRUE}
 library(GenomicAlignments)
 ga <- readGAlignmentPairs(bf)
 ga[1]
 ```

 Read in sequencing data: nice objects
 ===

 ```{r cache=TRUE}
 first(ga)[1]
 ```

 Also `last()`, `left()` and `right()`.

 Read count table
 ===

 Often people want to count NGS reads which overlap features. Two good functions for this:

 * `summarizeOverlaps` in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package 
 * `featureCounts` in the [Rsubread](http://bioconductor.org/packages/Rsubread) package

 Read count table: summarizeOverlaps
 ===

 ```{r cache=TRUE}
 gr <- GRanges("1",IRanges(110:114 * 1e5, width=1e5))
 so <- summarizeOverlaps(gr, bf)
 assay(so)
 ```

 Lots of options in `?summarizeOverlaps`. To count in the exons of genes or transcripts you would use `exonsBy`.

 Instead of a BamFile, we could have had a BamFileList.

 Read count table: featureCounts
 ===

 ```{r echo=FALSE}
 gtf.file <- system.file("extdata/Homo_sapiens.GRCh37.75_subset.gtf",package="airway")
 ```

 ```{r results="hide", cache=TRUE}
 library(Rsubread)
 fc <- featureCounts(filename, annot.ext=gtf.file, isGTFAnnotationFile=TRUE)
 ```

 ```{r cache=TRUE}
 fc$counts[,1]
 ```

 Likewise, here we could have used multiple files.

 Annotation interface
 ===

 The annotation packages use functionality from [AnnotationDbi](http://bioconductor.org/packages/AnnotationDbi)

 ```{r}
 library(org.Hs.eg.db)
 keytypes(org.Hs.eg.db)
 # columns(org.Hs.eg.db)
 ```

 Annotation interface
 ===

 ```{r cache=TRUE}
 head(keys(org.Hs.eg.db))
 ```

 Annotation interface
 ===

 ```{r cache=TRUE}
 select(org.Hs.eg.db, key=c("TP53","PTEN"), 
       keytype="SYMBOL", columns=c("ENSEMBL","ENTREZID"))
 ```

 Another option is the [biomaRt](http://bioconductor.org/packages/biomaRt) package, which pings Biomart servers, but I like that the *org* packages are downloadable, and therefore versionable and reproducible.

 Annotation interface
 ===

 New in Bioc 3.1, `mapIds()` returns as a named character vector

 ```{r cache=TRUE}
 mapIds(org.Hs.eg.db, key=c("TP53","PTEN"), keytype="SYMBOL", column="ENSEMBL")
 ```

 Annotation interface: txdb's also are databases
 ===

 ```{r cache=TRUE}
 keytypes(txdb)
 select(txdb, keys=c("1","10"), keytype="GENEID", columns="TXID")
 ```

 Make a package
 ===

 Here I show the bare minimum. For more, see Hadley Wickham's [R packages](http://r-pkgs.had.co.nz/)

 ```{r eval=FALSE}
 source("scripts.R")
 package.skeleton(name = "foo") # packages up all functions in envir.
 library(devtools)
 document()
 ```

 Also good is the *testthat* packaage, for unit testing.

 Make a package
 ===

 This is documentation using the *roxygen2* package.

 ```{r eval=FALSE}
 #' title
 #'
 #' descriptive paragraph
 #'
 #' @param x a numeric value
 #' 
 #' @return a numeric value, x^2
 foo <- function(x) {
  x^2
 }
 ```

 BEDtools-like operations
 ===

 Read about these in the [GenomicRanges](http://bioconductor.org/packages/GenomicRanges) manual.

 ```{r}
 gr1 <- GRanges("1", IRanges(c(1,11,21),c(5,15,25)))
 gr2 <- GRanges("1", IRanges(c(1,9),c(13,12)))
 gr1 %over% gr2
 ```

 BEDtools-like operations
 ===

 ```{r}
 fo <- findOverlaps(gr1, gr2)
 fo
 ```

 BEDtools-like operations
 ===

 ```{r}
 intersect(gr1, gr2)
 ```

 BEDtools-like operations
 ===

 ```{r}
 union(gr1, gr2)
 ```

 BEDtools-like operations
 ===

 ```{r}
 gr1 <- GRanges("1", IRanges(5001000,width=100))
 gr2 <- GRanges("1", IRanges((1:10) * 1e6, width=300))
 distanceToNearest(gr1, gr2)
 ```

 Nice HTML reports
 ===

 From the [ReportingTools](http://bioconductor.org/packages/ReportingTools) vignette.

 ```{r}
 my.df <- data.frame(EGID = c("103", "104", "105", "106", "107"),
                    RPKM = c(4, 5, 3, 100, 75),
                    DE = c("Yes", "Yes", "No", "No", "No"))
 my.df
 ```

 Nice HTML reports
 ===

 ```{r eval=FALSE}
 library(ReportingTools)
 htmlRep <- HTMLReport(shortName = "my_html_file",
                      reportDirectory = "report",
                      basePath="~/Desktop")
 publish(my.df, htmlRep)
 finish(htmlRep)
 ```

 Note, this is already set up to work with limma, edgeR, DESeq2, etc. Can show boxplots and link out to external db's. Check `browseVignettes("ReportingTools")`.

 BiocParallel
 ===

 [BiocParallel](http://bioconductor.org/packages/BiocParallel) implements `bplapply` which is very similar to `mclapply`.

 "*BiocParallel aims to provide a unified interface to existing parallel infrastructure where code can be easily executed in different environments*"

 ```{r eval=FALSE}
 library(BiocParallel)
 register(MulticoreParam(5)))
 # register(SerialParam())
 result.list <- bplapply(1:10, function(i) {
    heavyWork(filename[i])
 })
 ```
	How It's Done: Bioc II
	===
	author: Michael Love
	date: 30 April 2015
	autosize: false
	width: 1600

	Survey results (n=28)
	===

	<center>
	```{r echo=FALSE, fig.width=14}
	library(rafalib)
	mypar()
	res <- c(gene.models=13,
	read.seq=12,
	read.counts=8,
	Anno.DB=10,
	make.pkg=10,
	BEDtools=9,
	nice.HTML=8,
	parallel=8)
	barplot(res, col=rainbow(7))
	```
	</center>

	<https://gist.github.com/mikelove/83e47aa27dd4a71ea73f>

	Quick note on versioning
	===

	```{r cache=TRUE}
	biocVersion() # your machine
	packageVersion("limma")
	biocValid("limma")
	```

	http://bioconductor.org/bioc-version

	Second quick note: help and vignettes
	===

	What does this function / package do? Always go for help

	```{r eval=FALSE}
	?function
	help(package="foo", help_type="html")
	browseVignettes("foo")
	```

	Also: http://support.bioconductor.org and post your `sessionInfo()`.

	Gene models
	===

	This functionality comes from [GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures) package.

	```{r}
	library("TxDb.Hsapiens.UCSC.hg19.knownGene")
	txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
	```

	Gene models
	===

	```{r cache=TRUE}
	g <- genes(txdb)
	g["128229"]
	```

	Also `exons()` and `transcripts()`

	Gene models: exons by gene
	===

	```{r cache=TRUE}
	ebg <- exonsBy(txdb, by="gene")
	ebg[["128229"]]
	```

	Gene models: transcripts by gene
	===

	```{r cache=TRUE}
	tbg <- transcriptsBy(txdb, by="gene")
	tbg[["128229"]]
	```

	Gene models: exons by transcript
	===

	```{r cache=TRUE}
	ebt <- exonsBy(txdb, by="tx")
	ids <- tbg[["128229"]]$tx_id
	ebt.sub <- ebt[ids]
	lapply(ebt.sub, function(gr) gr$exon_id)
	```

	Read in sequencing data
	===

	Basic functionality in the [Rsamtools](http://bioconductor.org/packages/Rsamtools) library.

	```{r echo=FALSE}
	filename <- system.file("extdata/SRR1039508_subset.bam",package="airway")
	```

	```{r}
	library(Rsamtools)
	bf <- BamFile(filename)
	seqinfo(bf)
	```

	Read in sequencing data
	===

	```{r echo=FALSE}
	index <- indexBam(bf)
	bf <- BamFile(filename)
	```

	```{r cache=TRUE}
	countBam(bf, param=ScanBamParam(which=GRanges("1",IRanges(11e6,11.4e6))))
	```

	Read in sequencing data
	===

	```{r cache=TRUE}
	reads <- scanBam(bf, param = ScanBamParam(which = GRanges("1",IRanges(11e6,11.4e6)), what = c("strand","pos")))
	class(reads)
	head(reads[[1]]$strand) # '1' for the first range in 'which'
	head(reads[[1]]$pos)
	```

	Options for what: qname, flag, strand, pos, cigar, etc.

	Read in sequencing data: nice objects
	===

	Nicer access to sequencing data in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package.

	```{r cache=TRUE}
	library(GenomicAlignments)
	ga <- readGAlignmentPairs(bf)
	ga[1]
	```

	Read in sequencing data: nice objects
	===

	```{r cache=TRUE}
	first(ga)[1]
	```

	Also `last()`, `left()` and `right()`.

	Read count table
	===

	Often people want to count NGS reads which overlap features. Two good functions for this:

	* `summarizeOverlaps` in the [GenomicAlignments](http://bioconductor.org/packages/GenomicAlignments) package
	* `featureCounts` in the [Rsubread](http://bioconductor.org/packages/Rsubread) package

	Read count table: summarizeOverlaps
	===

	```{r cache=TRUE}
	gr <- GRanges("1",IRanges(110:114 * 1e5, width=1e5))
	so <- summarizeOverlaps(gr, bf)
	assay(so)
	```

	Lots of options in `?summarizeOverlaps`. To count in the exons of genes or transcripts you would use `exonsBy`.

	Instead of a BamFile, we could have had a BamFileList.

	Read count table: featureCounts
	===

	```{r echo=FALSE}
	gtf.file <- system.file("extdata/Homo_sapiens.GRCh37.75_subset.gtf",package="airway")
	```

	```{r results="hide", cache=TRUE}
	library(Rsubread)
	fc <- featureCounts(filename, annot.ext=gtf.file, isGTFAnnotationFile=TRUE)
	```

	```{r cache=TRUE}
	fc$counts[,1]
	```

	Likewise, here we could have used multiple files.

	Annotation interface
	===

	The annotation packages use functionality from [AnnotationDbi](http://bioconductor.org/packages/AnnotationDbi)

	```{r}
	library(org.Hs.eg.db)
	keytypes(org.Hs.eg.db)
	# columns(org.Hs.eg.db)
	```

	Annotation interface
	===

	```{r cache=TRUE}
	head(keys(org.Hs.eg.db))
	```

	Annotation interface
	===

	```{r cache=TRUE}
	select(org.Hs.eg.db, key=c("TP53","PTEN"),
	keytype="SYMBOL", columns=c("ENSEMBL","ENTREZID"))
	```

	Another option is the [biomaRt](http://bioconductor.org/packages/biomaRt) package, which pings Biomart servers, but I like that the org packages are downloadable, and therefore versionable and reproducible.

	Annotation interface
	===

	New in Bioc 3.1, `mapIds()` returns as a named character vector

	```{r cache=TRUE}
	mapIds(org.Hs.eg.db, key=c("TP53","PTEN"), keytype="SYMBOL", column="ENSEMBL")
	```

	Annotation interface: txdb's also are databases
	===

	```{r cache=TRUE}
	keytypes(txdb)
	select(txdb, keys=c("1","10"), keytype="GENEID", columns="TXID")
	```

	Make a package
	===

	Here I show the bare minimum. For more, see Hadley Wickham's [R packages](http://r-pkgs.had.co.nz/)

	```{r eval=FALSE}
	source("scripts.R")
	package.skeleton(name = "foo") # packages up all functions in envir.
	library(devtools)
	document()
	```

	Also good is the testthat packaage, for unit testing.

	Make a package
	===

	This is documentation using the roxygen2 package.

	```{r eval=FALSE}
	#' title
	#'
	#' descriptive paragraph
	#'
	#' @param x a numeric value
	#'
	#' @return a numeric value, x^2
	foo <- function(x) {
	x^2
	}
	```

	BEDtools-like operations
	===

	Read about these in the [GenomicRanges](http://bioconductor.org/packages/GenomicRanges) manual.

	```{r}
	gr1 <- GRanges("1", IRanges(c(1,11,21),c(5,15,25)))
	gr2 <- GRanges("1", IRanges(c(1,9),c(13,12)))
	gr1 %over% gr2
	```

	BEDtools-like operations
	===

	```{r}
	fo <- findOverlaps(gr1, gr2)
	fo
	```

	BEDtools-like operations
	===

	```{r}
	intersect(gr1, gr2)
	```

	BEDtools-like operations
	===

	```{r}
	union(gr1, gr2)
	```

	BEDtools-like operations
	===

	```{r}
	gr1 <- GRanges("1", IRanges(5001000,width=100))
	gr2 <- GRanges("1", IRanges((1:10) * 1e6, width=300))
	distanceToNearest(gr1, gr2)
	```

	Nice HTML reports
	===

	From the [ReportingTools](http://bioconductor.org/packages/ReportingTools) vignette.

	```{r}
	my.df <- data.frame(EGID = c("103", "104", "105", "106", "107"),
	RPKM = c(4, 5, 3, 100, 75),
	DE = c("Yes", "Yes", "No", "No", "No"))
	my.df
	```

	Nice HTML reports
	===

	```{r eval=FALSE}
	library(ReportingTools)
	htmlRep <- HTMLReport(shortName = "my_html_file",
	reportDirectory = "report",
	basePath="~/Desktop")
	publish(my.df, htmlRep)
	finish(htmlRep)
	```

	Note, this is already set up to work with limma, edgeR, DESeq2, etc. Can show boxplots and link out to external db's. Check `browseVignettes("ReportingTools")`.

	BiocParallel
	===

	[BiocParallel](http://bioconductor.org/packages/BiocParallel) implements `bplapply` which is very similar to `mclapply`.

	"BiocParallel aims to provide a unified interface to existing parallel infrastructure where code can be easily executed in different environments"

	```{r eval=FALSE}
	library(BiocParallel)
	register(MulticoreParam(5)))
	# register(SerialParam())
	result.list <- bplapply(1:10, function(i) {
	heavyWork(filename[i])
	})
	```