Conference info: https://bioc2018.bioconductor.org/
My first Bioconductor meeting, and I'm not a BioC or R expert so these notes are probably going to be naïve!
(I arrived at 1:00pm and missed the morning sessions)
- Boston first, then NYC
- Bioconductor has a community Slack, there is an #meetups channel
- Boston: 40 to 100 people turn up
Local file management. Cache files locally to avoid downloading from remote sources if not needed. Also, try to have a better way to organize files.
BiocFileCache()
. Backed by a sqlite database.
bfcadd( rname=..., fpath=...)
adds an existing file to track in the cachebfcnew( rname=... )
gives a new path in the cachebfcneedsupdate()
check if a dataset has changed remotely and needs to be downloaded againbfcquery(...)
search for datasets in the cachebfcrpath(...)
gives the local path of file by id
Can also attach metadata to datasets.
- Stored variants (genotypes, multiple assays, multiple individuals).
- Extends
RangedSummarizedExperiment
. - Can construct from
gds
file or from avcf
file. - Subsetting and range slicing.
- "Many statistical methods are defined". Example: hwe.
- map/reduce in R:
lapply( X, FUN, ...)
. - BioCParallel:
bplapply( ..., bpparam )
bpparam
determines the BioCParallel backend to use- e.g.
SerialParam
orMulticoreParam
- New: scalaing across clusters,
BatchToolsParam( workers=..., cluster=... )
- cluster might be SGE, SLURM, LSF, PBS, et cetera
- Example: Salmon psuedoalignment
- instantiate
BatchToolsParam
with institution specific template - Write a function that processes a single sample
- Pass function to
bplapply
to run in parallel- Progress Bar!
- instantiate
- Benefits: easier cluster management, ...
- https://github/nturaga/BatchtoolsParam_examples
How the confernece workshops (will run Thursday/Friday) materials were built.
- https://github.com/Bioconductor/BiocWorkshops
- What's different about (BioC conference) workshops this year?
- Collection of
Rmd
files collated with bookdown.- 130 package dependencies (ouch!)
- Coordinated with GitHub issues+tags
- Produces a single gh_pages site with all of the workshops (https://bioconductor.github.io/BiocWorkshops)
- Building an AMI for all workshops
- Using packer.io to build AMI with everything needed for every workshop
- Alternative interface to the
org.*
packages, similar purpose toOrganismDBI
- Any organism with both a
org
andtxdb
package can be used src_organism( "org..." )
provides the interface. Compatible with all methods fromdplyr
- 11 genomic coordinate extractor methods available, e.g.
transcripts
gets a GRanges,transcripts_tbl
gets a tibble - Examples of a variety of complex filters (too small to read!)
Parallel tracks, Peter Hickey on Effectively using DelayedArray and Levi on New Data Structures for Bioconductor. (Sorry @PeteHaitch but I can only choose one).
First, a presentation "Why re-use core classes: A plea to developers of Bioconductor packages" (Levi).
- What is Bioconductor? 1,400 packages on a backbone of data structures.
- e.g.: GenomicRanges, SummarizedExperiment
- Why do core classes matter?
- Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.
- But your frame has limited testing, and probably doesn't handle many use cases
- It is easy to define a new S4 class in R
- But you shouldn't very difficult to build a robust and flexible class for genomic data analysis
- Example from phylogenetics / microbiome packages: not using common classes ⟶ limits interoperability
- Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.
- What are the core classes?
SummarizedExperiment
,GenomicRanges
,Biostrings
,GSEABase
,MultiAssayExperiment
,SingleCellExperiment
,MSnbase
...- https://bioconductor.org/developers/how-to/commonMethodsAndClasses/
- Core classes represent years of work and maintenence and have been used by tens of thousands of users
-
Q: What would you do instead of defining your own class (in the case of phyloseq)
-
.MicrobiomeExperiment <- setClass("MicrobiomeExperiment", contains="SummarizedExperiment", representation( rowData="MicrobiomeFeatures" ) )
- Gives you the benefits of
SummarizedExperiment
, compatible withMultiAssayExperiment
-
-
Exploring S4 classes
extends
tells you what superclasses a given class extends- e.g.
RangedSummarizedExperiment
isaSummarizedExperiment
,Vector
,Annotated
- e.g.
showclass
adds known subclasses and what slots it containsmethods
tells you what methods are defined for a class- e.g. 100+ methods on
SummarizedExperiment
, but 54 of those from parent classes)
- e.g. 100+ methods on
-
Example that has "done things right":
SingleCellExpriment
- extends
RangedSummarizedExperiment
and defines additional methods...
- extends
https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html
-
WIP: Fixing up to work better with Bioconductor objects
-
Drug sensitivity data: "treat a cell line with a drug and see how well it kills it"
-
Structure
- molecularProfiles: List of ExpressionSet objects
- sensitivity: List of a couple of data frames and an array
- Initially Experiment IDs with dose/viability pairs
- But, drugs combinations, other dimensions, not naturally a matrix
- Solution:
LongArray
object- col.ids and row.ids: data.frame's
- Example (combination_name, drugA, drugB)
- Get data as if from a list
- "long array but behaves when you use a single bracket as if a matrix"
- Slicing example
- dcText is a longArray object with "rows" across 2 variables and "columns" across 1 variable
- slicing dcTest[c("5-FU", "Bortezomib", "Erlotinib"), "A2058"]
- col.ids and row.ids: data.frame's
- Initially Experiment IDs with dose/viability pairs
-
Q: Why is this different from a SummarizedExperiment?
- Multiple experiments, e.g. drug combinations of 2, 3, ... n drugs
- Followup Q: what about MultiAssayExperiment?
- Would lose quick subsetting through multiple dimensions (?)
https://bioconductor.org/packages/release/bioc/html/TxRegInfra.html
- Investigator's idea: eQTL from GTex, DHS,... from ENCODE, TFBS from FIMO... use this to interpret GWAS hits
- Developers: do as little as possible to resolve and keep metadata
- Existing resources that could help:
- rtracklayer+tabix, GenomicRanges, RaggedExperiments, ...
- mongolite
- RaggedExperiment
- (completely missed this and the slides blanked out)
- But seems like experiment where observations have different features, some shared some not
- I need to read this later: https://bioconductor.org/packages/release/bioc/manuals/RaggedExperiment/man/RaggedExperiment.pdf
- Example
- Data
- collection of eQTL from GTEx
- encode footprint (not sure what this actually is)
- encode DHS hotspots
- Documents in a mongodb database (RaggedMongoExperiment)
- Every document has a genomic range, so can respond to range queries
- Data
- Summary
- Basic layout: genomic coordinates x sample/tissue type x assay type
- MutltiAssayExperiment: could work but not an immediate fit
- "There is a competitor called Giggle"
http://steinbaugh.com/basejump/
- Recommended packages
GenomicRanges
,rtracklayer
(GTF -> GRanges),AnnotationHub
,ensembldb
,GenomicFeatures
basejump
extends these tools- Rich metadata columns (GRanges),
mcol(...)
https://f1000research.com/articles/7-741/v1
- Interactively explore any data in a
SummarizedExperiment
object (or subclass) - Multiple panels with different visualizations, can see how they are linked
https://pepkit.github.io; https://databio.org
- Motivation: Most pipelines require individual metadata organization
- PEP: a standard format for project metadata -- "Portable Encapsualted Project"
- Ecosystem of tools:
- format itself: project_config.yaml, samples.csv
- peppy: Python package
- pepr: R package
- geofetch, looper -- map samples onto pipelines and run in different compute environments
github.com/shians/biocexplorer
- Bioconductor packages are not that easy to find
- Prioritization: can sort by title (alpha), author (alpha), not really that useful
- Alternative: BiocExplorer:
- Prioritize packages based on usage
- Provides a graph of packages, prioritizing those that are widely used (not sure what
So fast... so small...
-
Summary first
- DelayedArray: seamless element level access to out-of-memory / remote array-like resources
- SummarizedExperiment/MAE: a sort of query language for annotated omics resources
- Current efforts: improve efficiency of statistical learning using Delayed* resources
-
DelayedArray backends: HDF5 server, BigTable, ...
-
HDF cloud / HDF Kita (Example using 10x)
-
BigTable (Example using (OncoTk)
-
The point (I think): You can work with all of these types of remote data in the current version of Bioconductor
https://www.bioconductor.org/packages/devel/bioc/html/sesame.html
- Improves masking on hyper-polymorphic region (e.g. MHC)
Brainstorm and prioritize some products that can be produced in ~45 minutes (and then do the thing).
(Martin using slido.com to accumulate suggestions from the attendees)
Voting, winners are:
- Come up with a data structure for PharmaGx data -- 9
- Strategies for posting and answering support site questions -- 8
- Checks on SummarizedExperiment rownames, rowData()<- -- 7
- name clashes between BiocGenerics, S4Vectors etc. and tidyverse -- 6
- Pull requests to fix usage and other warnings in core packages, e.g., Rsamtools -- 6
- Initiate collaborative development of ... (like iSEE at BiocEurope) -- 6
We'll see what happens...
- Main idea: Template for the "ask a question" box: provide some guidance for how to ask a good question
- Reproducible example
- Things you tired
- sessionInfo
- Other ideas (Google doc link)
- "Biocverse": visualizing the Bioconductor ecosystem
- Use cases
- New users: given a task, show me all packages, ranked by "importance"
- Experienced users
- New developers
Fixing a problem in assigning rownames. I think.
(more) Discussion of how to store drug sensitivity data.
Store as a database. Write construtor functions that create matrix / SummarizedExperiment from the database.
There's some code in github in a project called longArray but I can't read the user/org name.
Q&A for the project leadership team.
Q: "Can you think of ways to expand the network of project leaders, expand ownership, expand people who feel they are part of the project"
- Martin: project has life outside the core. e.g. Single Cell developments largely outside the core.
- Vince: (turns question back) Is there a lack of recognition or barriers to participation?
- Aedin: Surprising how many R people don't know Bioconductor
Q: Mechanism to kick packages out?
- M: There are obstacles that discourage people from participating, but the tradeoff is worth it. Quality of packages, having vignettes. Contrast in how tidyverse works with how Bioconductor works, different views on how software should work. Interesting to think about compromises in making those play along.
- V: Advantages of putting your package in Bioconductor: development vs release branch (not available in CRAN), vignettes and examples. Not trying to sell Bioconductor methods to people not trying to do them.
Q: Synchronization between Bioconductor and CRAN
- M: We communicate regularly. At a technical level there is communication. At a social level, much more restricted. CRAN has task views, but no overlap with Bioconductor
Q: How do you recommend new users learn the Bioconductor ecosystem
- Wolfgang: need a new book. Something like a textbook. Challenge defining what it means to use/learn Bioconductor
- A: A beginners user guide to Bioconductor should start with SummarizedExperiment, basics... The thing I direct people to is the f1000 channel
- If you want to learn Bioconducter there is a reason for that. You have some data.
- V: Talking about content from the edX MOOC
- Kasper: introduction should focus on the things that EVERY user of Bioconductor should know. Should streamline and cleanup what is presented.
Q: Funding mechanisms the Bioconductor community should be applying for
-
M: Historically get a lot of money into one big shop, easier than scrambling for smaller grants from a variety of places. More recently, have started to diversify. e.g. Human cell atlas grants written by a diverse group of people. More junior faculty wanting to expand participation but need funding to do so. But wants to call on James Taylor...
-
ACK IT ME: What works for Galaxy 1) Core group goes after diverse funding opportunities, one big pot but also lots of other pots 2) Full time funded community outreach. Multiple people across the project dedicated to this. Really makes a huge difference to have someone spending all their time on this so it is never lost/back-burnered [writing what I think not what I say right now]
Q: Increasing participation, diversity, gender balance
- M: (I didn't capture this well so nothing here)
Q: Question about connections with Africa (H3A?) training and outreach
- A: There are some efforts/connections
- V: There is a foundation for Bioconductor devoted to charitable works, has some money, could potentially be used to expand training
(discussion about website here, needs to be simplified, streamlined, refreshed... couldn't hear everything)
(Back to diversity, a plug for participation in Girls who Code and other such groups, always looking for help)
END OF DEVELOPER DAY.
Starting with introductory remarks from Martin (thanking sponsors, organizers, logistics -- all on the conference website).
Last item, Code of Conduct (https://bioc2018.bioconductor.org/code_of_conduct), interesting, shorter than many.
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome (@michaelhoffman)
https://www.biorxiv.org/content/early/2018/02/28/168419
- Introduction
- "Transcription over-simplified": TF binds DNA, recrits PolII, RNA is made (yup, that's all)
- ChIP-seq, you might have heard of it...
- Problem: ChIP-seq needs 10^6 to 10^8 cells ("Determined using 'Cunningham's Law'")
- Solution: computational prediction of transcription factor binding
- Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better
- Michael thinkg HINT works pretty well
- Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better
- How to move forward
- Use experimental data from ChIP-seq in other cell types
- Learn association between local TF binding and global cellular state as measured from transcribed RNA
- Learning from the transcriptome
- ChIP-seq data from some cell types, RNA-seq data from more cell types.
- Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)
- Correlation matrix is genome-wide, consider cases where p < 0.1
- For a given bin, consider all genes with "significant" correlation, compute spearman rank correlationp between expression and correlation == expression score (yes, correlations of correlations)
- phastCons + chromatin accessibility + expression score + number of cell types with TF binding + motif score into a "very simple neural network" [MH: it's simple in that it is fully connected, nothing fancy in the architecture]
- MLP... optimized # layers, size of layers, activation function, ... looks like lots of hyperparameters [MH: only 4 hyperparameters! 3×3×3×4 = 36 different possible values examined in the grid search]
- Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)
- ChIP-seq data from some cell types, RNA-seq data from more cell types.
- Evaluation
- (Describing cross-validation scheme, use of precision-recall curves)
- ChIP-seq data from other cell types most important, then expression score, then chromatin accessibility
- Performance
- "A few" TFs where we do very well -- auPR >= 0.5 -- SMC3, CTCF, RAD21... some others, went by too fast [MH: performance plot is in preprint]
- For 36 TFs MCC more than 0.3 in validation cell types (Roadmap)
- Correctly predicts novel TF binding sites
- Trackhub available, didn't see a URL though, probably in the preprint [MH: should add to presentation. https://virchip.hoffmanlab.org/]
- Future:
- Position dependency -- adding time dimension to network
Questions:
- Q: Now that we have 1600 cell lines with RNA-seq, would you be interested in inferring ChIP-seq for those cell lines
- M: Yes, I would be intersted in that, but open-chromatin data is important, not sure can do without
- Q: Can this be applied to single cell transcriptome data?
- M: Yes, could do something, again need measurement of open chromatin ("having single cell ATAC would be best")
- Q: (I think the question is using the model to find most similar cell line)
- M: (Not sure)
- Q: What two assays would you do on cells from a donor (for a difficult to acquire tissue)
- M: transcripts and open chromatin -- but still unclear for single cell
https://www.biorxiv.org/content/early/2018/04/02/196915
-
First, an appeal to the audience: "Need more tools for visualization across different matrix factorization techniques"
-
Introduction
- Pattern detection is critical in the genomics big data era
- (Many types of) omics data can be represented in matrices
- Focus here is mainly on transcription
- Omics data can be interpreted through matrix factorization (PCA, ICA, NMF, ...)
-
Data = Amplitude x Pattern
- D: molecules by samples
- A: molecules by features
- P: features by samples
-
Data = Amplitude x Pattern
-
Focus: Smooth sparse NMF, Bioconductor package CoGAPS
$$A_{i,j} ~ \Gamma(\alpha^A_{i,j},\lambda); \alpha^A_{i,j} ~ Poisson(\alpha)$$ - Gamma yields sparsity constraint
- Implementation: finds constrained non-negative sparse matrices using MCMC Gibbs sampler
-
Application: Biological model of theraputic resistance
- Cancer cell line initially sensitive, generate long term resistance, acquire time series (weekly) for cells acquiring resistance and controls
- Generated gene expression data (bulk RNA-seq?)
- Initial clustering
- Gives information about treatment but not much about resistance
- Matrix factorization for time-course analysis
- Perform sparse NMF (CoGAPS) and view the Patterns over time
- Reveals time dependent patterns in resistance
- Perform sparse NMF (CoGAPS) and view the Patterns over time
- BUT: how does one make these abstract patterns useful?
- Amplitude matrix allows mapping patterns back into gene expression space (or whatever original feature space)
- Instead of finding genes most highly associated with each pattern, what are the genes associated with only one of the patterns
- "Pattern marker genes"
- Group that slowly increases with resistance, another group that slowly decreases, clearly groups treatment and controls
- "Pattern marker genes"
- Can do standard GSEA on these marker genes
- Relate this back to non cell-line data?
- Take weights and project onto another dataset (ProjectR package on GitHub)
- Human tumors treated with the same therapy
- Found that the resistance patterns were elevated in the patient tumors that were resistant
-
"Resurgance of matrix factorization for single-cell data"
- What's different?
- Datasets are orders of magnitude larger
- Cell types and timing of individual cells are unknown a priori
- Showing UMAP of 10x 100k cell data, 10 different time points in mouse retina
- Cell types are hand annotated to get "ground truth" (hrmmmmmm...)
- scCoGAPS distinguishes cell types and trajectories
- Looks like a rod pattern and a cell type pattern...
- What's different?
-
Conclusions
- Matrix factorization has a long history in genomics
- Adding new visualization andnew statistics to the ouputs of MF can enable robust pattern detection
- Applicable to single-cell datasets
-
Q: 1. about manually classifying 100k cells, 2. (didn't get this one)
- Research question: how do you use these factorizations to aid classification
- Replicates: not able to replicate the whole time course (not enough $), but a collaborator had previously developed resistance in same cell line, found tremendous heterogeneity, sounds like generalizability still unclear
-
Q: Scalibility, algorithms people should focus on in the face of HCA, 2M cell scale...
- My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches
- || across different sets of genes, cells
- compaction approahces (group related cells, factorize in reduced space)
- My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches
-
Q: "Are you aware of groups that have ressurected CUR decomposition... quantization approaches... where you hit the limit"
- Haven't seen that. Surprised at the amount of reinventing the wheel. Need to go back to that literature.
(BREAK)
Analysis of high content microscopy data generated through automated yeast genetics (Brenda Andrews)
-
Introduction
- Major challenge, predicting phenotype from genotype using genetic interactions
- Using budding yeast because reagents for systematic genetics including
- Yeast deletion collection: 5000 yeast strains each deleted for a single non-essential gene
- 1000 temperature sensitive alleles of essential genes
- Need methods for detecting gene interactions
- SGA (Synthetic genetic array): introduce any marked allele into an arrayed set of straings
- Main phenotype is growth (colony size)
- e.g. tested 23.4 million double mutants identifying 1.1M genetic interactions
- generated "hierarchical modle of cell function" (Costanzo et al. Science 2016)
- ~35% of nonessential query gene mutatnts exhibit weak genetic interaction profiles
- Most of the time double mutants do not have a growth phenotype, but may have other phenotypes
-
"Marker project"
- Introduce flourescent markers for sub cellular compartments in to the arrayed strains
- How does mutation of any gene influence sub-cellular compartments?
- Developing a general phenotypic profiling pipeline
- Make strain collection: Use SGA to introduce three markers: compartment of interest, nucleus, cytoplasm
- Image: Opera Phenix automated confocal live cell
- Data collection
- Single cell images
- Single cell morphological features
- Cell Profiler, ~300 features, 10-50 PCs
- VAE (autoencode) to find latent feature vector
- Phenotyping profiling
- Detecting mutants -> penetrence
- Finding outliers, one-class SVM, distance methods, ... "no one size fits all"
- Classifying mutant pheonotypes
- "Neural networks"
-
Application: Endocytosis
- Four markers: actin pathc, clathrin coat, late endosome, vacuole
- Phenotype assignment and classification: two hidden layer MLP, "probabilistic" output layer
- 21 phenotypes: 4 WT, 17 mutants, ~88,5% accuracy
- Marker penetrance: ~1230 genes with sig penetrance for at least one marker, ~50% of mutations affect more than one marker
- What are the machnisms leading to incomplete penetrance / cell heterogeneity?
- (I missed a bit here but obviously lots of cool stuff you can do with systematic genetics at this scale!)
-
Q: (Can't hear, not using the microphone)
-
Q: Relationship between imaging features and growth phenotype
- A: Typically get a morphological phenotype when there is a growth defect
-
Motivation
- Emergence of adult neural stem cells using scRNA-seq
- First question is a hard question, how many clusters?
- Answer: keep clustering until there are no longer significant differences between clusters
- FDR corrected Wilcoxon rank-sum test
- Emergence of adult neural stem cells using scRNA-seq
-
From this, built an R/Shiny app for visualizing clustering
- Plots for evaluating number of genes
- Number of differentially expressed genes (Tukey plot), next to silohette
- "Classic" tSNE and UMAP
- Cell cycle -- "cycling" tool? (missed this, look up)
- Answer from @PeteHaitch! Thanks: it was the
cyclone()
function the in the scran Bioconductor package. Some detail in 'Classification of cell cycle phase' section of https://f1000research.com/articles/5-2122/v2 and based on https://www.ncbi.nlm.nih.gov/pubmed/26142758
- Answer from @PeteHaitch! Thanks: it was the
- Gene expression distribution (both detection rate and normalized expression in the cells where detected)
- JT: I like Scanpy's
dotplot
for this, more compact visualization for selected markers
- JT: I like Scanpy's
- Manually select cells for DE testing
- Plots for evaluating number of genes
-
Q
- A: right now handles seurat objects. Can load in whatever single cell object you want
-
Example: ATAC-seq, mouse (brain?) before/after ECT
-
DEScan2
- Call and filter peaks + counts ⟶ SummarizedExperiment peak (range) x count
- Peak calling based on Poison Liklihood without overdispersion
- Filter score threshold, number of samples
- Differential expression with edgeR, DESeq2
- Integration: few samples, annotation, many samples, mixomicx
- Call and filter peaks + counts ⟶ SummarizedExperiment peak (range) x count
-
Future: comparing with other packages, testing on ChIP-seq, visualiztion, test on ATAC single cell.
-
Q: If we wanted to look at ATAC-seq is it in Biocondutor experimental datasets
- A: No
Improving the accuracy of taxonomic classification for identifying taxa in microbiome samples (Eric Wright)
- First, udpates on DECIPHER (https://bioconductor.org/packages/release/bioc/html/DECIPHER.html)
- Classification for Microbiome sequences
- Two types of pipelines, 16s (or other marker genes) or WGS
- Unsupervised: create OTUs and Phylogenetic trees, ...
- Supervised: classify into known taxonomy (idTaxa in DECIPHER)
- Two types of pipelines, 16s (or other marker genes) or WGS
- Problem for taxonomic classification: reference databases are incomplete
- Evalutaing accuracy on a Mock community
- When there are missing organisms in database BLAST has very high false positive rate. idTaxa does not (0.01% false positive rate?)
- (Either I spaced out or there were no details on how it actually works)
- Evalutaing accuracy on a Mock community
DEsingle for detecting three types of differential expression in single-cell RNA-seq data (Zhun Miao)
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty332/4983067
- Detect three types of DE between two groups in a count matrix
- Using ZINB model, estimates proportion real vs drop-out zeros
- DE types
- DE status -- paramters \theta
- DE abundance -- argh... slide is gone
- "General DE"
- Showing heatmaps... not using the microphone so I'm missing a lot here
- Presenting yutorial of actually using DESingle
- Goal: benchmarking GSEA
- GSA theory null hypotheses: 1) competitive (vs other set), 2) self-contained (no genes in the set of interest are DE)
- Generations: 1) overrepresentation 2) functional classification 3) integration of network topology
- Bioconductor package: EnrichmentBrowser
- Enrichment analysis in practice: GO/KEGG overenrichment, dozens of methods claim improvements, how to evaluate?
- Solution: GSEABenchmarkeR, standardized benchmarking of GSEA methods
- Comprehensive real data compendium
- Systematic and reproducible assessment
- Benchmark Panel
- Array Panel: Tarca12/13 19 diseases
- RNA-seq: 33 datasets from TCGA
- Assessment
- Statistical sig
- For each method how many genes found to be significant
- Competitive methods 10-20%, Self-contained much higher, like 80%?
- For each method how many genes found to be significant
- Statistical sig
- Phenotype relevance
- Zooooooom....
(BREAK)
Workshop content is online
- Introduction
- What are TF/RBP binding motifs; Sequence logos; using a PFM to estimate binding affinity
- Lots of ways to do this; log scale mis-represents what protein actually does, prefers to stay in linear domain
- Where do motifs come from? 1) SELEX and variants, 2) PBMS, RNAcompute, and variants, 3) ChIP/CLIP-seq and variants
- PBMs easier to deal with, more realistic
- ChIP/CLIP "for the afficienados this is actually the worst way to obtains the motfis". Influence of chromatin, combinatorial relationships, ...
- All data types ammenable to k-mer counting, and learning PWMs in counting or discrimination mode
- "Most people are using MEME approaches which are not trying to be quantitative, something to keep in mind"
- Sources of motifs
- TRANSFAC, HOCOMOCO, JASPAR, UniPROBE
- In Bioconductor a panel of TF motif collections are collated as MotifDb
- PLEASE cite original data for individual findings, not JUST aggregation
- In Bioconductor a panel of TF motif collections are collated as MotifDb
- TRANSFAC, HOCOMOCO, JASPAR, UniPROBE
- What are TF/RBP binding motifs; Sequence logos; using a PFM to estimate binding affinity
- CIS-BP (http://cisbp.ccbr.utoronto.ca/)
- Determining binding preferences of all human TFs first requires... a list of all human TFs!
- Collaboration with JUssi Taipale and Matt Weirauch to manually reassess and collate human TFs
- Pooling existing sources gives 2700 potential TFs; ~14% of human genes "way above anyones reasonable estimate"
- Review by two independent expert judges
- PIs resolved disagreements (less than 100, mostly "a specific assay that somebody doesn't believe"
- Final list of 1639 (actually 1638) TFs. 1105 with Motif, 104 with homolog motif
- Collaboration with JUssi Taipale and Matt Weirauch to manually reassess and collate human TFs
- Take home messages
- New list has hundreds of TFs not in the last TF survey paper (2009) [+347, -49]
- Most human TFs (74%) have at least one known motif and most that don't are C2H2 ZFs
- Tim has clones for the remaining 426, "if you have an assay you want to run we will give you the clone"
- C2H2 are not only largest and least well characterized, also most diverse, ...
- Het ChIP-seq motifs for 131 C2H2-ZFs
- Almost all have a different motif -- diversification
- Many bind endogenous retro elements
- So, duplicate easily, carry a KRAB domain around, diversify quickly.. largest within class diversity of any family by far
- Het ChIP-seq motifs for 131 C2H2-ZFs
Cancer Biomarker Discovery: Building a Bridge Between Preclinical and Clinical Research (Benjamin Haibe-Kains)
- Acknowledgements first, Yay!
- Introduction
- Genotype matched clinical trials
- Only 18% to 39% of patients can be matched to a trial based on targetted sequencing
- Biological materials for biomarker discovery
- In situ: patient tumors
- In vivo: patient derived xenografts, engineered mouse models
- In vitro: immortalized cell lines, 2d cell lines from biopsy, organoids
- Why cell lines?
- No ethical issues, cheap high throughput, approved for drugs
- Biomarker discovery using cell lines
- Comprehensive molecular profiling
- Drug senitivity screening
- Data sharing
- Long history. NCI60/DTP........... CCLE and GDSC (2012)........ FIMM
- Pharmacogenomics datasets require curation
- Genesis of PHarmacoGx
- Cellosaurus to uniquely identify cell lines and tissues
- Drugs annotated with PubChemID, InChiKey, SMILES
- Genotype matched clinical trials
- PharmacoGx
- 1691 cell lines, 41 tissues, 759 compounds, 650k drug response experiments, 200 million+ gene/drug associations
- (Some block diagrams of the database model here but way too small to read anything)
- PharmacoDB (http://pharmacodb.ca)
- Web interface to mine data, cell line, tissue, compound, gene, ...
- Response curves across multiple datasets
- Lot of heterogeneity; even within one dataset!
- Response curves across multiple datasets
- Web interface to mine data, cell line, tissue, compound, gene, ...
- "Real power is being able to combine across datasets"
- Cell line overlaps, compound overlaps between datasets (showing UpSet plots)
- Concordance of drug sensitivity
- Correlation performs poorly
- Disconnect between statistical analysis and how scientists look at the data
- Defining modified concordance index greatly increases condcordance between GDSC and CCLE
- (I didn't really follow the details of the mCI unfortuantely)
- Preclinical <--> Clinical
- Drug matching, integration with cBioPortal
- Conclusions
- Selecting right therapy for patient is crucial, need biomarkers predictive of drug response
- cBioPortal/PharmacoDB can discovery robust biomarkers and share them with clinicians
- Challenge: how do we properly validate the pre-clinical biomarkers for clinical use?
- Organoids? xenografts? clinical trials?
- On reproducibility: GEO, GitHub, Bioconductor, Docker + Code Ocean
https://sa-lee.github.io/plyranges
"If our technology behaves" (RevealJS...)
- Introduction
- Bioconductor infra is powerful: GenomicRanges, IRanges
- Enables developing new tools
- However, R world has changed (tidyverse). A lot is expected of a new R user to analyze data with Bioconductor
- More complexity, S4 system
- Bioconductor infra is powerful: GenomicRanges, IRanges
- How far can user get with just GRanges?
- GRanges are (already) tidy!
- Range is a variable, metadata columns are observations
- GRanges are (already) tidy!
- plyranges: design a grammer
- endomorphism (ops always give back a GRanges)
- cohesion
- consistency (lease surprise)
- expressiveness (funtions describe what they do)
- Incldue genomic semantics (bioRxiv 327841)
- arithmatic (shift, resize, flank, coverage)
- restrict (by metadata or by range queries)
- aggregation (summariz over ranges)
- merges
- Examples
- Changing coordinates;
exons %>% mutate( width=2*width )
doubles the width, by default starting at the START- Anchoring:
exons %>% anchor_center() %>% mutate( width=2*width )
- [This seems really odd to me, why is anchoring not an argument to mutate? This needs to carry some state along magically]
- [Stuart just explained it to me, anchoring is like
group_by
, it gives back an "AnchoredGRanges" which behaves like / wraps the original but also provides the anchoring info. Okay, makes sense, not that magical]
- Anchoring:
- Merging: treating genomic intersections like table joins
join_overlap_inner(a, b)
join_overlap_inner_within(a, b)
join_overlap_inner_directed(a, b)
- Changing coordinates;
bit.ly/tximeta
- Motiating example, data re-use
- From a paper: "we performed RNA-seq quantification with Salmon using hg38 and the RefSEq annotation"
- Provided raw and summary data: we have a count matrix, know RefSeq ans hg38
- Reusing this data
- read_tsv( "counts" ) ⟶ gene symbols
- genes(TxDB...)
- MapIDS ⟶ lose some genes right away (3770)
- Just work with interesection...
- Make a SummarizedExperiment
- Import ChIP-seq peaks... don't know what genome they are from...
- 😡
- Thus: reuse is difficult, time consuming
- From a paper: "we performed RNA-seq quantification with Salmon using hg38 and the RefSEq annotation"
coldata = data.frame( files, names, condition, batch )
se = tximeta(coldata)
- uses
BiocFileCache
- matches transcriptome automatically, adds transcript range data and metadata
- Whether or not originaly analyst kept track of it! [Awesome]
- uses
- What's it actually doing?
- Salmon hashes the contents of the transcriptome. Salmon index contains a hash of the transcriptome
- Salmon quant outputs include this hash
txmieta
will hash "all the transcriptomes"- If you post entire quantification directory
linkedTxomes
- Link your custome (e.g. filtered) transcriptome to upstream source, e.g. FASTA sources + GTFs
A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs (Charlotte Soneson @CSoneson)
"Should be a pre-print any minute"
-
Motivating Example
- ZADH2 data from GTEx. Has four isoforms, distinct 5' and 3' UTRs
- Which isoforms are expressed? Given junction mapping data
- GTEx expression data at this locus does not appear to be consistent with the junction evidence
- Can we detect genes with conflicting isoform expression in an automated fashion
- Which isoforms are expressed? Given junction mapping data
- ZADH2 data from GTEx. Has four isoforms, distinct 5' and 3' UTRs
-
The idea
- Assume: the annotation is complete AND we know the abundance of all transcripts AND we can model library prep and sequencing bias
- Then: should be able to predict the coverage of each genomic region
- In particular the number of reads spanning each exon/exon junction
- Thus: if we can't predict the coverage accurately, one of the assumptions is wrong (e.g. abundances)
-
In practice
- Predict junction coverages from transcript abundance and bias models
- Can estimate in many ways, Salmon kallisto sailfish rsem sringtie, ... whatever
- Predict coverage from transcript coverage and bias models. Bioconductor package Alpine
- https://bioconductor.org/packages/release/bioc/html/alpine.html
- From predicted coverage, weight and combine by transcript ⟶ expected junction coverage
- Consider correlation of observed and predicted junction counts
- Generally high correlation
$r = 0.945$ ,$r_s = 0.903$ - Compatibility score
$JCC_i$ : for each gene$i$ quantify deviation between predicted and observed junction coverage
-
Most genes have low
$JCC$ scores- "Different abundance estimations give different scores, but approximately equally affected"
-
Back to ZADH2
- Read coverage of exons explains why the "wrong" isoform is over-weighted
-
Choice of annotation effects score
- Some genes for which CHESS is better and others for which Ensembl is better.
-
Assembling missing transcripts improves the score
- Run stringtie and include new transcripts in quantification, many more genes are better with StringTie
- [Shouldn't we always do this?]
- Run stringtie and include new transcripts in quantification, many more genes are better with StringTie
-
Summary
- JCC score lets you flag genes for which tx abundance estimates are unreliable (regardless of underlying cause)
- works on a sample by sample basis (don't need replicates)
- requires only RNA-seq data
https://www.biorxiv.org/content/early/2018/06/21/352823
- Two different count types: exon (any annotated exon) and intron (uniquely to intron)
- Rsubread: align using
subjunc
, summarize usingfeatureCounts
- Total RNA: 56% exon, 21% intron; polyA: 69%, 7%
- Rsubread: align using
- Making use on intron reads
- What are they useful for? No consensus... intron retention? Nascent transcription? Noise/useless?
- Intro reads are informative: MDS plots using limma, samples separate by biological groups and library preparation
- Looking at coverage patterns in bins across genes, metagene and individual genes
- Final thoughts
- Majority of expressed genes have both exon and intron signal
- 3' coverage bias may impact intron retention detection
- Lots of intron reads in single cell data... Needs to be looked at
Slides: https://github.com/steinbaugh/presentations/blob/master/2018-07-27/bioc2018.pdf
- Motivation: cell population analysis facilitates cell type (clustering) and differentiation (trajectory, pseudotime) analysis
- Technologies: InDrops, Chromium ("use this one if you can"), DropSeq, Seq-well
- bcbio: best practice pipelines for lots of bioinformatics tasks...
- bcbio single-cell pipeline
- Not in R: bcl2fastq -> umis -> RapMap/Kallisto
- bcbioSingleCell() -> QC: scrater, scran, TRAJ: monocle, STREAM, SPRING, Cluster: SEurat, ...
- Sparse matrix (
Matrix
)SingleCellExperiment
. QC tools, visualization, DE testing
- Sparse matrix (
- bcbio single-cell pipeline
- (Usage examples for the
bcbioSingleCellExperiment
stuff)
gwasurvivr: an R package to perform survival association testing on imputed genetic data (Abbas Rizvi @aarizvi)
- Introduction: what's a GWAS?
- Association between genotype and pheonotype
- Typically using arrays, 750k to 1M markers
- Standard practice is to impute genotypes
- Popular options: IMPUTE2, Minimac3, BEAGLE
- Estimte haploytypes using a reference panel (1000 genomes, etc)
- Popular options: IMPUTE2, Minimac3, BEAGLE
- Survival packages for GWAS.. a bunch
- gwassurvir
- Fast survival analysis
- model SNP by covariate interactions
- Filter SNP imputation/quality metrics
- Leverage existing Bioconductor packages, GWASTools, VariantAnnotation
- || across compute cores, modify existing survival::coxph.fit
- Goal: decrease number of iterations needed for convergence
- (Usage examples, as usual these are too small to read)
- Benchmarking
- Runtime comparison, orders of magnitude faster
- "Highly correlated results"
- [Looks identical? Is correlation the right metric to use here?]
methyvim package
I can't read these slides at all due to the colors ;( bit.ly/bioc_methyvim_2018
- Data: Infinium EPIC methylation arrays; ~850k sites
- Is disease state related to methylation?
- First pass analysis, fit linear model to each site, get
$\beta$ , test significance- Does this answer the right question?
- Sites are not independent, consider effects of neighboring sites
- Approach uses "causual inference and machine learning"
- Filter sites with some evidence of differential methylation and cluster
- Estimate variable importance measure at each site controlling for pattern at neighboring sites
- "and of course we need to correct for multiple testing"
Data Adaptive Evaluation of Preprocessing Methods using Ensemble Machine Learning (Rachael Phillips)
- Introduction
- Arsenic: naturally occuring, acute and chronic health effects
- Question: Qhat is the effect of early life arsenic exposure on CpG methylation in adults
- Data
- 44 blood samples (21 exposed, 23 unexposed)
- EPIC chip, 850k CpGs, batches...
- Pre-processing
- Raw -> QC -> Transform -> Filter -> Batch Correction -> Downstream
- Lots of options for every step... "why not let my data tell me what appraoch might be the best?" "maximize my biological system"
- Raw -> QC -> Transform -> Filter -> Batch Correction -> Downstream
- Data Adaptive Methodology
- Assess using positive control vars (age, exposure, smoking) and negative control variables (batch, date, who processed, ...)
- Zoooooom... but basically maximize (avreage risk) prediction of positive controls, minimize prediction of negatives
- Ensemble learning
- ~10 different base learners, ~10 different normalization strategies
- 5-fold cross validation
- Future
- Defining optimal choice
- Refine screening to be more efficient
- Limitations of data (confounding)
- ...
(END OF TALKS!)
Thanks, James. These notes are excellent and it was great to have you here.