Keles -- Statistical Methods for profiling long range chromatin interactions from repetitive regions of the genome
- Multi-mapping reads (multi-reads) are typically thrown out in many HTS analyses incuding Hi-C
- Assays predominently rely on short-read (50-150bp) so multi-reads are common
- Using ChIP-seq as an example, incorporating multi-reads finds peaks in regions where "uni-reads" do not
- e.g. Perm-seq using DHS + ChIP-seq data and multi-reads. 27.3% more peaks compared to ENCODE uniform processing pipeline
- How to combine this with Hi-C data?
- Hi-C read processing
- Typical pipelines: singletons, multi-mapping ends, low map quality, and unaligned all discarded
- Evaluation of the impact of this using IMR90 and Plasmodium datasets
- Impact on sequencing depth - high quality multimaps look like ~20% in all cases (interesting, need to check how repeat content varies)
- Restriction fragment filtering (invalid read pairs) makes some multi-reads become single-reads
- Continuing with normal Hi-C processing: Bin -> Raw contact maps -> Normalize -> Identify signficicant contacts
- Overall imapact of including multi-reads: ~5-9% of reads
- Assigning reads that remain multi-reads after all filters needs modeling
- Model for Hi-C multi-reads
- Leverage other reads within the same vicinity
- Observed
$Y_{i,(j,k) = 1$ , valid read pair$i$ aligned, Summation of Y might be more than one for multi-reads - Hidden
$Z_{i,(j,k) = 1$ , summation of Z must be 1 -
$Z_i ~ Multinomial( \pi_{(1,2)}, ... \pi_{(M,M-1) )$ ,$\pi$ 's get Dirichlet prior, based on genomic distance between bins. - (Fit-Hi-C like stuff here, I'm probably not capturing it perfectly)
- Fit with EM, get posterior probabilities of read-pairs over each contact bin, threshold to get to counts
- Observed
- Evaluation
- Number of significant contacts: Always gaining more contacts than losing
- 41% more significant contacts at higher FDR, 31% specific to using multi-reads
- Reproducibility (across replicates)
- Common to Uni and Multi are highly reproducible
- Specific to Multi more reproducible than specific to Uni
- Novel enhancer/promoter interactions: 20.4% more EPIs that are reproducible using multi-reads (not sure how EPIs were called here)
- Leverage other reads within the same vicinity
- Beta version "mHiC" available from [email protected]
- Concludes that multi-reads play an even bigger role in Hi-C data (than other data types)
- Future: incorporate multi-mapping into interaction calls
- Questions
- On structural variants, A: incorporate copy number parameter into model
- Do 3D models change when incorporating questions, not tested
- (Not able to hear all questions)
- Hi-C read processing
- Introducting TADs
- Histone marks around bondaries (peaks and dips)
- Predict TADs from histone marks (from ENCODE)
- Classifiers SVM, SGD, Random Forrest (scikit-learn)
- Extract 1D features from the 2 boundaries of each tad
- Two ways: Binarized and a continuous strength
- Compute correlation (pearon) across all marks (?)
- Negative examples:
- Case 1: sample another region of similar length (anywhere on the same chromosome?)
- Case 2: Fix one boundary to real TAD
- Single cell line test, 5 fold cross validation AUC ~0.9 (SVD)
- CTCF is most important feature
- Does this generalize across cell lines? AUC ~0.9 on held out cell line (RF)
- "Bag of boundaries" appraoch
- Hold out one cell line and train on the bag of boundaries
- predict bag of boundaries with held-out features
- "Enables TAD prediction in new cell line"
- "Limited predictive power"
- Basically, you get a set of boundaries from other cell lines and the predict whether they form a TAD in a new cell line
- So, any boundaries that are new in that cell line will not be predicted
- From questions (important) the TADs here are 200-300kb domains (WHAT IS A TAD?)
Jacob Schreiber -- Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture
- Introducing Hi-C, distance effect, Fit-Hi-C, Splines, nulls, outliers
- Would like to predict interactions (and such) without doing Hi-C, reduce cost / inform genetic basis
- What features? DNA sequence and DNase hypersensitivity
- What training data? 82M Fit-Hi-C contacts with q <= 0.1 (1kb on genome), 1-hot encoded DNA, binarized DHS signal
- "Obvious classifier is neural networks"
- DNA network C/P/C/P
- DNase P/C
- Combine and then C/C/P
- Combine two arms + distance, D/D/Predict
- Looks like AUC ~.8, better than genomic distance and other single features
- To validate in cell types with lower resolution Hi-C data, convert from 1kb to 5kb resolution (how?)
- Performs better than gneomics distance or using GM12878 contact map in other cell types
- Predictions can recreate insulation score (or anyway, good correlations), also good correlations with replication timing
- Predicted structures cluster by cell function
- Introducing distal gene regulation, TAD disruption in disease, Hi-C contact maps
- Hi-C-Reg: regression approach for predicting contaxy counts
- Extract aggregated histone mark and DHS signal across cell lines
- Pair regions (?)
- Predict contact count for pair regions
- RF regression correlation of ~.83
- Window features (I believe these are the features from the genomic region between the endpoints) are "very helpful" for improving predictions
- Window features also important for capturing domains
- Picture of predicted map. No taddy domains without the window feature (or reduced), more clear with window features
- Ensemble: average training across cell lines. As good or better than cross-cell line predictions
end of day 1
- Three parts:
- Mapping RNA interactome in vivo (MARIO)
- Mapping RNA-genome interactions (MARGI)
- Protocol
- Cross-link RNA/protein complex
- Attach to surface
- Add linker DNA to 5' end of RNA
- Double ligate to RNA-biotin+linker+RNA
- RT into DNA complement of chimera and sequence
- Advantages
- Unbiased selection
- Applicable to human tissue
- Risk
- Random ligation of RNA
- Mitigate: extreme wash conditions, large distance between complexes on surface
- Random ligation of RNA
- Output
- Pairded end reads with ends mapping anywhere on the genome (hopefully in a known RNA loci)
- Use to create pairwise interaction network
- Validation
- Test co-localization with single molecule RNA imaging (two color labeling)
- Appears to validate (a few images shown)
- Test co-localization with single molecule RNA imaging (two color labeling)
- Protocol
- Protocol
- Similar idea, RNA/DNA complex tethered on solid surface, add to RNA a ss/ds adapter, enriches for RNA/DNA interactions
- How to determine which side was originally RNA vs DNA? phase the linker so the junction is very specific
- Circularlize/linearize to ensure the linker remains in the read
- (How efficient is all this?)
- Purify, RT, amplify, sequence
- Similar idea, RNA/DNA complex tethered on solid surface, add to RNA a ss/ds adapter, enriches for RNA/DNA interactions
- Figure: a bipartite genome browser showing links between RNA ends and DNA ends
- What are the chromatin interaction non-coding RNAs?
- snoRNA (~200 genes), miRNA (~100), misc, antisense, miRNA, pesudo, (big drop), linc (~20), proc transc
- Where do they interact?
- 80% proximal, 4% cis-distal, 16% trans
- Distal and trans accumulate at TSSs, and density appears correlated with expression level
- Reverse correlation between RNA attachment and H3K9me3, but no corr wih H4K4me3/H4K27Ac (scatterplots but no correlation quantification here) (also showing RNA attachment peaks have high H3K9me3)
- Oooooooooh... secret stuff here.
- Motivation: dynamics of regulatory networks in lineages (either cell lineage or species phylogeny), how do networks change
- What controls cell-type specific regulation
- Computational tools I: Comparing 3D organization across cell types / species
- A graph is a natural representation of a Hi-C dataset: regions == nodes, interaction strengths == edges
- Does graph clustering help? (to identify structures/domains)
- Spectral clustering
- Adjacency Matrix -> Laplacian -> Eignen decomp -> k-means
- Assessment: how good are the clusters? enrichment of genomic signals?
- Spectral clustering tends to do better on different measures (compared to hierarchical and k-means)
- Spectral clustering of Hi-C data for human ESCs -- 10 clusters
- Two types, 1) associated with chromtin marks 2) associated with LADs and gene poor
- Arboretum for clustering regulatory networks across species (existing work), how to adapt for Hi-C
- Graph combines orthology maps (trees) for regions (genes) and interactions of regions (genes?) within each species
- Assert: chromatin organization is more similar within species than between
- Algorithm gives conserved signatures in matched clusters
- Chromatin organization is conserved -- changes in clustes are between clusters of the same type (these are the two types from earlier)
- Summary: Graph based methods maybe more effective, Arboretum-Hi-C allows comparison of related datasets, organization is conserved across species
- Tools II: Chromatin state dynamics across cell lineages
- Data: characterizing chromatin state during reprogramming (MEF->IPS-C->IPS) 5+ marks and 3- marks.
- Chromatin module: group of genomic loci that have the same chromatin state (where state == same chromatin marks)
- What are the modules in each cell type and how do they transition?
- CMINT: Chromatin Module INference on Trees: each module is a MV gaussian, group is mixture
- Chromatin state during reprogramming defined by 15 different patterns (each is labeled by one dominant histone mark)
- Transitions between modules
- e.g. chromatin transition states of Oct4 -- switched completely in iPSC but no completely in pre-
- Conclusions
- Chromatin state can be studied in 1D and 3D
- Predict EPIs
- Compare chromatin state and 3D state across cell types and species
I think we were just asked not to post (unpublished work)
- Challenges in current Hi-C data
- Expensive (10 day protocol), 6+ billion reads for kb resolution
- Most datasets are low (40kb) resolution; too low to infer EPIs
- Introducing deep learning for resolution enhancement in the context of images
- Convolution net Low res -> Feature extraction (low) > Fully connected mapping to -> High resolution features -> Output
- Chromatin interactions are predictable from neighboring regions, hence can impute
- Training on chromosome 1-17, test on 18: Local average has correlection ~0.8, larger matrices do better (decays with distance)
- Prediction
- Down-sample to 1/16 of reads, create Hi-C map (noisy)
- Enhanced matrix is highly similar to original matrix -- enhanced is very close to correlation with biological replicate (probably a ceiling on performance)
- Across cell types: train on GM12878, IMR90, K562 all perform well in correlation measures
- Identification of interactions in HiCPlus enhanced matrices
- Enhanced and hi-res recover similar numbers of interactions, low-dept misses many
- Recovers 50% of peaks from Chia-PET, most missed in low-resolution Hi-C, similar results with Capture Hi-C data
- Summary
- Convolutional net to impute hi-res from low-res Hi-C
- Works with 1/16 to 1/25 depth
- Will be available at ("If you haven't tried it try it, it's very fast")
- Existing work: Fit-Hi-C: Assigning statistical confidence estimates to chromatin contact maps
- Software available in Python (more scalable) and R
- Captures 3C validated cell specific enahncer promoter contacts
- Model works for other chromatin conformation capture assays (e.g. PLAC-seg)
- Three distinct diseases
- Malaria
- Plasmodium falciparum
- 3D reconstruction of genome. Centromeres colocalize in 3D, so do telomeres
- Virulence gene clusters also colocolize
- Plasmodium has 60 var genes (isoforms of same gene), at ends of chromosomes, exactly one expressed per cell
- Colocalization confirmed using DNA fish
- Genes close together have correlated expression profiles
- Telomeres have a repressive effect (closer you are to telomere in 3d lower expression)
- H2A.Z is depleted in these regions
- Newer work
- Major changes in genome organization between transmission stages (PCA plot stratifying stages shown)
- Gametocyte specific super-domain formation (similar to X chromosome activation)
- Other parasites (vivax, knowlesi [monkey], Toxoplasma [feline], yoelli, berghei [mus])
- Other plasmodium show telomeres with low expression, oposite phenomena (contromeres) in Toxo?
- Plasmodium falciparum
- Asthma
- GWAS locus known on chr17 (and nearly every other immune disorder)
- Increased ORMDL3 expression in 17q21 locus when asthma risk variants present
- SNPs overlap DHS, switch a CTCF binding site
- Risk allele creates interactions with open chromatin sites far away
- Gene normally has a nearby enhancer, which is lost with risk allele, creates an interaction is a different place == abberent expression
- Risk allele creates interactions with open chromatin sites far away
- GWAS locus known on chr17 (and nearly every other immune disorder)
- Cancer
- Chromosomal rearrangements of all kinds common in cancer (and transformed cell lines)
- HiCtrans: detect chromosomal translocations from Hi-C data (ENCODE 3D Nucleome group)
- HiCnv: detect copy number variations
- Malaria
Abhijit Chakraborty -- A versatile pipeline to simulate Hi-C data with genomic rearrangements (AVeSim)
- Hi-C provides "clue" to find genomic alterations (rearrangements of various types)
- Use CNV information to setup a Hi-C matrix simulation pipeline
- (I think the point here is that local amplifications / copy number variation can be seen in Hi-C matrices under simulation)
- Two different simulation aproaches: random counts and scaled observed counts
- Lots of examples of how different events look in simulated Hi-C maps
- Preprint available and code on github
- Choco: Predicting Chromosomal Organization using Constraints
- Hypotheis: Use constraint logic programming (ECLiPSe;
- Finding a scalble representation of the "3D genome reconstruction problem" in CLP.
- Model organism: Yeast (small genome, haploid, etc)
- For each row of interaction map, select one representative cell
- So, list with N column numbers and list with N frequency values
- Intra and inter problems performed independently (1, 2, 3 in cis, 1+2, 2+3, 1+3 trans)
- Cytoscape visualization
- Centromeres and telomeres are clusterd (more validation experiments in progress)
Tao Yang -- HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient
- Pearson correlation as a metric for replicibility may be misleading
- Sometimes non-pairs have higher correlation
- Distance dependence effects (dominates really) correlation
- Steps
- Smoothing
- 2d mean filter
- (Can't read the equation but looks like a 2D local average. weighted?)
- Stratification and aggregation
- Analogy to CHM statistic (
- Stratum-adjusted correlation coeeficient
- 2d mean filter
- Smoothing
- Evaluation
- SCC differentiates pseudo-replicate pairs, biological replicate pairs, and non-replicate
- Differentiates biological replicates from non-replicates
- SCC + clustering allows reconstruction of the true relationship between cells
- hicrep on github
- SOM: grid of nodes in output space, each of which maps to a point in data space (constrained by the grid, sort of a manifold)
- Example: 50x50 grid, trained on inter-chromosomal interaction frequencies
- Each node contains a set of genomic loci
- Genomic datasets can then be projected on the map
- Projected six sub compartments onto map
- Compartments are non overlapping but are split up on the map
- Chromosomes cluster on map
- Project epigentic marks, etc
- Use Gini coefficient for measuring level of segregation on the map
- CTCF and cohesin friends are low on segregation scale, znf??? is high (probably znf274)
- Compartment associatons with histone marks from Rao et al. 2014 are recapitulated in the maps
