Skip to content

Instantly share code, notes, and snippets.

@arq5x
Last active December 15, 2015 07:09
Show Gist options
  • Save arq5x/5221518 to your computer and use it in GitHub Desktop.
Save arq5x/5221518 to your computer and use it in GitHub Desktop.

Genome arithmetic for data exploration

The goal of today's practical session is to get your hands dirty with bedtools. We will be studying ChiP-seq data from three different cell types. Each cell type was assayed for H3K27ac. Our research goal is to understand and explore the similarities and differences between the ChIP peaks observed in the 3 different cell types.

Useful links

The data

http://dl.dropbox.com/u/515640/BIOCH508/H3K27a_HPC7_final.bed http://dl.dropbox.com/u/515640/BIOCH508/H3K27a_HPC7-derived_final.bed http://dl.dropbox.com/u/515640/BIOCH508/H3K27a_HPC7-Primed_final.bed

Annotations:

  • Genes: /data/slib/quinlan-bedtools-session/data/annotations/gencode.v14.bed

  • Conservation: /data/slib/quinlan-bedtools-session/data/annotations/conservation.bed

The tools

bedtools is installed on Franklin at /data/slib/quinlan-bedtools-session/bin/. Use the full PATH to the executable.

Examples:

bedtools --help

bedtools intersect -a a.bed -b b.bed

bedtools intersect -h

bedtools multiinter -i *.bed

bedtools jaccard -a a.bed -b b.bed

bedtools closest -a peaks.bed -b genes.bed

Today's experiments

Visualize your data in the UCSC Genome Browser

  1. Download the data (above) to your laptop
  2. Upload the data to UCSC.
  3. Explore the data.
  4. How similar are the peaks between the cell types?
  5. Can you make an estimate as to how similar (that is, how much overlap) the peaks from each cell type are?

How MANY peaks overlap between each file?

  1. Hints: use bedtools; what is another word for overlap?

What about a summary statistic summarizing the similarity?

  1. http://en.wikipedia.org/wiki/Jaccard_index
  2. bedtools has a jaccard metric as well.

How do we find all intervals that are common among the 3 files?

  1. bedtools multiinter. Have a look at the -examples, -header, and -names options.
  2. Examples here
  3. Hint: the fourth column tells you how many datasets had a peak for each interval.
  4. awk '$4 == 3' will identify all intervals where all three cell types had a peak.
  5. The awk example above is the same as perl -ane 'if ($F[3] == 3) {print}'. Note that awk` uses 1-based column numbers whereas Perl uses 0-based.

How do we find all intervals that are private to a single cell type?

  1. Focus on H3K27a_HPC7-Primed_final.bed
  2. awk is still your friend...what column?

Which genes are closest to peaks found in all 3 cells?

Genes: /data/slib/quinlan-bedtools-session/data/annotations/gencode.v14.bed

What fraction of the peaks found in all cells are conserved? Is the fraction significant?

Conservation: /data/slib/quinlan-bedtools-session/data/annotations/conservation.bed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment