Skip to content

Instantly share code, notes, and snippets.

@arq5x
Created September 14, 2012 00:55
Show Gist options
  • Save arq5x/3719100 to your computer and use it in GitHub Desktop.
Save arq5x/3719100 to your computer and use it in GitHub Desktop.
ENCODE consensus segmentations
# 1. Get the ENCODE segmentations from EBI.
# consensus
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/gm12878.combined.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/h1hesc.combined.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/helas3.combined.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/hepg2.combined.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/huvec.combined.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/k562.combined.bb
# Segway (ahem; https://twitter.com/michaelhoffman/status/246679147164880897)
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/gm12878.segway.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/h1hesc.segway.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/helas3.segway.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/hepg2.segway.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/huvec.segway.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/k562.segway.bb
# ChromHMM
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/gm12878.ChromHMM.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/h1hesc.ChromHMM.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/helas3.ChromHMM.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/hepg2.ChromHMM.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/huvec.ChromHMM.bb
wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/awgHub/byDataType/segmentations/jan2011/k562.ChromHMM.bb
# 2. Make BEDGRAPHs of the ENCODE segmentation BigBeds
for bigbed in `ls *.bb`
do
bigBedToBed $bigbed stdout | cut -f 1-4 | bgzip > $bigbed.bedg.gz
done
# 3. Use bedtools to make the union of each ENCODE segmentation set.
# That is, make a one file for the consensus segmentations including all 6 cell
# lines, another for segway, and another for ChromHMM
bedtools unionbedg -i gm12878.combined.bb.bedg.gz \
h1hesc.combined.bb.bedg.gz \
helas3.combined.bb.bedg.gz \
hepg2.combined.bb.bedg.gz \
huvec.combined.bb.bedg.gz \
k562.combined.bb.bedg.gz \
-header \
-names gm12878 \
h1hesc \
helas3 \
hepg2 \
huvec \
k562 \
-filler unknown \
| bgzip \
> encode.6celltypes.consensus.bedg.gz
bedtools unionbedg -i gm12878.segway.bb.bedg.gz \
h1hesc.segway.bb.bedg.gz \
helas3.segway.bb.bedg.gz \
hepg2.segway.bb.bedg.gz \
huvec.segway.bb.bedg.gz \
k562.segway.bb.bedg.gz \
-header \
-names gm12878 \
h1hesc \
helas3 \
hepg2 \
huvec \
k562 \
-filler unknown \
| bgzip \
> encode.6celltypes.segway.bedg.gz
bedtools unionbedg -i gm12878.ChromHMM.bb.bedg.gz \
h1hesc.ChromHMM.bb.bedg.gz \
helas3.ChromHMM.bb.bedg.gz \
hepg2.ChromHMM.bb.bedg.gz \
huvec.ChromHMM.bb.bedg.gz \
k562.ChromHMM.bb.bedg.gz \
-header \
-names gm12878 \
h1hesc \
helas3 \
hepg2 \
huvec \
k562 \
-filler unknown \
| bgzip \
> encode.6celltypes.ChromHMM.bedg.gz
# 4. take a peek (gzcat is for OSX, use zcat elsewhere)
#
# Glossary for ENCODE chromatin segment predictions. Taken verbatim from Table 3
# of doi:10.1038/nature11247
# CTCF: CTCF-enriched element
# E: Predicted enhancer
# PF: Predicted promoter flanking region
# R: Predicted repressed or low-activity region
# TSS: Predicted promoter region including TSS
# T: Predicted transcribed region
# WE: Predicted weak enhancer or open chromatin cis-regulatory element
# unknown: added by us. hopefully self-explanatory
(gzcat encode.6celltypes.consensus.bedg.gz | head -1; gzcat encode.6celltypes.consensus.bedg.gz | \
awk 'NR >= 100000 && NR <= 100005')
chrom start end gm12878 h1hesc helas3 hepg2 huvec k562
chr1 21710600 21710800 WE CTCF CTCF R R CTCF
chr1 21710800 21711000 WE R R R R R
chr1 21711000 21711200 WE unknown R R R R
chr1 21711200 21711298 WE unknown R R R unknown
chr1 21711298 21711400 WE unknown R R R WE
chr1 21711400 21711468 WE R R R R WE
(gzcat encode.6celltypes.segway.bedg.gz | head -1; gzcat encode.6celltypes.segway.bedg.gz | \
awk 'NR >= 100000 && NR <= 100005')
chrom start end gm12878 h1hesc helas3 hepg2 huvec k562
chr1 5103241 5103270 Low5 Low7 Low3 Low3 Low5 Low2
chr1 5103270 5103289 Low5 Low7 Low3 Low3 Low5 Low6
chr1 5103289 5103299 Low5 Low7 Low5 Low3 Low5 Low6
chr1 5103299 5103381 Low5 Low7 Low5 Low3 Quies Low6
chr1 5103381 5103388 Low5 Low7 Low1 Low3 Quies Low6
chr1 5103388 5103389 Low5 Low1 Low1 Low3 Quies Low6
(gzcat encode.6celltypes.ChromHMM.bedg.gz | head -1; gzcat encode.6celltypes.ChromHMM.bedg.gz | \
awk 'NR >= 100000 && NR <= 100005')
chr1 36245000 36245600 EnhF Quies Low Quies Quies Low
chr1 36245600 36245800 EnhWF Quies Low Quies Quies Low
chr1 36245800 36251000 Low Quies Low Quies Quies Low
chr1 36251000 36251800 Low Quies Low Low Quies Low
chr1 36251800 36252200 Low Quies Low Low Quies EnhWF
chr1 36252200 36252400 Low Quies H4K20 Low Quies EnhWF
# 5. tabix the 6-way segmentation maps for use within gemini.
tabix -p bed encode.6celltypes.consensus.bedg.gz
tabix -p bed encode.6celltypes.segway.bedg.gz
tabix -p bed encode.6celltypes.ChromHMM.bedg.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment