Last active
March 3, 2025 07:35
-
-
Save taoliu/5772f4b44767ff07b85ba11c42a57b78 to your computer and use it in GitHub Desktop.
Merge multiple peak files (BED format) then call the consensus regions
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# This script will find the consensus peak regions from peak files (in | |
# BED format) of multiple samples by: | |
# | |
# 1. Converting the peak file of each sample into non-overlapping 3 | |
# cols BED file and concatenating them; | |
# | |
# 2. Sorting the concatenated file and Building a genome coverage | |
# track in BedGraph, of which the value (the 3rd col) indicates the | |
# number of samples with a peak at a particular location; | |
# | |
# 3. Using MACS to call regions being covered by peaks from more than | |
# a certain number of samples. | |
# ------------------------------------------------------------------ | |
# Modify the following parameters | |
# | |
# define the peaks from multiple samples | |
SAMPLE_PEAKS=`ls *.narrowPeak` | |
# define the CUTOFF | |
CUTOFF=3 | |
# define the minlen and maxgap for peak calling (arbitrary) | |
MINLEN=200 | |
MAXGAP=30 | |
# define the genome file with the 1st column as chromosome name and | |
# the 2nd column as chromosome length | |
GENOME=chrom.len | |
# define the OUTPUT_PREFIX | |
OUTPUT_PREFIX="consensus_peaks" | |
# ------------------------------------------------------------------ | |
# 1 | |
I=$SAMPLE_PEAKS | |
O=${OUTPUT_PREFIX}.all.bed | |
rm -f $O | |
touch $O | |
for f in $I; do | |
bedtools sort -i $f | bedtools merge -i - | cut -f 1,2,3 >> $O; | |
done | |
# 2 | |
I=${OUTPUT_PREFIX}.all.bed | |
O=${OUTPUT_PREFIX}.bdg | |
bedtools sort -i $I | bedtools genomecov -bga -i - -g $GENOME > $O | |
#3 | |
I=${OUTPUT_PREFIX}.bdg | |
O=${OUTPUT_PREFIX}.consensus.bed | |
macs2 bdgpeakcall -i $I -o $O --no-trackline -c $CUTOFF -g $MAXGAP -l $MINLEN | |
# end | |
echo "All done. Check ${O}" |
bedtools intersect
can give you overlap regions between one file (-a) and one or more files (-b). However, it's hard to implement the same idea as shown in this script -- so that we can get common regions from e.g. any 3 out of 4 samples. Maybe I don't know the tricks of bedtools
. If you have some insight, please share with me :)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello!
Very helpful script, thank you.
I have some, pearhaps naive, questions about tje consensus peaks. Say, for example, I have 4 replicate samples and I manually find the overlapping peaks between them and keep the peaks with overlap in at least 3 of the 4 samples (using something like bedtools intersect).
Do you think those will be different than the consensus peaks found using your script?
Also, say I have found consensus peaks across replicates for a TF ChIP in control versus stimulation conditions. I now want to get differential peaks between control and stimulation with the MAnorm tool. The tool requires that I provide the peak files as well as the read files. Can I use the consensus peak file of each condition with the read file produced by merging all the initial replicate bam files?
Or does it make more sense to merge the bam files for each condition, call peaks and then get differential peaks between control and stimulation?
Thanks in advance!