Bedtools Cheatsheet

General:

Tools	Description
flank	Create new intervals from the flanks of existing intervals.
slop	Adjust the size of intervals.
shift	Adjust the position of intervals.
subtract	Remove intervals based on overlaps b/w two files.
complement	Extract intervals not represented by an interval file.
closest	Find the closest, potentially non-overlapping interval.
intersect	Find overlapping intervals in various ways.
window	Find overlapping intervals within a window around an interval.
cluster	Cluster (but don't merge) overlapping/nearby intervals.
merge	Combine overlapping/nearby intervals into a single interval.
map	Apply a function to a column for each overlapping interval.
groupby	Group by common cols. & summarize oth. cols. (~ SQL "groupBy")

Formatting:

Notes: BED file format, GFF vs BED indexing

Tools	Description
getfasta	Use intervals to extract sequences from a FASTA file.
maskfasta	Use intervals to mask sequences from a FASTA file.
sort	Order the intervals in a file.
bed12tobed6	Breaks BED12 intervals into discrete BED6 intervals.
bamtofastq	Convert BAM records to FASTQ records.
bamtobed	Convert BAM alignments to BED (& other) formats.
bedpetobam	Convert BEDPE intervals to BAM records.
bedtobam	Convert intervals to BAM records.

Statistics:

Tools	Description
jaccard	Calculate the Jaccard statistic b/w two sets of intervals.
random	Generate random intervals in a genome.
reldist	Calculate the distribution of relative distances b/w two files.
shuffle	Randomly redistribute intervals in a genome.
makewindows	Makes adjacent or sliding windows across a genome or BED file.
nuc	Profile the nucleotide content of intervals in a FASTA file.

Coverage:

Tools	Description
annotate	Annotate coverage of features from multiple files.
coverage	Compute the coverage over defined intervals.
genomecov	Compute the coverage over an entire genome.
multicov	Counts coverage from multiple BAMs at specific intervals.
unionbedg	Combines coverage intervals from multiple BEDGRAPH files.

common flags:

-s, -S : Require same strandedness or opposite strandedness, respectively.
-f, -F : Minimum overlap required as a fraction of A or a fraction of B respectively.
-r, -e : Require that the minimum overlap be satisfied for A AND B, or A OR B respectively.
-split : Treat "split" BAM or BED12 entries as distinct BED intervals.
-abam : A is a BAM file.

General

flank, slop

Create new intervals from the flanks of existing intervals. (flank Docs)

Adjust the size of intervals. (slop Docs)

IN           ▓▓▓▓▓       ▓▓▓
Flank      ██     ██   ██   ██
Slop       █████████   ███████

$ bedtools flank [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]

$ bedtools slop [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]

OPTIONS	.
-b, -l, -r	Flank/extend regions by x bp on both sides, on the left, or on the right respectively.
-s	Define -l and -r based on strand.
-pct	Define -l and -r as a fraction of the feature's length.

shift

Adjust the position of intervals, while respecting chromosome edges. (Docs).

IN      ██   ██      ████
OUT        ██   ██      ████

$ bedtools shift [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-s or (-m and -p)]

OPTIONS	.
-s	Number of BPs to shift the features.
-m, -p	Number of BPs to shift the features on the - strand or + strand, respectively.
-pct	Define -s, -m and -p as a fraction of the feature's length.

subtract

Remove intervals based on overlaps b/w two files. (Docs)

A        ▓▓▓▓▓▓▓▓▓▓   ▓▓▓     ▓▓▓▓▓▓
B          ▓▓▓▓           ▓▓▓▓▓▓▓  
A sub B  ██    ████   ███        ███

$ bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS	.
-A	Remove entire feature if any overlap.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e

complement

Extract intervals not represented by an interval file. (Docs)

IN           ▓▓▓▓▓     ▓▓▓     ▓▓▓▓▓▓
          ▓▓▓▓            ▓▓▓  
OUT  █████        █████      ██

$ bedtools complement -i <BED/GFF/VCF> -g <GENOME>

closest

Find the closest, potentially non-overlapping interval. (Docs)

A            █████   ✓
B   ████            ███

$ bedtools closest [OPTIONS] -a <FILE> -b <FILE1, FILE2, ..., FILEN>

OPTIONS	.
-d	Also report distance from A to the closest feature.
-k	Report the k closest hits. Default: 1.
-io	Ignore features in B that overlap A.
-iu, -id	Ignore features in B that are upstream or downstream, respectively, of features in A.
common	strandedness: -s, -S

intersect

Find overlapping intervals in various ways. (Docs)

A           ██████████
B         ▓▓▓▓    ▓▓        ▓▓▓  
A int B     ▓▓    ▓▓

$ bedtools intersect [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>

OPTIONS	.
-wa, -wb	Write the original entry in A/original entry in B, respectively, for each overlap.
-loj	For each feature in A report each overlap with B. Report a NULL feature for B if no overlap.
-wao	Report A and B features and no. of bp overlap between them.
-u	Only report each overlapping A feature once.
-c	For each entry in A, report count of overlapping B features.
-v	Only report features in A not overlapping B.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -abam, -split

window

Find overlapping intervals within a window around an interval. (Docs)

A           ┌────█████────┐
B         ▓▓▓▓    ▓▓▓        ▓▓▓  
A win B   ▓▓▓▓    ▓▓▓

$ bedtools window [OPTIONS] [-a|-abam] -b <BED/GFF/VCF>

OPTIONS	.
-w, -l, -r	Flank length of overlap window in each direction, upstream or downstream, respectively.
-sw	Define -l and -r based on strand.
-u	Only report each overlapping A feature once.
-c	For each entry in A, report count of overlapping B features.
-v	Only report features in A not overlapping B.
common	strandedness: -sm, -Sm; bam: -abam

cluster

Cluster (but don't merge) overlapping/nearby intervals. (Docs)

BED        ████     █████  ███  
clustID   └─#1─┘   └────#2────┘

$ bedtools cluster [OPTIONS] -i <BED/GFF/VCF>

OPTIONS	.
-d	Max distance between features in cluster.
common	strandedness: -s, -S

Aggregation Tools

For merge, groupby, and map the following* aggregation functions (specified by -o) can be applied to a column/columns specified by -c: sum, count, count_distinct, min, max, mean, median, mode, antimode, stdev, sstdev, collapse, distinct, first, last

*Other functions are available.

merge

Combine overlapping/nearby intervals into a single interval. (Docs)

IN       ▓▓▓      ▓        ▓▓··d··▓▓▓
      ▓▓▓▓         ▓▓        
OUT   ██████      ███      ██████████

$ bedtools merge [OPTIONS] -i <BED/GFF/VCF/BAM>

OPTIONS	.
-s	Require same strandedness.
-S	Force merge for one specific strand only. Options: <`+`/`-`>.
-d	Maximum distance between features to be merged.
common	aggregation: -o, -c;

map

Apply a function to a column for each overlapping interval.(Docs)

        score = 3  1     5                 4      6
B              ▓▓▓ ▓   ▓▓▓▓▓             ▓▓▓▓▓▓ ▓▓▓▓
A               ██████████                 ███████
B map(mean) A   ██████████ mean(3,1,5)=5   ███████ mean(4,6)=5

$ bedtools map [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS	. .
common	aggregation: -o, -c; strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split

groupby

Group by common cols & summarize other cols (~ SQL "groupBy"). (Docs)

$ bedtools groupby [OPTIONS] -i <BED> -g <groupby columns> -c <op. column> -o <operation>

OPTIONS	.
common	aggregation: -o, -c

Formatting

BED file format

Column	e.g.	Definition
chrom	Sc112.1	<STR> name of chromosome/scaffold
start	2134	<INT> start position of feature
end	2565	<INT> end position of feature
name	gene123	<STR> name of feature
score	544	<NUM> score for the feature e.g. bit score
strand	+	<+/-/.> strand on which feature is located
thickStart	2235
thickEnd	2489
itemRgb	255,0,0
blockCount	2
blockSizes	150,80
blockStarts	0,2333

GFF vs BED indexing

GFF    ┌─1   2   3─┐ 4   ...
         G---A---T   C   ...
BED    └─0   1   2 └─3   ...

.	gff -> bed	bed -> gff
new_start =	gff_start - 1	bed_start + 1
new_end =	gff_end	bed_end

getfasta

Use intervals to extract sequences from a FASTA file. (Docs)

FASTA   ACTGATCATGATACATGATACCATTAGGATACAATA
BED         ████       █████      ████
OUTFA       ATCA       TGATA      GGAT

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>

OPTIONS	.
-name	Use “name” column in BED file for FASTA headers in the output.
-s	Reverse complement features on "-" strand. Default: strand information ignored.
-split	Given BED12 input, concatenate the sequences from BED blocks (e.g., exons).

maskfasta

Use intervals to mask sequences from a FASTA file. (Docs)

FASTA   ACTGATCATGATACATGATACCATTAGGATACAATA
BED           ████       █████      ████
FASTA'  ACTGATNNNNATACATGNNNNNATTAGGNNNNAATA

$ bedtools maskfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>

OPTIONS	.
-soft	Soft-mask (convert to lower-case bases) instead of masking with "N".
-mc	Specify masking character.

sort

Order the intervals in a file. (Docs)

$ bedtools sort [OPTIONS] -i <BED/GFF/VCF>

OPTIONS	.
-sizeA	Sort by feature size (asc).
-sizeD	Sort by feature size (desc).
-chrThenSizeA	Sort by chromosome (asc), then by feature size (asc).
-chrThenSizeD	Sort by chromosome (asc), then by feature size (desc).
-chrThenScoreA	Sort by chromosome (asc), then by score (asc).
-chrThenScoreD	Sort by chromosome (asc), then by score (desc).

Statistics

jaccard

Calculate the Jaccard statistic b/w two sets of intervals. (Docs)

A                 ███████████  15bp
B               ▓▓▓▓ 10bp ▓▓ 4bp       ▓▓▓ 8bp
A int B           ▓▓ 6bp  ▓▓ 4bp
Jaccard(A,B)     (6+4)/((15+10+4+8)-(6+4)) =  0.37

$ bedtools jaccard [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS	.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split

random

Generate random intervals in a genome. (Docs)

$ bedtools random [OPTIONS] -g <GENOME>

OPTIONS	.
-l	The length of the intervals to generate. Default: 100
-n	The number of intervals to generate. Default: 1,000,000
-seed	Supply an integer seed for the shuffling.

reldist

Calculate the distribution of relative distances b/w two files. (Docs)

                ───────r──────
A            ▓▓▓▓▓▓         ▓▓▓▓
B                      ███
                ───d1─── ──d2──
reldist = min(d1,d2)/r

$ bedtools reldist [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS	.
-detail	Instead of a summary, report relative distance for each region in A.

shuffle

Randomly redistribute intervals in a genome. (Docs)

$ bedtools shuffle [OPTIONS] -i <BED/GFF/VCF> -g <GENOME>

OPTIONS	.
-excl	BED file with regions into which features won't be shuffled.
-incl	BED file with regions into which features will be shuffled.
-chrom	Keep features on the same chromosome.
-chromFirst	Distribute features ~uniformly across chroms, not across total sequence.
-noOverlapping	Don't allow shuffled intervals to overlap.

Coverage

annotate

Annotate coverage of features from multiple files. (Docs)

$ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1  100 200 nasty 1 - 0.500000  1.000000  0.300000
chr2  500 1000  ugly  2 + 0.000000  0.600000  1.000000

$ bedtools annotate [OPTIONS] -i <BED/GFF/VCF> -files FILE1 FILE2 FILE3 ... FILEn

OPTIONS	.
-counts	Report count of features that overlap -i in each file. Default: report fraction of -i covered by each file.
-both	Report counts & fractions for each file.
common	strandedness: -s, -S.

coverage

Compute the coverage over defined intervals. (Docs)

BED FILE A  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓  
BED File B  ████ ████              ██             █████████
              ████████                                      
Result      [  N=3, 10/15 ]     [  N=1, 2/15  ]    [N=1,6/6]

$ bedtools coverage [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>

OPTIONS	.
-d	Report the depth at each position in each A feature.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -split,-abam

edawson/bedtools_cheatsheet.md