This is a small experiment on the alignment of ~50bp INDELs. The query sequences are shown in 0.01.fq
below, where seq_ori
is a 204bp sequence extracted from the human reference genome, seq_del54
contains a 54bp deletion in the middle, seq_del84
contains a 84bp deletion in a 120bp read, and seq_ins40
contains a 40bp insertion in a 140bp read. These four short sequences were mapped to the human reference genome with Bowtie2, BWA-MEM, LAST, Novoalign, SNAP and Stampy with default settings. Non-default scoring functions were also tested for Bowtie2 (--rdg 5,1 --rfg 5,1), BWA-MEM (-A2 -E1) and LAST (-r2 -q4). The output by various mappers/settings can be found in this gist. The following table gives my summary:
Mapper | Setting | -84bp | -54bp | +40bp |
---|---|---|---|---|
BBMAP | default | Yes | Yes | Yes |
Bowtie2 | default | No | No | No |
Bowtie2 | --rdg 5,1 --rfg 5,1 | as insertion | as insertion | Yes |
BWA-MEM | default | as split | Yes | Yes |
BWA-MEM | -A2 -E1 | Yes | Yes | Yes |
LAST | default | as split | as split | as split |
LAST | -r2 -q4 | as split | as split | as split |
Novoalign | default | No | Yes | Yes |
SNAP | default | No | No | No |
Stampy | default | No | No | Yes |
SNAP finds hits within an edit-distance threshold (14 by default). It is practically unable to find ~50bp INDELs. I believe Bowtie2, LAST and Stampy should all be able to find the deletions in theory. I guess they don't because they have not used long enough reference subsequence in Smith-Waterman.
Disclaimer: I developed BWA-MEM and have evaluated it against similar examples (not this one) in development.