Note 1: that the index sequences provided in the Nextera kits are edit distance 2 from one another (meaning it takes 2 insertions, deletions, or substitutions to turn one sequence into another). This means that we can determine when there is an error in the index, but we cannot correct the error. We can only begin to correct the error when the edit distance is ≥ 3 [see below].
Note 2: I fully realize that this is not the perfectly optimal way to do what we need to do - for example, we won't get the maximal set of 6 nt barcodes that include the IDX barcodes, for instance. But, what is here is good enough for 99% of what we typically need to do (i,e. 24 barcodes or so).
We are going to use some programs from different places. First, the awesome create_index_sequences.py
and suggest_subset.py
from [bioinf.eva.mpg.de]](http://bioinf.eva.mpg.de/multiplex/), and second, my levenshtein.py helper script.
-
Generate (or copy from the file) the full set of 6nt, distance 2 tags using
create_index_sequences.py -l 6 -d 2 -o 6nt_ed2.txt
-
Following the format of the above, paste in the 12 IDX sequences in the Epicentre kit. Then, use some regex magic to make the file look like so (truncated):
[Linker] IDX1 = ATCACG IDX2 = CGATGT IDX3 = TTAGGC IDX4 = TGACCA IDX5 = ACAGTG IDX6 = GCCAAT IDX7 = CAGATC IDX8 = ACTTGA IDX9 = GATCAG IDX10 = TAGCTT IDX11 = GGCTAC IDX12 = CTTGTA index_6nt_1 = AACCAG index_6nt_2 = AACCGA index_6nt_3 = AACCTC index_6nt_4 = AACGAA index_6nt_5 = AACGCC ...
-
Run this file through
levenshtein.py
:python levenshtein.py --configuration 6nt_ed2.txt \ --section=Linker
-
Remove the sequences from
6nt_ed2.txt
starting withindex_6nt_*
(i,e. don't remove any sequences that start withIDX
) thatlevenshtein.py
indicates as causing the edit distance to be = 0. -
Run the file through
levenshtein.py
again and remove the sequences starting withindex_6nt_*
thatlevenshtein.py
indicates as causing the edit distance to be = 1. -
Run the edited file through
levenshtein.py
once more to ensure that the minimum edit distance is ≥ 2. -
This final set contains the extended set of edit distance 2 barcodes that you can use with the Nextera adapters. Before settling on a set, you need to build a different input file that looks like so:
#Index Name ATCACG IDX_1 CGATGT IDX_2 TTAGGC IDX_3 TGACCA IDX_4 ACAGTG IDX_5 GCCAAT IDX_6 CAGATC IDX_7 ACTTGA IDX_8 GATCAG IDX_9 TAGCTT IDX_10 GGCTAC IDX_11 CTTGTA IDX_12 AACCAG index_6nt_ed2_13 AACCGA index_6nt_ed2_14 AACCTC index_6nt_ed2_15 AACGAA index_6nt_ed2_16 ...
-
run
suggest_subset.py
on the barcodes in this file, to select the set of N that you need. Thus, if we use IDX 1-12, and we need a total of 20 adapters, total, then we run:suggest_subset.py -i extended_nextera_tags_6nt_ed2_barcodes.txt \ -s 8 -p 1-12 -o set_of_extended_nextera_barcodes.txt
-
Using the outfile, build the actual adapter sequences that we need to order:
barcode_inserter.py --5-prime CAAGCAGAAGACGGCATACGAGAT \ --3-prime CGGTCTGCCTTGCCAGCCCGCTCAG \ --input set_of_extended_nextera_barcodes.txt > \ set_of_extended_nextera_adapters.txt
-
Format a file with the IDX sequences in it, like so:
[Linker] IDX1 = ATCACG IDX2 = CGATGT IDX3 = TTAGGC IDX4 = TGACCA IDX5 = ACAGTG IDX6 = GCCAAT IDX7 = CAGATC IDX8 = ACTTGA IDX9 = GATCAG IDX10 = TAGCTT IDX11 = GGCTAC IDX12 = CTTGTA
-
Run this through
levenshtein.py
:python levenshtein.py --configuration IDX_sequences.txt \ --section=Linker
-
Remove those sequences indicated as having edit distance 2 and run the code again to make sure the set is only ≥ edit distance 3.
-
Create a set of 6nt tags of edit distance 3:
create_index_sequences.py -l 6 -d 2 -o 6nt_ed2.txt
-
Format these sequences to look like the file containing the IDX sequences above. Run this file through
levenshtein.py
python levenshtein.py --configuration IDX_sequences.txt \ --section=Linker
-
Remove those sequences making the edit distance ≤ 1 that start with
index_6nt_*
from the file, runlevenshtein.py
again, remove those sequences making the edit distance ≤ 2. We should have a file with only edit distance ≥ 3 sequences. -
Double-check this by running, one last time:
python levenshtein.py --configuration IDX_sequences.txt \ --section=Linker
-
This final set contains the extended set of edit distance 3 barcodes that you can use with the Nextera adapters. Before settling on a set, you need to build a different input file that looks like so:
#Index Name CGATGT IDX_2 TTAGGC IDX_3 TGACCA IDX_4 ACAGTG IDX_5 GCCAAT IDX_6 CAGATC IDX_7 GATCAG IDX_9 TAGCTT IDX_10 GGCTAC IDX_11 CTTGTA IDX_12 AACGCA index_6nt_ed3_13 ACTATA index_6nt_ed3_14 AGTTGG index_6nt_ed3_15 ...
-
run
suggest_subset.py
on the barcodes in this file, to select the set of N that you need. Thus, if we use IDX 2-12, and we need a total of 20 adapters, total, then we run:suggest_subset.py -i extended_nextera_tags_6nt_ed3_barcodes.txt \ -s 10 -p 2,3,4,5,6,7,9,10,11,12 \ -o set_of_extended_nextera_barcodes.txt
-
Using the outfile, build the actual adapter sequences that we need to order:
barcode_inserter.py --5-prime CAAGCAGAAGACGGCATACGAGAT \ --3-prime CGGTCTGCCTTGCCAGCCCGCTCAG \ --input set_of_extended_nextera_barcodes.txt > \ set_of_extended_nextera_adapters.txt