Skip to content

Instantly share code, notes, and snippets.

@brantfaircloth
Created November 9, 2010 23:22
Show Gist options
  • Save brantfaircloth/670033 to your computer and use it in GitHub Desktop.
Save brantfaircloth/670033 to your computer and use it in GitHub Desktop.
Extending the Nextera indexing set

Dependencies

Methods

Note 1: that the index sequences provided in the Nextera kits are edit distance 2 from one another (meaning it takes 2 insertions, deletions, or substitutions to turn one sequence into another). This means that we can determine when there is an error in the index, but we cannot correct the error. We can only begin to correct the error when the edit distance is ≥ 3 [see below].

Note 2: I fully realize that this is not the perfectly optimal way to do what we need to do - for example, we won't get the maximal set of 6 nt barcodes that include the IDX barcodes, for instance. But, what is here is good enough for 99% of what we typically need to do (i,e. 24 barcodes or so).

We are going to use some programs from different places. First, the awesome create_index_sequences.py and suggest_subset.py from [bioinf.eva.mpg.de]](http://bioinf.eva.mpg.de/multiplex/), and second, my levenshtein.py helper script.

Extending the original edit distance 2 set

  1. Generate (or copy from the file) the full set of 6nt, distance 2 tags using

     create_index_sequences.py -l 6 -d 2 -o 6nt_ed2.txt
    
  2. Following the format of the above, paste in the 12 IDX sequences in the Epicentre kit. Then, use some regex magic to make the file look like so (truncated):

     [Linker]
     IDX1 = ATCACG
     IDX2 = CGATGT
     IDX3 = TTAGGC
     IDX4 = TGACCA
     IDX5 = ACAGTG
     IDX6 = GCCAAT
     IDX7 = CAGATC
     IDX8 = ACTTGA
     IDX9 = GATCAG
     IDX10 = TAGCTT
     IDX11 = GGCTAC
     IDX12 = CTTGTA
     index_6nt_1 = AACCAG
     index_6nt_2 = AACCGA
     index_6nt_3 = AACCTC
     index_6nt_4 = AACGAA
     index_6nt_5 = AACGCC
    
     ...
    
  3. Run this file through levenshtein.py:

     python levenshtein.py --configuration 6nt_ed2.txt \ 
         --section=Linker
    
  4. Remove the sequences from 6nt_ed2.txt starting with index_6nt_* (i,e. don't remove any sequences that start with IDX) that levenshtein.py indicates as causing the edit distance to be = 0.

  5. Run the file through levenshtein.py again and remove the sequences starting with index_6nt_* that levenshtein.py indicates as causing the edit distance to be = 1.

  6. Run the edited file through levenshtein.py once more to ensure that the minimum edit distance is ≥ 2.

  7. This final set contains the extended set of edit distance 2 barcodes that you can use with the Nextera adapters. Before settling on a set, you need to build a different input file that looks like so:

     #Index	Name
     ATCACG	IDX_1
     CGATGT	IDX_2
     TTAGGC	IDX_3
     TGACCA	IDX_4
     ACAGTG	IDX_5
     GCCAAT	IDX_6
     CAGATC	IDX_7
     ACTTGA	IDX_8
     GATCAG	IDX_9
     TAGCTT	IDX_10
     GGCTAC	IDX_11
     CTTGTA	IDX_12
     AACCAG	index_6nt_ed2_13
     AACCGA	index_6nt_ed2_14
     AACCTC	index_6nt_ed2_15
     AACGAA	index_6nt_ed2_16
    
     ...
    
  8. run suggest_subset.py on the barcodes in this file, to select the set of N that you need. Thus, if we use IDX 1-12, and we need a total of 20 adapters, total, then we run:

     suggest_subset.py -i extended_nextera_tags_6nt_ed2_barcodes.txt \
         -s 8 -p 1-12 -o set_of_extended_nextera_barcodes.txt
    
  9. Using the outfile, build the actual adapter sequences that we need to order:

     barcode_inserter.py --5-prime CAAGCAGAAGACGGCATACGAGAT \
         --3-prime CGGTCTGCCTTGCCAGCCCGCTCAG \
         --input set_of_extended_nextera_barcodes.txt > \
         set_of_extended_nextera_adapters.txt
    

Turning the original edit distance 2 set into an edit distance 3 set

  1. Format a file with the IDX sequences in it, like so:

     [Linker]
     IDX1 = ATCACG
     IDX2 = CGATGT
     IDX3 = TTAGGC
     IDX4 = TGACCA
     IDX5 = ACAGTG
     IDX6 = GCCAAT
     IDX7 = CAGATC
     IDX8 = ACTTGA
     IDX9 = GATCAG
     IDX10 = TAGCTT
     IDX11 = GGCTAC
     IDX12 = CTTGTA
    
  2. Run this through levenshtein.py:

     python levenshtein.py --configuration IDX_sequences.txt \ 
         --section=Linker
    
  3. Remove those sequences indicated as having edit distance 2 and run the code again to make sure the set is only ≥ edit distance 3.

  4. Create a set of 6nt tags of edit distance 3:

     create_index_sequences.py -l 6 -d 2 -o 6nt_ed2.txt
    
  5. Format these sequences to look like the file containing the IDX sequences above. Run this file through levenshtein.py

     python levenshtein.py --configuration IDX_sequences.txt \ 
         --section=Linker
    
  6. Remove those sequences making the edit distance ≤ 1 that start with index_6nt_* from the file, run levenshtein.py again, remove those sequences making the edit distance ≤ 2. We should have a file with only edit distance ≥ 3 sequences.

  7. Double-check this by running, one last time:

     python levenshtein.py --configuration IDX_sequences.txt \ 
         --section=Linker
    
  8. This final set contains the extended set of edit distance 3 barcodes that you can use with the Nextera adapters. Before settling on a set, you need to build a different input file that looks like so:

     #Index	Name
     CGATGT	IDX_2
     TTAGGC	IDX_3
     TGACCA	IDX_4
     ACAGTG	IDX_5
     GCCAAT	IDX_6
     CAGATC	IDX_7
     GATCAG	IDX_9
     TAGCTT	IDX_10
     GGCTAC	IDX_11
     CTTGTA	IDX_12
     AACGCA	index_6nt_ed3_13
     ACTATA	index_6nt_ed3_14
     AGTTGG	index_6nt_ed3_15
    
     ...
    
  9. run suggest_subset.py on the barcodes in this file, to select the set of N that you need. Thus, if we use IDX 2-12, and we need a total of 20 adapters, total, then we run:

     suggest_subset.py -i extended_nextera_tags_6nt_ed3_barcodes.txt \
         -s 10 -p 2,3,4,5,6,7,9,10,11,12  \
         -o set_of_extended_nextera_barcodes.txt
    
  10. Using the outfile, build the actual adapter sequences that we need to order:

     barcode_inserter.py --5-prime CAAGCAGAAGACGGCATACGAGAT \
         --3-prime CGGTCTGCCTTGCCAGCCCGCTCAG \
         --input set_of_extended_nextera_barcodes.txt > \
         set_of_extended_nextera_adapters.txt
    
#!/usr/bin/env python
# encoding: utf-8
"""
barcode_inserter.py
Created by Brant Faircloth on 09 November 2010 10:16 PST (-0800).
Copyright (c) 2010 Brant C. Faircloth. All rights reserved.
"""
import pdb
import os
import sys
import optparse
import string
def interface():
'''Command-line interface'''
usage = "usage: %prog [options]"
p = optparse.OptionParser(usage)
p.add_option('--input', dest = 'input', action='store',
type='string', default = None, help='The path to the input barcodes file.',
metavar='FILE')
p.add_option('--5-prime', dest = 'head', action='store',
type='string', default = None, help='The sequence 5-prime of the barcode.',
metavar='FILE')
p.add_option('--3-prime', dest = 'tail', action='store',
type='string', default = None, help='The sequence 3-prime of the barcode.',
metavar='FILE')
(options,arg) = p.parse_args()
if not options.input:
p.print_help()
sys.exit(2)
if not os.path.isfile(options.input):
print "You must provide a valid path to the configuration file."
p.print_help()
sys.exit(2)
return options, arg
def rev_comp(seq):
'''Return reverse complement of seq'''
bases = string.maketrans('AGCTagct','TCGAtcga')
# translate it, reverse, return
return seq.translate(bases)[::-1]
def main():
options, args = interface()
for line in open(options.input, 'rU'):
if not line.startswith('#'):
if '\t' in line:
barcode, name = line.strip('\n').split('\t')
else:
name, barcode = line.strip('\n').split(' ')
primer = options.head + rev_comp(barcode).lower() + options.tail
print 'Index Primer {0}: 5\' - {1} ===>'.format(name, primer)
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment