Skip to content

Instantly share code, notes, and snippets.

@brantfaircloth
Created December 18, 2012 06:34
Show Gist options
  • Save brantfaircloth/4325589 to your computer and use it in GitHub Desktop.
Save brantfaircloth/4325589 to your computer and use it in GitHub Desktop.
Preparing sequences for Genbank using tbl2asn, blast, and a bit of custom vector screening

This is primarily directed towards preparing large amounts of UCE data for Genbank. However, parts of the following should work with most NGS data sets and other types of sequence data. Programs within phyluce are availble from:

https://github.com/faircloth-lab/phyluce

Sequin will trim vector contamination, but Sequin will also not handle huge files (nor do you want to have it try). So, the vector screening portions below attempt to be equivalent to this process.

  • run python against sequence and db (this is specific to UCE data):

    python ~/phyluce/bin/ncbi/prep_uce_fasta_files_for_ncbi.py \
    contigs \
    lastz/probe.matches.sqlite \
    example.conf \
    output.fsa
    
  • you shoud produce a fasta file that looks like (header line should not wrap):

    >0ard [organism=Ardeotis kori] [moltype=genomic] [location=genomic] [note=ultra conserved element locus chr3_18428] [specimen-voucher=FLMNH:44254] Ardeotis kori ultra-conserved element locus chr3_18428
    AAATTGTTCAGCCGGTGGGGGGTTGGATGCAGCCGACACAGGTACTTACCCCGAGAGAGAGCTTCCGCCAGAGCAGAGCAGCCAAACGGTC...
    >1ard [organism=Ardeotis kori] [moltype=genomic] [location=genomic] [note=ultra conserved element locus chr1_14351] [specimen-voucher=FLMNH:44254] Ardeotis kori ultra-conserved element locus chr1_14351
    GAATCTGAGCTTGTCTCACTAATTATCGTTCCACTCTTCCAGGTCAATGATGCCATAGGAAGGGACTGGCCCTGGATCTATTTTGTTACAC...
    
  • run output fasta against Univec_core:

    blastn -task blastn -db UniVec_core -query test.fsa \
        -evalue 1 -gapopen 3 -gapextend 3 -word_size 11 \
        -reward 1 -penalty -5 -out blast.out -num_threads 4 \
        -dust yes -searchsp 1750000000000 -soft_masking true \
        -outfmt 6
    
  • trim sequences:

    python ~/phyluce/bin/ncbi/trim_vecscreen_results.py \
        blast.out \
        test.fsa \
        output-univec-trimmed.fsa
    
  • create template file: http://www.ncbi.nlm.nih.gov/WebSub/template.cgi

  • create a sequence directory containing your fasta file and any cmt file needed:

    *.fasta
    *.cmt
    
  • structured cmt format in file below for sequencing tech and assembly type

  • run tbl2asn:

    tbl2asn -t ASN-template.sbt -p sequence -a s -X C -V v
    
[exclude]
phalacrocorax carbo
megalaima virens
urocolius indicus
struthio camelus
pitta guajana
gallus gallus
eudromia elegans
dromaius novaehollandiae
anser erythropus
zanclostomus javanicus
mus musculus
[metadata]
molecule:DNA
moltype:genomic
location:genomic
note:ultra conserved element locus {}
specimen_voucher:{}
[vouchers]
Ardeotis kori:FLMNH 44254
Balaeniceps rex:LSUMZ B13372
Cathartes aura:LSUMZ B17242
Colibri coruscans:LSUMZ B5574
Eudyptula minor:LSUMZ B36558
Eurypyga helias:LSUMZ B1980
Gampsonyx swainsonii:LSUMZ B15046
Gavia immer:LSUMZ B7923
Megalaima virens:LSUMZ B20788
Micrastur semitorquatus:LSUMZ B11298
Momotus momota:LSUMZ B927
Nyctibius grandis:LSUMZ B15415
Oceanites oceanicus:LSUMZ B37197
Opisthocomus hoazin:LSUMZ B9660
Phaethon rubicauda:LSUMZ B35135
Phalacrocorax carbo:LSUMZ B45740
Phegornis mitchelli:LSUMZ B103926
Phoenicopterus chilensis:LSUMZ B37257
Pitta guajana:LSUMZ B36368
Podiceps auritus:LSUMZ B19296
Psittacula alexandri:LSUMZ B36554
Psophia crepitans:LSUMZ B7513
Pterocles exustus:LSUMZ B20765
Rhinopomastus cyanomelas:LSUMZ B34262
Scopus umbretta:LSUMZ B28330
Sphyrapicus varius:FLMNH 43569
Tauraco erythrolophus:LSUMZ B5354
Treron vernans:LSUMZ B47229
Trogon personata:LSUMZ B7644
Tyto alba:LSUMZ B19295
Urocolius indicus:LSUMZ B34225
Vidua macroura:LSUMZ B16749
[remap]
pterocles:pterocles exustus
Assembly Method velvet v. 1.1.04
Sequencing Technology Illumina
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment