This is primarily directed towards preparing large amounts of UCE data for Genbank. However, parts of the following should work with most NGS data sets and other types of sequence data. Programs within phyluce are availble from:
https://github.com/faircloth-lab/phyluce
Sequin will trim vector contamination, but Sequin will also not handle huge files (nor do you want to have it try). So, the vector screening portions below attempt to be equivalent to this process.
run python against sequence and db (this is specific to UCE data):
python ~/phyluce/bin/ncbi/prep_uce_fasta_files_for_ncbi.py \ contigs \ lastz/probe.matches.sqlite \ example.conf \ output.fsa
you shoud produce a fasta file that looks like (header line should not wrap):
>0ard [organism=Ardeotis kori] [moltype=genomic] [location=genomic] [note=ultra conserved element locus chr3_18428] [specimen-voucher=FLMNH:44254] Ardeotis kori ultra-conserved element locus chr3_18428 AAATTGTTCAGCCGGTGGGGGGTTGGATGCAGCCGACACAGGTACTTACCCCGAGAGAGAGCTTCCGCCAGAGCAGAGCAGCCAAACGGTC... >1ard [organism=Ardeotis kori] [moltype=genomic] [location=genomic] [note=ultra conserved element locus chr1_14351] [specimen-voucher=FLMNH:44254] Ardeotis kori ultra-conserved element locus chr1_14351 GAATCTGAGCTTGTCTCACTAATTATCGTTCCACTCTTCCAGGTCAATGATGCCATAGGAAGGGACTGGCCCTGGATCTATTTTGTTACAC...
run output fasta against Univec_core:
blastn -task blastn -db UniVec_core -query test.fsa \ -evalue 1 -gapopen 3 -gapextend 3 -word_size 11 \ -reward 1 -penalty -5 -out blast.out -num_threads 4 \ -dust yes -searchsp 1750000000000 -soft_masking true \ -outfmt 6
trim sequences:
python ~/phyluce/bin/ncbi/trim_vecscreen_results.py \ blast.out \ test.fsa \ output-univec-trimmed.fsa
create template file: http://www.ncbi.nlm.nih.gov/WebSub/template.cgi
create a sequence directory containing your fasta file and any cmt file needed:
*.fasta *.cmt
structured cmt format in file below for sequencing tech and assembly type
run tbl2asn:
tbl2asn -t ASN-template.sbt -p sequence -a s -X C -V v