If you have a transcriptome that has been assembled from shotgun reads the TSA
(Transcriptome Shotgun Assembly) database is a good place to put it so that it can be widely accessed.
This guide assumes that you simply want to submit the assembled sequences from your transcriptome without annotations. NCBI sets a high bar for inclusion of annotations so for most non-model organisms they are probably not going to meet the criteria.
To create a TSA
submission take a look at the ncbi guidelines. This gist is based on those guidelines.
Start your TSA
submission by creating a BioProject. When you create this bioproject you should select the project type as Assembly
. If you have previously create a bioproject for your raw data with a different project type you will need to create a new one for the TSA
submission. Ensure that you have uploaded all of the raw data that was used to create the assembly to SRA
.
Probably the trickiest part of the process is cleaning the data so that it passes NCBI's automated checks. This gist contains a simple ruby script that will perform some very basic cleaning but you will almost certainly need to do more than this after you make your first attempt at submission.
First clean the database so that
- No sequence is shorter than 200 nucleotides
- No sequences has more than 14 consecutive Ns
This can be done with the ncbi_ise.rb script
./ncbi_ise.rb transcriptome.fasta > cleaned_transcriptome.fasta
Fill out the template form https://submit.ncbi.nlm.nih.gov/templates/
To do this you need the tbl2asn tool installed. This tool is distributed by NCBI but can also be installed with homebrew
brew tap homebrew/science
brew install tbl2asn
Once you have the tbl2asn utility installed you can use it to create the final asn file for upload.
Be sure to edit the command below to use the correct organism Genus and species
tbl2asn -t template.sbt -w assembly.cmt -M t -i cleaned_transcriptome.fasta -j "[organism=Genus species][moltype=transcribed_RNA][tech=TSA]"
This .asn
file can be uploaded directly to the NCBI website and will be subjected to more checks. Most likely it will identify all kinds of other problems. If it identifies problems immediately it will display a bunch of lines in red with errors like this one.
ERROR fhd.sqn FhD07614 1..34 VECTOR_MATCH File: fhd.sqn, Code(VECTOR_MATCH), Sequence-id: FhD07614, Interval: 1..34, This sequence has a Strong match on the following UniVec vector: gnl|uv|NGB00309.1:1-48 Multimer of Pharmacia EcoRI adaptor used in I.M.A.G.E. library NCI_CGAP_CLL1 and other libraries
You should download all these errors to a file (eg called 'errors.txt') and then provide them as input to another round of cleaning with the ncbi_ise.rb
script like this;
./ncbi_ise.rb -r errors.txt cleaned_transcriptome.fasta > cleaned_transcriptome2.fasta
After doing this rerun the tbl2asn
command on the newly cleaned fasta file to regenerate the .sqn
file and try again with the NCBI upload. If it keeps failing look at the errors it produces and see if you can fix them manually, or perhaps edit the ncbi_ise.rb
script so as to improve the cleaning process.
Eventually you will satisfy NCBI and your file will go through. But the process is most likely not over. NCBI may get back to you with further errors. These come in a file called FCSreport.txt
. The ncbi_ise.rb
script is also able to use this to perform further cleaning like this
./ncbi_ise.rb -f FCSreport.txt cleaned_transcriptome2.fasta > cleaned_transcriptome3.fasta
Once you've done this regenerate the .sqn
file and resubmit.