Skip to content

Instantly share code, notes, and snippets.

@iracooke
Last active September 20, 2015 23:53
Show Gist options
  • Save iracooke/1c787e5b34f7360e1654 to your computer and use it in GitHub Desktop.
Save iracooke/1c787e5b34f7360e1654 to your computer and use it in GitHub Desktop.
Create a combined transdecoder + 6frame database

Creating a protein database from 6-frame and transdecoder sequences

Analyses in Galaxy

  1. Run the TranscriptomePGMakeDatabase workflow. Input files for this include, a trinity assembly, predicted proteins from Transdecoder, gff3 coordinates corresponding to transdecoder predictions and the cRAP database of contaminants.
  2. Ensure that the known_novel_crap_decoy.fasta output from the above workflow is loaded onto Mascot for searching.
  3. Use the outputs from TranscriptomePGMakeDatabase to run the Transcriptome PG workflow. This workflow will be related to the Transcriptome PG workflow but should be modified to include a Mascot search for your specific organism.
  4. Download the observed_peptides.gff3 file that you get from running the previous workflow step.

Use bedtools to find novel only peptides

  1. First gather the following files in a directory on your laptop

    File Description
    observed_peptides.gff Output from the Galaxy workflow above
    transdecoder.gff3 GFF coordinates for transdecoder predictions (used as an input to the Galaxy workflow above)
    known.fasta Output from database generation workflow above
    novel.fasta Output from the database generation workflow above
    crap.fasta The cRAP contaminants. Download
  2. Make sure you install bedtools on your laptop

  3. Extract only CDS entries from the transdecoder file

        cat transdecoder.gff3 | grep 'CDS' > transdecoder_cds.gff3
  4. Compare observed peptides with transdecoder CDS and keep only those peptides that don't fully overlap with a transdecoder CDS prediction

        bedtools subtract -s -f 1.0 -a observed_peptides.gff3 -b transdecoder_cds.gff3 > really_novel.gff3
  5. Downlod this filter_fasta.rb script and put it in the same directory as the files above. Make sure it has execute permissions (ie)

        chmod u+x filter_fasta.rb
  6. Extract the id's from really_novel.gff and then grab their corresponding entries from the novel.fasta file

        cat really_novel.gff3 | awk -F '=' '{print $2}' | awk -F '.' '{print "lcl|"$1}' | sort -u > really_novel_ids.txt
    
        ./filter_fasta.rb novel.fasta really_novel_ids.txt > really_novel.fasta
  7. Concatenate the novel proteins with known.fasta

        cat really_novel.fasta known.fasta > known_updated.fasta

Name and install the database

For MaxQuant simply rename the DB. Modify the example below for your species

    cp known_updated.fasta SAustrinumMQ_20151908.fasta

To create a database for normal TPP decoy searching and for loading on Mascot. For this you need to install protk which has the make_decoy.rb tool.

   cat crap.fasta known_updated.fasta > known_updated_crap.fasta
   make_decoy.rb -A known_updated_crap.fasta -o SAustrinum_20151908.fasta
#!/usr/bin/env ruby
#
require 'set'
require 'bio'
fasta_file=ARGV[0]
output_fh = $stdout
id_filter=ARGV[1]
$filter_ids = Set.new()
$filter_ids = File.readlines(id_filter).collect { |e| e.chomp }
def passes_filters(entry)
if $filter_ids.length > 0
# require 'byebug';byebug
if $filter_ids.include? entry.entry_id
return true
end
return false
end
return true
end
file = Bio::FastaFormat.open(fasta_file.chomp)
file.each do |entry|
pass = passes_filters(entry)
if pass
output_fh.write entry
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment