Skip to content

Instantly share code, notes, and snippets.

@genomewalker
Last active August 4, 2024 06:29
Show Gist options
  • Save genomewalker/43e6a9934d41704326231c52a883512b to your computer and use it in GitHub Desktop.
Save genomewalker/43e6a9934d41704326231c52a883512b to your computer and use it in GitHub Desktop.
Get xRNA from Prokka gff
import argparse
import gzip
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def extract_rrna_trna_features(input_file, output_file):
# Determine if the input file is gzipped
if input_file.endswith(".gz"):
input_handle = gzip.open(input_file, "rt")
else:
input_handle = open(input_file, "r")
# Determine if the output file should be gzipped
if output_file.endswith(".gz"):
output_handle = gzip.open(output_file, "wt")
else:
output_handle = open(output_file, "w")
# Read and write rRNA and tRNA features
for record in SeqIO.parse(input_handle, "genbank"):
features = [feature for feature in record.features if feature.type in ['rRNA', 'tRNA']]
for feature in features:
# Extract the sequence for the feature
feature_seq = feature.extract(record.seq)
# Prepare the FASTA header
locus_tag = feature.qualifiers.get('locus_tag', ['unknown locus'])[0]
product = feature.qualifiers.get('product', ['unknown product'])[0]
header = f"{locus_tag}|{feature.type}|{product}"
# Create a new SeqRecord for each feature
feature_record = SeqRecord(
feature_seq,
id=header,
description=""
)
SeqIO.write(feature_record, output_handle, "fasta")
input_handle.close()
output_handle.close()
def main():
parser = argparse.ArgumentParser(description="Extract rRNA and tRNA features from a GenBank file")
parser.add_argument('--input', '-i', required=True, help='Input GenBank file (can be gzipped)')
parser.add_argument('--output', '-o', required=True, help='Output file to store rRNA and tRNA features (can be gzipped)')
args = parser.parse_args()
extract_rrna_trna_features(args.input, args.output)
if __name__ == "__main__":
main()

We get the rRNA and tRNA sequences from the non-redundant contigs using the Prokka gff and ffn files.

for i in *gff; do 
    DNAM=../../contigs-derep/ancientGut/${i/.gff/_derep_contigs.tsv};
    DNAM=${DNAM/.proteins./.assm.};
    seqkit grep -r -n -f <(awk -F'\t' '$3 == "rRNA" || $3 == "tRNA" {print $1, $9}' <(grep -f ${DNAM} $i) \
    | sed -e 's/ID//;' \
    | awk -F'[=;]' '{print $2}') ${i/.gff/.ffn} -o ${i/.gff/.rna.fa.gz}; 
done

Prepare the genomic xRNA database. After getting the gbff files from NCBI, we extract the rRNA and tRNA sequences.

for i in *gbff.gz; do 
    python get-rna-seqs.py -i ${i} -o ${i/genomic.gbff.gz/rna.fa.gz}; 
done

Then we remove the identical sequences in each genome:

for i in *_rna.fa.gz; do 
    NAM=$(basename $i _rna); vsearch --fastx_uniques $i --fastaout ${NAM}_derep.fa --uc ${NAM}_derep.uc --strand both; 
    cat *_rna.fa.gz_derep.fa > gut_simulation_rna.fa
done

Then we search the rRNA and tRNA sequences in the genomic xRNA database:

for i in *rna.fa.gz; do 
    vsearch --usearch_global ${i} --db ../../genome-rna/ancientGut/gut_simulation_rna.fa --id 0.9 --userout ${i/.fa.gz/.search.tsv} --userfields query+target+id+alnlen+mism+opens+ql+tl+qcov+tcov --maxaccepts 100 --maxrejects 100 --maxhits 1 --strand both --output_no_hits --threads 16 --query_cov 0.9; 
done  
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment