We have the fasta files of some viral genomes. We want to aligend all of them and extract only one gene of them.
cat *.fasta > all.fasta
cat gene.fasta all.fasta > all_with_gene.fasta
The I ran mafft on them
mafft --auto all_with_gene.fasta > aligned.fasta
The first record in aligned.fasta
shows the position of interest.
By openning the file in an editor atom.
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------atgtttgtgtttctggtgctgctgccgct
ggtgagcagccagtgcgtgaacctgaccacccgcacccagctgccgccggcgtataccaa
cagctttacccgcggcgtgtattatccggataaagtgtttcgcagcagcgtgctgcatag
cacccaggatctgtttctgccgttttttagcaacgtgacctggtttcatgcgattcatgt
gagcggcaccaacggcaccaaacgctttgataacccggtgctgccgtttaacgatggcgt
..
gctgtatcaggatgtgaactgcaccgaagtgccggtggcgattcatgcggatcagctgac
cccgacctggcgcgtgtatagcaccggcagcaacgtgtttcagacccgcgcgggctgcct
gattggcgcggaacatgtgaacaacagctatgaatgcgatattccgattggcgcgggcat
ttgcgcgagctatcagacccagaccaacagcccgcgccgcgcgcgc--------------
------------------------------------------------------------
------------------------------------------------------------
I found out the line number, the start and the end which was 361 and 395, respectively. So, for 0-index python list, we should have
start_line=359
end_line=393
The code extracts these lines from all scaffolds in the fasta files.
The code can be improved for extracting the line numbers, which can also be done in python by checking the first scaffold (as the reference, the gene ), and find the first line containing a/t/c/g rather than complete '-'.
ref_genome=genomes[1]
for line in ref_genome:
num_dash=0
for letter in line:
if letter=='-': num_dash+=1
first_n
if num_dash!=60:
print(line)