Skip to content

Instantly share code, notes, and snippets.

@knmkr
Created May 1, 2015 08:39
Show Gist options
  • Save knmkr/03fdabf3d4af2fb69e67 to your computer and use it in GitHub Desktop.
Save knmkr/03fdabf3d4af2fb69e67 to your computer and use it in GitHub Desktop.
[bioinfo] Get FASTA sequence of rs ID
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import gzip
import re
import sys
from Bio import SeqIO
from Bio.Alphabet import IUPAC
def _main():
parser = argparse.ArgumentParser(description='')
parser.add_argument('multi_fasta_path', help='Multi-fasta file downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta')
parser.add_argument('query_path', help='Text file, each line contains one rs ID for query.')
args = parser.parse_args()
rs_regexp = re.compile('.*(rs(\d+)).*')
seq_split_regexp = re.compile('([^atgc]+)', re.IGNORECASE)
#
query_rs = set()
with open(args.query_path, 'r') as qfin:
for line in qfin:
query_rs.update([line.strip()])
print >>sys.stderr, '[INFO] {}'.format(query_rs)
#
with gzip.open(args.multi_fasta_path, 'r') as fin:
for i, record in enumerate(SeqIO.parse(fin, 'fasta', alphabet=IUPAC.ambiguous_dna)):
rs_match = rs_regexp.match(record.id)
rs_id = rs_match.group(1) if rs_match else ''
if rs_id in query_rs:
seqs = seq_split_regexp.split(str(record.seq))
# FIXME
if not len(seqs) == 3:
print >>sys.stderr, '[WARN] ambigous dna found more than one.'
print >>sys.stderr, '[WARN] {}'.format(rs_id)
else:
seq_5, seq_ambigous, seq_3 = seqs
print ','.join([rs_id, seq_5, seq_ambigous, seq_3])
if __name__ == '__main__':
_main()
@knmkr
Copy link
Author

knmkr commented May 1, 2015

Download all multi-FASTA files from FTP for current dbSNP build:

$ wget -r -l 0 ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta
$ ls ~/ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/
index.html      rs_ch12.fas.gz  rs_ch16.fas.gz  rs_ch2.fas.gz   rs_ch3.fas.gz  rs_ch7.fas.gz        rs_chMT.fas.gz     rs_chUn.fas.gz
rs_ch1.fas.gz   rs_ch13.fas.gz  rs_ch17.fas.gz  rs_ch20.fas.gz  rs_ch4.fas.gz  rs_ch8.fas.gz        rs_chMulti.fas.gz  rs_chX.fas.gz
rs_ch10.fas.gz  rs_ch14.fas.gz  rs_ch18.fas.gz  rs_ch21.fas.gz  rs_ch5.fas.gz  rs_ch9.fas.gz        rs_chNotOn.fas.gz  rs_chY.fas.gz
rs_ch11.fas.gz  rs_ch15.fas.gz  rs_ch19.fas.gz  rs_ch22.fas.gz  rs_ch6.fas.gz  rs_chAltOnly.fas.gz  rs_chPAR.fas.gz

Create a query text file, then run posted python script:

$ cat rslist.txt
rs671
rs672
$ for x in {1..22} X Y AltOnly MT Multi NotOn PAR Un; python get_fasta_sequence_of_rs_id.py ~/ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/rs_ch${x}.fas.gz rslist.txt > ch${x}.seq.txt &; done

Results:

$ cat ch12.seq.txt
rs671,AGGCATAGTGGCACATACTTGTTATCTTAACTACTTGGGAGGCTGAGGCAGGAGGATCACTGAAGACCAGGAGTTGGAGACCAGCCTGGGTAACATAATCAGACCCTGTCTCTTAAAAAAAAATTTATTGCCAGGCGTGGTTGCACGTGCTGGTAGTCCAGCTACTCAGGAAGCTGAGGCAGGAGAATCTCTTGAACCCCAGATGTGGAGGTTGCAACGAGCCAAGATCATGCCATGGCAACTCCAGCCTGGGCAACAGAGAAAGATTCTATCTCAAAAAAAAAAATTTTTTTTTAAGTTAAAAATAAAATAAAGACTTTGGGGCAATACAGGGGGTCCTGGGAGTGTAACCCATAACCCCCAAGAGTGATTTCTGCAATCTCGTTTCAAATTACAGGGTCAACTGCTATGATGTGTTTGGAGCCCAGTCACCCTTTGGTGGCTACAAGATGTCGGGGAGTGGCCGGGAGTTGGGCGAGTACGGGCTGCAGGCATACACT,R,AAGTGAAAACTGTGAGTGTGGGACCTGCTGGGGGCTCAGGGCCTGTTGGGGCTTGAGGGTCTGCTGGTGGCTCGGAGCCTGCTGGGGGATTGGGGTCTGTTGGGGGCTCGGGGCCTGCCAGAGGTTCAGGACCTGCCGGGGACTCAGGGCCTGCTGGAAGTTCAGGACCTGCTGGGGATCAGGGCCTGCCAGGGATTTAGGGTCTGCTGGGCGGGCCACCTTTTGGCCTCTCCCTCATGCTTGAGGCCATCAGTGTTTCCTACTAATTTCCCATTTTAAGCCTGAGAAGTGACAAGAGAGGGTAAAGACCCAGCCTCTGCTCTGTCCCATGAGAAATACTGAGGGACGTGCCCCCATCAGGCCTATGCGGTCATTTGCTGGGCTTCGTTATACGCCAAGGCCTGTAGGCCTGAGAAGAGGGAGAGACTTCAGGGGGCGGAGCGGAGAGGAAAAGCTTCTAGTAAGAATCTTTTCAGATTTTCACCAGGCGCGGTGGCTTT
$ cat ch1.seq.txt
rs672,GCCGCAATGCCCGGGAGTTTCTCCAATGTGTGGAGAAGGCCTTAGAAGACATGTTTGATGCCTTAGAAGGCAAATCCATCAAAAGTTAACTTCTGGGCAGATGAAAAGCTACCATCACTTCCTCATCATGAAAACTGGGAggccgggcat,V,gtgctcatgcctgtaatcccagcattttgagaggctgaggcgggtggatcacttgaggtcaggagtttgagaccaacctggccaacat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment