Skip to content

Instantly share code, notes, and snippets.

@fomightez
Last active August 29, 2015 14:07
Show Gist options
  • Save fomightez/b7192a203bd54cc34f33 to your computer and use it in GitHub Desktop.
Save fomightez/b7192a203bd54cc34f33 to your computer and use it in GitHub Desktop.
get list of just accession numbers from fasta sequence entries list using regular expressions

Step 1: eliminate all but description line of FASTA entries

First to reduce to just lines beginning with carets, i.e., leave only the description line (<---from http://stackoverflow.com/questions/7310598/remove-all-lines-without-an-character-in-notepad)

FIND:

^[^>]*$

REPLACE:

[LEAVE EMPTY, BECAUSE DON'T WANT ANYTHING REPLACED HERE]

Step 2: eliminate all but accession

Extract accession.version value from downloaded list of FASTAs using regex (In particular to just get left with accession after starting with fasta, run below regex first to discard sequences):

Find:

^>gi\|\d+\|\w+\|(\w+\.\d)\|.*$

Replace:

\1

##Example:

Starting with :

>gi|16580628|emb|CAC82173.1| FSH receptor [Podarcis siculus]
>gi|37778925|gb|AAO72730.1| follicle-stimulating hormone receptor [Bothrops jararaca]
>gi|16580628|emb|CAC82173.1| FSH receptor [Podarcis siculus]

Following the two steps results in:

CAC82173.1
AAO72730.1
CAC82173.1

You can then feed this to Batch Entrez to get back the FASTA entries, if you want. The amazing added bonus will be that the duplicate ones will have been discarded. This means you can collect all sorts of FASTA entries as the result of many different BLAST queries, combine them, run these two regex steps, and in the end have a clean list with single instances for every organisms.

Final FINAL result after that BONUS step:

CAC82173.1
AAO72730.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment