get list of just accession numbers from fasta sequence entries list using regular expressions

Step 1: eliminate all but description line of FASTA entries

First to reduce to just lines beginning with carets, i.e., leave only the description line (<---from http://stackoverflow.com/questions/7310598/remove-all-lines-without-an-character-in-notepad)

FIND:

^[^>]*$

REPLACE:

[LEAVE EMPTY, BECAUSE DON'T WANT ANYTHING REPLACED HERE]

Step 2: eliminate all but accession

Extract accession.version value from downloaded list of FASTAs using regex (In particular to just get left with accession after starting with fasta, run below regex first to discard sequences):

Find:

^>gi\|\d+\|\w+\|(\w+\.\d)\|.*$

Replace:

\1

##Example:

Starting with :

>gi|16580628|emb|CAC82173.1| FSH receptor [Podarcis siculus]
>gi|37778925|gb|AAO72730.1| follicle-stimulating hormone receptor [Bothrops jararaca]
>gi|16580628|emb|CAC82173.1| FSH receptor [Podarcis siculus]

Following the two steps results in:

CAC82173.1
AAO72730.1
CAC82173.1

You can then feed this to Batch Entrez to get back the FASTA entries, if you want. The amazing added bonus will be that the duplicate ones will have been discarded. This means you can collect all sorts of FASTA entries as the result of many different BLAST queries, combine them, run these two regex steps, and in the end have a clean list with single instances for every organisms.

Final FINAL result after that BONUS step:

CAC82173.1
AAO72730.1

fomightez/get accession numbers regex.md

Step 1: eliminate all but description line of FASTA entries

Step 2: eliminate all but accession