Skip to content

Instantly share code, notes, and snippets.

@fomightez
Last active June 5, 2018 17:29
Show Gist options
  • Save fomightez/709a7c62df8e49dac5b7 to your computer and use it in GitHub Desktop.
Save fomightez/709a7c62df8e49dac5b7 to your computer and use it in GitHub Desktop.
Regex to make unique id with genus species in modified fasta entries

REGEX to make unique id with genus species in modified fasta entries.

NOTE: These FASTA entries were first put through my namerv.1.py Python program to put scientific name at start instead of lots of codes that you get back in versions from BATCH ENTREZ.

WAIT!!!! This didn't quite work. For example, failed on all like '>Pichia kudriavzevii |gi|695112010|gb|KGK38559.1|' and '>Colletotrichum higginsianum |gi|380481846|emb|CCF41606.1|'. NEEDS PERFECTING

FIND:

(>\w)\w+ (\w+) (\w+)

REPLACE:

\1.\2_\3

EXAMPLE:

INPUT:

>Saccharomyces cerevisiae S288c |gi|6323174|ref|NP_013246.1| Rmp1p [Saccharomyces cerevisiae S288c]
MDEMDNVIRSLEQEYRLILLLNHRNKNQHRAASWYGSFNEMKRNCGQIITLFSSRRLQAKRLKDVEWVKL
HRLLQRALFRQLKRWYWQFNGVIALGQFVTLGCTLVTLLANVRALYMRLWEINETEFIRCGCLIKNLPRT
KAKSVVNDVEELGEIIDEDIGNNVQENELVITSIPKPLTENCKKKKKRKKKNKSAIDGIFG
>Schizosaccharomyces pombe 972h- |gi|19115290|ref|NP_594378.1| ribonuclease MRP complex subunit (predicted) [Schizosaccharomyces pombe 972h-]
MQELQYDVVLLQKIVYRNRNQHRLSVWWRHVRMLLRRLKQSLDGNEKAKIAILEQLPKSYFYFTNLIAHG
QYPALGLVLLGILARVWFVMGGIEYEAKIQSEIVFSQKEQKKLELQSQDDIDTGTVVARDELLATEPISL
SINPASTSYEKLTVSSPNSFLKNQDESLFLSSSPITVSQGTKRKSKNSNSTVKKKKKRARKGRDEIDDIF
G
>Ashbya gossypii ATCC 10895 |gi|45200937|ref|NP_986507.1| AGL160Wp [Ashbya gossypii ATCC 10895]
MSDKALRAGEDGTEIRNALRSLQQELRVIHILYHRNKNQHRVATWWKQLNSLKRSVSQVVTVTSKPVRTE
ADLEALAGLLRRFAVRQAPAMYYEFNGVIALGQFVTLGVVLVAALARVWALYGQLREALGLLPVRAAQAE
RECDVAPTEEIGEEVAVAVAASPPGAAALPGGKRIKKKSKSKRSAIDDIFG

OUTPUT:

>S.cerevisiae_S288c |gi|6323174|ref|NP_013246.1| Rmp1p [Saccharomyces cerevisiae S288c]
MDEMDNVIRSLEQEYRLILLLNHRNKNQHRAASWYGSFNEMKRNCGQIITLFSSRRLQAKRLKDVEWVKL
HRLLQRALFRQLKRWYWQFNGVIALGQFVTLGCTLVTLLANVRALYMRLWEINETEFIRCGCLIKNLPRT
KAKSVVNDVEELGEIIDEDIGNNVQENELVITSIPKPLTENCKKKKKRKKKNKSAIDGIFG
>S.pombe_972h- |gi|19115290|ref|NP_594378.1| ribonuclease MRP complex subunit (predicted) [Schizosaccharomyces pombe 972h-]
MQELQYDVVLLQKIVYRNRNQHRLSVWWRHVRMLLRRLKQSLDGNEKAKIAILEQLPKSYFYFTNLIAHG
QYPALGLVLLGILARVWFVMGGIEYEAKIQSEIVFSQKEQKKLELQSQDDIDTGTVVARDELLATEPISL
SINPASTSYEKLTVSSPNSFLKNQDESLFLSSSPITVSQGTKRKSKNSNSTVKKKKKRARKGRDEIDDIF
G
>A.gossypii_ATCC 10895 |gi|45200937|ref|NP_986507.1| AGL160Wp [Ashbya gossypii ATCC 10895]
MSDKALRAGEDGTEIRNALRSLQQELRVIHILYHRNKNQHRVATWWKQLNSLKRSVSQVVTVTSKPVRTE
ADLEALAGLLRRFAVRQAPAMYYEFNGVIALGQFVTLGVVLVAALARVWALYGQLREALGLLPVRAAQAE
RECDVAPTEEIGEEVAVAVAASPPGAAALPGGKRIKKKSKSKRSAIDDIFG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment