REGEX for replacing SGD fasta description line with chromosome number
recreating steps probably used in process described in ChIP-Seq example at NUCwave site
S. cerevisiae reference genome was downloaded from SGD and FASTA headers for chromosome names were replaced with chrI-chrXVI.
FIND:
>.*chromosome=(\w+)\]
REPLACE:
>chr\1
ALSO TRY WITH $ at right side end. Sublime Text matches with it but other flavors of Regular Expressions, such as at Regular Expressions 101, didn't like this. (Also need g global modifier on at Regular Expressions 101 to see as in Sublime Text.) I think Regular Expressions 101 $ seems to regard that as the end if the string and not the end of the line like Sublime Text does.
Text after chromsome= is what is being captured and used in the Replace.
They don't mention in the description but they changed the mitochondrion description to be very succinct as well. Will have to do mitochondria separately.
FIND:
>.*mitochondrion.*
REPLACE:
>chrmt
##EXAMPLE: ###INPUT:
>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCC.....
>ref|NC_001134| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=II]
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCA
AAAGCTACTTACTACCTTTATTTTATGTTTACTTTTTATAGGTTGTCTTTTTATCCCACT
TCTTCGCACTTGTCTCTCGCTACTGCCGTGCAACAAACACTAAATCAAAACAATGAAATA
CTACTACATCAAAACGCATTTTCCCTAGAAAAAAAATTTTCTTACAATATACTATACTAC
ACAATACATAATCACTGACTTTCGTAACAACAATTTCCTTCACTCTCCAACTTCTCTGCT
CGAATCTCTACATAGTAATATTATATCAAATCTACCGTCTGGAACATCATC...
>ref|NC_001224| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [location=mitochondrion] [top=circular]
TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATATTATAAAAATA
ATATTTATTATTAAAATATTTATTCTCCTTTCGGGGTTCCGGCTCCCGTGGCCGGGCCCC
GGAATTATTAATTAATAATAAATTATTATTAATAATTATTTATTATTTTATCATTAAAAT
ATATAAATAAAAAATATTAAAAAGATAAAAAAAATAATGTTTATTCTTTATATAAATTAT
ATATATATATATAATTAATTAATTAATTAATTAATTAATAATA...
###FINAL OUTPUT:
>chrI
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCC.....
>chrII
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCA
AAAGCTACTTACTACCTTTATTTTATGTTTACTTTTTATAGGTTGTCTTTTTATCCCACT
TCTTCGCACTTGTCTCTCGCTACTGCCGTGCAACAAACACTAAATCAAAACAATGAAATA
CTACTACATCAAAACGCATTTTCCCTAGAAAAAAAATTTTCTTACAATATACTATACTAC
ACAATACATAATCACTGACTTTCGTAACAACAATTTCCTTCACTCTCCAACTTCTCTGCT
CGAATCTCTACATAGTAATATTATATCAAATCTACCGTCTGGAACATCATC...
>chrmt
TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATATTATAAAAATA
ATATTTATTATTAAAATATTTATTCTCCTTTCGGGGTTCCGGCTCCCGTGGCCGGGCCCC
GGAATTATTAATTAATAATAAATTATTATTAATAATTATTTATTATTTTATCATTAAAAT
ATATAAATAAAAAATATTAAAAAGATAAAAAAAATAATGTTTATTCTTTATATAAATTAT
ATATATATATATAATTAATTAATTAATTAATTAATTAATAATA...