Last active
April 22, 2016 20:31
-
-
Save aseetharam/83530ca3644b78e32dabac7950fd09ed to your computer and use it in GitHub Desktop.
to extract the sequence of interest
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ## OPTION 1 | |
| # convert the gzipped fastq file to a single line sequence files | |
| zcat input.fastq.gz | sed -n '2~4p' > single_line_sequences.txt | |
| # you are aksing it to print 2nd line followed every 4th line after that. | |
| perl -ne 'while ($_ =~ m/GTGTTCCCCGCGCCAGCGGGGATAAACC([ATCG]{32})/g) {print $1."\t"} {print "\n"}' single_line_sequences.txt | |
| # here, you are printing the 32 bases after the matching string using perl and using tab as delimiter. | |
| # the input is the above file you created in the first step. | |
| # this will generate the output for the first part. | |
| ## OPTION 2 | |
| # if you prefer only bash then: | |
| grep -one "GTGTTCCCCGCGCCAGCGGGGATAAACC.\{32\}" single_line_sequences.txt | \ | |
| sed 's/:GTGTTCCCCGCGCCAGCGGGGATAAACC/\t/g' | \ | |
| awk -F"\t" '{a[$1]=a[$1]"\t"$2;} END{for (x in a) print x "\t" substr(a[x],2);}' | |
| # does exactly as before, but it is little slower | |
| # first step uses grep to "o"nly print the matches, with line "n"umbers using r"e"gular expressions. The matching part is crispr | |
| # followed by 32 letters | |
| # Next step removes the crispr sequence and pritns only the line numbers and spacers | |
| # awk (complicated) will merge all the lines with same numbers using tab as delimiters | |
| #example: | |
| $ cat test.txt | |
| CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA | |
| CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGATAATGCTGTATGATAAAAAAATGCTGGATAGGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA | |
| CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGGTTTTGCGCCATTCGATGGTGTCCGGGATCTCGTGTTCCCCGCGCCAGCGGGGATAAACCGGACAAGTTTTGGTGACTGCGCTCCTCCAAGCCGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA | |
| # OPTION 1 | |
| $ time perl -ne 'while ($_ =~ m/GTGTTCCCCGCGCCAGCGGGGATAAACC([ATCG]{32})/g) {print $1."\t"} {print "\n"}' test.txt | |
| GATAATGCTGTATGATAAAAAAATGCTGGATA | |
| GGTTTTGCGCCATTCGATGGTGTCCGGGATCT GGACAAGTTTTGGTGACTGCGCTCCTCCAAGC | |
| real 0m0.023s | |
| user 0m0.005s | |
| sys 0m0.004s | |
| #OPTION 2 | |
| $ time grep -one "GTGTTCCCCGCGCCAGCGGGGATAAACC.\{32\}" test.txt | sed 's/:GTGTTCCCCGCGCCAGCGGGGATAAACC/\t/g' | awk -F"\t" '{a[$1]=a[$1]"\t"$2;} END{for (x in a) print x "\t" substr(a[x],2);}' | |
| 2 GATAATGCTGTATGATAAAAAAATGCTGGATA | |
| 3 GGTTTTGCGCCATTCGATGGTGTCCGGGATCT GGACAAGTTTTGGTGACTGCGCTCCTCCAAGC | |
| real 0m0.033s | |
| user 0m0.011s | |
| sys 0m0.018s | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment