Skip to content

Instantly share code, notes, and snippets.

@aseetharam
Last active April 22, 2016 20:31
Show Gist options
  • Save aseetharam/83530ca3644b78e32dabac7950fd09ed to your computer and use it in GitHub Desktop.
Save aseetharam/83530ca3644b78e32dabac7950fd09ed to your computer and use it in GitHub Desktop.
to extract the sequence of interest
## OPTION 1
# convert the gzipped fastq file to a single line sequence files
zcat input.fastq.gz | sed -n '2~4p' > single_line_sequences.txt
# you are aksing it to print 2nd line followed every 4th line after that.
perl -ne 'while ($_ =~ m/GTGTTCCCCGCGCCAGCGGGGATAAACC([ATCG]{32})/g) {print $1."\t"} {print "\n"}' single_line_sequences.txt
# here, you are printing the 32 bases after the matching string using perl and using tab as delimiter.
# the input is the above file you created in the first step.
# this will generate the output for the first part.
## OPTION 2
# if you prefer only bash then:
grep -one "GTGTTCCCCGCGCCAGCGGGGATAAACC.\{32\}" single_line_sequences.txt | \
sed 's/:GTGTTCCCCGCGCCAGCGGGGATAAACC/\t/g' | \
awk -F"\t" '{a[$1]=a[$1]"\t"$2;} END{for (x in a) print x "\t" substr(a[x],2);}'
# does exactly as before, but it is little slower
# first step uses grep to "o"nly print the matches, with line "n"umbers using r"e"gular expressions. The matching part is crispr
# followed by 32 letters
# Next step removes the crispr sequence and pritns only the line numbers and spacers
# awk (complicated) will merge all the lines with same numbers using tab as delimiters
#example:
$ cat test.txt
CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA
CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGATAATGCTGTATGATAAAAAAATGCTGGATAGGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA
CCGTAATTAATAATAGGTTATGTTTAGAGTGTTCCCCGCGCCAGCGGGGATAAACCGGTTTTGCGCCATTCGATGGTGTCCGGGATCTCGTGTTCCCCGCGCCAGCGGGGATAAACCGGACAAGTTTTGGTGACTGCGCTCCTCCAAGCCGTGTTCCCCGCGCCAGCGGGGATAAACCGGCAAAAACCGGGCAATCGCAA
# OPTION 1
$ time perl -ne 'while ($_ =~ m/GTGTTCCCCGCGCCAGCGGGGATAAACC([ATCG]{32})/g) {print $1."\t"} {print "\n"}' test.txt
GATAATGCTGTATGATAAAAAAATGCTGGATA
GGTTTTGCGCCATTCGATGGTGTCCGGGATCT GGACAAGTTTTGGTGACTGCGCTCCTCCAAGC
real 0m0.023s
user 0m0.005s
sys 0m0.004s
#OPTION 2
$ time grep -one "GTGTTCCCCGCGCCAGCGGGGATAAACC.\{32\}" test.txt | sed 's/:GTGTTCCCCGCGCCAGCGGGGATAAACC/\t/g' | awk -F"\t" '{a[$1]=a[$1]"\t"$2;} END{for (x in a) print x "\t" substr(a[x],2);}'
2 GATAATGCTGTATGATAAAAAAATGCTGGATA
3 GGTTTTGCGCCATTCGATGGTGTCCGGGATCT GGACAAGTTTTGGTGACTGCGCTCCTCCAAGC
real 0m0.033s
user 0m0.011s
sys 0m0.018s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment