Files are downloaded from the MGnify resource and then processed using the calc.pl and checksum_checker.pl scripts. We also use the Unix commands awk and sort to format input into a santised format
- Process FASTA files
- Read two lines at a time
- Extract
ID
- Extract
seq
- Calculate
MD5
andTRUNC512
- Write to file as "ID MD5 TRUNC512 SEQ" (all tab separated)
- Find checksum clashes non-identical sequence
- Create a tmp dir to use for sorting
- Choose a checksum, cut and re-assemble files into "CHECKSUM SEQ ID"
- Sort all tab delim files by checksum and then seq
- Pipe into checksum checker
- Checksum checker
- Compare previous line to current line
- If checksum is the same check sequence
- If sequence not the same then report clash of checksums with previous ID and current ID
- Write all clashes to screen as "CHECKSUM ID1 SEQ1 ID2 SEQ2"
Example bad data (held in example.bad.data
) contains a duplicate checksum for MD5
and TRUNC512
with a non-identical ID. These rows are identified as MGYP000639767355
and BOGUS
; the first and last lines of the file respectively. You can pass this through the following command:
$ cat example.bad.data | awk '{print $1,$4,$3}' | sort -k1,1 -k2,2 | ./checksum_checker.pl
And it will emit the following output:
01fe63fd972452286cfc7be81c47ded3e0d5825811eced30 BOGUS MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKII MGYP000639767355 MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKIIN
Indicating there is an identical checksum for TRUNC512 with non-identical sequences.