Skip to content

Instantly share code, notes, and snippets.

@andrewyatz
Last active October 24, 2018 11:11
Show Gist options
  • Save andrewyatz/d9762ec8ef4106ded3305b7d0b137263 to your computer and use it in GitHub Desktop.
Save andrewyatz/d9762ec8ef4106ded3305b7d0b137263 to your computer and use it in GitHub Desktop.
Creates a file of TRUNC512, MD5, ID and sequence from a Gzip compressed FASTA where line 1 is the ID line and line 2 is the sequence

Files

Files are downloaded from the MGnify resource and then processed using the calc.pl and checksum_checker.pl scripts. We also use the Unix commands awk and sort to format input into a santised format

Algorithm

  • Process FASTA files
    • Read two lines at a time
    • Extract ID
    • Extract seq
    • Calculate MD5 and TRUNC512
    • Write to file as "ID MD5 TRUNC512 SEQ" (all tab separated)
  • Find checksum clashes non-identical sequence
    • Create a tmp dir to use for sorting
    • Choose a checksum, cut and re-assemble files into "CHECKSUM SEQ ID"
    • Sort all tab delim files by checksum and then seq
    • Pipe into checksum checker
  • Checksum checker
    • Compare previous line to current line
    • If checksum is the same check sequence
    • If sequence not the same then report clash of checksums with previous ID and current ID
    • Write all clashes to screen as "CHECKSUM ID1 SEQ1 ID2 SEQ2"

Testing

Example bad data (held in example.bad.data) contains a duplicate checksum for MD5 and TRUNC512 with a non-identical ID. These rows are identified as MGYP000639767355 and BOGUS; the first and last lines of the file respectively. You can pass this through the following command:

$ cat example.bad.data | awk '{print $1,$4,$3}' | sort -k1,1 -k2,2 | ./checksum_checker.pl

And it will emit the following output:

01fe63fd972452286cfc7be81c47ded3e0d5825811eced30	BOGUS	MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKII	MGYP000639767355	MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKIIN

Indicating there is an identical checksum for TRUNC512 with non-identical sequences.

#!/usr/bin/env perl
use strict;
use warnings;
use Digest::MD5 qw/md5_hex/;
use Digest::SHA qw/sha512_hex/;
my ($file, $output) = @ARGV;
open my $fh, "zcat ${file}|" or die "Cannot open $file for zcat reading: $!";
my $fh;
if($output && $output eq '-') {
$fh = \*STDOUT;
}
if(defined !$fh) {
$output = "${file}.processed" if ! $output;
open $out, '>', or die "Cannot open ${file}.processed for writing: $!";
}
else {
$output = "${file}.processed";
}
while(my $header = <$fh>) {
chomp($header);
my $sequence = <$fh>;
chomp($sequence);
$header =~ s/>(.+?\s).+/$1/;
my $sha = sha512_hex($sequence);
my $trunc512 = substr($sha, 0, 48);
my $md5 = md5_hex($sequence);
print $out "${trunc512}\t${md5}\t${header}\t${sequence}\n";
}
close $fh;
#!/bin/bash
if [ -f ${1}.processed ]; then
rm ${1}.processed
fi
zcat $1 | while read -r ONE; do
read -r TWO;
trunc512=$(echo -n $TWO | sha512sum | cut -c 1-48);
md5=$(echo -n $TWO | md5sum | awk '{print $1}')
id=$(echo $ONE | perl -pe 's/>(.+?\s).+/$1/');
echo -e "${trunc512}\t${md5}\t${id}\t${TWO}" >> ${1}.processed
done
#!/usr/bin/env perl
# Input (STDIN): CHECKSUM SEQ ID
# Input must be sorted by CHECKSUM, then SEQ
# Outputs (STDOUT): CLASHING_CHECKSUM ID1 SEQ1 ID2 SEQ2
use strict;
use warnings;
my ($previous_checksum, $previous_seq, $previous_id) = (q{},q{},q{});
while(my $line = <STDIN>) {
my ($checksum, $seq, $id) = split(/\s/, $line);
# Only check if we had a previous checksum
if($previous_checksum) {
# Check checksums are the same
if($previous_checksum eq $checksum) {
# Print out if there was a mismatch
if($previous_seq ne $seq) {
print "${checksum}\t${previous_id}\t${previous_seq}\t${id}\t${seq}\n";
}
}
}
# Copy current row into holding varibles
($previous_checksum, $previous_seq, $previous_id) = ($checksum, $seq, $id);
}
01fe63fd972452286cfc7be81c47ded3e0d5825811eced30 c7bdbb444a7597eb861d531b43780d23 MGYP000639767355 MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKIIN
4df30e7140edb20ef444d3e4824e1482f5ce6644318b09de c089de3a42c8d1feb3e128a684a9388f MGYP000636998339 MKTVLLHGLGQTAKDWATIAKQLSSDADCPELFALAQSGLSYPQILAGLEKRYADTAEPLRICGLSLGGMLALDYAIRHGEKV
59e80ba14ab4a877f9ba95a429b460d5e03e117f361cc393 6e7a3004ff7c8f32584da646cadb822b MGYP000639305754 MNFAAWRLASVAFKQRDLENGGKVFYSIKLTMEMILLKIAIGLKRNVLSGYILSENFSSCVTYSDFFVMIDSE
605c765863ef3dfca60b6196daf3f8196bbc74c1759db62f 14f9872b9a47ea8c8fc113a40d12075e MGYP000637651890 VLQMNFRASTGYGRKFTELGYKQWGQTMQNDITDGVEWLIKKGIADPKRVAIYGGSYGGYATLAGGTFTPGLYACAIDYVGVSNLFTFMQTIPPYWKPLLDMMYEMVGDPVKDKEMMEKYSPVFHVDQIKAPLFIAQGANDPRVNKAESDQMVEALKKRGIEVEYMVKDNEGHGFHNEENKFDFYRAMEKFLDAHLKK
b7bc66f0e06568596d072ddf725d32a49e4eba222a911d01 c1f2acfcdfebfef4fb3607dc8395409a MGYP000638483824 PISSKKKKLTKEIAKELLDKGWCRVTGLYTPKKPQLYDAVIRLDDSGGKYVSFKMEFDR
886e5a31e46a9a66bde97ae06b17213ecc2f2de005bbef70 23b279dcf1554ec878b78c0f6d716d15 MGYP000637557897 MKLDTDIKYLKGVGERRAAMLSRLGVSDVNALVRLYPRVYEDWSRIKSINEAQIGEICCIKGIVGSPVRKNLIRKGLTLYKTEITDGSGIMGITIFNSRFAAEKLTEGDEFLFFGRVGGNLYRKEMNSPEIEPAEGADRIRPIYPQTHGLNSKMIEKLVRTALTECRDELVDPIPPWLREKYCLMNLPDSLWNIHFPKSPDYLEEARRRLIFEELLILQLGLEKMRSQTQKNAGAIIERDFSEEYFSHLPFSPTGAQRRAVKEAMRDMMSGRQMNRLLQGDVGSGKTAVAAALVYSAAKNSMQSALMAPTEVLAEQHYKTFLKLFEGCSINVELLTGSDTAAQKRRKKEALKAGEIDLLIGTHAIIQSDVEFKSLALVITDEQHRFGVEQRNALGEKGENPHLYVMSATPIPRTLALIIYGELDISILDELPPGRQKIETYAVTSELRQRAYNYVKKHLDAGRQGYIICPLVDEGESDTELASAVKYADELQRGDFRGYTVGLMHGKMRSADKKKVMESFSNGETQLLVSTTVIEVGVDVPNAVIMVIENAERFGLSQLHQLRGRIGRGQYKSTCILITDAKNDTAQRRMKVMETTTDGFKIADEDLKLRGPGEFFGSRQHGLPEMKIADMLEDRSTLEETQRAAKEIVARDPELSSPESTALKNEIQRLFDAVGSAGMN
dd416602e04d44ea20d7713a3ecad23599e91cd50426631b 07b51ccac84c92c787f6c36851e49284 MGYP000638734167 MNRWQRRWKCDEVVYSELSLVEWEEFVQSRPWAALVNELMERDKYITDCLRKGDKDWSDAEMRARLDELEFVLTTPSSIIEELQLITKNNLKGELIGKTYGVGIFGFNKYTYRFIHGI
79f976449665813e79e55c935e75e7082b97946d6d9d24af 95896ea6508a633c72b737e7f1f61402 MGYP000638603629 MADSGITLRELDARPVTDLNGVGDARARALAAAEVHSILDLLGFYPRRYLDRTREATVGGLSIGDEGMVLVRVGTVSSRRTRNRRSLVEARVSDDTGELRLTFFNQPWRADQLKVG
b75f6d0f4ef1970504b35fb412cc81f9e93a7363600959b1 9429eced5d4b16c0f33ff03391e4a41e MGYP000639615096 GGSTVDALTGEGASLATSFFSGKIVMDGAIKGGSKILSRFMTPLVKTDSWRKLGQPVIDQVAAPVTQMLKYAGPELGIAFGAETDMVTLISPFDVDPNSTDAEQVLAARMNIFVDSLLISGILETAAKSGLKLIDFVNGASVGAVAKSLLQSDSAKMISAMDNVLTD
70186aed8699826816f1a34393409372286e341435f414e9 463a0c73e1f83371ef4a33638c27c442 MGYP000639067606 MRITYLGHSGFLAETKEHYLLFDYYTGKLPAFLPEKKLTVFVSHSHQDHYNKEIFALQCGADEGNGQEKHAMQIVTEKMQEDHAVQEVPEKKLSERRRIVYVLSKDIRRKQAEELVRKENTLLSVKAHEKYTLYEGQMTVETLRSTDLGVAYIVTVTEDGGKSYRFYHAGDLNLWKWEGESKAYNNNME
7002e58f882241adde38d5daf0fee55dd94d05cbee68dabb eb7fd1c2c1e495e168244839411b3ba4 MGYP000637674570 LNRIIVEDVLLEDRRGREMLKVARLSAKFEILPLLKGKITITSVQLFGFSANMERETPQSPPNYQFVLDAFASKDTVKTQTNLDLRINSLLIRRGRITYDVLSEPETPGRFNTGHLAVKNFAATVSLKALRNDSLNASIRRLSFEEQCGFALQKFSMKVTANNKRLNIKDFSLELSNSALSIDSLSMRYDSLPDLPKLTDNVQYQGSVNASVVLKDIAPFVPALAKFDTPLHFNMDFSGKGKNIDGPTLRLTDNHSLMINGEASVADWNAGRNMYIMGKLTDVNLNKEGISTVMSNVAGKVPPIFQQLEYVRFSGDATGYLHDLIITGLLRTGEGTIKADVMMSIDKQQNRTYSGNIASADLNVGKLLDNEKKFGTADFNLELKGFNYKNHYPESYIKGNIASFEYSQYKYENIALDGVYKDGGFNGKLAMDDDNGSVQINGNFNVACAIPEFNLKAVLKNLRPDNLHLSDKYKDTDISLNVTADFTGKSIDDMNGKIRLDSLQMNTPEGNNYFLDNLTITAGQVEGQKEIQVLSSFMTAVIRGDYSYNTIPASILQTVQRYIPSLLTLNENMPVPHNNFKFDIRVNNTEFFQKLFQIPLELKFPASIKGYFNDREEKLRVEGHFPAFKYNGTHYDSGVLICENPSDWFKCSLRTGMLMNSGAMVNLTVDTQAKNDRLETTVNWGNNTNVTYGGKFAAVTRFYKTEGKHPILQADINIKPTKVVLNDTIWNIHPSHIAVDSGRVYIDNFLFEHENQHLRINGKLTKSKEDSCHIDLKNINLEYVLDIVHFDDVAFNGLVTGTVNMKGVMKEPAMHTKLNVHNFCLNKALLGEADITGVWDSELGGVRLDAKIAEEGISATHVTGYVSPKMKGLDLMIEADSTNIGLIQPFVEGIFSDLTGRVNGHIRLYGDFKHLDLEGEARANIDTKIDVLNTYFQIRDDSVHISSGQLLFTDVHVYDREGNSGVVNGYLHHTKLKNLMYHFNISGDNLLMYDTNEGGDMPFYGKVYGSGSVILDGGNNAMTVDASLTTGHNTSFTYITGITTEATNTQFITFVDKTPKRIHDNVETNLYHHSNVQKEKEEEGPPMDLRINMLINATPNATMKIVMDPVAGDNIT
01fe63fd972452286cfc7be81c47ded3e0d5825811eced30 c7bdbb444a7597eb861d531b43780d23 BOGUS MFGKLSILRFVSIFVIYILLVGHAPWGAYQAYRQQHLLVMSTREDAPTYPFSKLLVKII
#!/bin/sh
# Currently set to June 2018 release of MGnify
release="2018_06"
for n in 1 2 3 4; do
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/${release}/mgy_proteins_${n}.fa.gz
done
for f in mgy_proteins_*.fa.gz; do
perl ./calc.pl $f - | gzip -c > ${f}.processed.gz
done
mkdir -p $PWD/tmpsort
zcat *.processed.gz | awk '{print $1,$4,$3}' | sort -k1,1 -k2,2 -T $PWD/tmpsort | ./checksum_checker.pl > report.trunc512.out
zcat *.processed.gz | awk '{print $2,$4,$3}' | sort -k1,1 -k2,2 -T $PWD/tmpsort | ./checksum_checker.pl > report.md5.out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment