This is the first of a series of write-ups that demonstrate using Bash tricks to solve an Informatics Problem.
The format is the Overview section briefly describes the Problem, and describes pitfalls and difficulties.
The Method section describes any relevant background and CS theory, note it may be blank.
The Solution section describes how to solve the described problem.
A BAM file is suspected of being corrupted, due to EOF errors in an analysis. The file was re-downloaded, but the same EOF/truncated error appeared at the end of the analysis, and caused the command to fail.
Two steps will be taken to validate the BAM file: re-verify the checksum, and scanning the file in depth.
The former can be done quickly, while the latter will be time and computing intensive. Both will be described here.
Validating a file's checksum is self-explanatory, and uses the md5 command. Keep in mind that BSD and Linux have a different output and syntax for this command, and code may need to be updated appropriately.
samtools is a multipurpose Bioinformatics tool that can manipulate Binary Alignment Map (BAM) files. It will be used here to scan the file for missing EOF markers, which would indicate a truncated/malformed file. Unfortunately, if samtools does not throw the error immediately, this can be easily missed. The solution is to redirect the output, but, samtools may ignore redirects, necessitating the need to use Bash for this task. Also note that more recent versions of samtools may require you to explicitly use a redirect operator, and will output to a file by default.
For validating files using md5 checksums, use the following command:
# mac osx
md5 file
# output
MD5 (file) = ca199a157957e4da315277246abe4175
# Ubuntu 14.04
md5sum file
# output
42c2bdf4bfdee43a14ec5e4c657c76c2 file
# Fancy way of doing it
if [[ $(uname) = 'Darwin' ]] ; then
V1=$(md5 file | cut -d\ -f4)
V2=$(md5 file | cut -d\ -f4)
if [[ "$V1" = "$V2" ]] ; then
echo 'Files match!'
else
echo 'Files do not match!'
fi
else
V1=$(md5sum file | cut -d\ -f1)
V2=$(md5sum file | cut -d\ -f1)
if [[ "$V1" = "$V2" ]] ; then
echo 'Files match!'
else
echo 'Files do not match!'
fi
fi
To redirect samtools output from STDOUT to a file, use the following command:
samtools view someFile.bam > someFile.bam.sam
# the output is an uncompressed text file
# The '>' operator redirects output from STDOUT to a file, note that some versions of samtools
To redirect samtools output and error messages to a file, and potentially use grep to filter the output, use the following command:
{ samtools view someFile.bam 1>&2; } 2>&1 | grep EOF
# Note the command in {} is executed first, and output is redirected to STDERR.
# Then STDERR is redirected to STDOUT, and scanned using grep for an EOF error.
A few final comments:
- The above command to redirect output and error messages will scan the entire BAM file, and will take awhile
- If you run the samtools redirect command as-is, no output file will be created. STDOUT will show the results of the grep command, i.e. nothing for no errors, output for errors.