Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Last active May 9, 2019 15:59
Show Gist options
  • Save disulfidebond/e376a617afb999815111ed887a353601 to your computer and use it in GitHub Desktop.
Save disulfidebond/e376a617afb999815111ed887a353601 to your computer and use it in GitHub Desktop.
Tips and Tricks using Bash to solve Informatics problems

Overview

This is the first of a series of write-ups that demonstrate using Bash tricks to solve an Informatics Problem.
The format is the Overview section briefly describes the Problem, and describes pitfalls and difficulties. The Method section describes any relevant background and CS theory, note it may be blank. The Solution section describes how to solve the described problem.

A BAM file is suspected of being corrupted, due to EOF errors in an analysis. The file was re-downloaded, but the same EOF/truncated error appeared at the end of the analysis, and caused the command to fail. Two steps will be taken to validate the BAM file: re-verify the checksum, and scanning the file in depth.
The former can be done quickly, while the latter will be time and computing intensive. Both will be described here.

Methods

Validating a file's checksum is self-explanatory, and uses the md5 command. Keep in mind that BSD and Linux have a different output and syntax for this command, and code may need to be updated appropriately.

samtools is a multipurpose Bioinformatics tool that can manipulate Binary Alignment Map (BAM) files. It will be used here to scan the file for missing EOF markers, which would indicate a truncated/malformed file. Unfortunately, if samtools does not throw the error immediately, this can be easily missed. The solution is to redirect the output, but, samtools may ignore redirects, necessitating the need to use Bash for this task. Also note that more recent versions of samtools may require you to explicitly use a redirect operator, and will output to a file by default.

Solution

For validating files using md5 checksums, use the following command:

    # mac osx
    md5 file
    # output
    MD5 (file) = ca199a157957e4da315277246abe4175
    
    # Ubuntu 14.04
    md5sum file
    # output
    42c2bdf4bfdee43a14ec5e4c657c76c2  file
    
    # Fancy way of doing it
    if [[ $(uname) = 'Darwin' ]] ; then 
      V1=$(md5 file | cut -d\  -f4)
      V2=$(md5 file | cut -d\  -f4)
      if [[ "$V1" = "$V2" ]] ; then 
        echo 'Files match!'
      else
        echo 'Files do not match!'
      fi
    else
      V1=$(md5sum file | cut -d\  -f1)
      V2=$(md5sum file | cut -d\  -f1)
      if [[ "$V1" = "$V2" ]] ; then 
        echo 'Files match!'
      else
        echo 'Files do not match!'
      fi
    fi

To redirect samtools output from STDOUT to a file, use the following command:

    samtools view someFile.bam > someFile.bam.sam 
    # the output is an uncompressed text file
    # The '>' operator redirects output from STDOUT to a file, note that some versions of samtools 

To redirect samtools output and error messages to a file, and potentially use grep to filter the output, use the following command:

    { samtools view someFile.bam 1>&2; } 2>&1 | grep EOF
    # Note the command in {} is executed first, and output is redirected to STDERR. 
    # Then STDERR is redirected to STDOUT, and scanned using grep for an EOF error.

A few final comments:

  • The above command to redirect output and error messages will scan the entire BAM file, and will take awhile
  • If you run the samtools redirect command as-is, no output file will be created. STDOUT will show the results of the grep command, i.e. nothing for no errors, output for errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment