Skip to content

Instantly share code, notes, and snippets.

@ernstki
Last active June 16, 2025 20:46
Show Gist options
  • Save ernstki/94a7d14b998d1504d4117b7a0b5331a0 to your computer and use it in GitHub Desktop.
Save ernstki/94a7d14b998d1504d4117b7a0b5331a0 to your computer and use it in GitHub Desktop.
How to create (and verify) checksums for NGS data

Hello, collaborators!

If you have the ability to do so, please compute SHA1 hashes for any files you intend to share with the Weirauch Lab. These hashes, or "checksums," are critical for verifying that we've received your data 100% intact and unmodified.

You will need to save these checksums into a file, then transmit or store that file alongside the original data. Instructions for macOS and Linux are directly below. For Windows, please follow the instructions in the "Generating checksums on Windows" section.

Generating checksums on Unix

Here is a one-liner that will do the job on a typical Unix/Linux system. macOS is also Unix, but if you are not familiar with the command line, see the section "Extra guidance for Mac users," below. If you use Windows and have Git Bash or Cygwin available, this method should also work for you.

sha1sum * > SHA1SUMS

Failing that, this should work on macOS, or any Unix with the Digest::SHA Perl module:

shasum -a1 * > SHA1SUMS

If you're writing a shell script and want to account for both possibilities, see the "Additional notes" section.

Computing checksums for files in multiple subdirectories

If you have multiple samples in separate subdirectories (say, a collection of experiments and their associated .fastq.gz files), you can use find to locate all files matching a pattern, then run shasum / sha1sum on each of those files in sequence. See the the next section for a parallel method.

The command below will generate a SHA1SUMS file in the parent directory, for all files matched by the find pattern:

find * -name "*.fastq.gz" -exec sha1sum {} \; | tee SHA1SUMS

Piping through tee just allows you to monitor the progress.

Computing checksums in parallel

If you have GNU Parallel available on your end, this command will compute checksums in parallel (one task per CPU core) for all .fastq.gz files in the current directory:

find *.fastq.gz | parallel --keep-order sha1sum {} | tee SHA1SUMS

The --keep-order option to parallel just rearranges the output to match the order of the input (filenames from find), since there's no guarantee the output of the sha1sum tasks will show up in that order otherwise. Let me know if you've heard this one… Why'd the parallel computing chicken cross the road? Answer: side. other get To to the

To include all .fastq.gz files including those in subdirectories, do this instead:

find * -name "*.fastq.gz" | parallel --keep-order sha1sum {} | tee SHA1SUMS

Extra guidance for Mac users

If you have never used the Terminal application on your Mac, it can be found in the "Utilities" folder, inside the "Applications" folder. You can also open Spotlight with +space and start typing t-e-r-m… until you see its icon appear.

You will follow the instructions from the "Generating checksums on Unix" section above, but first you need to cd (change directory) to the folder containing the files you want to checksum. An easy way to do this is to

  1. in the Terminal app, type cd␣
    • that is cd, then a space
  2. drag-and-drop the icon for the folder containing those files from Finder (the file manager) into the open Terminal window, and finally
    • if you have trouble finding the Terminal window again, drag the folder icon—without letting go—over the Terminal icon in the dock at the bottom, and hold it there for a while until the Terminal window comes to the front, then drop the folder icon on that window
  3. press Enter.

You can run the command pwd (print working directory) to confirm that it worked. It should display the full path to the folder you dragged into the Terminal window.

Now run the commands as directed above, substituting shasum -a1 wherever you see sha1sum. macOS does not have a built-in sha1sum command (nor md5sum), although you can add these yourself after the fact by installing MacPorts' md5sha1sum package.

Generating checksums on Windows

If you have a recent version of Windows 10*, you can use PowerShell to create SHA1 hashes in the format we require on our end. If you are stuck with an older version of Windows, see the "Generating checksums on older versions of Windows" section, below.

First, right-click on the "Start" menu, choose "Windows PowerShell" from the menu.

Then, cd ("change directory") to the folder containing the files you wish to checksum. An easy way to do this is to

  1. type cd␣ at the command prompt
    • that is cd, then a space
  2. drag-and-drop the icon for the folder containing those files from the Windows file manager into the open PowerShell window
    • if you have trouble finding the PowerShell window again, drag the folder icon—without letting go—over the PowerShell icon in the taskbar at the bottom, and hold it there for a while
  3. finally, press Enter.

Your prompt should now look something like PS X:\Path\To\The\Folder> , where X:\Path\To... is the actual drive letter and path to the folder you dragged into the window.

Next, copy the command (select with the mouse, then press Ctrl+C)

ls -recurse -file -name `
  | foreach { get-filehash $_ -a SHA1 } `
  | foreach { echo $(
         $_.hash.tolower() + "  " +
         $(resolve-path -relative $_.path `
             | foreach { $_ -replace "\\","/" -replace "^\./","" }
          )
       )
     } `
  | tee SHA1SUMS.txt

…and paste it into an open PowerShell terminal window by clicking with the right (context-click) mouse button, or pressing Ctrl+V.

Copy-paste in the Windows terminal is a bit tricky, and requires a steady hand. If you click the wrong mouse button, or find yourself otherwise frustrated, just click on the "control icon" in the upper-left of the PowerShell window and use the "Edit" menu there.

Finally, press Enter to invoke the command, creating the file SHA1SUMS.txt containing the computed checksums. If you end up uploading the analysis data to a different server, please include the SHA1SUMS.txt file along with the rest of the data.

*for Windows 10 prior to the "Creators Update" (version 1703), you'll just need to find and launch "Windows PowerShell" from the Start menu

Other notes

You can use the arrow keys to recall, edit, and re-run this command, if necessary. Be sure not to accidentally delete any of the backticks (a.k.a. grave accents, which are PowerShell's line continuation character) or you'll get an error.

Omit -recurse from the first line if you do not want to descend into subfolders (that is, only checksum the files in the current folder). If you want to compute checksums for specific file types only, add the -include option with a wildcard pattern; for example, -include *.fastq.gz. Multiple patterns may be separated with commas, like this: -include *.vcf,*.bam.

Using tee at the end just lets you see the progress on-screen while simultaneously saving the output to the file SHA1SUMS.txt. Like a "T" in a piping system.

Generating checksums on older versions of Windows

These instructions apply to Windows 7 and 8. You will still need to open a PowerShell terminal, but the commands below differ for lack of the Get-FileHash command present in newer Windowses. You will use certutil instead.

Open the Start menu and start typing p-o-w… until the PowerShell program icon appears. You don't need to run it as administrator, if you see that option there.

In the PowerShell window you'll need to first cd to the folder containing the files you want to compute checksums for, as detailed in the "Generating checksums on Windows" section, above.

Now, copy the command (select with the mouse, then press Ctrl+C)

ls -file -recurse -name `
  | resolve-path -relative `
  | foreach { certutil -hashfile $_ SHA1 } `
  | tee SHA1SUMS.txt

…and paste it into an open PowerShell terminal window by clicking with the right (context-click) mouse button.

Copy-paste in the Windows terminal is a bit tricky, and requires a steady hand. If you click the wrong mouse button, or find yourself otherwise frustrated, just click on the "control icon" in the upper-right of the PowerShell window and use the "Edit" menu there.

Omit -recurse from the first line if you do not want to descend into subfolders (that is, only checksum the files in the current folder). If you want to compute checksums for specific file types only, add the -include option with a wildcard pattern; for example, -include *.fastq.gz. Multiple patterns may be separated with commas, like this: -include *.vcf,*.bam.

Using tee at the end just lets you see the progress on-screen while also saving the output to the file SHA1SUMS.txt.

You can use the arrow keys to edit this command, if necessary, but be sure not to accidentally delete any of the backticks (a.k.a. grave accents, which are PowerShell's line continuation character) or you'll get an error.

Finally, press Enter to invoke the command, creating the file SHA1SUMS.txt containing the computed checksums. If you end up uploading the analysis data to a different server, please include the SHA1SUMS.txt file along with the rest of the data.


Sections that follow are a reference for Weirauch Lab members…

Verifying checksums

Important note: If you receive a SHA1SUMS.txt from a collaborator who created it using the Window PowerShell commands above, you must first convert it to Unix format using dos2unix. This is because PowerShell's default output encoding is UTF-16 with a BOM, which Unix utilities can't understand; dos2unix will handle the encoding as well as removing the extraneous CR (carriage return) characters for you, updating the file in place.

To verify the checksums of data transmitted by a collaborator, just give the -c option to sha1sum (or shasum -a1, as appropriate) followed by the name of the checksum file and it will verify checksums of files which exist in the current working directory against entries in the file. The file SHA1SUMS here is assumed to be the output of a previous sha1sum command:

sha1sum -c SHA1SUMS

# on macOS, or BSD or Linux with a reasonable Perl installation
shasum -a1 -c SHA1SUMS

Files listed in SHA1SUMS which don't exist on the filesystem are reported, as are files whose checksums don't match what's in the provided checksum file.

Verifying checksums in parallel

If you have GNU Parallel available, this will put one sha1sum task on each CPU core, and finish much faster than computing hashes one-by-one by hand, or sequentially in a for loop:

parallel --pipepart --cat sha1sum -c {} :::: SHA1SUMS

The --pipepart --cat options cause Parallel to divide up the input into separate temporary files, the filenames of which are then given to sha1sum -c, which expects the filename containing the pre-computed hashes as an argument. This divides the work over all the cores in your system.

Given the output of certutil on Windows:

Note: The format of the certutil output changed somewhere between Windows 7 and Windows 10. See an example for NOTEPAD.EXE in the header of the script below. Just an FYI if you're trying to parse the output of certutil with other tools.

This shell script below, certutil2shasum, will remove the DOS/Windows CR line endings and do the necessary transformation to turn the output of certutil into the format that sha1sum (or shasum) understands. It can handle the hash output format from Windows 7's certutil, too, which has a space between every pair of hex digits. See the "Installation" and "Usage" sections in the script's headers to get started.

As an alternative to using the script, here's a one-liner that uses built-in Unix utilities to handle the reformatting and line end conversion of the input file transparently. Use shasum -a1 -c instead of sha1sum -c on macOS and alter the input filename SHA1SUMS.txt as appropriate.

sha1sum -c <(
  tr -d '\r' < SHA1SUMS.txt \
    | sed -n '/^SHA/{
        s/.*hash of\( file\)* \(.*\):/\2/;s^\\^/^g;h;n;s/ //g;G;s/\n/  /p
      }'
)

This one-liner requires the process substitution feature (cf. command substitution) of the Korn and Bash shells. This feature may not be present in other shells (like csh); just use the full script below in that case. See the comments in the script for a thorough explanation of the sed regex.

The s^\\^/^g part replaces Windows-style path separators (backslashes) with Unix-style ones (forward slashes). If the person who sent you the checksums used the PowerShell one-liner from above, this will correctly process checksums for files in subdirectories.

Additional notes

Using sha1sum or shasum, as appropriate, in a shell script

In, general, every "standard" Linux desktop or server environment will have the sha1sum command available out-of-the box, as well as other hashing algorithms like md5sum, sha256, sha512, and so on. These come with the GNU Coreutils distribution, which is as important as it sounds.

However, sha1sum is not present on a Mac system, without the user having taken extra steps to install it. You would need need to use shasum -a1 instead on macOS. If you're writing a script and you want to cover both possibilities, you can do this:

if [ "$(uname -s)" = "Linux" ]; then
  SHA1SUM='sha1sum'
else
  SHA1SUM='shasum -a1'
fi

Now use $SHA1SUM in your script in place of shasum or sha1sum and it will use the appropriate utility for both Linux and macOS / BSD.

#!/usr/bin/env bash
##############################################################################
##
## Make the output of Windows' 'certutil -hashfile' match the format that
## 'sha1sum -c' expects
##
## Author: Kevin Ernst
## Date: 30 October 2018 (updated 28 October 2021)
##
## Installation:
##
## $ curl -LOJ https://tf.cchmc.org/s/certutil2shasum # or use 'wget'
## $ chmod a+x certutil2shasum
##
## Usage:
##
## Briefly, given 'winsums.txt' generated from 'certutil' on Windows:
##
## $ ./certutil2shasum winsums.txt > SHA1SUMS
##
## Details:
##
## The input file, 'winsums.txt', is assumed to be the product of having
## done something like:
##
## X:\Path> for %i in (*) do certutil -hashfile "%i" SHA1 >> winsums.txt
##
## using the Windows command prompt (CMD.EXE).
##
## You then convert the output of 'certutil' into a format that 'sha1sum'
## or 'shasum' on Unix (including macOS) can understand, like this:
##
## $ ./certutil2shasum winsums.txt > SHA1SUMS
## $ sha1sum -a1 -c SHA1SUMS # use 'shasum -a1 -c' on macOS
##
## Because Windows text files have a different (CR+LF) line ending, and may
## be UTF-16 with a "byte order mark", the 'dos2unix' utility is required
## to convert the sums file into a format Unix utilities can work with.
## This utility is present by default on any reasonable Linux system, or
## can be installed on a Mac with MacPorts or Homebrew.
##
## Examples:
##
## Output of 'certutil' on Windows 7; note 'hash of file <filename>' and
## the spaces in the hash:
##
## SHA1 hash of file c:\windows\notepad.exe:
## 26 40 c2 90 04 2e 09 e0 41 fa 76 ae 86 37 74 7e 10 0a 5c 2c
## CertUtil: -hashfile command completed successfully.
##
## Output of 'certutil' on Windows 10 (possibly 8, too?):
##
## SHA1 hash of C:\windows\notepad.exe:
## 867b54f1bc5b71045a9a00baca485a24176b202c
## CertUtil: -hashfile command completed successfully
##
## Resulting output from this script:
##
## 4894bd2097ec497206975385f2e2b0e57fe65017 UI_input_form_wireframe.png
##
## Closing Remarks:
##
## Shebangs are tricky; see https://unix.stackexchange.com/a/14894.
##
##############################################################################
# set TRACE=1 in the environment to enable execution tracing
(( TRACE )) && set -x
# '$(readlink -f "$BASH_SOURCE")' is better, but 'readlink' is GNU/Linux-only
ME=$(basename "${BASH_SOURCE[0]}")
# I did it this way so the resulting program will work withoug GNU sed (e.g.,
# on macOS), but I can still have comments
SED_SCRIPT=(
# for lines starting with "SHA"…
/^SHA/ {
# replace entire line w/ just filename part
# (Win 7/10 ouput differs slightly here)
's/.*hash of\( file\)* \(.*\):/\2/;'
's^\\^/^g;' # replace backslashes with slashes
's^\./^^;' # and remove the leading './'
'h;' # then save that in the hold space
'n;' # read the next line (the checksum)
's/ //g;' # replace all spaces (with nothing)
'G;' # append the hold space
's/\n/ /p;' # replace the embedded newline with two spaces
}
)
if [[ $# -eq 0 || ! -r $1 ]]; then
echo "$ME: Expected first argument to be a (readable) file" >&2
exit 1
fi
if ! type -t dos2unix &>/dev/null; then
echo "$ME: 'dos2unix' needed; please install with MacPorts / Homebrew" >&2
exit 1
fi
# remove the DOS/Windows CR line endings, convert to UTF-8 w/ no BOM
dos2unix "$1"
# 'tr' /could/ fix the line endings (`<infile tr -d '\r' >outfile`) but that
# doesn't fix the fact that the input file is UTF-16 with byte order mark
# 'sed' with a "heredoc" for the script wouldn't work, so this trickery
# see also: https://unix.stackexchange.com/a/14894
sed -n -f <(echo "${SED_SCRIPT[@]}") -- "$1"
# certutil2shasum
@ernstki
Copy link
Author

ernstki commented Jan 7, 2020

@bioinformike Good point. I've updated the instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment