Hello, collaborators!
If you have the ability to do so, please compute SHA1 hashes for any files you intend to share with the Weirauch Lab. These hashes, or "checksums," are critical for verifying that we've received your data 100% intact and unmodified.
You will need to save these checksums into a file, then transmit or store that file alongside the original data. Instructions for macOS and Linux are directly below. For Windows, please follow the instructions in the "Generating checksums on Windows" section.
Here is a one-liner that will do the job on a typical Unix/Linux system. macOS is also Unix, but if you are not familiar with the command line, see the section "Extra guidance for Mac users," below. If you use Windows and have Git Bash or Cygwin available, this method should also work for you.
sha1sum * > SHA1SUMS
Failing that, this should work on macOS, or any Unix with the Digest::SHA Perl module:
shasum -a1 * > SHA1SUMS
If you're writing a shell script and want to account for both possibilities, see the "Additional notes" section.
If you have multiple samples in separate subdirectories (say, a collection of experiments and their associated .fastq.gz
files), you can use find
to locate all files matching a pattern, then run shasum
/ sha1sum
on each of those files in sequence. See the the next section for a parallel method.
The command below will generate a SHA1SUMS
file in the parent directory, for all files matched by the find
pattern:
find * -name "*.fastq.gz" -exec sha1sum {} \; | tee SHA1SUMS
Piping through tee
just allows you to monitor the progress.
If you have GNU Parallel available on your end, this command will compute checksums in parallel (one task per CPU core) for all .fastq.gz
files in the current directory:
find *.fastq.gz | parallel --keep-order sha1sum {} | tee SHA1SUMS
The --keep-order
option to parallel
just rearranges the output to match the order of the input (filenames from find
), since there's no guarantee the output of the sha1sum
tasks will show up in that order otherwise. Let me know if you've heard this one… Why'd the parallel computing chicken cross the road? Answer: side. other get To to the
To include all .fastq.gz
files including those in subdirectories, do this instead:
find * -name "*.fastq.gz" | parallel --keep-order sha1sum {} | tee SHA1SUMS
If you have never used the Terminal application on your Mac, it can be found in the "Utilities" folder, inside the "Applications" folder. You can also open Spotlight with ⌘+space and start typing t-e-r-m… until you see its icon appear.
You will follow the instructions from the "Generating checksums on Unix" section above, but first you need to cd
(change directory) to the folder containing the files you want to checksum. An easy way to do this is to
- in the Terminal app, type
cd␣
- that is
cd
, then a space
- that is
- drag-and-drop the icon for the folder containing those files from Finder (the file manager) into the open Terminal window, and finally
- if you have trouble finding the Terminal window again, drag the folder icon—without letting go—over the Terminal icon in the dock at the bottom, and hold it there for a while until the Terminal window comes to the front, then drop the folder icon on that window
- press Enter.
You can run the command pwd
(print working directory) to confirm that it worked. It should display the full path to the folder you dragged into the Terminal window.
Now run the commands as directed above, substituting shasum -a1
wherever you see sha1sum
. macOS does not have a built-in sha1sum
command (nor md5sum
), although you can add these yourself after the fact by installing MacPorts' md5sha1sum
package.
If you have a recent version of Windows 10*, you can use PowerShell to create SHA1 hashes in the format we require on our end. If you are stuck with an older version of Windows, see the "Generating checksums on older versions of Windows" section, below.
First, right-click on the "Start" menu, choose "Windows PowerShell" from the menu.
Then, cd
("change directory") to the folder containing the files you wish to checksum. An easy way to do this is to
- type
cd␣
at the command prompt- that is
cd
, then a space
- that is
- drag-and-drop the icon for the folder containing those files from the Windows file manager into the open PowerShell window
- if you have trouble finding the PowerShell window again, drag the folder icon—without letting go—over the PowerShell icon in the taskbar at the bottom, and hold it there for a while
- finally, press Enter.
Your prompt should now look something like PS X:\Path\To\The\Folder>
, where X:\Path\To...
is the actual drive letter and path to the folder you dragged into the window.
Next, copy the command (select with the mouse, then press Ctrl+C)
ls -recurse -file -name `
| foreach { get-filehash $_ -a SHA1 } `
| foreach { echo $(
$_.hash.tolower() + " " +
$(resolve-path -relative $_.path `
| foreach { $_ -replace "\\","/" -replace "^\./","" }
)
)
} `
| tee SHA1SUMS.txt
…and paste it into an open PowerShell terminal window by clicking with the right (context-click) mouse button, or pressing Ctrl+V.
Copy-paste in the Windows terminal is a bit tricky, and requires a steady hand. If you click the wrong mouse button, or find yourself otherwise frustrated, just click on the "control icon" in the upper-left of the PowerShell window and use the "Edit" menu there.
Finally, press Enter to invoke the command, creating the file SHA1SUMS.txt
containing the computed checksums. If you end up uploading the analysis data to a different server, please include the SHA1SUMS.txt
file along with the rest of the data.
*for Windows 10 prior to the "Creators Update" (version 1703), you'll just need to find and launch "Windows PowerShell" from the Start menu
You can use the arrow keys to recall, edit, and re-run this command, if necessary. Be sure not to accidentally delete any of the backticks (a.k.a. grave accents, which are PowerShell's line continuation character) or you'll get an error.
Omit -recurse
from the first line if you do not want to descend into subfolders (that is, only checksum the files in the current folder). If you want to compute checksums for specific file types only, add the -include
option with a wildcard pattern; for example, -include *.fastq.gz
. Multiple patterns may be separated with commas, like this: -include *.vcf,*.bam
.
Using tee
at the end just lets you see the progress on-screen while simultaneously saving the output to the file SHA1SUMS.txt
. Like a "T" in a piping system.
These instructions apply to Windows 7 and 8. You will still need to open a PowerShell terminal, but the commands below differ for lack of the Get-FileHash
command present in newer Windowses. You will use certutil
instead.
Open the Start menu and start typing p-o-w… until the PowerShell program icon appears. You don't need to run it as administrator, if you see that option there.
In the PowerShell window you'll need to first cd
to the folder containing the files you want to compute checksums for, as detailed in the "Generating checksums on Windows" section, above.
Now, copy the command (select with the mouse, then press Ctrl+C)
ls -file -recurse -name `
| resolve-path -relative `
| foreach { certutil -hashfile $_ SHA1 } `
| tee SHA1SUMS.txt
…and paste it into an open PowerShell terminal window by clicking with the right (context-click) mouse button.
Copy-paste in the Windows terminal is a bit tricky, and requires a steady hand. If you click the wrong mouse button, or find yourself otherwise frustrated, just click on the "control icon" in the upper-right of the PowerShell window and use the "Edit" menu there.
Omit -recurse
from the first line if you do not want to descend into subfolders (that is, only checksum the files in the current folder). If you want to compute checksums for specific file types only, add the -include
option with a wildcard pattern; for example, -include *.fastq.gz
. Multiple patterns may be separated with commas, like this: -include *.vcf,*.bam
.
Using tee
at the end just lets you see the progress on-screen while also saving the output to the file SHA1SUMS.txt
.
You can use the arrow keys to edit this command, if necessary, but be sure not to accidentally delete any of the backticks (a.k.a. grave accents, which are PowerShell's line continuation character) or you'll get an error.
Finally, press Enter to invoke the command, creating the file SHA1SUMS.txt
containing the computed checksums. If you end up uploading the analysis data to a different server, please include the SHA1SUMS.txt
file along with the rest of the data.
Sections that follow are a reference for Weirauch Lab members…
Important note: If you receive a SHA1SUMS.txt
from a collaborator who created it using the Window PowerShell commands above, you must first convert it to Unix format using dos2unix
. This is because PowerShell's default output encoding is UTF-16 with a BOM, which Unix utilities can't understand; dos2unix
will handle the encoding as well as removing the extraneous CR (carriage return) characters for you, updating the file in place.
To verify the checksums of data transmitted by a collaborator, just give the -c
option to sha1sum
(or shasum -a1
, as appropriate) followed by the name of the checksum file and it will verify checksums of files which exist in the current working directory against entries in the file. The file SHA1SUMS
here is assumed to be the output of a previous sha1sum
command:
sha1sum -c SHA1SUMS
# on macOS, or BSD or Linux with a reasonable Perl installation
shasum -a1 -c SHA1SUMS
Files listed in SHA1SUMS
which don't exist on the filesystem are reported, as are files whose checksums don't match what's in the provided checksum file.
If you have GNU Parallel available, this will put one sha1sum
task on each CPU core, and finish much faster than computing hashes one-by-one by hand, or sequentially in a for
loop:
parallel --pipepart --cat sha1sum -c {} :::: SHA1SUMS
The --pipepart --cat
options cause Parallel to divide up the input into separate temporary files, the filenames of which are then given to sha1sum -c
, which expects the filename containing the pre-computed hashes as an argument. This divides the work over all the cores in your system.
Note: The format of the certutil
output changed somewhere between Windows 7 and Windows 10. See an example for NOTEPAD.EXE
in the header of the script below. Just an FYI if you're trying to parse the output of certutil
with other tools.
This shell script below, certutil2shasum
, will remove the DOS/Windows CR line endings and do the necessary transformation to turn the output of certutil
into the format that sha1sum
(or shasum
) understands. It can handle the hash output format from Windows 7's certutil
, too, which has a space between every pair of hex digits. See the "Installation" and "Usage" sections in the script's headers to get started.
As an alternative to using the script, here's a one-liner that uses built-in Unix utilities to handle the reformatting and line end conversion of the input file transparently. Use shasum -a1 -c
instead of sha1sum -c
on macOS and alter the input filename SHA1SUMS.txt
as appropriate.
sha1sum -c <(
tr -d '\r' < SHA1SUMS.txt \
| sed -n '/^SHA/{
s/.*hash of\( file\)* \(.*\):/\2/;s^\\^/^g;h;n;s/ //g;G;s/\n/ /p
}'
)
This one-liner requires the process substitution feature (cf. command substitution) of the Korn and Bash shells. This feature may not be present in other shells (like csh
); just use the full script below in that case. See the comments in the script for a thorough explanation of the sed
regex.
The s^\\^/^g
part replaces Windows-style path separators (backslashes) with Unix-style ones (forward slashes). If the person who sent you the checksums used the PowerShell one-liner from above, this will correctly process checksums for files in subdirectories.
In, general, every "standard" Linux desktop or server environment will have the sha1sum
command available out-of-the box, as well as other hashing algorithms like md5sum
, sha256
, sha512
, and so on. These come with the GNU Coreutils distribution, which is as important as it sounds.
However, sha1sum
is not present on a Mac system, without the user having taken extra steps to install it. You would need need to use shasum -a1
instead on macOS. If you're writing a script and you want to cover both possibilities, you can do this:
if [ "$(uname -s)" = "Linux" ]; then
SHA1SUM='sha1sum'
else
SHA1SUM='shasum -a1'
fi
Now use $SHA1SUM
in your script in place of shasum
or sha1sum
and it will use the appropriate utility for both Linux and macOS / BSD.
@bioinformike Good point. I've updated the instructions.