Installation:
A quick installation does not require root access, as shown:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
On Ubuntu (as tested on 16.04), it is already available in the repositories:
sudo apt-get install parallel
EXAMPLE: Replace a for-loop
It is often faster to write a command using GNU Parallel than making a for loop:
for i in *gz; do
zcat $i > $(basename $i .gz).unpacked
done
Can also be written as:
parallel 'zcat {} > {.}.unpacked' ::: *.gz
The added benefit is that the zcats are run in parallel - one per CPU core.
EXAMPLE: Parallelizing BLAT:
This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:
cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl
EXAMPLE: Blast on multiple machines:
Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results
If you have access to the local machine (and can ideally log into either of them using SSH keys), here named server1
and server2
, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:
cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result
EXAMPLE: Run bigWigToWig for each chromosome:
If you have one file per chomosome, it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {}
will be substituted with arguments following the separator ':::'.
parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M
EXAMPLE: Running composed commands:
GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):
parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna
See this also for more information.
EXAMPLE: Running experiments:
Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age
, --sex
and --chr
:
experiment --age 18 --sex M --chr 22
Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y
But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single directory for neatness:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
This will create files like outputdir/1/80/2/M/3/X/stdout
containing the standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y
Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If one of your parameters take on many different values, these can be read from a file using '::::'
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y
Advanced example: Using GNU Parallel to parallelize you own scripts:
Assume you have BASH/Perl/Python script called launch
. It takes one arguments, ID:
launch ID
Using parallel you can run multiple IDs in parallel using:
parallel launch ::: ID1 ID2 ...
But you would like to hide this complexity from the user, so the user only has to do:
launch ID1 ID2 ...
You can do that using --shebang-wrap
. Change the shebang line from:
#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python
to:
#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python
You further develop your script so it now takes an ID
and a DIR
:
launch ID DIR
You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again, simply change the shebang line to:
#!/usr/bin/parallel --shebang-wrap bash
And now you can run:
launch ID1 ID2 ID3 ::: DIR
If you want/need to build a cluster of nodes that can run jobs with parallel over a shared sshfs mount, see this gist for the definitive guide.