Skip to content

Instantly share code, notes, and snippets.

@samrocketman
Last active November 25, 2015 21:44
Show Gist options
  • Save samrocketman/4e1bafb3c464879668d3 to your computer and use it in GitHub Desktop.
Save samrocketman/4e1bafb3c464879668d3 to your computer and use it in GitHub Desktop.
Using xargs to process multiple arguments and commands in parallel. Use case is for software that does not take advantage of multiple cores processing independent data. This means data can be split and processed separately.

About

How to parallelize across tmp directories using full paths and a wrapper script. This assumes all scripts use current working directory for all processing.

Use case is for software that does not take advantage of multiple cores processing independent data. This means data can be split and processed separately.

Try this out

git clone https://gist.github.com/4e1bafb3c464879668d3.git xargs_example
cd xargs_example
./main.sh
find tmp

Clean up:

  • rm -rf tmp
  • Or alternatively git clean -xfd

Scripts

  • main.sh - the main script to launch all parallelized processing.
  • setup.sh - creates the temporary directories and sets up arguments for wrapper.sh.
  • filter.sh - used by split command to output data into different directories.
  • wrapper.sh - wraps all processing scripts. This script is designed to be executed by xargs with parallelism. It will cd to the temporary directory and then execute all processing scripts in the context of that working directory.
  • process.sh - an example processing script that uses the current working directory for output.

main.sh explained

./setup.sh | xargs -P0 -l2 ./wrapper.sh

xargs options:

  • -P0 launches as many parallel processes as possible. If you want to limit it to, say, 4 concurrent processes then change it to -P4.
  • -l2 is the max args to pass into the script. My setup.sh script is "echoing" 2 arguments that need to be passed. So -l2 tells xargs to read two lines and pass both in as arguments to the script. You can process N number of args this way per run of each wrapper.sh script.

Here's output of that command:

^_^[sam@autopsy:~/sandbox/xargs_example]$ ls
main.sh  process.sh  README.md  setup.sh  wrapper.sh
^_^[sam@autopsy:~/sandbox/xargs_example]$ ./main.sh
$1: tmp/tmp.kEnlmRbamS
$2: somearg1
$1: tmp/tmp.dRdpOpaJUS
$2: somearg2
$1: tmp/tmp.K2fI8YWwqP
$2: somearg3
^_^[sam@autopsy:~/sandbox/xargs_example]$ ls
main.sh  process.sh  README.md  setup.sh  tmp  wrapper.sh
^_^[sam@autopsy:~/sandbox/xargs_example]$ find tmp
tmp
tmp/tmp.kEnlmRbamS
tmp/tmp.kEnlmRbamS/datafile.txt
tmp/tmp.kEnlmRbamS/datafile.txt.md5
tmp/tmp.dRdpOpaJUS
tmp/tmp.dRdpOpaJUS/datafile.txt
tmp/tmp.dRdpOpaJUS/datafile.txt.md5
tmp/tmp.K2fI8YWwqP
tmp/tmp.K2fI8YWwqP/datafile.txt
tmp/tmp.K2fI8YWwqP/datafile.txt.md5
data point 1
data point 2
data point 3
data point 4 with additional data
data point 5
data point 6
data point 7
data point 8
data point 9
data point 10
data point 11 with additional data
data point 12
data point 13
data point 14
data point 15 with additional data
data point 16
data point 17
data point 18
data point 19 with additional data
data point 20
#!/bin/bash
#Wed Nov 25 12:37:37 PST 2015
#Ubuntu 14.04.3 LTS
#Linux 3.13.0-68-generic x86_64
#GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
#cat (GNU coreutils) 8.21
#md5sum (GNU coreutils) 8.21
#mktemp (GNU coreutils) 8.21
#xargs (GNU findutils) 4.4.2
mkdir -p tmp
TEMPDIR="$(mktemp -dp tmp)"
cat > "$TEMPDIR/data.dat"
#!/bin/bash
#Wed Nov 25 12:37:37 PST 2015
#Ubuntu 14.04.3 LTS
#Linux 3.13.0-68-generic x86_64
#GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
#cat (GNU coreutils) 8.21
#md5sum (GNU coreutils) 8.21
#mktemp (GNU coreutils) 8.21
#xargs (GNU findutils) 4.4.2
./setup.sh | xargs -P0 -l1 ./wrapper.sh
#xargs -l2 states how many stdout lines should be passed to a single wrapper.sh
#script. In this case it's 2.
#!/bin/bash
#Wed Nov 25 12:37:37 PST 2015
#Ubuntu 14.04.3 LTS
#Linux 3.13.0-68-generic x86_64
#GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
#cat (GNU coreutils) 8.21
#md5sum (GNU coreutils) 8.21
#mktemp (GNU coreutils) 8.21
#xargs (GNU findutils) 4.4.2
#hash a pre-created data.dat
md5sum ./data.dat > ./data.dat.md5
md5sum -c data.dat.md5
#!/bin/bash
#Wed Nov 25 12:37:37 PST 2015
#Ubuntu 14.04.3 LTS
#Linux 3.13.0-68-generic x86_64
#GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
#cat (GNU coreutils) 8.21
#md5sum (GNU coreutils) 8.21
#mktemp (GNU coreutils) 8.21
#xargs (GNU findutils) 4.4.2
#split the data into parts
split --filter="./filter.sh" -l 5 "data.dat"
#output all of the tmp directories for xargs
find tmp/* -maxdepth 0 -type d
#!/bin/bash
#Wed Nov 25 12:37:37 PST 2015
#Ubuntu 14.04.3 LTS
#Linux 3.13.0-68-generic x86_64
#GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
#cat (GNU coreutils) 8.21
#md5sum (GNU coreutils) 8.21
#mktemp (GNU coreutils) 8.21
#xargs (GNU findutils) 4.4.2
#first argument is temp directory where scripts will be processed
echo "\$1: $1"
#Additional argument (optional), this could be used to provide additional
#arguments to scripts
echo "\$2: $2"
#save the directory where process scripts are located
SCRIPTDIR="${SCRIPTDIR:-$PWD}"
#cd to temp directory which should have been first argument
cd "$1"
#execute all process scripts
${SCRIPTDIR}/process.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment