Skip to content

Instantly share code, notes, and snippets.

@erochest
Last active December 21, 2015 17:59
Show Gist options
  • Save erochest/6344191 to your computer and use it in GitHub Desktop.
Save erochest/6344191 to your computer and use it in GitHub Desktop.

Notes

This contains the files I used to perform the timings, as well as the timings themselves.

The timings are to process one bag with 60,000 small files and one bag with one large (10GB) file. Scripts related to the bag with many files are named like *-lots, and scripts related to the bag with one large file are named like *-large.

What I'm Timing

Ruby

$ ruby -version
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]

The ruby version is a fork of https://github.com/tipr/bagit with these changes.

The actual script that I'm running for this uses this changes. It's listed below as bagit-dir.

GoLang

$ go version
go version go1.1.2 darwin/amd64

The Go version is of https://github.com/APTrust/bagins.

Generating Input

I generated the input files using the rand-lots and rand-large scripts. Output went into the directories bag-lots and bag-large.

Timing

Timings were done with the time-lots and time-large scripts. They ran each processor five times and used the UNIX utility time. Kind of the sledgehammer approach to benchmarking.

#!/usr/bin/env ruby
# usage: bagit-dir BAGDIR INPUTDIR
require 'bagit'
begin
if ARGV.count == 0 then
puts "usage: bagit-dir BAGDIR INPUTDIR"
else
bag_dir = ARGV[0]
input_dir = ARGV[1]
bag = BagIt::Bag.new(bag_dir)
files = Dir[File.join(input_dir, '**', '*')].select { |f| File.file? f }
files.each do |input_name|
basename = File.basename(input_name)
bag.add_file(basename, input_name, true)
end
bag.write_bag_info
bag.quick_sha1!
end
end
#!/bin/bash
OUTDIR=$HOME/p/bagins/bag-large
time dd if=/dev/urandom of=$OUTDIR/data-10.binary bs=2048 count=5242880
# time dd if=/dev/urandom of=$OUTDIR/data-12.binary bs=2048 count=6291456
# time dd if=/dev/urandom of=$OUTDIR/data-14.binary bs=2048 count=7340032
# time dd if=/dev/urandom of=$OUTDIR/data-16.binary bs=2048 count=8388608
# time dd if=/dev/urandom of=$OUTDIR/data-18.binary bs=2048 count=9437184
# time dd if=/dev/urandom of=$OUTDIR/data-20.binary bs=2048 count=10485760
#!/bin/bash
K=60000
OUTDIR=$HOME/p/bagins/bag-lots
for ((n=0;n<$K;n++)); do
echo $n
time dd if=/dev/urandom of=$OUTDIR/data-$n.binary bs=$RANDOM count=1
done
#!/bin/bash
DATADIR=./bag-large/
OUTPUTDIR=./bags
PREFIX=$OUTPUTDIR/bag-large-
K=5
for ((n=0;n<$K;n++)); do
echo "Timing Ruby bagit ($n) ..."
rm -rf ${PREFIX}rb && time ./bin/bagit-dir ${PREFIX}rb $DATADIR
done
rm -rf ${PREFIX}rb
for ((n=0;n<$K;n++)); do
echo "Timing go bagit ($n) ..."
rm -rf ${PREFIX}go && time ./bin/bagmaker -algo=sha1 -dir=$OUTPUTDIR -name=bag-large-go -payload=$DATADIR
done
rm -rf ${PREFIX}go
# Output:
#!/bin/bash
WORKING=$(pwd)
DATADIR=$WORKING/bag-lots/
OUTPUTDIR=$WORKING/bags
PREFIX=$OUTPUTDIR/bag-lots-
K=5
for ((n=0;n<$K;n++)); do
echo "Timing Ruby bagit ($n) ..."
rm -rf ${PREFIX}rb
time $WORKING/bin/bagit-dir ${PREFIX}rb $DATADIR
done
for ((n=0;n<$K;n++)); do
echo "Timing go bagit ($n) ..."
pushd $DATADIR
rm -rf ${PREFIX}go
time $WORKING/bin/bagmaker -algo=sha1 -dir=$OUTPUTDIR -name=bag-lots-go -payload=.
popd
done
engine test real user sys
ruby lots 378.649 27.163 123.151
ruby lots 386.593 26.368 109.979
ruby lots 430.980 27.264 133.121
ruby lots 328.839 26.475 127.351
ruby lots 329.551 25.325 126.748
go lots 148.867 9.740 48.095
go lots 195.299 10.217 47.960
go lots 152.904 10.061 49.178
go lots 160.352 9.993 48.938
go lots 182.452 10.003 47.701
ruby large 10g 981.635 39.656 26.424
ruby large 10g 1399.347 36.550 28.453
ruby large 10g 1029.103 39.370 25.654
ruby large 10g 885.638 39.514 26.910
ruby large 10g 863.371 39.684 28.630
go large 10g 579.519 47.592 16.573
go large 10g 622.52 46.448 17.481
go large 10g 658.321 45.850 16.513
go large 10g 941.326 43.202 16.779
go large 10g 677.335 46.151 17.504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment