Skip to content

Instantly share code, notes, and snippets.

@nc7s
Last active August 15, 2019 17:34
Show Gist options
  • Save nc7s/70425bd083811d2fe0c4f41366f5869e to your computer and use it in GitHub Desktop.
Save nc7s/70425bd083811d2fe0c4f41366f5869e to your computer and use it in GitHub Desktop.
diff-then-copy strategy benchmarks

diff-then-copy strategy benchmarks

methods

  • direct_copy direct copy using language standard library

  • shell_copy direct copy using shell copy command

  • sha1_diff calculate sha1 hashes of both files using language standard library

  • shell_sha1_diff calculate sha1 hashes of both files using /usr/bin/sha1sum

  • shell_diff using diff programs like GNU diff or git-diff to determine difference

  • read_and_compare_whole read files as a whole and compare in-memory

  • read_and_compare_chunk read chunks of files and compare in-memory until differ or end

implementation details

  • direct_copy

    • Python shutil.copyfile()

    • Node fs.writeFileSync(fs.readFileSync) on 7.10, fs.copyFileSync on >= 8.5

  • shell_copy

    • Python subprocess.call(); xcopy on Windows, cp on Linux

    • Node child_process.execSync('copy ..')

  • sha1_diff

    • Python hashlib.sha1()

    • Node crypto.createHash('sha1')

  • shell_sha1_diff

    • Python subprocess.run(['sha1sum', ..], capture_output=True).stdout

    • Node child_process.execSync('sha1sum ..')

  • shell_diff

    • Python subprocess.run(['git-diff', '--no-index', '--name-status'], capture_output=True).stdout

    • Node child_process.execSync('git-diff --no-index --name-status')

  • read_and_compare_chunk files read in 1024 byte chunks

conclusion

  • if files are likely to differ, read in chunks and compare

  • if not,

    • for small files (~100k to ~1m), direct copy - don’t compare at all

    • for big files (~10m to ~100m), diff 'em in the shell then copy if differ

results

on an XPS 13 9370

specs
  • Intel Core i7-8550U 4x 1.8GHz

  • 16GB RAM

  • Debian 10.0 64-bit

  • Python 3.7.4

  • Node 10.15.2

Table 1. ~100k sized files, 100 rounds, results in ms

average

min

max

total

Python direct_copy

0.17

0.14

0.87

17.31

Python shell_copy

2.74

2.28

3.25

273.63

Python sha1_diff

0.31

0.24

0.48

30.71

Python shell_sha1_diff

4.71

4.13

7.28

470.61

Python shell_diff

2.55

2.23

3.11

254.98

Python read_and_compare_whole

0.06

0.06

0.12

6.25

Python read_and_compare_chunk

0.01

0.01

0.03

1.45

Node direct_copy

0.15

0

1

15

Node shell_copy

4.49

3

8

449

Node sha1_diff

0.58

0

2

58

Node shell_sha1_diff

4.41

3

10

441

Node shell_diff

10.84

6

16

1084

Node read_and_compare_whole

0.12

0

1

12

Node read_and_compare_chunk

0.03

0

1

3

Table 2. ~1m sized files, 100 rounds, results in ms

average

min

max

total

Python direct_copy

1.40

0.81

4.17

139.63

Python shell_copy

3.34

2.77

4.48

333.57

Python sha1_diff

2.39

2.17

3.68

239.11

Python shell_sha1_diff

8.35

7.65

11.78

834.59

Python shell_diff

2.65

2.46

3.42

264.43

Python read_and_compare_whole

0.76

0.54

2.58

76.41

Python read_and_compare_chunk

0.02

0.02

0.03

1.78

Node direct_copy

1.84

1

4

184

Node shell_copy

6.48

4

18

648

Node sha1_diff

3.02

2

6

302

Node shell_sha1_diff

12.17

10

19

1217

Node shell_diff

4.4

3

8

440

Node read_and_compare_whole

2.06

1

4

206

Node read_and_compare_chunk

0.02

0

1

2

Table 3. ~10m sized files, 100 rounds, results in ms

average

min

max

total

Python direct_copy

12.65

7.00

21.08

1264.54

Python shell_copy

11.26

10.12

28.93

1126.17

Python sha1_diff

24.55

23.63

31.09

2454.66

Python shell_sha1_diff

42.78

42.20

45.49

4277.63

Python shell_diff

2.57

2.21

3.26

257.058

Python read_and_compare_whole

5.90

5.58

10.01

590.50

Python read_and_compare_chunk

0.02

0.02

0.06

2.00

Node direct_copy

11.63

9

22

1163

Node shell_copy

13.79

12

30

1379

Node sha1_diff

25.92

24

41

2592

Node shell_sha1_diff

47.28

45

68

4728

Node shell_diff

4.92

4

9

492

Node read_and_compare_whole

11.01

9

20

1101

Node read_and_compare_chunk

0.03

0

1

3

Table 4. ~100m sized files, 100 rounds, results in ms

average

min

max

total

Python direct_copy

143.91

70.96

196.10

14390.73

Python shell_copy

131.35

91.16

301.93

13134.70

Python sha1_diff

248.78

242.84

267.23

24877.72

Python shell_sha1_diff

391.50

387.80

404.79

39149.54

Python shell_diff

2.49

2.17

3.16

249.32

Python read_and_compare_whole

54.07

52.60

66.63

5407.09

Python read_and_compare_chunk

0.02

0.02

0.07

2.32

Node direct_copy

117.78

80

177

11778

Node shell_copy

131

89

190

13100

Node sha1_diff

250.32

245

278

25032

Node shell_sha1_diff

410.98

405

444

41098

Node shell_diff

7.22

6

10

722

Node read_and_compare_whole

96.44

93

125

9644

Node read_and_compare_chunk

0.03

0

1

3

fs = require('fs')
crypto = require('crypto')
cp = require('child_process')
sts = 'Lorem ipsum dolor, sit amet consectetur adipisicing elit. Quaerat tenetur porro vero amet eveniet pariatur corrupti asperiores esse culpa tempore enim commodi earum repudiandae, itaque cupiditate. Aperiam ullam id molestiae.\n'
size_in_kb = parseInt(process.argv[process.argv.length - 1])
size_in_kb = parseInt(size_in_kb)
if(size_in_kb === NaN) {
console.log('usage: node test.js size_in_kb rounds')
process.exit()
}
rounds = parseInt(process.argv[process.argv.length - 2])
diffSrc = `a${size_in_kb}.js.txt`
diffDst = `b${size_in_kb}.js.txt`
Array.prototype.sum = function arraySum() {
return this.reduce((prev, cur) => prev + cur)
}
Array.prototype.avg = function arrayAverage() {
return this.sum() / this.length
}
Array.prototype.min = function arrayMinimum() {
let min = this[0]
for(let i = 0; i < this.length; i++) {
if(this[i] < min) {
min = this[i]
}
}
return min
}
Array.prototype.max = function arrayMaximum() {
let max = this[0]
for(let i = 0; i < this.length; i++) {
if(this[i] > max) {
max = this[i]
}
}
return max
}
round = (num, digits) => {
num = num.toString().split('.')
return num[0] + '.' + (num[1] ? num[1].substr(0, digits) : '0')
}
randInt = (a, b) => {
if(!b) { b = a, a = 0 }
return Math.floor(Math.random() * (b - a) + a)
}
randStr = (l, s) => {
l = randInt(100)
s = ''
while(l-- != 0) {
s += 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'[randInt(61)]
}
return s
}
const
_colorPrefix = "\x1b[",
colorCodes = {
reset: 0,
black: 30,
red: 31,
green: 32,
yellow: 33,
blue: 34,
magenta: 35,
cyan: 36,
white: 37
}
setColor = color => _colorPrefix + colorCodes[color] + 'm'
colored = (text, color) => setColor(color) + text + setColor('reset')
timeIt = (rounds, f) => {
let elapsed = [],
start = 0
while(rounds-- != 0) {
start = Date.now()
f()
elapsed.push(Date.now() - start)
}
console.log(colored(f.name, 'cyan'))
console.log(`average|${colored(elapsed.avg(), 'green')}\tmin|${colored(elapsed.min(), 'cyan')}\tmax|${colored(elapsed.max(), 'magenta')}\ttotal|${colored(elapsed.sum(), 'yellow')}`)
}
function direct_copy() {
fs.copyFileSync(diffSrc, diffDst)
}
function shell_copy() {
cp.execSync(`cp ${diffSrc} ${diffSrc}.target`)
}
function sha1_digest(path) {
return crypto.createHash('sha1').update(fs.readFileSync(path)).digest('hex')
}
function sha1_diff() {
return sha1_digest(diffSrc) == sha1_digest(diffDst)
}
function shell_sha1_diff() {
return cp.execSync(`sha1sum ${diffSrc}`) == cp.execSync(`sha1sum ${diffDst}`)
}
function shell_diff() {
let result
try {
result = cp.execSync(`git diff --no-index --name-status ${diffSrc} ${diffDst}`)
} catch(e) {
result = e.stdout
}
return result.length
}
function read_andcompare_whole() {
return fs.readFileSync(diffSrc).toString() === fs.readFileSync(diffDst).toString()
}
function read_andcompare_chunk() {
let srcBuf = new Buffer(1024),
dstBuf = new Buffer(1024),
srcFd = fs.openSync(diffSrc, 'r'),
dstFd = fs.openSync(diffDst, 'r'),
position = 0,
lastReadSrc = -1,
lastReadDst = -1
while(lastReadSrc + lastReadDst != 0) {
lastReadSrc = fs.readSync(srcFd, srcBuf, 0, 1024, position)
lastReadDst = fs.readSync(dstFd, dstBuf, 0, 1024, position)
if(lastReadSrc != lastReadDst || srcBuf.toString() != dstBuf.toString()) {
fs.close(srcFd, f => null)
fs.close(dstFd, f => null)
return false
}
position += 1024
}
fs.close(srcFd, f => null)
fs.close(dstFd, f => null)
return true
}
let fillRounds = Math.ceil(size_in_kb * 1000 / sts.length),
writtenBytes = 0,
changed = null,
pos = 0,
generateStart = Date.now(),
srcStream = new fs.WriteStream(diffSrc),
dstStream = new fs.WriteStream(diffDst)
while(fillRounds-- != 0) {
pos = randInt(sts.length)
changed = sts.substr(0, pos) + randStr() + sts.substr(pos)
srcStream.write(sts)
dstStream.write(changed)
writtenBytes += sts.length + changed.length
}
let intv = setInterval(() => {
let srcSize = fs.lstatSync(diffSrc).size,
dstSize = fs.lstatSync(diffDst).size
if(srcSize + dstSize != writtenBytes) {
return
}
clearInterval(intv)
console.log(`test files generated, using ${colored(Date.now() - generateStart, 'cyan')} ms`)
srcStream.close()
dstStream.close()
console.log(`source file size ${colored(round(srcSize / 1024, 3), 'cyan')}KB, destination file size ${colored(round(dstSize / 1024, 3), 'cyan')}KB, ${colored(rounds, 'cyan')} rounds`)
timeIt(rounds, direct_copy)
timeIt(rounds, shell_copy)
timeIt(rounds, sha1_diff)
timeIt(rounds, shell_diff)
timeIt(rounds, shell_sha1_diff)
timeIt(rounds, read_andcompare_whole)
timeIt(rounds, read_andcompare_chunk)
cp.execSync('rm *.js.txt *.js.txt.target')
}, 10)
import random
import shutil
import string
import time
import sys
import hashlib
import subprocess as sp
import tempfile
import os
ALPHABET = string.ascii_letters + string.digits
sts = 'Lorem ipsum dolor, sit amet consectetur adipisicing elit. Quaerat tenetur porro vero amet eveniet pariatur corrupti asperiores esse culpa tempore enim commodi earum repudiandae, itaque cupiditate. Aperiam ullam id molestiae.\n'
sts_len = len(sts)
try:
rounds = int(sys.argv[-2])
size_in_kb = int(sys.argv[-1])
fill_rounds = int(size_in_kb * 1024 / sts_len)
except:
print('usage: python test.py size_in_kb rounds')
exit()
diff_src, diff_dst = 'a{}.py.txt'.format(size_in_kb), 'b{}.py.txt'.format(size_in_kb)
try:
from ctypes import windll
stdout_handle = windll.kernel32.GetStdHandle(-11)
_color_codes = {
'black': 0,
'blue': 1,
'green': 2,
'cyan': 3,
'red': 4,
'magenta': 5,
'yellow': 6,
'grey': 7
}
def set_console_color(color):
windll.kernel32.SetConsoleTextAttribute(stdout_handle, color)
def print_c(text, color='grey', end='\n'):
set_console_color(color)
sys.stdout.flush()
print(text, end=end)
sys.stdout.flush()
set_console_color('grey')
sys.stdout.flush()
except:
_color_codes = {
'reset': 0,
'black': 30,
'red': 31,
'green': 32,
'yellow': 33,
'blue': 34,
'magenta': 35,
'cyan': 36,
'white': 37
}
def colored(text, color):
return '\x1b[{}m{}\x1b[{}m'.format(_color_codes[color], text, _color_codes['reset'])
def print_c(text, color='white', end='\n'):
print(colored(text, color), end=end)
def round(num, digits):
integer, fraction = str(num).split('.')
if len(fraction) >= digits and int(fraction[digits]) > 4:
digits += 1
return '.'.join((integer, fraction[:digits]))
def timeit(times, func):
elapsed = []
start = 0
for i in range(times):
start = time.time()
func()
elapsed.append(time.time() - start)
print_c(func.__name__, color='cyan')
print_c('average', end='|')
print_c(round(sum(elapsed) / times * 1000, 2), color='green', end='\t')
print_c('min', end='|')
print_c(round(min(elapsed) * 1000, 2), color='cyan', end='\t')
print_c('max', end='|')
print_c(round(max(elapsed) * 1000, 2), color='magenta', end='\t')
print_c('total', end='|')
print_c(round(sum(elapsed) * 1000, 2), color='yellow')
def direct_copy():
shutil.copyfile(diff_src, diff_src + '.target')
def shell_copy():
sp.call(['cp', diff_src, diff_src + '.target'], stdout=sp.DEVNULL)
def sha1_diff():
return hashlib.sha1(open(diff_src, 'rb').read()).hexdigest() == hashlib.sha1(open(diff_dst, 'rb').read()).hexdigest()
def shell_sha1_diff():
return sp.run(['sha1sum', diff_src], capture_output=True).stdout[:40] == sp.run(['sha1sum', diff_dst], capture_output=True).stdout[:40]
def shell_diff():
return len(sp.run(['git', 'diff', '--no-index', '--name-status', diff_src, diff_dst], capture_output=True).stdout) == 0
def read_and_compare_whole():
return open(diff_src, 'rb').read() == open(diff_src, 'rb').read()
def read_and_compare_chunk():
last_read_src = last_read_dst = 0
with open(diff_src, 'rb') as src, open(diff_dst, 'rb') as dst:
read_src, read_dst = src.read(1024), dst.read(1024)
if len(read_src) != len(read_dst) or read_src != read_dst:
return False
return True
generate_start = time.time()
with open(diff_src, 'w') as f1, open(diff_dst, 'w') as f2:
for i in range(fill_rounds):
f1.write(sts)
pos = random.randint(0, sts_len)
f2.write(sts[:pos] + ''.join([random.choice(ALPHABET) for j in range(random.randint(0, 50))]) + sts[pos + 1:])
print_c('test files generated, using', end=' ')
print_c(round((time.time() - generate_start) * 1000, 5), color='cyan', end='')
print('ms')
print_c('source file size', end=' ')
print_c(round(os.lstat(diff_src).st_size / 1024, 2), color='cyan', end=' ')
print_c('KB, destination file size', end=' ')
print_c(round(os.lstat(diff_dst).st_size / 1024, 2), color='cyan', end=' ')
print_c('KB, ', end='')
print_c(rounds, color='cyan', end=' ')
print_c('rounds')
for func in (direct_copy, shell_copy, sha1_diff, shell_sha1_diff, shell_diff, read_and_compare_whole, read_and_compare_chunk):
timeit(rounds, func)
[os.unlink(fn) for fn in [diff_src, diff_dst, diff_src + '.target']]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment