Sample from arXiv October, 2009 (yymm=0910):
- 5281 HTML files
- biggest file is 20MB in size, arxiv 0910.2294, with 1472 math formulas
- html preview in CorTeX
- possible cause? ids:
<ci id="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.cmml"
xref="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2">Θ</ci>
Experiment in compressing the directory of these sources with different algorithms and tools:
algorithm | command | size (MB) | pack time (min) | unpack (min) |
---|---|---|---|---|
unpacked | 6997 | |||
lzop | tar --lzop -cvf | 1500 | <1 | |
gzip | tar -czvf | 789 | 2.5 | 0.5 |
zip | zip -9 -r | 772 | 8 | 0.6 |
bzip2 | tar -cjvf | 530 | 10 | 2.2 |
lzip | tar --lzip -cvf | 559 | 31 | 1.1 |
xz | tar -cJvf | 487 | 37 | 0.67 |
7-Zip LZMA | advzip -3 | 715 | 284 | |
zopfli | advzip -4 | 711 | 385 |
- lzop is substantially faster than alternatives (<1 min)
- zip, bzip2 reasonable, =10 min
- lzip is slowish, ~25 min
- xz is substantially slower than alternatives (>35 min)
- advzip, which claims extreme size reduction, is too slow to use in practice, and what's worse - isn't that effective size-wise.