Exploring compression over HTML5+MathML from arXiv

Sample from arXiv October, 2009 (yymm=0910):

  <ci id="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.cmml"
    xref="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2">Θ</ci>

Experiment in compressing the directory of these sources with different algorithms and tools:

algorithm	command	size (MB)	pack time (min)	unpack (min)
unpacked		6997
lzop	tar --lzop -cvf	1500	<1
gzip	tar -czvf	789	2.5	0.5
zip	zip -9 -r	772	8	0.6
bzip2	tar -cjvf	530	10	2.2
lzip	tar --lzip -cvf	559	31	1.1
xz	tar -cJvf	487	37	0.67
7-Zip LZMA	advzip -3	715	284
zopfli	advzip -4	711	385

lzop is substantially faster than alternatives (<1 min)
zip, bzip2 reasonable, =10 min
lzip is slowish, ~25 min
xz is substantially slower than alternatives (>35 min)
advzip, which claims extreme size reduction, is too slow to use in practice, and what's worse - isn't that effective size-wise.

dginev/report.md