Skip to content

Instantly share code, notes, and snippets.

@dginev
Last active January 22, 2021 22:09
Show Gist options
  • Save dginev/17261bcd64ee59eb85a525be6f149b43 to your computer and use it in GitHub Desktop.
Save dginev/17261bcd64ee59eb85a525be6f149b43 to your computer and use it in GitHub Desktop.
Exploring compression over HTML5+MathML from arXiv

Sample from arXiv October, 2009 (yymm=0910):

  • 5281 HTML files
  • biggest file is 20MB in size, arxiv 0910.2294, with 1472 math formulas
  • html preview in CorTeX
  • possible cause? ids:
  <ci id="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.cmml"
    xref="S7.Ex655.m1.9.9.1.1.1.1.1.1.1.1.1.1.2">Θ</ci>

Experiment in compressing the directory of these sources with different algorithms and tools:

algorithm command size (MB) pack time (min) unpack (min)
unpacked 6997
lzop tar --lzop -cvf 1500 <1
gzip tar -czvf 789 2.5 0.5
zip zip -9 -r 772 8 0.6
bzip2 tar -cjvf 530 10 2.2
lzip tar --lzip -cvf 559 31 1.1
xz tar -cJvf 487 37 0.67
7-Zip LZMA advzip -3 715 284
zopfli advzip -4 711 385
  • lzop is substantially faster than alternatives (<1 min)
  • zip, bzip2 reasonable, =10 min
  • lzip is slowish, ~25 min
  • xz is substantially slower than alternatives (>35 min)
  • advzip, which claims extreme size reduction, is too slow to use in practice, and what's worse - isn't that effective size-wise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment