Skip to content

Instantly share code, notes, and snippets.

@gerrard00
Created September 13, 2017 05:57
Show Gist options
  • Save gerrard00/1a2584adcc84163f4d8cc69e40022092 to your computer and use it in GitHub Desktop.
Save gerrard00/1a2584adcc84163f4d8cc69e40022092 to your computer and use it in GitHub Desktop.
Xml Compression tests
using System;
using System.Diagnostics;
using System.IO;
using System.Xml;
namespace Test
{
class Test
{
static void Main()
{
var stopWatch = new Stopwatch();
stopWatch.Start();
using(var inputStream = File.OpenRead(@"c:\junk\supp2017.xml"))
{
var doc = new XmlDocument();
doc.Load(inputStream);
}
stopWatch.Stop();
Console.WriteLine("Loading xml {0}", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");
}
}
}
using System;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Xml;
namespace Test
{
class Test
{
static void Main()
{
var stopWatch = new Stopwatch();
stopWatch.Start();
using(var inputStream = File.OpenRead(@"c:\junk\supp2017.xml.cmp"))
{
using (DeflateStream decompressingStream = new DeflateStream(inputStream, CompressionMode.Decompress))
{
var doc = new XmlDocument();
doc.Load(decompressingStream);
}
}
stopWatch.Stop();
Console.WriteLine("Loading xml {0}", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");
}
}
}
@gerrard00
Copy link
Author

Please run a test and demonstrate the difference in performance. I think that would be more valuable than conjecture. At this point I still think the time taken to load that entire .Net object graph with hundreds of thousands of objects that will take up gigabytes of memory is the more expensive part of the process.

Disk IO is much slower than memory IO. I don't think anyone disputes that basic knowledge. The question is, is that difference larger than any other work you do once the data is loaded into memory? I don't think you can make that assertion. Loading gigabytes worth of data into hundreds of thousands of .Net objects is slow and expensive.

If the issue was disk caching wouldn't the first run of the tests have been much slower than the subsequent runs?

@robert4
Copy link

robert4 commented Sep 13, 2017

Indeed, the first run of the tests must have been much(?) slower than the subsequent runs. The question is whether that “much” is much or not so much, but should be clearly slower somewhat. Since it was not the case, it indicates that disk caching affected the tests.

I have downloaded the supp2017.xml file you used. (That's 572M, not 572K.) Then I used a different approach: I read the whole file into memory and measured the time of only XmlDocument.Load() on it:

byte[] data = File.ReadAllBytes(@"G:\supp2017.xml");
var m = new MemoryStream(data);
var stopWatch = new System.Diagnostics.Stopwatch();
stopWatch.Start();
var doc = new XmlDocument();
doc.Load(m);
stopWatch.Stop();
Console.WriteLine("XmlDocument.Load(): {0}ms", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");

This reported 9.1-9.3 seconds on my computer. This is the cost of building the object graph in the given .NET environment, a work entirely in-memory. Now I should add the time it takes to load the file, with or without compression. In fact the question is not the sum, but the ratio of the two members of the sum to each other (graph-building vs loading). Since it's difficult to precisely measure the time it takes to load the file (due to the effect of the file system cache), I can use estimates, based on everyday practice. Actually we don't need the precise amount of time it takes to load the file, we only want to know whether is it much more or much less than 9s?
Clearly it depends on the speed of the media the XML file is being read from. I usually experience 100M/s reading speed with my HDD, therefore I estimate the cold loading of the 572M uncompressed file would be approx. 6s. When the file is compressed to 44M, it would load in 0.5s and then +3s decompression would incur (I measured it with a program similar to the above one, that read the compressed file to a byte[] array and then measured the time of decompression + XmlDocument.Load() together).

You are right that this “approx. 6s” is not much larger but instead smaller than the 9s cost of XmlDocument.Load(). Thus the cost of this in-memory work is not dwarfed by the disk I/O, and you are right that this in-memory work is the more expensive part of the process.
Yet the lack of compression makes the whole process 20% slower (15s vs. 12.5s), and this gets more and more pronounced as the speed of the media decreases:

Media Speed Loading uncompressed Loading compressed and decompressing Speed loss if not compressed
internal hdd 100M/s 6s 3.5s ×1.2 = (9+6)/(9+3.5)
external hdd 30M/s 19s 4.5s ×2.1 ≈ (9+19)/(9+4.5)
pendrive or NAS 10M/s 57s 7.5s ×4 ≈ (9+57)/(9+7.5)

@robert4
Copy link

robert4 commented Sep 14, 2017

You convinced me that XmlDocument.Load() is slow enough to be on par with the extremely slow disk I/O (especially for GB-sized documents), but still isn't slow enough to render compression worthless as you stated. So your reasoning is not completely wrong – I'm willing to remove the downvote from your answer. But this can be done only after editing it because 2+ days has elapsed. Thank you for the constructive discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment