A colleague of mine was attempting to improve throughput of an application that was being fed FHIR data, and noticed some problems.
The system that was sending us data was sending bundles that were pretty printed JSON. These pretty printed bundles were close to 2 MB in size. Removing pretty printing reduced the bundles to about 1 MB. Compressing these bundles reduced the size to 70 kB.
That was a sign that we don’t have a lot of entropy in the data. Of course, json will have field names repeating over and over, thus this is expected. We discovered that our system performance was in some ways limited by the amount of IO involved in transmitting these bundles.
So I decided to investigate using a binary format. There are quite a few popular ones out there - Protobuf, Thrift, Avro, MsgPack.
I started with Avro, however the existence of cycles in the structures made it unsuitable for FHIR data. Next up, I tried Protobuf. This worked out pretty well, and I think is good enough to enable a discussion.
For the impatient, here are the benchmark results. I used 2 bundles - one about 2 kB in size with uncompressed pretty printed JSON. The other about 20 kB similarly.
The source files should permit modification for one to test out for their own data.
The first table shows file sizes for different scenarios.
- Input
-
Input file (generated)
- P Json
-
Pretty printed JSON
- P XML
-
Pretty printed XML
- U Json
-
Non-Pretty (Ugly) printed JSON
- U XML
-
Non-Pretty (Ugly) printed XML
- Proto
-
Protobuf binary format
Each has in parens the size of the GZip compressed data. This should reflect IO for a typical web server with GZip encoding supported.
The second table shows the performance for parsing and serializing different formats for each bundle type.
Environment
===========
* Groovy: 2.4.12
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15, Oracle Corporation)
* JRE: 1.8.0_141
* Total Memory: 440.5 MB
* Maximum Memory: 3641 MB
* OS: Mac OS X (10.13.1, x86_64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
Filename | Input( Zipped) | P Json( Zipped) | P XML( Zipped) | U Json( Zipped) | U XML( Zipped) | Proto( Zipped)
-------------------- + ------------------ + ------------------ + ------------------ + ------------------ + ------------------ + ------------------
bundle-2k.json | 2028( 585) | 2028( 585) | 2622( 652) | 1027( 510) | 1717( 595) | 455( 382)
bundle-20k.json | 20327( 1228) | 20327( 1228) | 27782( 1374) | 10869( 1055) | 17655( 1215) | 4331( 729)
user system cpu real
Print Pretty JSON - 2k 75787 520 76307 76722
Print Ugly JSON - 2k 66706 316 67022 67444
Print Pretty XML - 2k 78654 624 79278 80399
Print Ugly XML - 2k 68887 504 69391 70010
Serialize Protobuf - 2k 985 2 987 989
Humanize Protobuf - 2k 70677 905 71582 73008
Parse Pretty JSON - 2k 80398 212 80610 80869
Parse Ugly JSON - 2k 76539 451 76990 77377
Parse Pretty XML - 2k 123847 597 124444 125119
Parse Ugly XML - 2k 123952 1715 125667 127153
Parse Protobuf - 2k 4441 117 4558 4709
Print Pretty JSON - 20k 764057 2842 766899 769137
Print Ugly JSON - 20k 680447 1574 682021 682836
Print Pretty XML - 20k 805454 3954 809408 814134
Print Ugly XML - 20k 653903 1352 655255 656354
Serialize Protobuf - 20k 9888 7 9895 9900
Humanize Protobuf - 20k 822226 1850 824076 826057
Parse Pretty JSON - 20k 763755 3275 767030 770772
Parse Ugly JSON - 20k 775614 4480 780094 782850
Parse Pretty XML - 20k 1025954 1102 1027056 1028101
Parse Ugly XML - 20k 968764 2623 971387 972552
Parse Protobuf - 20k 43236 717 43953 45119
For JSON and XML formats, I used HAPI-FHIR.
For Protobuf, I created a quick-and-dirty library - fhir-protobuf.
It uses the file fhir.schema.json
from FHIR’s download section.
Naturally, it doesn’t do as much work as HAPI-FHIR does, and should not be an apples-to-apples comparison for performance.
However, given serialization is a couple orders of magnitude faster, and parsing is about twice as fast, that might be of interest to some FHIR users. My main interest is the size of data.
An obvious downside is the difficulty in reading binary files without custom tooling. With most of these libraries (Protobuf, Avro, Thrift, MsgPack), this custom tooling is about a dozen lines of code in your favorite scripting language. That should not be a major deterrent. Some of them even have 2 way conversion from JSON to binary.
The other major downside is the absence of validation. I think neither JSON nor XML offer validation. In case of HAPI FHIR, it has been built on top of the parser. Something similar should be possible in this case as well.
Did test for a 2M real bundle.