Skip to content

Instantly share code, notes, and snippets.

@rahulsom
Last active July 21, 2022 13:17
Show Gist options
  • Save rahulsom/598cd37924197d6ee2526483a3d539d2 to your computer and use it in GitHub Desktop.
Save rahulsom/598cd37924197d6ee2526483a3d539d2 to your computer and use it in GitHub Desktop.
Protocol Buffers as a format for FHIR data
# Created by .ignore support plugin (hsz.mobi)
### JetBrains template
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff:
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/dictionaries
# Sensitive or high-churn files:
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.xml
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
# Gradle:
.idea/**/gradle.xml
.idea/**/libraries
# CMake
cmake-build-debug/
# Mongo Explorer plugin:
.idea/**/mongoSettings.xml
## File-based project format:
*.iws
## Plugin-specific files:
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
### Project specific files
data

Protocol Buffers as a format for FHIR data

A colleague of mine was attempting to improve throughput of an application that was being fed FHIR data, and noticed some problems.

The system that was sending us data was sending bundles that were pretty printed JSON. These pretty printed bundles were close to 2 MB in size. Removing pretty printing reduced the bundles to about 1 MB. Compressing these bundles reduced the size to 70 kB.

That was a sign that we don’t have a lot of entropy in the data. Of course, json will have field names repeating over and over, thus this is expected. We discovered that our system performance was in some ways limited by the amount of IO involved in transmitting these bundles.

So I decided to investigate using a binary format. There are quite a few popular ones out there - Protobuf, Thrift, Avro, MsgPack.

I started with Avro, however the existence of cycles in the structures made it unsuitable for FHIR data. Next up, I tried Protobuf. This worked out pretty well, and I think is good enough to enable a discussion.

Benchmarks

For the impatient, here are the benchmark results. I used 2 bundles - one about 2 kB in size with uncompressed pretty printed JSON. The other about 20 kB similarly.

The source files should permit modification for one to test out for their own data.

File Sizes

The first table shows file sizes for different scenarios.

Input

Input file (generated)

P Json

Pretty printed JSON

P XML

Pretty printed XML

U Json

Non-Pretty (Ugly) printed JSON

U XML

Non-Pretty (Ugly) printed XML

Proto

Protobuf binary format

Each has in parens the size of the GZip compressed data. This should reflect IO for a typical web server with GZip encoding supported.

Parse/Serialize Performance

The second table shows the performance for parsing and serializing different formats for each bundle type.

Environment
===========
* Groovy: 2.4.12
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15, Oracle Corporation)
    * JRE: 1.8.0_141
    * Total Memory: 440.5 MB
    * Maximum Memory: 3641 MB
* OS: Mac OS X (10.13.1, x86_64)

Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On

Filename             |    Input(  Zipped) |   P Json(  Zipped) |    P XML(  Zipped) |   U Json(  Zipped) |    U XML(  Zipped) |    Proto(  Zipped)
-------------------- + ------------------ + ------------------ + ------------------ + ------------------ + ------------------ + ------------------
bundle-2k.json       |     2028(     585) |     2028(     585) |     2622(     652) |     1027(     510) |     1717(     595) |      455(     382)
bundle-20k.json      |    20327(    1228) |    20327(    1228) |    27782(    1374) |    10869(    1055) |    17655(    1215) |     4331(     729)



                             user  system      cpu     real

Print Pretty JSON  - 2k     75787     520    76307    76722
Print Ugly JSON    - 2k     66706     316    67022    67444
Print Pretty XML   - 2k     78654     624    79278    80399
Print Ugly XML     - 2k     68887     504    69391    70010
Serialize Protobuf - 2k       985       2      987      989
Humanize Protobuf  - 2k     70677     905    71582    73008
Parse Pretty JSON  - 2k     80398     212    80610    80869
Parse Ugly JSON    - 2k     76539     451    76990    77377
Parse Pretty XML   - 2k    123847     597   124444   125119
Parse Ugly XML     - 2k    123952    1715   125667   127153
Parse Protobuf     - 2k      4441     117     4558     4709
Print Pretty JSON  - 20k   764057    2842   766899   769137
Print Ugly JSON    - 20k   680447    1574   682021   682836
Print Pretty XML   - 20k   805454    3954   809408   814134
Print Ugly XML     - 20k   653903    1352   655255   656354
Serialize Protobuf - 20k     9888       7     9895     9900
Humanize Protobuf  - 20k   822226    1850   824076   826057
Parse Pretty JSON  - 20k   763755    3275   767030   770772
Parse Ugly JSON    - 20k   775614    4480   780094   782850
Parse Pretty XML   - 20k  1025954    1102  1027056  1028101
Parse Ugly XML     - 20k   968764    2623   971387   972552
Parse Protobuf     - 20k    43236     717    43953    45119

For JSON and XML formats, I used HAPI-FHIR.

For Protobuf, I created a quick-and-dirty library - fhir-protobuf. It uses the file fhir.schema.json from FHIR’s download section. Naturally, it doesn’t do as much work as HAPI-FHIR does, and should not be an apples-to-apples comparison for performance.

However, given serialization is a couple orders of magnitude faster, and parsing is about twice as fast, that might be of interest to some FHIR users. My main interest is the size of data.

An obvious downside is the difficulty in reading binary files without custom tooling. With most of these libraries (Protobuf, Avro, Thrift, MsgPack), this custom tooling is about a dozen lines of code in your favorite scripting language. That should not be a major deterrent. Some of them even have 2 way conversion from JSON to binary.

The other major downside is the absence of validation. I think neither JSON nor XML offer validation. In case of HAPI FHIR, it has been built on top of the parser. Something similar should be possible in this case as well.

@Grab('com.github.rahulsom:fhir-protobuf-translate:0.1.0')
@Grab('ca.uhn.hapi.fhir:hapi-fhir-structures-dstu3:3.1.0')
@Grab('org.gperfutils:gbench:0.4.3-groovy-2.4')
import ca.uhn.fhir.context.FhirContext
import com.github.rahulsom.fhirprotobuf.Converter
import com.google.protobuf.util.JsonFormat
import org.fhir.stu3.Bundle
import org.hl7.fhir.instance.model.api.IBaseResource
import java.text.DecimalFormat
import java.util.zip.GZIPOutputStream
def rootDir = new File('data')
if (!rootDir.exists()) {
rootDir.mkdirs()
}
void createProtoFiles(String name) {
new File("data/${name}.proto.json").text =
Converter.fromFhirJson(new File("data/${name}.json").text)
def builder2k = Bundle.newBuilder()
JsonFormat.parser().ignoringUnknownFields().merge(new File("data/${name}.proto.json").text, builder2k)
def stream2k = new File("data/${name}.buf").newOutputStream()
builder2k.build().writeTo(stream2k)
stream2k.flush()
stream2k.close()
}
new JsonFactory(rootDir: rootDir).
with {
createBundle('bundle-2k', [doc, patient])
createProtoFiles('bundle-2k')
createBundle('bundle-20k',
([doc, patient] + visits(6) + observations(21)))
createProtoFiles('bundle-20k')
createProtoFiles('bundle-170k')
}
def fhirContext = FhirContext.forDstu3()
def files = rootDir.
listFiles().
findAll { it.name.endsWith('.json') && !it.name.contains('.proto') }.
sort { it.size() }
def zipLength(byte[] s) {
def targetStream = new ByteArrayOutputStream()
def zipStream = new GZIPOutputStream(targetStream)
zipStream.write(s)
zipStream.close()
def zipped = targetStream.toByteArray()
targetStream.close()
return zipped.length
}
def r = benchmark {
def types = ['Input', 'P Json', 'P XML', 'U Json', 'U XML', 'Proto']
def dataFormat = "%-20s | %6s(%6s) | %6s(%6s) | %6s(%6s) | %6s(%6s) | %6s(%6s) | %6s(%6s)"
println String.sprintf(dataFormat, ['Filename'] + types.collect { [it, 'Zipped'] }.flatten())
def sepFormat = "%-20s + %14s + %14s + %14s + %14s + %14s + %14s"
println String.sprintf(sepFormat, ['-' * 20] + types.collect { '-' * 14 })
files.each { file ->
def name = file.name.split(/\./)[0].split('-')[1]
IBaseResource resource = fhirContext.newJsonParser().parseResource(file.newReader())
def protoFile = new File(rootDir, "bundle-${name}.buf")
def protobufBundle = Bundle.parseFrom(protoFile.newInputStream())
String json = ''
String xml = ''
String prettyJson = ''
String prettyXml = ''
byte[] protoByteArray = []
String protoString = ''
"Print Pretty JSON - $name" {
def parser = fhirContext.newJsonParser()
parser.prettyPrint = true
prettyJson = parser.encodeResourceToString(resource)
}
"Print Ugly JSON - $name" {
def parser = fhirContext.newJsonParser()
parser.prettyPrint = false
json = parser.encodeResourceToString(resource)
}
"Print Pretty XML - $name" {
def parser = fhirContext.newXmlParser()
parser.prettyPrint = true
prettyXml = parser.encodeResourceToString(resource)
}
"Print Ugly XML - $name" {
def parser = fhirContext.newXmlParser()
parser.prettyPrint = false
xml = parser.encodeResourceToString(resource)
}
"Serialize Protobuf - $name" {
protoByteArray = protobufBundle.toByteArray()
}
"Humanize Protobuf - $name" {
protoString = protobufBundle.toString()
}
"Parse Pretty JSON - $name" {
fhirContext.newJsonParser().parseResource(prettyJson)
}
"Parse Ugly JSON - $name" {
fhirContext.newJsonParser().parseResource(json)
}
"Parse Pretty XML - $name" {
fhirContext.newXmlParser().parseResource(prettyXml)
}
"Parse Ugly XML - $name" {
fhirContext.newXmlParser().parseResource(xml)
}
"Parse Protobuf - $name" {
Bundle.parseFrom(protoByteArray)
}
println String.sprintf(dataFormat,
file.name,
format(file.size()), format(zipLength(file.bytes)),
format(prettyJson.length()), format(zipLength(prettyJson.bytes)),
format(prettyXml.length()), format(zipLength(prettyXml.bytes)),
format(json.length()), format(zipLength(json.bytes)),
format(xml.length()), format(zipLength(xml.bytes)),
format(protoFile.size()), format(zipLength(protoFile.bytes)))
}
}
private static String format(long number) {
String[] suffix = ["", "k", "M", "G", "T", "P", "E", "Z", "Y"]
// int MAX_LENGTH = 4;
String r = new DecimalFormat("##0E0").format(number);
def exponent = r.split('E')[1].toInteger()
def eng = suffix[exponent / 3 as int]
r.split('E')[0] + eng
}
println ''
println ''
println ''
r.prettyPrint()
@Grab(group = 'ca.uhn.hapi.fhir', module = 'hapi-fhir-structures-dstu3', version = '3.1.0')
import ca.uhn.fhir.context.FhirContext
import org.hl7.fhir.dstu3.model.*
import static org.hl7.fhir.dstu3.model.Bundle.BundleType.TRANSACTION
import static org.hl7.fhir.dstu3.model.ContactPoint.ContactPointUse.HOME
import static org.hl7.fhir.dstu3.model.ContactPoint.ContactPointUse.WORK
import static org.hl7.fhir.dstu3.model.Enumerations.AdministrativeGender.MALE
import static org.hl7.fhir.dstu3.model.Identifier.IdentifierUse.OFFICIAL
import static org.hl7.fhir.dstu3.model.Identifier.IdentifierUse.SECONDARY
class JsonFactory {
File rootDir = new File('data')
private static Bundle.BundleEntryComponent entry(Resource patient) {
new Bundle.BundleEntryComponent().setResource(patient)
}
void createBundle(String fileName, List<Resource> resources) {
def ctx = FhirContext.forDstu3()
def bundle = new Bundle().
setType(TRANSACTION).
setEntry(resources.collect { entry(it) }).
setId(UUID.randomUUID().toString())
def jsonParser = ctx.newJsonParser()
jsonParser.prettyPrint = true
new File(rootDir, "${fileName}.json").text = jsonParser.encodeResourceToString(bundle)
}
def doc = new Practitioner().with { it.id = "Practitioner_1"; it }.
setName([
new HumanName().setFamily('Kelso').setGiven([new StringType('Bob')])
]).
setGender(MALE).
setIdentifier([
new Identifier().
setSystem('https://stmarys.com/practitioners').
setValue('BK001'),
]).
setQualification([
new Practitioner.PractitionerQualificationComponent().
setCode(new CodeableConcept().setCoding([
new Coding('http://codesystem/', 'code', 'Some display name')
]))
])
def patient = new Patient().with { it.id = "Patient_1"; it }.
setName([new HumanName().
setFamily("Doe").
setGiven([new StringType("John")])]).
setAddress([new Address().
setUse(Address.AddressUse.HOME).
setLine([new StringType('11 Spooner St')]).
setCity('Quahog').
setState('Rhode Island').
setPostalCode('90210').
setCountry('United States of America')
]).
setIdentifier([
new Identifier().
setUse(OFFICIAL).
setSystem('https://ssn.gov/id').
setValue('101239980'),
new Identifier().
setUse(SECONDARY).
setSystem('https://stmarys.com/patients').
setValue('P100390142'),
new Identifier().
setUse(SECONDARY).
setSystem('https://dmv.ca.gov/').
setValue('D5103342'),
]).
setActive(true).
setBirthDate(Date.parse('yyyyMMdd', '19800102')).
setTelecom([
new ContactPoint().setValue('[email protected]').setUse(HOME),
new ContactPoint().setValue('+18005359090').setUse(HOME),
new ContactPoint().setValue('+18008889797').setUse(WORK),
]).
setGender(MALE).
setGeneralPractitioner([
new Reference().setReference('Practitioner_1')
])
List<Encounter> visits(int count) {
(1..count).collect {
new Encounter().with { it.id = "Encounter$it"; it }.
setIdentifier([
new Identifier().
setUse(SECONDARY).
setSystem('https://stmarys.com/encounters').
setValue("E00010$it"),
]).
setDiagnosis([
new Encounter.DiagnosisComponent()
]).
setLength(new Duration().setValue(10).setUnit('days')).
setStatus(Encounter.EncounterStatus.FINISHED)
}
}
List<Observation> observations(int count) {
(1..count).collect {
new Observation().with { it.id = "Observation$it"; it }.
setStatus(Observation.ObservationStatus.FINAL).
setIdentifier([
new Identifier().
setUse(SECONDARY).
setSystem('https://stmarys.com/observations').
setValue("E00010$it"),
]).
setValue(new Quantity().setUnit('mg').setValue(100)).
setBodySite(new CodeableConcept().setText("Foo").setCoding([
new Coding().setSystem('http://loinc.com').setCode('12043'),
new Coding().setSystem('http://snomed.com').setCode('H123-J010'),
]))
}
}
}
@faraway
Copy link

faraway commented Feb 7, 2018

Did test for a 2M real bundle.

Environment
===========
* Groovy: 2.4.8
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.91-b14, Oracle Corporation)
    * JRE: 1.8.0_91
    * Total Memory: 256.5 MB
    * Maximum Memory: 3641 MB
* OS: Mac OS X (10.12.3, x86_64)

Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On

Filename             |  Input(Zipped) | P Json(Zipped) |  P XML(Zipped) | U Json(Zipped) |  U XML(Zipped) |  Proto(Zipped)
-------------------- + -------------- + -------------- + -------------- + -------------- + -------------- + --------------
bundle-2M.json       |  2.01M( 60.7k) |  2.01M( 60.7k) |  2.48M( 67.4k) |   919k( 37.7k) |  1.47M( 47.7k) |   434k( 34.4k)



                             user  system       cpu      real

Print Pretty JSON  - 2M  76390727  333571  76724298  77600570
Print Ugly JSON    - 2M  56845726   20163  56865889  57127932
Print Pretty XML   - 2M  64914837   32971  64947808  65500018
Print Ugly XML     - 2M  58923753  136146  59059899  59481353
Serialize Protobuf - 2M    976486     596    977082    977082
Humanize Protobuf  - 2M  57245356  199879  57445235  57789294
Parse Pretty JSON  - 2M  75509314  166434  75675748  77075182
Parse Ugly JSON    - 2M  66827548  118188  66945736  67324599
Parse Pretty XML   - 2M  89544560  211673  89756233  89983939
Parse Ugly XML     - 2M  80500225  237758  80737983  81181709
Parse Protobuf     - 2M   3172113    8101   3180214   3192213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment