kchristidis/protobuf-serialization.md

Last active June 27, 2025 14:10

Star (33) You must be signed in to star a gist
Fork (7) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/kchristidis/39c8b310fd9da43d515c4394c3cd9510.js"></script>
Save kchristidis/39c8b310fd9da43d515c4394c3cd9510 to your computer and use it in GitHub Desktop.

Notes on protocol buffers and deterministic serialization (or lack thereof)

Raw

There doesn't seem to be a good resource online describing the issues with protocol buffers and deterministic serialization (or lack thereof). This is a collection of links on the subject.

Protocol Buffers v3.0.0. release notes:

The deterministic serialization is, however, NOT canonical across languages; it is also unstable across different builds with schema changes due to unknown fields.

Maps documentation:

Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.

Encoding & Field Order documentation:

While you can use field numbers in any order in a .proto, when a message is serialized its known fields should be written sequentially by field number, as in the provided C++, Java, and Python serialization code. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object – for instance, it's sometimes useful to merge two messages by simply concatenating them.

Jason Bouzane

Proto3 does not help you. There are at least two places in proto3 that allow equivalent messages to differ in their serialized form. One is field order. While the proto3 specification recommends that fields be written in numerical order, this is not required, and it explicitly requires parsers to deal with fields out of order. The second is that packed repeated fields may be specified any number of times and they are to be concatenated. While the specification recommends against encoding more than one packed repeated field for a particular tag number in a message, it does require that parsers deal with this situation correctly.

[...]

In any case, the upshot of this is that while a particular implementation of the proto library may deterministically produce the same serialized proto every time when given a particular proto message, there's no guarantee that two different proto libraries will serialize it in the same way, nor are there any guarantees that any particular proto library serializer will be stable over time. While I doubt that any official Google implementation would ever change the serialization, third party implementations may do whatever they like. For example, some serializers may choose to output the fields in hash order instead of ascending order, and that could even make the serialization non-deterministic between invocations of the program.

Feng Xiao:

The undeterministic comes from unknown fields and a new feature protobuf maps. If you can guarantee there are no such fields in your proto, the protobuf library will always serialize other fields ordered by field number and thus should output the same bytes.

Petteri Aimonen:

In general, the same data will serialize in exactly the same way.

However, this is not guaranteed by the protobuf specifications. For example, the following differences in encoding are allowable and must decode to the same result in all conforming libraries:

Encoding fields in different order than the tag number order.

Encoding packed fields as unpacked.

Encoding integers as longer varint byte sequences than needed.

Encoding same (non-repeated) field multiple times.

Probably others.

pherl:

The main concern that the deterministic serialization isn't canonical is due to the unknown fields. As string and message type share the same wire type, when parsing an unknown string/message type, the parser has no idea whether to recursively canonicalize the unknown field.The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.

anderson-dan-w commented Mar 13, 2018

Thanks for consolidating these, making it clear that things aren't super clear. Exactly what I needed to be sure (sure that I can't count on deterministic serialization, that is).

MBoldyrev commented Nov 22, 2019

Thank you for bringing this together. Please fix the Encoding & Field Order documentation link in your gist, it leads to this same gist now (I think it was supposed to point at the docs). Also, here are several other related snippets:

From C++ API documentation on method SetSerializationDeterministic that enables deterministic serialization (disabled by default):
https://github.com/protocolbuffers/protobuf/blob/a1bb147e96b6f74db6cdf3c3fcb00492472dbbfa/src/google/protobuf/io/coded_stream.h#L834-L846

// Deterministic serialization, if requested, guarantees that for a given
// binary, equal messages will always be serialized to the same bytes. This
// implies:
// . repeated serialization of a message will return the same bytes
// . different processes of the same binary (which may be executing on
// different machines) will serialize equal messages to the same bytes.
//
// Note the deterministic serialization is NOT canonical across languages; it
// is also unstable across different builds with schema changes due to unknown
// fields. Users who need canonical serialization, e.g., persistent storage in
// a canonical form, fingerprinting, etc., should define their own
// canonicalization specification and implement the serializer using
// reflection APIs rather than relying on this API.

There is an analogous method in Java API with a similar note.

Encoding docs:

By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.

Author

kchristidis commented Nov 27, 2019

@MBoldyrev: Thanks for suggesting the edit (done), and for adding more snippets!

rsmets commented Aug 18, 2021 •

edited

Loading

Thank you for the comprehensive references on the topic. I find it odd that still, there is no way to cleanly enforce deterministic byte serialization with protos. Everything was smooth sailing for us across various languages (js, java, swift) until we started to handle signatures over a message with a Struct field. =/

fmg-lydonchandra commented May 23, 2023

Is this still current ? or has any of the above gone stale ?

cheako commented May 23, 2023

A valid question, but I sus your motivations. Unless you need to be told that deterministic is never expected to be a consideration and that hasn't changed. I think chat would be a better place for this discussion.

fmg-lydonchandra commented May 23, 2023

Ok thanks for confirming @cheako , i was under the impression that when deterministic serialization is used, and schema is identical, then serialization will produce same binary result between Java library and C++ library.
Obviously I am incorrect.

Found this note from protobuf Java binding.

Note the deterministic serialization is NOT canonical across languages; it is also unstable * across different builds with schema changes due to unknown fields. Users who need canonical * serialization, e.g. persistent storage in a canonical form, fingerprinting, etc, should define * their own canonicalization specification and implement the serializer using reflection APIs * rather than relying on this API.

cheako commented May 23, 2023

You could use rust with JNI to get Hash and Eq trait implementations derived.

fmg-lydonchandra commented Jun 27, 2023

Will the byte output be guaranteed to be IDENTICAL between Windows-built and Linux-built utilizing the same C++ protobuf version and same message schema (when Deterministic serialization is true) ?

Encoding docs:

By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.

caspermeijn commented Feb 13, 2024 •

edited

Loading

The text linked to as Encoding & Field Order documentation has changed since the creation of this document.

New text:

Field numbers may be declared in any order in a .proto file. The order chosen has no effect on how the messages are serialized.

When a message is serialized, there is no guaranteed order for how its known or unknown fields will be written. Serialization order is an implementation detail, and the details of any particular implementation may change in the future. Therefore, protocol buffer parsers must be able to parse fields in any order.

justusranvier commented Jun 27, 2025

The text linked to as Encoding & Field Order documentation has changed since the creation of this document.

In earlier versions of Protobuf Google made promises about a set of circumstances under which determinism can be guaranteed which they have since broken in newer versions of the library. Back in 2014 a project I worked on relied on those promises for components which are now very painful to change because back then Google didn't yet have a reputation for breaking every promise it ever made.

Originally (proto2 era) it was the case that if you didn't use map types, and if you only used the official Google library instead of a third party library which might serialize fields in arbitrary orders, then identical schemas were guaranteed to have deterministic output.