Let's Talk About Protocol Buffers

🧵 Source

First of all, they're not nearly as complicated and low level as the name makes them sound.

When I heard the word protocol I pictured network switches and embedded programming, and that's not what this is.

Heck, the word buffer is incredibly loaded too imo. It's used loosely.

According to the website, Protocol Buffers are "a language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler."

That's all true and makes sense once you understand them, but let's unpack it a bit.

If I had to describe them in one sentence to most web developers I'd say something like this.

Protocol Buffers are like JSON, except the payloads are smaller and you can generate stub code for working with known types in any language.

A little more compelling, right?

I will come back to the generated code and types later, but first I will explain the "like JSON but smaller" part.

Picture a JSON payload of 10 simple "user" objects, where each item has an id prop that is a number, and a name prop that is a string. Actually picture the text.

Notice how many of the raw characters are just the property names repeated over and over. It's literally the majority of the text in the payload.

Now imagine you have a million user objects instead of 10. You are sending several megabytes of repeated prop names you already know.

If you're at all concerned with how much data you are making your users download, several megabytes of property names hardly seems like a good use of your bandwidth budget.

With Protocol Buffers, you'd define your user type like this.

message User {
  uint32 id = 1;
  string name = 2;

And then you can serialize them using the positions to denote the field. Your data looks like this instead of typical JSON.


[[An Introduction to Protocol Buffer Definitions]]

[[Modeling Complex Types with Protocol Buffers]]

Now again imagine a million users encoded like that versus spelling out every field name a million times.

Protocol Buffer payloads can be orders of magnitude smaller than typical JSON payloads.

And yet, they're encoded as JSON. No property names, but it's a valid JSON array.

That's what people mean when they say JSON is the "transport format" or "wire format" for protocol buffers.

Separate from the format is the encoding. The example above is a simple text encoding, which makes it easy to decipher what the data is if we know the message definition.

But you can also use a binary encoding, which results in serialized data that is even smaller (!), and looks something like this.


This of course makes debugging more difficult, but in a production environment there's no reason not to use it.

Note that the binary encoded data is still valid JSON. JSON is still the transport format, even though it barely resembles what most people picture when they think about the format.

I think the binary encoding is (usually?) a base64 kind of thing, but I have not dug into it.

That's probably a good place to take a breath and see if there are questions about encoding or transport.

Any of that confusing or unclear before we move on to code generation and the other benefits of protocol buffers?

https://developers.google.com/protocol-buffers/docs/proto3  shows the options available when defining your protocol buffers. Types can be nested and imported from other files, and any type of data can be represented.

You define your whole data model, and then you generate code for working with it.

The code is generated by protoc, a command line tool that processes protocol buffer definitions and spits out code for the languages you request. Which languages are supported? Most of them.

C++, Go, Java, Python, Ruby, C#, Objective C, PHP, and JS.

[[Comparing Protocol Buffer Usage in Varying Languages]]



Then on the back end you do this kind of thing in Python or whatever you're using.

u = datamodel_pb2.User()
u.id  = 123
u.name  = "Ace"

And to send it over the network you just do u.SerializeToString() and ship it off.

Then on the client or wherever, you use generated code again, this time to read it. In JS it's something like this.

const u = User.deserialize(stringFromPython);

And now u is an instance of User with the properties sent over the wire. It might have methods like getName, etc.

The exact syntax of the generated code can vary depending on your toolchain, but the functionality is consistent.

[[Protocol Buffer Output with Varying Toolchains]]

No matter the environment, your data objects have a known structure and API. If you use Typescript this extends all the way to coding for the browser.

[grpc] (grpc-web)

In case it's not obvious, that's a super helpful thing to have on a project that crosses language and environment boundaries. So, you know, all of them.

No more "what is the server sending me?" or "how do I structure this payload?".

[[Stop Using Ad Hoc JSON for Data Transport]]

There's also a more amorphous benefit to formally documenting your data model that I will mostly leave for another time, but I think it's significant.

I think it helps at all levels of the stack to understand how the various parts of the system are structured and fit together.

I can see the structure of data the embedded system is sending the server, and what that server is sending to other servers, and what is being sent to the client.

All by reading the .proto files I find and seeing where they're used.

[The Intangible Benefits of Adopting Protocol Buffers]

Oh, since they're fully serializable, you can also easily persist protocol buffers. Write them to disk, IndexedDB, NoSQL, whatever.

[Protocol Buffers, Databases, and Hybrid Storage]

They make great test fixtures.

[Protocol Buffers Make Great Serializable Test Fixtures]