Skip to main content

Lars

Protocol buffers transform data into a compact binary stream for storage or transmission. In this blog post, we will use a proto definition of a sample message and serialize it to binary data.

The sample message

For our sample we use the following .proto file:

syntax = "proto3";

message Fruit {
int32 weight = 1;
string name = 2;
}

This defines a Fruit message with two fields name and weight. Each of these fields has a type, a name and a field number.

We will serialize a simple sample message with the following values:

weight: 150
name: 'Apple'

Each of the field value pairs is encoded as a combination of the field number, the wire type and a payload. The binary stream always starts with the tag of the first field. The tag is a varint-encoded value consisting of the field number and the wire type.

The varint

Varint is a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes. This encoding is used for the tag of each field as well as for several types in protobuf (int32, enum, bool, and others). Varint uses a group of seven bits to represent the value of the number and an eighth bit as a continuation bit to indicate whether more bytes are needed. Here are the steps involved in encoding integers into varints:

  1. Grouping: The integer is broken into 7-bit groups from the least significant to the most significant bits.
  2. Continuation Bit: Each 7-bit group is prefixed with a continuation bit. This bit is set to 1 for all byte groups except the last, which is set to 0. This bit tells the decoder whether to expect another byte.
  3. Combination: These groups are then combined in a little-endian format, where the least significant group (the rightmost 7 bits) is stored first.

Let's look at encoding the number 150 as a varint:

         10010110 # decimal 150 in binary
1 0010110 # split into 7bit group
0010110 1 # change to little endian
10010110 00000001 # add continuation bits

As you can see, the number 150 in varint is 10010110 00000001 in binary or 96 01 in hexadecimal.

The main benefits of the varint encoding is the space efficiency for small numbers. Numbers smaller than 128 are stored in just one byte. As numbers get larger, additional bytes are used. This is very efficient for data that is frequently small but can occasionally be large (which in reality is often the case for most numbers).

For fields that almost always contain large numbers, the varint encoding is inefficient due to the additional continuation bit. In this case fixed size numbers, for example fixed32, should be preferred.

The wire types

Protobuf knows five different wire types. A wire type describes the encoding format of a payload.

ValueNameProto types
0varintint32, int64, uint32, uint64, sint32, sint64, bool, enum
1i64fixed64, sfixed64, double
2lenstring, bytes, embedded messages, packed repeated fields
3SGROUPgroup start (deprecated)
4EGROUPgroup end (deprecated)
5i32fixed32, sfixed32, float

The tag

The tag is a varint-encoded value consisting of the field number and the wire type. The field number of our first field is 1 and since it is an int32 which gets encoded as varint, the wire type is 0. The low three bits represent the wire type, the other bits represent the field number. This can be expressed as

wire_type | (field_number << 3)

As our field number is less than the maximum number we can serialize to the four bits available, we do not need an additional byte for the field number. For the wire type 0 and the field number 1 this will result in 08 with the following binary representation:

0000 1000
  │   └─── Wire type (0)
  └─────── Field number (1)
└────────── Varint continuation bit

The value

Immediately after the tag the value of the field gets encoded according to the wire type. For the weight field we want to encode 150 as int32 with a wire type of varint. As the sample in the varint paragraph shows, this results in 96 01.

So far we have the following data:

08 96 01
    └──── Payload of the field "weight", varint encoded
└───────── Tag of the field "weight" (field number and wire type)

Length delimited field

On to the field name. According to our wire type table, a string is a length delimited field and therefore encoded with wire type 2. Together with the field number 2 this results in the tag 12:

0001 0010
  │   └─── Wire type (2)
  └─────── Field number (2)
└────────── Varint continuation bit

The tag of a length delimited field is followed by a varint which specifies the length of the payload. The UTF-8 representation of Apple is 41 70 70 6c 65. These are 5 bytes, therefore the length is 5.

12 05 41 70 70 6c 65
  │        └──────── UTF-8 encoded string payload (Apple)
  └───────────────── Count of UTF-8 bytes of the payload (5)
└──────────────────── Tag of the field "name" (field number and wire type)

The encoded sample message

Concatenating our two encoded fields leads to the following bytes:

08 96 01 12 05 41 70 70 6c 65
        └────────── Length delimited field "name" with field number 2
└─────────────────── Varint encoded field "weight" with field number 1

We can verify our encoding using protoc:

echo '08960112054170706c65' | xxd -r -p | protoc --decode=Fruit ./fruit.proto
weight: 150
name: "Apple"

which exactly results in our sample values as expected 🥳🎉

The command works like this:

  1. Echo our hex encoded bytes
  2. Pass them through xxd to transform the hex into binary
  3. Pass the binary stream to protoc with the decode flag
    • For protoc to access our sample proto we stored it in the working directory in a file named fruit.proto

Even if we do not have access to the proto file, we can extract some information from the encoded protobuf by using the decode_raw flag:

echo '08960112054170706c65' | xxd -r -p | protoc --decode_raw
1: 150
2: "Apple"

This tells us there are two fields, one with the field number one that decodes to the value of 150, and one with the field number of two which decodes to the string "Apple".

Other wire types

  • Bytes and nested messages are encoded exactly the same way as strings with a length delimited encoding.
  • Boolean values are encoded as varints resulting in 01 for true and 00 for false.
  • Enums are also encoded as varints.
  • Repeated fields (as long as they are not packed) end up as multiple tag value pairs in the byte stream with the same tag being present multiple times.
  • Packed repeated fields are encoded as length delimited fields.

Closing

In this blog post we have encoded a sample protobuf message and validated the encoded bytes with protoc. If you want to dig deeper and understand alternative protocol buffer formats such as protoscope or how other features of protocol buffers such as maps, negative numbers or packed repeated fields are encoded, check out the excellent protobuf encoding guide.

Manuel

Protobuf Editions bring a major change in the way Protocol Buffer versions are handled. Currently there are two versions, "proto2" and "proto3", and the differences between them aren't always obvious. Migrating from proto2 to proto3 isn't easy either. In this blog post we will explore how Protobuf Editions are going to improve this.

Silvan

Kreya 1.14 brings you a lot of new features. Starting with running a directory as a collection, an Insomnia importer, automatic import of system environment variables and opening items in the file manager. The CLI now has the ability to invoke multiple collections with a single call and generate a JUnit test report. Scripting has been improved and trace messages can be filtered.

Silvan

2.87 million invoked operations, almost 4,000 monthly active users and many more numbers that our telemetry has provided us with over the last year.

Silvan

Kreya 1.13 is now available! With a new major feature .. 🥁🥁 .. collections! This allows you to invoke multiple operations with a single click. It's also now possible to import your Postman collections and environments. The protobuf declaration for gRPC operations can be viewed in the new Declaration tab. The CLI has a new 'create project' command and a few bugs have been fixed.

Silvan

Today I was sitting at my computer desk and it was raining and cold outside. Nothing special, just a Friday. I was implementing some code for a project and had to run over 2,000 tests. I started the tests and had a moment to surf the internet. Opened Hacker News and saw the link to Advent of Code 2023. Like almost every year, I opened the page, logged in and looked at the first puzzle. My interest and motivation to solve such a puzzle was quite high.

Silvan

Kreya 1.12 has been released! This new version comes with a lot of new features and some bug fixes. Starting with a new dialog for unimported operations that is visible after an import run. If you ever wanted to see how long a request takes on the wire and server side, this is now possible with the new Timing tab. And for REST, we now support server-sent events.

Silvan

First of all, bring your own storage means you have the freedom to choose which storage provider you want to use to store your data. Unlike Postman, Insomnia and other API clients, Kreya gives you an easy way to bring your own storage. Kreya intentionally stores the project data in readable text files in the location of your choice, so you can easily read, edit, and share/sync the files with your favorite tools.

Silvan

Kreya 1.11 is out now! This release prioritises enhancing the user experience. New users are now provided with a tour to familiarise themselves with Kreya on their first launch. The UI components, including buttons, have been improved to boost usability. The welcome screen has been completely redesigned and now includes quick links and tips. Moreover, quick action hints have been added when no tab is open. It is now possible to purge user variables, and an option to duplicate importers has been added. Additionally, this release includes numerous bug fixes.

Manuel

Do you have a legacy API without any tests? Multiple APIs written in different languages? Or a Q&A team that wants to test all APIs with the same powerful tool? Then Kreya is the perfect solution for you!

Testing APIs with Kreya has its advantages. You can test your API both manually (e.g. during development) and in an automated way. Learn more about testing with Kreya in this article.