This exercise aims to facilitate the understanding of the binary serialization method of protobuf by comparing the storage density of information with the text/json format.


TL;TR

The content serialized in text/json will be on average 1.3 to 4 times larger than the same data serialized in binary form with protobuf. This will be reflected in IO and higher CPU usage at the time of de/serialization. This is one of the reasons why protobuf has been adopted in many scenarios.


But, before the tests...

Ceci n'est pas une pipe In addition to an exercise to understand protobuf, perhaps some comments about information vs representation are valid. Something I have noticed many times, especially with beginner developers, is a certain lack of understanding about the differences between information and its forms of representation. And how different forms of representation have different densities (i.e., the amount of information per storage space).

For example, an integer between 0 and 255 (such as the value 100, for example) can be stored as a uint8 (8-bit unsigned integer) occupying 1 byte, as an int32 (Java int) it will occupy 4 bytes, while its decimal representation in readable characters '100' (serialized in a text format, for example) will occupy at least 3 bytes (and can reach 12 bytes in some cases, depending on the encoding).

Another example are UUIDs [RFC4122], which at maximum density occupy 16 bytes (128 bits), but in the most well-known representation format, a string of hexadecimal characters with 32 positions, it occupies at least 32 bytes (36 with '-'). Although MySQL allows binary columns (e.g. binary(16)) and PostgreSQL has the UUID data type that allows the efficient use of 128 bits, an unsuspecting developer might initially think of storing a UUID as text in a varchar(36), which significantly impacts IO, especially when it comes to a primary key.

Of course, there are situations where you need to use an inefficient representation, especially when it comes to presenting it in a human-readable way. But it is important to understand these aspects to know when and how to represent information in the best way.

About Metadata

In addition to the binary representation, another factor that makes protobuf more efficient and with greater density is that in JSON the metadata (which describes the data) is transported together with the data. While in protobuf the metadata is previously known through the .proto contract and this will not be transported together with the serialized data. In this sense, protobuf is more similar to a positional format. In the JSON example below, the keys "id" and "name" are metadata that describe the corresponding data:

{
    "id": 1,
    "name": "James Wilson"
}
---
---
---

## Ok, now enough talk and let's get to the tests...

To perform this test/comparison we will use data from 1000 fake users from the API [https://randomuser.me/api/?format=json&results=1000&seed=teste_proto_vs_json](https://randomuser.me/api/?format=json&results=1000&seed=teste_proto_vs_json).

Not all data obtained from this API will be used, as we will go through a mapping to .proto that will use some of the fields, while other data will be discarded in the process.

The json will be generated from the object mapped to proto, so that both reflect the same information.

The proto file is this:
```proto
syntax="proto3";

package my.system.person;

message User {
    string gender = 1;
    Name name = 2;
    string email = 3;
    Login login = 4;
    Picture picture = 5;
    Location location = 6;
    bool isactive = 7;
}

message Name {
    string title = 1;
    string first = 2;
    string last = 3;
}

message Location {
    Street street = 1;
    string city = 2;
    string state = 3;
    string country = 4;
    string postcode = 5;
    Geo coordinates = 6;
    TZ timezone = 7;
}

message Street {
    int32 number = 1;
    string name = 2;
}

message Geo {
    float latitude = 1;
    float longitude = 2;
}

message TZ {
    string offset = 1;
    string description = 2;
}

message Login {
    string username = 1;
    bytes uuid = 2;
    bool isloggedin = 3;
}

message Picture {
    string large = 1;
    string medium = 2;
    string thumbnail = 3;
}

An object mapped to this proto contract looks like this:

{
  gender: 'male',
  name: { title: 'Mr', first: 'Sean', last: 'Perkins' },
  email: 'sean.perkins@example.com',
  login: {
    username: 'someperson84',
    uuid: '25eAaALBTHa5Ixo+RLtSZQ==',
    isloggedin: true
  },
  picture: {
    large: 'https://randomuser.me/api/portraits/men/1.jpg',
    medium: 'https://randomuser.me/api/portraits/med/men/1.jpg',
    thumbnail: 'https://randomuser.me/api/portraits/thumb/men/1.jpg'
  },
  location: {
    street: { number: 4481, name: 'Northaven Rd' },
    city: 'Adelaide',
    state: 'Queensland',
    country: 'Australia',
    postcode: 7056,
    coordinates: { latitude: -86.0805, longitude: -24.3252 },
    timezone: { offset: '+3:30', description: 'Tehran' }
  },
  isactive: false
}

The code that will map the data, the serialization and writing of files will be done in js with node, but it could be with any other language. The dependencies (package.json):

{
    "name": "proto_vs_json",
    "version": "0.1.0",
    "description": "Testing serialization with protobuf and json",
    "scripts": {
        "test-user": "node users/user_serializer.js",
        "build-user-proto": "protoc --js_out=import_style=commonjs,binary:./ --plugin=protoc-gen-grpc=node_modules/grpc-tools/bin/grpc_node_plugin users/user.proto"
    },
    "keywords": [],
    "author": "Artus Rocha",
    "dependencies": {
        "google-protobuf": "^3.15.6",
        "grpc": "^1.24.6"
    },
    "devDependencies": {
        "grpc-tools": "^1.11.1"
    }
}

Command to generate code from .proto file:

protoc --js_out=import_style=commonjs,binary:./ --plugin=protoc-gen-grpc=node_modules/grpc-tools/bin/grpc_node_plugin user.proto 

I won't go through the entire code, because that's not the focus here, but the complete code can be seen here . I will highlight just a few excerpts.
Serializing to binary/protobuf and writing file:

function writeProtobuf(user, i) {
    const filepath = './data/user-' + zeroPad(i, 3) + '.pb'
    const content = user.serializeBinary()
    fs.writeFile(filepath, content, "binary", function (err) {
        if (err) console.log("Error bin", err);
    })
}

Serializing to text/json and writing file:

function writeJson(user, i) {
    const filepath = './data/user-' + zeroPad(i, 3)  + '.json'
    const content = JSON.stringify( user.toObject() )
    fs.writeFile(filepath, content, function (err) {
        if (err) console.log("Error json", err);
    })
}

Files with serialized data in json string format had an average size of 688 bytes. ```sh $> wc -c data/*.json

...

686 data/user-991.json 662 data/user-992.json 684 data/user-993.json 682 data/user-994.json 663 data/user-995.json 729 data/user-996.json 669 data/user-997.json 699 data/user-998.json 698 data/user-999.json 687943 total

Files with serialized data in protobuf binary format, which I saved with the '.pb' extension, they had an average size of 362 bytes
```sh
$> wc -c data/*.pb 
# ...
   360 data/user-991.pb
   338 data/user-992.pb
   357 data/user-993.pb
   358 data/user-994.pb
   333 data/user-995.pb
   406 data/user-996.pb
   341 data/user-997.pb
   371 data/user-998.pb
   378 data/user-999.pb
362108 total

Here we see that the serialized version with a protobuf binary format in this case was on average 47% smaller than the same serialized data in text/json format.

But... here we are not using compression and when we transfer this data, good practices recommend using compression such as gzip.
So let's compress the files and check how this relationship looks:

$> gzip -6 data/*
$> wc -c data/*.json.gz
# ...
   429 data/user-991.json.gz
   413 data/user-992.json.gz
   433 data/user-993.json.gz
   431 data/user-994.json.gz
   413 data/user-995.json.gz
   455 data/user-996.json.gz
   416 data/user-997.json.gz
   442 data/user-998.json.gz
   438 data/user-999.json.gz
432815 total
$ wc -c data/*.pb.gz
# ...
   285 data/user-991.pb.gz
   274 data/user-992.pb.gz
   290 data/user-993.pb.gz
   287 data/user-994.pb.gz
   266 data/user-995.pb.gz
   320 data/user-996.pb.gz
   271 data/user-997.pb.gz
   303 data/user-998.pb.gz
   310 data/user-999.pb.gz
291953 total

Using gzip compression with a compression factor of 6, we get an average of 433 bytes for files with json. And an average of 292 bytes for files with data in protobuf binary format. The ratio between protobuf and json decreases, and now protobuf is on average 32% smaller than json. The greater compression of the text/json version is expected because, as binary/protobuf is already much more optimized and has less repeated data, there is less margin for compression.


But why so many strings?

The test with the User object is interesting because it is a very typical scenario and helps us to gain some insights. But this type of object is compound basically by utf8 strings. That's why I decided to perform a second test, following the same methodology, but now with an object composed only of scalar values, booleans, integers and floats.

The proto file:

syntax="proto3";

package my.system.scalar;

message Scalar {
    bool boolean1 = 1;
    bool boolean2 = 2;
    float float1 = 3;
    float float2 = 4;
    uint32 uint1 = 5;
    uint32 uint2 = 6;
    int32 int1 = 7;
    int32 int2 = 8;
}
$> wc -c data/scalar*.json
# ...
   164 data/scalar-991.json
   165 data/scalar-992.json
   161 data/scalar-993.json
   164 data/scalar-994.json
   163 data/scalar-995.json
   162 data/scalar-996.json
   162 data/scalar-997.json
   161 data/scalar-998.json
   163 data/scalar-999.json
162527 total
$> wc -c data/scalar*.pb
# ...
   38 data/scalar-991.pb
   33 data/scalar-992.pb
   37 data/scalar-993.pb
   34 data/scalar-994.pb
   33 data/scalar-995.pb
   36 data/scalar-996.pb
   35 data/scalar-997.pb
   35 data/scalar-998.pb
   33 data/scalar-999.pb
35395 total

In this new scenario, we had a much greater difference between json and protobuf. On average, json serialization generated files with 163 bytes (144 bytes with gzip), while the average for the same data in protobuf was 35 bytes, making json 4 times larger than protobuf on average.

This is mainly due to the difference in size between the data and its representations in readable characters. For example, any int32 value will occupy 4 bytes, but the representation of an int32 in text (for example when we serialize it to json) will occupy a larger amount of bytes (e.g.: the integer 2000000000 serialized with this representation in text/json will occupy at least 10 bytes).




The code with these tests

That's all folks...

Artus