Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the gRPC API to ditch google.protobuf.Any in favor of protobuf native oneof #2880

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Tuetuopay
Copy link
Contributor

@Tuetuopay Tuetuopay commented Feb 18, 2025

Hi,

This pull request introduces a major change to the GoBGP API: transitioning from google.protobuf.Any to protobuf's native oneof feature.

Why

The BGP protocol uses a lot of list of various types, that reflects in the gRPC API. For example, path attributes can be any type of a specific subset. Currently, the API uses runtime dynamic types in the form of Any, allowing to hold any protobuf type. Which values are accepted is documented by a comment above the field, indicating which types are legal.

There are, however, a few issues with this:

  • the API is not easily discoverable, as this is akin to a void* in C, and one has to rely on the comment for this
  • those manual lists have to be maintained, and can fall out of sync between the actual implementation and the comment. In fact, while performing this rewrite, I found several such occurrences while adapting the various Marshal and Unmarshal helpers for Attribute, Nlri, etc.
  • debugging tools cannot leverage Any, and easily drop down to the raw proto bytes contained by the Any (e.g. Wireshark)
  • similarly, most API tools will stumble upon this (e.g. grpcurl)
  • api producers and consumers must wrangle with the Any type, and manually (un)marshal to the correct type, instead of relying on protobuf's typing, especially in generated code.

In a nutshell, the main motivation is type safety and discoverability.

In addition to those usability aspects, a performance aspect is present as well, since it requires dynamic typing and parsing of the actual type URL. The GC will also be happier, as it means less allocations and garbage (it was, in fact, looking at pprof that initially triggered this initiative).

There also is an improvement on the wire usage, as all variants of a oneof share the same memory once encoded. Since that is always a sub-message for gobgp, it's always a pointer, so all those oneofs only eat up a pointer on the wire. A net win.

How

The API structure did not change. All Anys were taken one by one, and a corresponding oneof message was written.

  • oneofs specific to a single message type were nested near their usage (e.g. TunnelEncapTLV's tlvs has its oneof message nested since it's only used in this message)
  • oneofs that are used in multiple places have their own message (e.g. NLRI, RouteDistinguisher)
  • oneofs that are large and significant enough have also their own non-nested message (e.g. ExtendedCommunity that does even have its own .proto file)

Once those were identified, the various Marshal/Unmarshal helpers were used to infer the correct set of accepted values in the oneof (many documentation comments were out of sync, especially for stuff used at several places).

Due to include cycles, the big attribute.proto file had to be broken down in smaller files, like extcom.proto, common.proto, nlri.proto, etc.

Result

In the end, the API is much more self-describing, avoids wrangling with Any, and (subjectively) better overall. A simple look at some of the main oneofs ought to be enough: NLRI, Attribute, ExtendedCommunity just to name those three.

For the performance aspect, I measured notable improvements on the gRPC API. The tests were performed using EVPN type-2 routes and a small Rust program for route injection.

Load Anyful Anyless Improvement
Insertion 100k, unary backend 4352/s 4459/s +2.5% paths/s
Insertion 100k, stream backend 49619/s 59925/s +21% paths/s
Insertion 1M, stream backend 43989/s 52693/s +20% paths/s
44.63s 36.67s -18% cpu time (user)
3.35s 2.79s -17% cpu time (system)
Listing 1M, batch 128 40.82s 33.91s -17% wall time

Note: "unary backend" means calling the "AddPath" gRPC endpoint, while "stream backend" means using a single "AddPathStream" gRPC call and streaming all routes. I include both for completeness: most users will use the unary backend and won't see a meaningful improvement, while heavy users (like myself) will use the stream backend and will see meaningful improvements.

CPU time was measured by sampling /proc/pid/stat before and after applying the workload. It is included as an indicator of how much less work gobgp does, and that the increased performance does not come at a CPU cost. We do less work.

The tests were performed on a Scaleway POP2-4C-16G instance running Ubuntu 24.04.

Downsides

A few downsides are present with this approach:

  • it is not possible to have a different set of variants depending on where the type is used. previously, there could be some variants (e.g. of nlri) illegal in some place but legal in any other. while I personally don't see where this could be useful, and think it would be bad api design given the goal of this api, it is up to the maintainer to weigh in on this.
    (it could be documented via comments that a variant is illegal in some situation, which is IMHO better, as to explicitly document exceptions)
  • some languages may see increased memory usage for deserialized messages. AFAICT only Rust is affected: it has tagged enums, and a oneof is represented as a rust enum. Those will be sized to fit their largest variants, which means that any other variant pays the price. (e.g. with this new oneofs, the Rust generated Attribute is more than 500 bytes).
    IMHO this is a language / tooling issue, as a simple patch in protoc-gen-prost would allow to box enum variants, and reduce the enum size to a pointer.
  • the other class of languages not having tagged enums resort to wrapping structs with interfaces (e.g. Go), which can make the code a bit more verbose to write (see the attribute utils tests). I think this is a small price to pay for an actually discoverable and safe API.
  • this is a breaking API change

However, I think the benefits far outweigh the drawbacks.

Other words

I don't expect this pull request to be merged straight away, or as-is. It definitely could, however. But I don't come to get this merged ASAP, and I'm open to discussion. This is a big changed, and we already discussed on slack about this. I hope this clear most of the doubts that could cloud this initiative.

On a side note, I did not fully nuke Anys from the proto. Deep down SRv6, there is a pair of Anys left, which I didn't touch, because I'm far from being knowledgeable enough to touch this (though I'll read up on it and end up doing it, it's a tiny leaf struct anyways). EDIT: now done

Thank you for considering this :)

@Tuetuopay
Copy link
Contributor Author

Update: I did the SRv6 TLVs, so now the API truly is Any-free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant