This blog post is a part-2 of the previous blog post about "Deploying Hitless and HA Load Balancing" in LoxiLB. Before continuing this blog, it is highly recommended to read the previous blog.
In this blog, we are going to take a deep dive into the technology and methods explored for state synchronization and performance numbers with LoxiLB.
Why we need state-synchronization??
When eBPF started gaining popularity, it's first adoption was mainly for observability. As time changed and technology evolved, eBPF programs were started getting developed as stateful networking solutions like load balancing, connection tracking, firewalls and even CGNAT etc. They need to be deployed in clusters to avoid single-point of failures. If the application is stateless then there is no need for synchronization but stateful application case is different especially when it needs to be deployed with high availability.
In stateful applications, states are maintained in the application or in some centralized DB but in case of an eBPF application, state or rather information is maintained in the eBPF Maps. And, state of each node. pod or a network-session needs to be synchronized across the cluster.
Synchronization, in itself, is not a new concept. Traditionally, netfilter or iptables were used for Linux stateful filtering. Conntrackd, a purpose built daemon was used for synchronization. But, there are no known synchronization tool or daemon available for eBPF Maps.
"We are in the brave new world powered by eBPF. So, Let's explore!"
Different approaches to eBPF Map synchronization
When you stand on the cross way, it is very important to choose the right path because you don't want unnecessary longer routes which will delay your journey. Same is true when it comes to designing the strategy for synchronization as it is going to cost you performance and resources if not done efficiently. Let's discuss:
Let's start with the easiest approach - The straight forward way - Fetch the map entries and traverse them periodically and then sync them. You may find this black box type of syncing pretty easy but it is not the case. There are two problems with this approach. First, it will be very difficult to scan all those entries and track if something changed. Second, the scan, track and sync is going to very costly in terms of compute. On one hand, we are saving compute with eBPF and on the other hand we are wasting it for synchronization.
The Alternative - async map notifications for add, delete and update of a map entry. This is our preferred way as it is efficient in terms of compute and there is not need to track the changes explicitly. eBPF comes to rescue here to solve synchronization for eBPF Maps. We have used eBPF kprobes at kernel hook points used for eBPF table modification functions. Next, we have to funnel these events by putting a filter at user plane level for a particular eBPF map we want to synchronize via eBPF perf or ring buffer. Perf buffer(or perfbuf) is a collection of per-CPU circular buffers, which allows to efficiently exchange data between kernel and user-space. As per the documentation, the perf buffer size is defined in terms of page count. It is defined to be 1 + 2^n pages. Finally, we just have to announce these events to cluster-wide entities via gRPC or other well known reliable messaging frameworks.
If we see the components in one node, you would notice there are two eBPF programs. First, "eBPF user program" who is responsible loading and unloading eBPF programs on the tc hook which will update the eBPF maps. Second, "eBPF Sync programs" which will put kprobes on add, delete and update events on the map to get all the notifications. All the event notifications will be received by the eBPF kernel program. There are no inbuilt filters available at the hook level which means we will receive event notifications for all the map entries. If required, event entries can be filtered out on the basis of map names which can be passed to kernel kprobe program via separate eBPF maps. When eBPF user program receives the event notifications, it is not supposed to know any semantics about the eBPF maps. It can go completely agnostic of the eBPF map's semantics and read/write map key-value pairs. For our current use-case in LoxiLB, we are using kprobes for BPF_MAP_TYPE_HASH entry-points which can be changed as per need.
We installed multiple instances of LoxiLB along with state-sync as a sidecar. We are running BGP to advertise the service IP to the external client. We will not go into the details about how we managed the connectivity of service IP here as it was already covered in our previous blog.
Synchronization is not required just for providing high availability for the applications. This data can also be fed to any observability platform to understand what is going on in the system so that user can do some analysis and take some informed decision.
So far, we have understood how we have received map entries. Now, let's talk about the next step, i.e. the synchronization of the entries to other entities in the cluster to achieve high availability. Once, eBPF user space program receive the map entries, they need to synchronized with all the peers. There are number of known messaging platforms available so we decided to have two of them on-board: net/RPC and gRPC.
Performance - Show me the numbers!
We tested the performance of LoxiLB, using different metrics like number of threads for sending traffic and perf buffer page count, with and without map state-sync infra(MSS) in the same topology showed above.
First test was to see the impact on the TCP connection per second when we enable to event notifications. Below are the parameters used for the test.
CPS with and without MSS Infra
Test System Configuration: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz - 40 core RAM 125GB, Kernel 5.15.0-52-generic
Performance Tool : Netperf
Netperf Threads : 50
Page Count : 16384
We observed a ~5% penalty in data-plane performance after introducing map state-sync infra.
The second test was the determine the performance of map sync infra. We measured TCP connections per second and event drops together by varying only one metric and keeping the other one fixed.
CPS and Event drops after varying Page Count
Test System Configuration: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz - 40 core RAM 125GB, Kernel 5.15.0-52-generic
Performance Tool : Netperf
Netperf Threads : 30
Initially, we saw event drops but it came to 0 drops when we fixed the page count to 32K.
CPS and Event drops after varying Netperf threads
Test System Configuration: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz - 40 core RAM 125GB, Kernel 5.15.0-52-generic
Performance Tool : Netperf
Page Count : 16384
These bottlenecks here are due to eBPF perf-buffer and it's reader side implementation. Events are more likely to be dropped if reader is too slow to process them.
The Sync Rate
Next, we compared two messaging infra to sync the map entries to the peer. We observed that net/rpc performance was better than gRPC.
Test System Configuration: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz - 40 core RAM 125GB, Kernel 5.15.0-52-generic
Page Count : 32768
But, on further analysis, it was found that defining gRPC messages and choosing encoding/decoding methods can significantly impact the performance. In this test, we defined our RPC message in a generic way:
message CtInfoMod {
bool add = 1;
google.protobuf.Any details = 2;
}
This approach is good when you want to have a generic message but this was the reason of performance drop because for transmission, messages need to converted to JSON and both sides has to do Marshal/Unmarshal which proved to be very costly.
Later, we define our message more precisely as we knew what information we were synchronizing. Our message definition looks like this now:
message CtInfo {
bytes dip = 1;
bytes sip = 2;
int32 dport = 3;
int32 sport = 4;
........
}
message CtInfoMod {
bool add = 1;
CtInfo ct = 2;
}
After that, gRPC matched the performance of net/rpc.
DEMO TIME!!!
Here is the demo video which shows the synchronization of 50,000 concurrent session. We have used tcpkali to create these sessions.
FUTURE WORK
There is a lot of glue code needed to expose eBPF maps via gRPC or other frameworks currently. We are exploring if clang compiler can be extended to autogenerate .proto files for gRPC. This will help reduce boiler-plate code for eBPF developers. Further, ring buffer vs perf-buffer performance measurements is also planned.
References:
* Inspired by work done in https://github.com/CrowdStrike/bpfmon-example
Comments