Authors:
(1) Tianyi Cui, University of Washington ([email protected]);
(2) Chenxingyu Zhao, University of Washington ([email protected]);
(3) Wei Zhang, Microsoft ([email protected]);
(4) Kaiyuan Zhang, University of Washington ([email protected]).
Editor's note: This is Part 4 of 6 of a study detailing attempts to optimize layer-7 load balancing. Read the rest below.
4.2 End-to-end Throughput
4.3 End-to-end Latency
4.4 Evaluating the Benefits of Key Techniques
4.5 Real-world Workload
In this section, we conduct both microbenchmarks and end-to-end application benchmarks to understand the performance of Laconic. Specifically, we aim to answer the following three questions:
• Can Laconic’s design provide end-to-end performance benefis compared with existing load balancers?
• What benefits are provided by each of the individual techniques used in Laconic’s design?
• How well does Laconic perform under real-world application workloads?
Implementation: We implement Laconic on both on-path and off-path SmartNICs with DPDK 21.11. We implement Laconic in about 6k lines of C++ code.
Testbed: Our testbed consists of five x86 servers with Intel Xeon 5218 CPU and 64GB of memory. Each server is equipped with a Mellanox ConnectX-5 two-port NIC. The link speed of each individual port is 100 Gbps. We use a dedicated set of servers to host SmartNICs. The on-path SmartNIC we use is a Marvell LiquidIO 3 CN3380 with 2x50 Gbps ports. The off-path SmartNIC used in our deployment is an Nvidia BlueField-2 MBF2H516C-CECOT with 2x100 Gbps ports. An Arista DCS-7060CX switch is configured with VLAN and L4 routing. We partition the clients and servers into two different subnets, which is a common setting in actual deployment. We allocate two of the five servers as clients and the other three act as servers. The load balancer operates within the server subnet and is responsible for receiving and forwarding requests through the same NIC port. This setup, referred to as "router-on-a-stick," allows the load balancer to manage the incoming requests and distribute the requests among the backend servers. Figure 7 shows the path of a request to one of the backend servers.
Baseline: There are multiple widely adopted L7 load balancers on the market, including Envoy [18], Nginx [15], HAProxy [24], etc. We choose Nginx as our baseline since it is widely adopted in practical deployment and provides higher performance than others [13].
Client and backend server: wrk and wrk2 are used with customized Lua scripts as the client software. We generate a list of static files with various file sizes to responding the HTTP requests of wrk. We use the metric of requests per second (RPS) reported by wrk, and we multiply it by the file size to get the goodput measurement. Nginx is configured as the backend server software which responds to wrk requests with static files of varied file sizes.
We evaluate the throughput performance of Laconic on offpath BlueField-2 SmartNIC and on-path LiquidIO3 SmartNIC (Figure 8) while varying the number of cores and file sizes of wrk requests.
Using a single core: To show the processing capability of Laconic, we first evaluate the throughput on just a single core by varying the response size. Figure 8a clearly shows that the single-core performance of Laconic on BlueField-2 outperforms Nginx on x86. In this experiment, we set the threshold of offloading onto the flow processing engine as 1MB. With the help of the hardware flow processing engine, for response sizes of 16M, Laconic achieves 8.7x throughput compared with Nginx x86. As shown in Figure 8c, Laconic on LIO3 performs much better than Nginx, even though Nginx runs on a more powerful x86 core. Although LiquidIO3 does not have a hardware flow processing engine, the throughput of Laconic is still 4.5x better compared with x86 Nginx.
Scaling to more cores: To show the scalability behavior of using multiple cores, we measure the throughput by varying the number of cores. Figure 8b shows Laconic performance on BlueField-2. While offloading the processing into a hardware engine for messages larger than 1M, we observe that throughput scales with more cores for processing large messages, which can achieve up to 150 Gbps throughput. For small messages, we don’t offload processing onto the flow processing engine. Throughput for processing small messages also can scale with more cores. In Section 4.4, we will explain more about the scalability of processing small messages and why the flow processing engine has a limited impact on its scaling. Figure 8d depicts the scalability performance for LIO3, where we see linear increases in throughput till the line rate is achieved, except for the 1KB small message flow case due to frequent connection setup and tear-down. In Appendix A.1, we show zoomed-in figures of the throughput performance with all cores.
We next study Laconic performance in terms of latency as Figure 9 shows. We use wrk2 with setting a target throughput as 50% peak throughput measured in the previous experiments. In this experiment, we focus on the Laconic on BlueField-2 because Laconic on BlueField-2 achieves higher throughput performance than LiquidIO-3 according to the throughput experiments in Section 4.2.
Using a single core: Figure 9a shows average latency results with a single core. Despite our design running on BlueField-2 with higher throughput, Laconic’s latency is comparable to that of the Nginx running over x86 servers. Moreover, Laconic can achieve lower latency for processing large message sizes compared with other baselines by efficiently offloading onto the flow processing engine.
As Figure 9b and Figure 9c show, we further zoom in to specific response sizes to plot the cumulative distribution function (CDF) of latencies for all requests. Laconic on BlueField-2 achieves lower latency when compared to Nginx running on the ARM cores or the x86 cores for both short and large response sizes, which benefit from our lightweight network stack and efficient synchronization mechanisms. Additionally, large messages are processed by the flow processing engine without much involvement of cores.
Scaling to all cores: Figure 9d shows the average latency while Laconic uses all ARM cores on BlueField-2. Compared with using only one core, Laconic achieves lower latency with more cores. Also, in the multi-core scenario, Laconic still outperforms other baselines. In Section 4.4, we also investigate how the core scalability affects the latency by varying the number of cores.
Offloading to Flow Processing Engine: Figure 10 shows the benefits of using a hardware flow processing engine. In this experiment, we vary the offloading threshold such that requests with response size larger than the threshold were processed by the flow processing engine, while ARM cores were used for requests with response sizes smaller than the threshold. Note that prior to fully offloading the flow processing, the initial packets are processed by the ARM cores for establishing connections and inserting flow rules.
Figure 10a and 10b show how offloading affect the throughput. As Figure 10a shows, if we use ARM cores to process requests without offloading, it is hard to achieve high throughput even with more cores. There are two reasons for the poor scalability: 1) Memory bandwidth for SmartNIC accessing the on-board memory is limited, which is also reported in other work [30]. 2) For BlueField-2, every two cores share 1MB L2 cache, which could result in cache contention while using multiple cores [41]. Thus, the poor scalability necessitates the offloading to the flow processing engine. Figure 10b shows throughput performance while we process all sizes of requests by offloading to the flow processing engine. We find that the processing of large responses can be scaled with more cores. Because packet processing is offloaded to the flow processing engine, the ARM core is just responsible for setting up the connection and installing flow rules. Additionally, we observed that the processing of small responses does not see improvement from the packet processing engine, largely due to the overhead involved in updating flow rules. This overhead will be discussed in detail below.
Table 2 shows the time cost for updating the flow rule in the flow processing engine. Flows can be inserted or deleted from the engine in batches. From Table 2, we can see that time cost for updating rules is not negligible for small responses, especially for updating singleton rules. The time cost of updating the rule dominates the flow completion time of small requests. This is the reason why offloading the processing of small responses, such as those less than 1KB, does not improve performance. So, in practice, we only offload the processing of large responses (e.g., >1MB) to the flow processing engine. Also, we use batching to amortize the updating cost. The batch size is determined by the flow arrival pattern.
Figures 10c and 10d show how offloading affects the latency. In Figure 10c, we offload processing responses whose size is larger than 1MB. As expected, the latency of processing responses remains stable as we increase the number of cores because, after offloading, ARM cores are not used to process response packets. Figure 10d depicts the case where ARM cores process responses without offloading. In this case, it is shown that not offloading to the flow processing engine leads to lower latency when using all cores and similar performance to offloaded processing when using a single core, due to avoiding the rule update costs.
Network Stack: We next study the performance of our lightweight network stack. Table 3 shows the throughput of packet processing with our network stack. Our network stack can achieve up to 3 million packets per second (Mpps) using a single ARM core of BlueField-2. To further analyze the efficiency of the network stack, we use the perf tool to break down the ARM core utilization for operations of packet processing. Among the top-5 operations in terms of ARM core utilization, We find that the DPDK API costs up to 40% utilization, which is the internal cost of the DPDK. For other operations that are memory-intensive, they frequently execute memory allocation and population, which are bottlenecked by the memory bandwidth. From the breakdown, we can see that our network stack implementation is efficient, as it is mainly bottlenecked by the DPDK API cost and memory bandwidth.
Concurrent Hashtable: We evaluate our concurrent hash table design. We measure the throughput of two basic hashtable operations, insert and lookup, with the ARM cores. As Figure 11 shows, the hash table of Laconic can achieve higher throughput than the hash table provided by the DPDK library. Also, the hash table of Laconic scales the performance well with more cores. Further, Table 3 shows that Laconic handles 3M packets/sec, a rate that is supported by our hashtable but not the other alternatives.
To evaluate the performance of Laconic under a real-world workload, we adopt the workload from the CONGA work on datacenter traffic load balancing [9]. We extract the flow size distribution from their enterprise workload and use it to generate a list of files on the backend server with corresponding sizes. On the client side, we use a customized Lua script for wrk to generate the request according to the distribution.
Figure 12a shows the throughput of Laconic with the realworld workload on BlueField-2, which only has eight ARM cores. Laconic can achieve 13.3x throughput with a single core compared to Nginx running on BlueField-2. Laconic achieves higher performance with a fewer number of cores up to 75 Gbps; however, Nginx needs more than 16 cores to achieve a similar throughput.
Figure 12b shows Laconic performance on LiquidIO3. LIO3 has 24 ARM cores, so we perform the tests on LIO3 with up to 24 cores. We observe that Nginx on LIO3 runs at a very low throughput (5.21 Gbps). Although Nginx on x86 linearly increases the performance, it needs more than 16 cores for just 50G bandwidth. In contrast, Laconic only needs up to two ARM cores of LIO3 to match the 50G NIC bandwidth.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.