Authors:
(1) Tianyi Cui, University of Washington ([email protected]);
(2) Chenxingyu Zhao, University of Washington ([email protected]);
(3) Wei Zhang, Microsoft ([email protected]);
(4) Kaiyuan Zhang, University of Washington ([email protected]).
Editor's note: This is Part 2 of 6 of a study detailing attempts to optimize layer-7 load balancing. Read the rest below.
4.2 End-to-end Throughput
4.3 End-to-end Latency
4.4 Evaluating the Benefits of Key Techniques
4.5 Real-world Workload
This section discusses the structure, connectivity, and performance of programmable multi-core SmartNICs that are commonly available. We also describe the load-balancing functionality we want to deploy on these SmartNICs.
We consider programmable SmartNICs equipped with multicore processors. A typical SmartNIC has onboard memory, DMA engines, and accelerators (e.g., engines for crypto, compression, and packet rewriting) in addition to the multi-core processor. Below, we discuss the two dominant categories, on-path and off-path SmartNICs [35].
On-Path SmartNICs: These are SmartNICs where the NIC processing cores are on the data path between the network port and the host processor (see Figure 1). Consequently, every packet received or transmitted by the host is also processed by the NIC cores. The performance of the NIC cores is critical to the throughput and latency characteristics of the NIC. To address this issue, these NICs typically augment the traditionally wimpy cores on the SmartNIC with specialized hardware support that enhances the packet-processing capabilities of the core. For example, packet contents are prefetched and placed in a structure similar to the L1 cache, and there are hardware mechanisms for managing packet buffers. The SmartNIC has hardware mechanisms for delivering incoming packets to NIC cores in a balanced manner, but there are no mechanisms, such as receive side scaling (RSS), to deliver specific packets to specific NIC cores. Further, the NIC cores can invoke specialized accelerators for tasks such as crypto and compression. Marvell LiquidIO [37] and Netronome NICs [40] are on-path SmartNICs.
Off-Path SmartNICs: These are SmartNICs where the NIC’s processing cores are off the data path connecting the host to the network. A NIC-level switching fabric, a NIC switch, provides connectivity between the network port, the host cores, and the NIC cores. The NIC switch is a specialized hardware unit with match-action engines for selecting packet fields and rewriting them based on runtime-configurable rules. The rules route packets received from the network to the host or the NIC, or rewrite them and immediately transmit them back into the network. Mellanox Bluefield [41] and Broadcom Stingray [10] are off-path SmartNICs.
The off-path SmartNICs thus contain both general-purpose cores and packet match-action engines. The packet-processing logic could thus be split across the cores and the match-action engines based on the complexity of the logic. For instance, the cores can perform complex packet processing and delegate to the match-action engines simpler operations such as packet steering, header-rewriting, and packet forwarding.
Offloading to SmartNICs: Datacenter operators are increasingly resorting to offloading various infrastructure functions to SmartNICs and reducing the usage of host cores, which can, in turn, then be rented out and monetized. The ARM cores on SmartNICs are not as powerful as the host x86 cores, but they are an order of magnitude less expensive and result in substantial cost savings [3, 6]. Network virtualization, security, and storage are some of the typical operations offloaded to SmartNICs, and in this work, we seek to expand this to include the infrastructure’s load-balancing function.
We seek to offload our load-balancing capability on both on-path and off-path load balancers. Others have noted that the communication latencies between the SmartNIC cores and the host cores can be significant in the case of off-path SmartNICs [35], so we target complete offloads for both types of SmartNICs, i.e., all load balancing logic is performed on the SmartNIC cores. Further, this allows us to target even headless SmartNICs with an independent power supply and a carrier card, allowing for a host-less solution (e.g., Broadcom Stingray PS1100R [1]) and enabling substantial reductions to both hardware and operating costs.
Load balancers operating at different network layers are widely deployed inside data centers to deliver traffic to services. They fall into two categories: Layer 4 and Layer 7.
An L4 load balancer maps a virtual IP address (VIP) to a list of backend servers, with each server having its own dynamic IP address (DIP). As the L4 load balancers operate on the transport layer, the routing decision is solely based on the packet headers of the transport/IP layers (i.e., the 5-tuple of IP addresses and ports) but not the payload. Several L4 load balancers are deployed by cloud providers [14, 16, 43]).
L7 load balancers are significantly more complex and operate on content at the application layer (e.g., HTTP data). Many services inside a data center may share a common application gateway implementing the L7 load-balancing functionality. The L7 load balancer dispatches requests to the corresponding backend servers based on the service requested, e.g., different services are commonly differentiated by the URL [23]. The load balancer would then reassemble the stream and match the URL against various patterns to route the request to the corresponding services.
It is common for an L7 load balancer to modify the streamed application data, e.g., insert an x-forwarded-for header to inform the backend server of the real IP address of the client. The load balancer may also modify the reply from the server to inject a cookie into the response. Future requests from the same client can then be routed to the same backend server based on the cookie.
Due to the complex nature and the functionality provided by the L7 load balancer, it is typically implemented as a dedicated application on top of the kernel networking stack, e.g., Nginx [15], Envoy [18], HAProxy [24]. Given the overheads of the OS’s networking stack and the application-level load balancing, the performance of L7 load balancers is an order of magnitude lower than that of L4 load balancers, with 50% to 90% of the processing time spent inside the kernel [25, 31].
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.