Your Internet Traffic Is Slower Than It Should Be—Laconic Fixes That

Authors:

(1) Tianyi Cui, University of Washington ([email protected]);

(2) Chenxingyu Zhao, University of Washington ([email protected]);

(3) Wei Zhang, Microsoft ([email protected]);

(4) Kaiyuan Zhang, University of Washington ([email protected]).

Editor's note: This is Part 6 of 6 of a study detailing attempts to optimize layer-7 load balancing. Read the rest below.

Table of Links

Abstract and 1 Introduction
2 Background and 2.1 Programmable Multi-core SmartNICs
- 2.2 Load balancers
3 SmartNIC-based Load Balancers and 3.1 Offloading to SmartNICs: Challenges and Opportunities
- 3.2 Laconic Overview
- 3.3 Lightweight Network Stack
- 3.4 Lightweight synchronization for shared data
- 3.5 Acceleration with Hardware Engine
4 Evaluation and 4.1 Experiment Setup
- 4.2 End-to-end Throughput
- 4.3 End-to-end Latency
- 4.4 Evaluating the Benefits of Key Techniques
- 4.5 Real-world Workload
5 Related Work
6 Conclusion and References
- A Appendix

6 Conclusion

We propose Laconic, which explores the potential of offloading L7 load balancing onto SmartNICs and addresses the associated challenges. In Laconic, we employ a lightweight network stack to avoid the costs of a heavy, feature-rich network stack. To minimize the cost of synchronization operations, we design lightweight synchronization mechanisms that enable higher concurrency and scalability. To further reduce the burden on the generic cores on the SmartNIC, we offload packet processing by effectively utilizing the hardware acceleration engines available on SmartNICs. This paper does not raise any ethical issues.

References

[1] Broadcom PS1100R. https://www.microlandusa.com/broadcomps1100r-100gbe-nvme-pcie-storage-adapter-with-power-supplycarrier-card-serial-cable.html.

[2] ELB vs. ALB vs. NLB: Choosing the Best AWS Load Balancer for Your Needs . https://iamondemand.com/blog/elb-vs-alb-vs-nlb-choosingthe-best-aws-load-balancer-for-your-needs/.

[3] Fungible can solve the public cloud Hotel California problem. https://blocksandfiles.com/2021/07/19/fungible-can-solve-thepublic-cloud-trillion-dollar-paradox-hotel-california-problem/.

[4] How to cut AWS ELB costs by 90% using Application Load Balancers. . https://medium.com/cognitoiq/how-cognitoiq-are-usingapplication-load-balancers-to-cut-elastic-load-balancing-cost-by-90- 78d4e980624b.

[5] Knowing How Much to Spend on the AWS Elastic Load Balancer . https://logz.io/blog/cost-management-elb-aws-load-balancer/.

[6] Reducing data center TCO with server offload strategies. https://www.datacenterdynamics.com/en/opinions/reducingdata-center-tco-with-server-offload-strategies/.

[7] What is an Application Load Balancer. https://docs.aws.amazon.com/elasticloadbalancing.

[8] Akhan Akbulut and Harry G. Perros. Performance analysis of microservice design patterns. IEEE Internet Computing, 23(6):19–27, 2019.

[9] Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. Conga: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page 503–514, New York, NY, USA, 2014. Association for Computing Machinery.

[10] Broadcom. Stingray smartnic adapters and ic. [EB/OL]. https: //www.broadcom.com/products/ethernet-connectivity/networkadapters/smartnic Accessed Oct 25, 2020.

[11] Qizhe Cai, Midhul Vuppalapati, Jaehyun Hwang, Christos Kozyrakis, and Rachit Agarwal. Towards 𝜇 s tail latency and terabit ethernet: disaggregating the host network stack. In Proceedings of the ACM SIGCOMM 2022 Conference, pages 767–779, 2022.

[12] DPDK. [rfc] ethdev: support hairpin queue. [EB/OL]. https://patches.dpdk.org/project/dpdk/patch/1565703468-55617-1- [email protected]/ Accessed Oct 20, 2021.

[13] Dropbox. How we migrated dropbox from nginx to envoy. [EB/OL]. https://web.archive.org/web/20220403133038/https://dropbox.tech/ infrastructure/how-we-migrated-dropbox-from-nginx-to-envoy Accessed April 19th, 2022.

[14] Daniel E Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. Maglev: A fast and reliable software network load balancer. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), pages 523–535, 2016.

[15] Inc F5. Nginx high performace load balancer. [EB/OL]. https://www. nginx.com/ Accessed Oct 25, 2020.

[16] Facebook. Katran. [EB/OL]. https://github.com/facebookincubator/ katran Accessed Oct 25, 2020.

[17] Roy Fielding, Jim Gettys, Jeffrey Mogul, Henrik Frystyk, Larry Masinter, Paul Leach, and Tim Berners-Lee. Rfc2616: Hypertext transfer protocol–http/1.1, 1999.

[18] Cloud Native Computing Foundation. Envoy proxy-home. [EB/OL]. https://www.envoyproxy.io/ Accessed Oct 25, 2020.

[19] Soudeh Ghorbani, Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. Micro load balancing in data centers with drill. In Proceedings of the 14th ACM Workshop on Hot Topics in Networks, pages 1–7, 2015.

[20] Soudeh Ghorbani, Zibin Yang, P Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. Drill: Micro load balancing for low-latency data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 225–238, 2017.

[21] Massimo Girondi, Marco Chiesa, and Tom Barbette. High-speed connection tracking in modern servers. In 2021 IEEE 22nd International Conference on High Performance Switching and Routing (HPSR), pages 1–8. IEEE, 2021.

[22] Will Glozer. wrk-a http benchmarking tool, 2018.

[23] Google. Google cloud endpoints now generally available: a fast, scalable api gateway. [EB/OL]. https://cloud.google.com/blog/products/ gcp/google-cloud-endpoints-now-ga-a-fast-scalable-api-gateway Accessed Oct 25, 2020.

[24] Haproxy. Haproxy the reliable, high performance tcp/http load balancer. [EB/OL]. https://www.haproxy.org/ Accessed Oct 25, 2020.

[25] EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mtcp: a highly scalable user-level {TCP} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), pages 489–502, 2014.

[26] Georgios P Katsikas, Tom Barbette, Marco Chiesa, Dejan Kostić, and Gerald Q Maguire. What you need to know about (smart) network interface cards. In International Conference on Passive and Active Network Measurement, pages 319–336. Springer, 2021.

[27] Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. Hula: Scalable load balancing using programmable data planes. In Proceedings of the Symposium on SDN Research, pages 1–12, 2016.

[28] Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr Sharma, Arvind Krishnamurthy, and Thomas Anderson. Tas: Tcp acceleration as an os service. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–16, 2019.

[29] Duckwoo Kim, SeungEon Lee, and KyoungSoo Park. A case for smartnic-accelerated private communication. In 4th Asia-Pacific Workshop on Networking, pages 30–35, 2020.

[30] Taehyun Kim, Deondre Martin Ng, Junzhi Gong, Youngjin Kwon, Minlan Yu, and KyoungSoo Park. Rearchitecting the tcp stack for i/ooffloaded content delivery. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022. USENIX, 2023.

[31] Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. Socksdirect: Datacenter sockets can be fast and compatible. In ACM SIGCOMM Conference (SIGCOMM), August 2019.

[32] Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM ’16, page 1–14, New York, NY, USA, 2016. Association for Computing Machinery.

[33] Xiaozhou Li, David G Andersen, Michael Kaminsky, and Michael J Freedman. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the Ninth European Conference on Computer Systems, pages 1–14, 2014.

[34] Jianshen Liu, Carlos Maltzahn, Craig Ulmer, and Matthew Leon Curry. Performance characteristics of the bluefield-2 smartnic. arXiv preprint arXiv:2105.06619, 2021.

[35] Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. Offloading distributed applications onto smartnics using ipipe. In Proceedings of the ACM Special Interest Group on Data Communication, pages 318–333. 2019.

[36] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C Evans, Steve Gribble, et al. Snap: A microkernel approach to host networking. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 399–413, 2019.

[37] Marvel. Octeon tx2 liquidio iii smartnic. [EB/OL]. https://www.marvell.com/products/infrastructure-processors/multicore-processors/liquidio-smart-nics.html Accessed Oct 25, 2020.

[38] Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 15–28, 2017.

[39] YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. Acceltcp: Accelerating network applications with stateful TCP offloading. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 77–92, Santa Clara, CA, February 2020. USENIX Association.

[40] Netronome. Agilio cx smartnics. [EB/OL]. https://www.netronome. com/products/agilio-cx/ Accessed Oct 25, 2020.

[41] Nvidia. Bluefield smartnic ethernet. [EB/OL]. https://www.mellanox. com/products/BlueField-SmartNIC-Ethernet Accessed Oct 25, 2020.

[42] Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu. Stateless datacenter load-balancing with beamer. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 125–139, 2018.

[43] Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, et al. Ananta: Cloud scale load balancing. ACM SIGCOMM Computer Communication Review, 43(4):207–218, 2013.

[44] Rajath Shashidhara, Tim Stamler, Antoine Kaufmann, and Simon Peter. FlexTOE: Flexible TCP offload with Fine-Grained parallelism. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 87–102, Renton, WA, April 2022. USENIX Association.

[45] Chaoliang Zeng, Layong Luo, Teng Zhang, Zilong Wang, Luyang Li, Wenchen Han, Nan Chen, Lebing Wan, Lichao Liu, Zhipeng Ding, Xiongfei Geng, Tao Feng, Feng Ning, Kai Chen, and Chuanxiong Guo. Tiara: A scalable and efficient hardware acceleration architecture for stateful layer-4 load balancing. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1345–1358, Renton, WA, April 2022. USENIX Association.

[46] Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. Resilient datacenter load balancing in the wild. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 253–266, 2017.

[47] Noa Zilberman. Technical perspective: hxdp: Light and efficient packet processing offload. Communications of the ACM, 65(8):91–91, 2022.

A Appendix

A.1 Throughput of Laconic using all cores

Figure 13 shows the throughput of Laconic on BlueField-2 with seven ARM cores [1]. The BlueField-2 is equipped with 2x100Gbps ports. We enable the two ports together to further stress the processing power on the ARM cores. For the baseline of Nginx running on the commodity x86 servers, we use 2x100Gbps ports of a ConnectX-5 NIC to make the link speed the same. However, although there are 2x100Gbps ports on ConnectX-5, due to the bottleneck of PCIe 3.0 for testbed servers, only about 120 Gbps can be achieved. As shown in Figure 13, Laconic with only seven NIC cores can achieve line rate[2] for large message sizes. For small message sizes, Laconic achieves comparable performance with Nginx running on x86. BlueField-2 has wimpy ARM cores, and Nginx running on ARM performs poorly, as expected. There is a clear increase in throughput for messages larger than 1M because Laconic offloads the packet processing into the hardware engine for response messages larger than 1M. Other than the first few packets, all the remaining packets in that flow are solely processed by the hardware packet processing engine. This can mitigate the head-of-line blocking problem caused by limited processing power and also relieves the CPU resources for small or future responses. In later Section 4.4, we show experiments with varying offloading thresholds.

Figure 14 shows the throughput of Laconic on LiquidIO3 50G NIC with 24 ARM cores. Since the LiqudIO3 can only support 50 Gbps link speed, we also cap the x86 baselines at 50 Gbps for a fair comparison of all the experiments running on LiquidIO3. Nginx on the ARM cores of the SmartNIC (bar "LIO3 Nginx") has a low throughput. The throughput plateaus at around 5 Gbps even with 24 cores. Laconic can easily reach the full NIC bandwidth with 10K response size. However, Nginx running on an x86 server needs 32 beefy cores to achieve comparable performance.

[1] BlueField-2 has eight available ARM cores in total. We reserve one core for the necessary background tasks.

[2] Since the same port on the SmartNIC receives data from both the requests of the clients and also receive the responses from the backend servers, the theoretical upper limit for the load balancer’s goodput is lower than the line rate 2x100Gbps.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Your Internet Traffic Is Slower Than It Should Be—Laconic Fixes That

Too Long; Didn't Read

Table of Links

6 Conclusion

References

A Appendix

A.1 Throughput of Laconic using all cores

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Your Internet Traffic Is Slower Than It Should Be—Laconic Fixes That

Too Long; Didn't Read

Table of Links

6 Conclusion

References

A Appendix

A.1 Throughput of Laconic using all cores

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics