The Path to DPDK Speeds for AF XDP

af_xdp是一种新的套接字类型,用于在4.18中引入的原始帧(在编写本文时的下一个linux中)。当前的代码库为我们系统上的64字节数据包提供了每个应用程序核心20 mpps以北的吞吐量,但是为了进一步提高吞吐量,可以执行许多优化。本文的内容是我们需要在af_xdp中进行性能优化,以使它能够像dpdk一样快速地执行。
我们将优化分为两大类:一类是与应用程序无缝连接的优化,另一类是需要添加到uapi的优化。在第一类中,我们检查以下内容:

放宽拥有xdp程序的要求。如果用户不需要xdp程序,并且只有一个af_xdp套接字绑定到特定队列,则不需要xdp程序。这将从rx路径中截取相当多的周期。
从用户空间连接忙轮询。如果应用程序编写器正在使用epoll()和friends,则这有可能消除rx(napi)核心与应用程序核心,因为现在所有的工作都是在一个核心上完成的。应该提高许多用例的性能。在这种情况下,也许有必要重新审视线程式napi的旧思想。通过批处理优化高指令缓存使用率,例如Cisco的VPP堆栈和Edward Cree在他的网络下一个RFC“在每个阶段处理多个接收到的数据包”。

在uapi扩展类别中,我们检查以下优化:

为具有顺序tx完成的nics支持新模式。在此模式下,将不使用完成队列。相反,应用程序只需查看tx队列中的指针,以查看数据包是否已完成。在这种模式下,我们不需要在完成队列和tx队列之间设置任何backpreassure,也不需要在完成队列中填充或发布任何内容,因为它不被使用。应能显著提高有序网卡的发送性能。
介绍“类型编写器”模型,其中每个块可以包含多个数据包。这就是chelsio在其nics中的模型。但是实验表明,由于队列上的事务较少,这种模式也可以为常规nic提供更好的性能。需要在描述符的选项字段中引入新标志。

通过这些优化,我们相信可以在零拷贝模式下实现64字节数据包接近40 mpps吞吐量的目标。在最后的论文中,我们将对性能数据进行全面的分析。

展开查看详情

1.The Path to DPDK Speeds for AF XDP Magnus Karlsson, magnus.karlsson@intel.com Björn Töpel, bjorn.topel@intel.com Linux Plumbers Conference, Vancouver, 2018

2.Legal Disclaimer • Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. • No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. • Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. • All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. • Intel, the Intel logo, and other Intel product and solution names in this presentation are trademarks of Intel. • Other names and brands may be claimed as the property of others. • c 2018 Intel Corporation. 2

3.XDP 101 Userspace VMs and containers Applications Control plane Linux kernel Network stack AF_INET AF_PACKET Virtual devices TCP/UDP BPF maps IP layer AF_XDP Queueing TC BPF and forwarding Device driver XDP Build sk_buff Drop Network hardware “Kernel diagram” by Toke Høiland-Jørgensen licensed under CC-BY-SA cba. 3

4.AF XDP 101 • Ingress • Userspace XDP packet sink • XDP REDIRECT to socket via XSKMAP • Egress • No XDP program • Register userspace packet buffer memory to kernel (UMEM) • Pass packet buffer ownership via descriptor rings 4

5.AF XDP 101 Packet Application Completion Ring Buffers Fill Ring Tx Ring Rx Ring Kernel • Fill ring (to kernel) / Rx ring (from kernel) • Tx ring (to kernel) / Completion ring (from kernel) • Copy mode (DMA to/from kernel allocated frames, copy data to user) • Zero-copy mode (DMA to/from user allocated frames) 5

6.Baseline and optimization strategy • Baseline • Linux 4.20 • 64B @ ˜15-22 Mpps • Strategy • Do less (instructions) • Talk less (coherency traffic) • Do more at the same time (batching, i$) • Land of Spectres: fewer retpolines, fewer retpolines, fewer retpolines 6

7.Experimental Setup • Broadwell E5-2660 @ 2.7GHz • 2 cores used for run-to-completion benchmarks • 1 core used for busy-poll benchmarks • 2 i40e 40GBit/s NICs, 2 AF XDP sockets • Ixia load generator blasting at full 40 Gbit/s per NIC 7

8.Ingress • XDP ATTACH and bpf xsk redirect, attach at-most one socket per netdev queue, load built-in XDP program, 2-level hierarchy • Remove indirect call, bpf prog run xdp • Remove indirect call, XDP actions switch-statement (>= 5 =⇒ jump table) • Driver optimizations (batching, code restructure) • bpf prog run xdp, xdp do redirect and xdp do flush map: per-CPU struct bpf redirect info + struct xdp buff + struct xdp rxq info vs explicit, stack-based context 8

9.Ingress, results, data not touched 90 Baseline 80 XDP ATTACH Remove indirect call in XDP path 70 Replace switch in driver Various driver opts 60 Explicit context in XDP path 50 Mpps 39.3 40 36.8 31.5 30 23.4 15.1 17.1 20 10 0 rxdrop Results have bee estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. 9

10.Egress • Tx performance capped per HW queue Standard case In-Order Completion =⇒ multiple Tx sockets per UMEM TX Completion TX Completion • Larger/more batching, larger head head head head descriptor rings tail tail tail tail • Dedicated AF XDP HW Tx queues • In-order completion, setsockopt descriptors descriptors descriptors descriptors XDP INORDER COMPLETION 10

11.Egress, results, data not touched 140 Baseline Multiple Tx queues 120 Batch size and descriptor ring changes Optimized cleanup not shared with XDP 100 In order completion 80 68.0 Mpps 58.0 60 54.1 40 25.3 25.9 20 0 txpush Results have bee estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. 11

12.Busy poll() vs run-to-completion Busy poll() Run-to-completion Application Application Rx/Tx, softirq Rx/Tx, poll() Core 1 Core 2 Core 1 12

13.Busy poll() vs run-to-completion, results 80 Run-to-completion 68.0 poll() 70 60 51.1 50 39.3 Mpps 40 30.4 30 22.4 20 16.4 10 0 rxdrop txpush l2fwd Results have bee estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. 13

14.Comparison with DPDK • Userspace, vectorized drivers • “Learning from the DPDK” http://vger.kernel.org/netconf2018_files/ StephenHemminger_netconf2018.pdf 14

15.Comparison with DPDK, results 120 AF XDP Run-to-completion AF XDP poll() 100 DPDK scalar driver DPDK vectorized driver 80 73.0 73.7 68.0 64.2 Mpps 60 52.8 51.1 39.3 40 30.4 22.4 20.0 22.5 16.4 20 0 rxdrop txpush l2fwd Results have bee estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. 15

16.Next steps Upstream! • XDP: switch-statement • Rx/Tx: drivers • Rx: XDP ATTTACH and bpf xsk redirect • libbpf AF XDP support • Tx: multiple Tx sockets per UMEM • selftest, samples 16

17.Future work • Hugepage support, less fill ring traffic (get user pages) • fd.io/VPP work vectors (i$, explicit batching in function calls) • “XDP first” drivers • Collaborate/share code with RDMA (e.g. get user pages) • Type-writer model (currently not planned) 17

18.Summary • Rx 15.1 to 39.3 Mpps (2.6x) • Tx 25.3 to 68.0 Mpps (2.7x) • Busy poll() promising • DPDK still faster for “notouch”, but AF XDP on par when data is touched • Drivers need to change when skb is not the only consumer 18

19.Thanks! • Ilias Apalodimas • David S. Miller • Daniel Borkmann • Pavel Odintsov • Jesper Dangaard Brouer • Sridhar Samudrala • Willem De Bruijn • Yonghong Song • Eric Dumazet • Alexei Starovoitov • Alexander Duyck • William Tu • Mykyta Iziumtsev • Anil Vasudevan • Jakub Kicinski • Jingjing Wu • Song Liu • Qi Zhang 19

20.