XDP acceleration using nic meta data

本文是在韩国首尔netdev 2.1上介绍的最初基于xdp硬件的提示工作的延续。
它将从重点展示新的原型开始,以允许xdp程序从nic请求所需的hw生成的元数据提示。本讲座将展示nic如何生成提示,以及各种xdp应用程序的性能特征。我们还想演示这样的元数据如何有助于使用af-xdp套接字的应用程序。
然后与讨论计划中的上游思想,并期待与来自社区的更多观众围绕实现细节、编程流程等进行更多的讨论。

展开查看详情

1.Neerav Parikh, PJ Waskiewicz (Intel Corporation, Networking Division) Saeed Mahameed (Mellanox) Linux Plumbers Conference, Nov. 2018 Vancouver, BC, Canada Network Division

2. Overview • XDP Acceleration – Netdev 2.1 Recap • XDP Performance Results • L4 Load Balancer • xdp_tx_ip_tunnel • XDP NIIC Rx Metadata Requirements • XDP NIC Rx Metadata Programming Model • Next steps Network Division 2

3.XDP Acceleration – Netdev 2.1 Recap What can present-day NIC HW do to help • • How do you dynamically program § Accelerate what is being done in XDP programs in terms of packet processing the Hardware to get the XDP § Offset some of the CPU cycles used for packet program the right kind of packet processing parsing help? • Keep it consistent with XDP philosophy § Avoid kernel changes as much as possible § Keep it HW agnostic as much as possible • How to pass the packet § Best effort acceleration parsing/map lookup hints that the § A framework that can change with changing needs of packet processing HW provides with every packet into • Expose the flexibility provided by programmable packet the XDP program so that it can processing pipeline to adapt to XDP program needs benefit from it? • Help design the next generation hardware to take full advantage of XDP and the kernel framework Network Division 3

4. Netdev 2.1 Recap - Performance data • XDP1: Linux kernel sample, parses packet to identify protocol, count and drop • XDP3: Zero packet parsing (best case scenario), just drop all packets • XDP_HINTS: Uses packet type (IPv4/v6, TCP/UDP, etc.) provided by driver as meta data, no packet parsing, count and drop Network Division 4

5.L4 Load balancer Performance XDP L4 LB - with no state tracking • L4 LB: L4 Load Balancer sample application with multiple Virtual IP 16,000,000 tunnels, forwarding packets to 14,000,000 destination based on hash 12,000,000 calculations and lookup 10,000,000 8,000,000 • Hints Type 1: Protocol Type (IPv4/v6, 6,000,000 TCP or UDP, etc.) 4,000,000 2,000,000 • Hints Type 2: Additional hints from 0 type 1 including packet data like packets /s source/destination IP addresses, XDP LB No Hints (1Q) XDP LB - Hints Type 1 (1Q) XDP LB - Hints Type 2 (1Q) source/destination ports, packet hash index (RSS) generated by XDP LB No Hints (4Q) XDP LB - Hints Type 1 (4Q) XDP LB - Hints Type 2 (4Q) hardware Network Division 5

6.L4 Load balancer Performance XDP L4 LB - with state tracking No visible advantage in performance with 10,000,000 just packet parsing hints when XDP application is doing state tracking and 8,000,000 connection management. 6,000,000 4,000,000 2,000,000 0 packets /s XDP LB No Hints (1Q) XDP LB - Hints Type 1 (1Q) XDP LB - Hints Type 2 (1Q) XDP LB No Hints (4Q) XDP LB - Hints Type 1 (4Q) XDP LB - Hints Type 2 (4Q) https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next- queue.git/log/?h=XDP-hints-EXPERIMENTAL Network Division 6

7.L4 Load balancer Performance Analysis Projected XDP L4 LB - with no state tracking XDP L4 LB - with state tracking 4,500,000 2,500,000 4,000,000 +7% +6% +7% 3,500,000 2,000,000 3,000,000 +77% 1,500,000 2,500,000 -8% -5% +6% 2,000,000 1,000,000 1,500,000 -8% 1,000,000 500,000 500,000 0 0 XDP LB No Hints XDP LB - Hints Type XDP LB - Hints Type XDP LB No Hints XDP LB - Hints Type XDP LB - Hints Type (1Q) 1 (1Q) 2 (1Q) (1Q) 1 (1Q) 2 (1Q) PPS without any Hints % Improvement in PPS with %Change in PPS with SW inline HW Hints (driver) generated hints Network Division 7

8.xdp_tx_ip_tunnel with HW Flow Mark • Modified xdp_tx_iptunnel kernel sample • Need an extra map flow2tnl similar to vip2tnl • Setup a TC rule to mark packets with the well-known VIP (dst ip protocol and ds port) with a unique flow mark • XDP Rx Meta data includes a flow_mark to fetch the tunnel from flow2tnl map * Saeed Mahameed (Mellanox) Network Division 8

9.XDP and Rx metadata Requirements XDP program to Rx metadata type selections: § Legacy NICs: Fixed vendor specific meta data structures provided as Rx descriptors or completions – Intel 82599(ixgbe), 7xx Series (i40e) § Programmable NICs: Flexible Rx descriptors allows customization of Rx meta data based on use-cases – Intel E800 Series (ice) Association of Rx meta data type to Rx Queues: § XDP Programs should run regardless of Rx meta-data enabling – Legacy Programs should run without requiring meta data § Granularity of configuration – All Rx Queues - Same fixed or flexible format meta data – Per Rx Queue – Fixed or Flexible metadata for different Rx queues for example XDP program may need different information in terms of Rx meta-data v/s AF_XDP based application on a given Rx queue may need different information Network Division 9

10. XDP meta data programming model • Need mechanism to allow meta data types or Generic type information exchange between SW driver and XDP programs • Supported XDP meta data configured at XDP program at load time or either at compile time Netdev 2.1 Proposal Network Division 10

11. XDP meta data programming model – Solution Options Option #1 (Fields Offset Array) Option #2 (BTF) Well known XDP meta data types, defined by the kernel • BTF support added in 4.15+ by Facebook to provide eBPF program and maps meta data description. A program can request any subset of well-known meta data fields from driver 2(a) Offset array • Extend that to provide NIC meta data programming - The driver will fill meta data buffer with a pre-defined to describe meta data formats with the ndo_bfp() order according to the requested meta data fields callback of the driver to determine if the HW can (ascending order by the field enum) offload/provide such a meta data or not - The user program will access the specific field via the 2(b) pre-defined (calculated offset array) • Optionally Driver + firmware keep layout of the metadata in BTF format; that a user can query the flow_mark = xdp->data_meta + offset_array[XDP_META_FLOW_MARK]; driver and generate normal C header file based on BTF in the given NIC • During sys_bpf(prog_load) the kernel checks (via supplied BTF) • Every NIC can have their own layout of metadata and its own meaning of the fields, Standardize at least a *Inputs from Saeed Mahameed (Mellanox) few common fields like hash Network Division 11

12.XDP meta data programming model – Pros v/s Cons of Option #2 (BTF) compared to Options #1(Fields Offset Array) Pros • Allows vendor defined or specific offloads to be enabled without requiring kernel support • Meta data layout is well known to the BPF program at load time and doesn’t need to use offsets at run-time Cons • XDP program has to be compile/recompiled with the correct meta data type for given SW+FW+HW • Standardizing some fields is up to naming conventions of fields between different NIC vendors and overlap of these fields across vendors may create issues *Input from Saeed Mahameed (Mellanox) Network Division 12

13.XDP Acceleration using NIC HW: Current Status • Rx meta data WIP/RFC level patches: • Intel (WIP): • https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=XDP-hints-EXPERIMENTAL • Mellanox: • [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) https://www.spinics.net/lists/netdev/msg509814.html • [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure https://www.spinics.net/lists/netdev/msg509820.html • https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp_metadata&id=5f290851 5bf64d72684b2bf902acb1a8d9af2d44 • Alexei and Daniel proposal in netdev mailing list • https://www.spinics.net/lists/netdev/msg509820.html Network Division 13

14.XDP Acceleration using NIC HW: Next Steps • Community need to agree on the approach on Rx meta data programming model to provide flexibility for a user across various use-cases and applications • Chaining, Meta data placement in the xdp buffer • Chaining can be easily achieved by calling bpf_xdp_adjust_meta helper from the chained programs • Having the meta data fields sitting exactly before the actual packet buffer (xdp→data) is ok, BUT ! • When bpf_xdp_adjust_head is required (header rewrite), and meta data buffer is filled, memmove(meta_data) will be required (performance hit) • Invalidate meta data once consumed, this will break chaining • Place meta data starting at xdp_buff.data_hard_start, complicated *Input from Saeed Mahameed (Mellanox) Network Division 14

15.XDP Acceleration using NIC HW: Next Steps • Tx metadata and processing hints • Programming Rules in NIC HW to accelerate • Same as Rx need way to flow look-ups and actions: configure/consume Tx meta data from – Advantage of taking actions prior to Rx applications to HW via SW drivers. in software (e.g. drop or forwarding to a • Provide hints to take advantage of HW Rx queue) offloads/accelerations like checksums, – Currently tc u32/flower or ethtool based packet processing/forwarding, QoS, etc. model for enabling HW offloads and match-action rules. Programming model not suitable for XDP. – Not all NICs have eBPF map-table like semantics Network Division 15

16. Questions? Network Division 16

17. Backup Network Division 17

18. Performance improvements • Internal testing yielded promising results • Test setup: Target: Intel Xeon E5-2697v2 (Ivy Bridge) Kernel: 4.14.0-rc1+ (net-next) Network device: XXV710, 25GbE NIC, driver version 2.1.14-k Configuration: Single Rx queue, pinned interrupt XDP3: Zero packet parsing (best case scenario) XDP_HINTS: Uses ptype provided by driver, no packet parsing Network Division 18

19. HW Hints Parsing Hints Type of HW hint Size Description Packet Type U16 A unique numeric value that identifies an ordered chain of headers that were discovered by the HW in a given packet. Header offset U16 Location of the start of a particular header in a given packet. Example start of innermost L3 header. Extracted Field variable Example Inner most IPv6 address value Match U32 Match a packet on certain fields and the values, provide a SW marker as a hint if the Map Offload packet matches the rule Packet Checksum U32 A total packet Checksum Processing Hints Packet Hash U32 Hash value calculated over specified fields and a given key for a given packet type Ingress Timestamp U64 Packet timestamp as it arrives Network Division 19

20.