XDP acceleration using nic meta data

本文是在韩国首尔netdev 2.1上介绍的最初基于xdp硬件的提示工作的延续。


1.Neerav Parikh, PJ Waskiewicz (Intel Corporation, Networking Division) Saeed Mahameed (Mellanox) Linux Plumbers Conference, Nov. 2018 Vancouver, BC, Canada Network Division

2. Overview • XDP Acceleration – Netdev 2.1 Recap • XDP Performance Results • L4 Load Balancer • xdp_tx_ip_tunnel • XDP NIIC Rx Metadata Requirements • XDP NIC Rx Metadata Programming Model • Next steps Network Division 2

3.XDP Acceleration – Netdev 2.1 Recap What can present-day NIC HW do to help • • How do you dynamically program § Accelerate what is being done in XDP programs in terms of packet processing the Hardware to get the XDP § Offset some of the CPU cycles used for packet program the right kind of packet processing parsing help? • Keep it consistent with XDP philosophy § Avoid kernel changes as much as possible § Keep it HW agnostic as much as possible • How to pass the packet § Best effort acceleration parsing/map lookup hints that the § A framework that can change with changing needs of packet processing HW provides with every packet into • Expose the flexibility provided by programmable packet the XDP program so that it can processing pipeline to adapt to XDP program needs benefit from it? • Help design the next generation hardware to take full advantage of XDP and the kernel framework Network Division 3

4. Netdev 2.1 Recap - Performance data • XDP1: Linux kernel sample, parses packet to identify protocol, count and drop • XDP3: Zero packet parsing (best case scenario), just drop all packets • XDP_HINTS: Uses packet type (IPv4/v6, TCP/UDP, etc.) provided by driver as meta data, no packet parsing, count and drop Network Division 4

5.L4 Load balancer Performance XDP L4 LB - with no state tracking • L4 LB: L4 Load Balancer sample application with multiple Virtual IP 16,000,000 tunnels, forwarding packets to 14,000,000 destination based on hash 12,000,000 calculations and lookup 10,000,000 8,000,000 • Hints Type 1: Protocol Type (IPv4/v6, 6,000,000 TCP or UDP, etc.) 4,000,000 2,000,000 • Hints Type 2: Additional hints from 0 type 1 including packet data like packets /s source/destination IP addresses, XDP LB No Hints (1Q) XDP LB - Hints Type 1 (1Q) XDP LB - Hints Type 2 (1Q) source/destination ports, packet hash index (RSS) generated by XDP LB No Hints (4Q) XDP LB - Hints Type 1 (4Q) XDP LB - Hints Type 2 (4Q) hardware Network Division 5

6.L4 Load balancer Performance XDP L4 LB - with state tracking No visible advantage in performance with 10,000,000 just packet parsing hints when XDP application is doing state tracking and 8,000,000 connection management. 6,000,000 4,000,000 2,000,000 0 packets /s XDP LB No Hints (1Q) XDP LB - Hints Type 1 (1Q) XDP LB - Hints Type 2 (1Q) XDP LB No Hints (4Q) XDP LB - Hints Type 1 (4Q) XDP LB - Hints Type 2 (4Q) https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next- queue.git/log/?h=XDP-hints-EXPERIMENTAL Network Division 6

7.L4 Load balancer Performance Analysis Projected XDP L4 LB - with no state tracking XDP L4 LB - with state tracking 4,500,000 2,500,000 4,000,000 +7% +6% +7% 3,500,000 2,000,000 3,000,000 +77% 1,500,000 2,500,000 -8% -5% +6% 2,000,000 1,000,000 1,500,000 -8% 1,000,000 500,000 500,000 0 0 XDP LB No Hints XDP LB - Hints Type XDP LB - Hints Type XDP LB No Hints XDP LB - Hints Type XDP LB - Hints Type (1Q) 1 (1Q) 2 (1Q) (1Q) 1 (1Q) 2 (1Q) PPS without any Hints % Improvement in PPS with %Change in PPS with SW inline HW Hints (driver) generated hints Network Division 7

8.xdp_tx_ip_tunnel with HW Flow Mark • Modified xdp_tx_iptunnel kernel sample • Need an extra map flow2tnl similar to vip2tnl • Setup a TC rule to mark packets with the well-known VIP (dst ip protocol and ds port) with a unique flow mark • XDP Rx Meta data includes a flow_mark to fetch the tunnel from flow2tnl map * Saeed Mahameed (Mellanox) Network Division 8

9.XDP and Rx metadata Requirements XDP program to Rx metadata type selections: § Legacy NICs: Fixed vendor specific meta data structures provided as Rx descriptors or completions – Intel 82599(ixgbe), 7xx Series (i40e) § Programmable NICs: Flexible Rx descriptors allows customization of Rx meta data based on use-cases – Intel E800 Series (ice) Association of Rx meta data type to Rx Queues: § XDP Programs should run regardless of Rx meta-data enabling – Legacy Programs should run without requiring meta data § Granularity of configuration – All Rx Queues - Same fixed or flexible format meta data – Per Rx Queue – Fixed or Flexible metadata for different Rx queues for example XDP program may need different information in terms of Rx meta-data v/s AF_XDP based application on a given Rx queue may need different information Network Division 9

10. XDP meta data programming model • Need mechanism to allow meta data types or Generic type information exchange between SW driver and XDP programs • Supported XDP meta data configured at XDP program at load time or either at compile time Netdev 2.1 Proposal Network Division 10

11. XDP meta data programming model – Solution Options Option #1 (Fields Offset Array) Option #2 (BTF) Well known XDP meta data types, defined by the kernel • BTF support added in 4.15+ by Facebook to provide eBPF program and maps meta data description. A program can request any subset of well-known meta data fields from driver 2(a) Offset array • Extend that to provide NIC meta data programming - The driver will fill meta data buffer with a pre-defined to describe meta data formats with the ndo_bfp() order according to the requested meta data fields callback of the driver to determine if the HW can (ascending order by the field enum) offload/provide such a meta data or not - The user program will access the specific field via the 2(b) pre-defined (calculated offset array) • Optionally Driver + firmware keep layout of the metadata in BTF format; that a user can query the flow_mark = xdp->data_meta + offset_array[XDP_META_FLOW_MARK]; driver and generate normal C header file based on BTF in the given NIC • During sys_bpf(prog_load) the kernel checks (via supplied BTF) • Every NIC can have their own layout of metadata and its own meaning of the fields, Standardize at least a *Inputs from Saeed Mahameed (Mellanox) few common fields like hash Network Division 11

12.XDP meta data programming model – Pros v/s Cons of Option #2 (BTF) compared to Options #1(Fields Offset Array) Pros • Allows vendor defined or specific offloads to be enabled without requiring kernel support • Meta data layout is well known to the BPF program at load time and doesn’t need to use offsets at run-time Cons • XDP program has to be compile/recompiled with the correct meta data type for given SW+FW+HW • Standardizing some fields is up to naming conventions of fields between different NIC vendors and overlap of these fields across vendors may create issues *Input from Saeed Mahameed (Mellanox) Network Division 12

13.XDP Acceleration using NIC HW: Current Status • Rx meta data WIP/RFC level patches: • Intel (WIP): • https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/?h=XDP-hints-EXPERIMENTAL • Mellanox: • [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) https://www.spinics.net/lists/netdev/msg509814.html • [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure https://www.spinics.net/lists/netdev/msg509820.html • https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp_metadata&id=5f290851 5bf64d72684b2bf902acb1a8d9af2d44 • Alexei and Daniel proposal in netdev mailing list • https://www.spinics.net/lists/netdev/msg509820.html Network Division 13

14.XDP Acceleration using NIC HW: Next Steps • Community need to agree on the approach on Rx meta data programming model to provide flexibility for a user across various use-cases and applications • Chaining, Meta data placement in the xdp buffer • Chaining can be easily achieved by calling bpf_xdp_adjust_meta helper from the chained programs • Having the meta data fields sitting exactly before the actual packet buffer (xdp→data) is ok, BUT ! • When bpf_xdp_adjust_head is required (header rewrite), and meta data buffer is filled, memmove(meta_data) will be required (performance hit) • Invalidate meta data once consumed, this will break chaining • Place meta data starting at xdp_buff.data_hard_start, complicated *Input from Saeed Mahameed (Mellanox) Network Division 14

15.XDP Acceleration using NIC HW: Next Steps • Tx metadata and processing hints • Programming Rules in NIC HW to accelerate • Same as Rx need way to flow look-ups and actions: configure/consume Tx meta data from – Advantage of taking actions prior to Rx applications to HW via SW drivers. in software (e.g. drop or forwarding to a • Provide hints to take advantage of HW Rx queue) offloads/accelerations like checksums, – Currently tc u32/flower or ethtool based packet processing/forwarding, QoS, etc. model for enabling HW offloads and match-action rules. Programming model not suitable for XDP. – Not all NICs have eBPF map-table like semantics Network Division 15

16. Questions? Network Division 16

17. Backup Network Division 17

18. Performance improvements • Internal testing yielded promising results • Test setup: Target: Intel Xeon E5-2697v2 (Ivy Bridge) Kernel: 4.14.0-rc1+ (net-next) Network device: XXV710, 25GbE NIC, driver version 2.1.14-k Configuration: Single Rx queue, pinned interrupt XDP3: Zero packet parsing (best case scenario) XDP_HINTS: Uses ptype provided by driver, no packet parsing Network Division 18

19. HW Hints Parsing Hints Type of HW hint Size Description Packet Type U16 A unique numeric value that identifies an ordered chain of headers that were discovered by the HW in a given packet. Header offset U16 Location of the start of a particular header in a given packet. Example start of innermost L3 header. Extracted Field variable Example Inner most IPv6 address value Match U32 Match a packet on certain fields and the values, provide a SW marker as a hint if the Map Offload packet matches the rule Packet Checksum U32 A total packet Checksum Processing Hints Packet Hash U32 Hash value calculated over specified fields and a given key for a given packet type Ingress Timestamp U64 Packet timestamp as it arrives Network Division 19