Achieve stable high performance DPDK Application on modern CPU

Achieve stable high performance DPDK Application on modern CPU

展开查看详情

1. x Achieve stable high performance DPDK Application on modern CPU TAO YANG XUEKUN HU INTEL

2. Agenda 1. Modern CPU Architecture 2. Performance impact on shared resource 1. Shared EU and L1/L2 Cache 2. Shared L3 Cache 3. Shared Core Power 3. Summary 2 INTEL CONFIDENTIAL Doc #xxxxx

3.Server CPU in Hot Chips 2017 •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.921-EPYC- Lepak-AMD-v2.pdf •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon- Skylake-sp-Kumar-Intel.pdf •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.942-Centriq- 2400-Wolford-Qualcomm%20Final%20Submission%20corrected.pdf 3

4.Shared Execution Engine and L1/L2 Cache •Hyper-Threading Technology •Intel® 64 and IA-32 Architectures Software Developer Manuals 4

5.Linux tool for hyper-thread •Hyper-thread and cores in the system [root@wolfpass-6230n ~]# lscpu CPU(s): 80 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 [root@wolfpass-6230n ~]# lscpu -e CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 yes 3900.0000 800.0000 1 0 0 1 1:1:1:0 yes 3900.0000 800.0000 ...... 40 0 0 0 0:0:0:0 yes 3900.0000 800.0000 ...... •Binding application to core • taskset -c CORE-ID DPDK-APP 5

6. Hyper-thread performance impact •DPDK L3fwd on 1 core/1 Thread (Intel Xeon 6230N 2.30GHz) with 2*10G port •Stress workload running on the other hyper thread DPDK L3fwd performance impact on hyper thread workload DPDK 1 Core 1 Stress-ng L3fwd 0.90 0.90 0.83 Thread2 0.81 Thread1 Intel Xeon 6230N 0.77 Intel Xeon 6230N 10GbE NIC Port Port 1 2 IXIA* Traffic Generator No hyper thread compute workload L1 Cache workload L2 Cache workload Socket workload io workload workload 6

7.Shared L3 Cache •Intel® 64 and IA-32 architectures optimization reference manual 7

8.Shared Memory Bandwidth •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday- Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf 8

9.Intel® Resource Director Technology •Intel® 64 and IA-32 Architectures Software Developer Manuals •https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html 9

10.Linux Tools for Resource Control •Intel RDT Kernel Interface Documentation https://www.kernel.org/doc/html/latest/x86/resctrl_ui.html •Intel RDT Reference Software Package(Direct access CPU register) https://github.com/intel/intel-cmt-cat 10

11. RDT Test configuration Memory “CAT with Cache Allocation Scheme Bandwidth Stress-ng VPP vRouter VPP vRouter Stress-ng Aggressors” Case VM3 Physical Allocation Scheme VM1 VM2 VM4 Process/VM Core Capacity Port Port Port Port Bit Mask 11 bit CBM representation 1 2 CoS 1 2 (CBM) Open vSwitch with DPDK 33,35,37 Other App 3 0xC 10 9 8 7 6 5 4 3 2 1 0 10% 60,73,75,77 CentOS 7.6 20 ovs-vswitchd 3 0xC 10 9 8 7 6 5 4 3 2 1 0 10% 21,26,27, OVS-DPDK 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% Intel Xeon 6230N 28,29,36 PMD 22,23,24 VM1 - SUT 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% Intel 710 Intel 710 10GbE 10GbE 30,31,32 VM2 - SUT 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% VM3 - Noisy Port Port Port Port 25,34,65,74 2 0x3 10 9 8 7 6 5 4 3 2 1 0 10% 1 2 1 2 Neighbor VM4 - Noisy 38,39,78,79 2 0x3 10 9 8 7 6 5 4 3 2 1 0 10% Neighbor IXIA* Traffic Generator 0-19,40-59 OS on CPU 0 0 0x7FF 100% pqos -e "llc:0=0x7ff;llc:1=0x7f0;llc:2=0x3;llc:3=0xc;" pqos -e "mba:0=100;mba:1=100;mba:2=10;mba:3=10;“ pqos -a "llc:0=0-19,40-59" pqos -a "llc:1=21,26,27,28,29,36,22,23,24,30,31,32;llc:2=25,34,38,39;llc:3=20,33,35,37" pqos -a "llc:1=61,66,67,68,69,76,62,63,64,70,71,72;llc:2=65,74,78,79;llc:3=60,73,75,77" 11

12.RDT Test data 12

13.Performance data with RDT OVS-DPDK/VPP vRouter performance throughput Mpps 28.2% 24.1% 23.5% 18.8% 23.4% 17.1% 18.9% 14.9% 11.9% 9.8% 64 128 256 512 1024 1280 1518 No Stress memory stress memory stress + cat + mba 13

14.Shared Power •IntelSpeed Select Technology - Base Frequency •https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency- enhancing-performance.pdf 14

15. Intel® SST-BF Enabled CPU SKUs Base Config Intel SST-BF Configuration Base Configuration Intel SST-BF High Intel SST-BF Low Priority Parameters Priority Cores Cores SSE Base Freq SSE Base Freq SSE Base Freq Cores (GHz) Cores (GHz) Cores (GHz) Intel® Xeon® Gold 6252N Processor 24 2.3 8 2.8 16 2.1 Intel® Xeon® Gold 6230N Processor 20 2.3 6 2.7 14 2.1 Intel® Xeon® Gold 5218N Processor 16 2.3 4 2.7 12 2.1 Intel Network Builders NDA Partner Training 15 1

16.Linux tool for ISS-BF • Enable the Intel® SST-BF feature in the BIOS. • OS can determine high priority cores by enumerating ACPI _CPC object’s “guaranteed perf” value for each core for scheduling purposes • Linux kernel v5.0.8+ exposes /sys/devices/system/cpu*/cpufreq/base_frequency • User space script to enable High/Low priority cores • https://github.com/intel/CommsPowerManagement 16

17.SR-IOV Performance with ISS-BF L3fwd/SR-IOV Thoughtput DPDK 9.5% 6230N wo BF(1C, 2.3G) 6230N w BF(1C, 2.7G) •1 core in Xeon 6230N, 4 * 10G, SR-IOV passthrough, DPDK l3fwd in VM, packet size 64B 17

18.OVS-DPDK Performance with ISS-BF •2*Intel Xeon 6230N + 6*10G in a system •Only used 16 cores in CPU 1 6 cores for OVS-DPDK data plane Low priority Core ID: ovs-vswitchd Core ID: 20 High priority Core ID: ovs-pmd: 21,26,27,33,34,36 3 cores for every VPP vRouter VM Low priority Core ID: VM 1: 22,23,24 VM 2: 28,29,30 VM 3: 32,35,37 •VPP VM core configuration: VM Core 0: for control plane VM Core 1,2: VPP data plane 18

19.OVS-DPDK Performance with ISS-BF •Enable ISS-BF(VPP vRouter 2.1Ghz/OVS-DPDK 2.7Ghz) vs Disable ISS-BF(All Core 2.3Ghz ) OVS-DPDK/VPP vRouter Throughput (Mpps) 8.0% 9.5% 9.3% 8.6% 6.2% 64B 128B 256B 512B 1024B 1518B 6230N w BF(20C, 2.3G) w 3VM Total 16C 6230N w BF(20C, 2.7/2.1G) w 3VM Total 16C 19

20.Summary •Many resources are shared in multi-core CPU. •Application running on different core compete for the shared resources. •Shared resource partition can reduce the competition and achieve stable high performance. 20

21.Thank You !