- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Achieve stable high performance DPDK Application on modern CPU
Achieve stable high performance DPDK Application on modern CPU
展开查看详情
1 . x Achieve stable high performance DPDK Application on modern CPU TAO YANG XUEKUN HU INTEL
2 . Agenda 1. Modern CPU Architecture 2. Performance impact on shared resource 1. Shared EU and L1/L2 Cache 2. Shared L3 Cache 3. Shared Core Power 3. Summary 2 INTEL CONFIDENTIAL Doc #xxxxx
3 .Server CPU in Hot Chips 2017 •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.921-EPYC- Lepak-AMD-v2.pdf •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon- Skylake-sp-Kumar-Intel.pdf •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.942-Centriq- 2400-Wolford-Qualcomm%20Final%20Submission%20corrected.pdf 3
4 .Shared Execution Engine and L1/L2 Cache •Hyper-Threading Technology •Intel® 64 and IA-32 Architectures Software Developer Manuals 4
5 .Linux tool for hyper-thread •Hyper-thread and cores in the system [root@wolfpass-6230n ~]# lscpu CPU(s): 80 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 [root@wolfpass-6230n ~]# lscpu -e CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 yes 3900.0000 800.0000 1 0 0 1 1:1:1:0 yes 3900.0000 800.0000 ...... 40 0 0 0 0:0:0:0 yes 3900.0000 800.0000 ...... •Binding application to core • taskset -c CORE-ID DPDK-APP 5
6 . Hyper-thread performance impact •DPDK L3fwd on 1 core/1 Thread (Intel Xeon 6230N 2.30GHz) with 2*10G port •Stress workload running on the other hyper thread DPDK L3fwd performance impact on hyper thread workload DPDK 1 Core 1 Stress-ng L3fwd 0.90 0.90 0.83 Thread2 0.81 Thread1 Intel Xeon 6230N 0.77 Intel Xeon 6230N 10GbE NIC Port Port 1 2 IXIA* Traffic Generator No hyper thread compute workload L1 Cache workload L2 Cache workload Socket workload io workload workload 6
7 .Shared L3 Cache •Intel® 64 and IA-32 architectures optimization reference manual 7
8 .Shared Memory Bandwidth •https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday- Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf 8
9 .Intel® Resource Director Technology •Intel® 64 and IA-32 Architectures Software Developer Manuals •https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html 9
10 .Linux Tools for Resource Control •Intel RDT Kernel Interface Documentation https://www.kernel.org/doc/html/latest/x86/resctrl_ui.html •Intel RDT Reference Software Package(Direct access CPU register) https://github.com/intel/intel-cmt-cat 10
11 . RDT Test configuration Memory “CAT with Cache Allocation Scheme Bandwidth Stress-ng VPP vRouter VPP vRouter Stress-ng Aggressors” Case VM3 Physical Allocation Scheme VM1 VM2 VM4 Process/VM Core Capacity Port Port Port Port Bit Mask 11 bit CBM representation 1 2 CoS 1 2 (CBM) Open vSwitch with DPDK 33,35,37 Other App 3 0xC 10 9 8 7 6 5 4 3 2 1 0 10% 60,73,75,77 CentOS 7.6 20 ovs-vswitchd 3 0xC 10 9 8 7 6 5 4 3 2 1 0 10% 21,26,27, OVS-DPDK 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% Intel Xeon 6230N 28,29,36 PMD 22,23,24 VM1 - SUT 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% Intel 710 Intel 710 10GbE 10GbE 30,31,32 VM2 - SUT 1 0x7F0 10 9 8 7 6 5 4 3 2 1 0 100% VM3 - Noisy Port Port Port Port 25,34,65,74 2 0x3 10 9 8 7 6 5 4 3 2 1 0 10% 1 2 1 2 Neighbor VM4 - Noisy 38,39,78,79 2 0x3 10 9 8 7 6 5 4 3 2 1 0 10% Neighbor IXIA* Traffic Generator 0-19,40-59 OS on CPU 0 0 0x7FF 100% pqos -e "llc:0=0x7ff;llc:1=0x7f0;llc:2=0x3;llc:3=0xc;" pqos -e "mba:0=100;mba:1=100;mba:2=10;mba:3=10;“ pqos -a "llc:0=0-19,40-59" pqos -a "llc:1=21,26,27,28,29,36,22,23,24,30,31,32;llc:2=25,34,38,39;llc:3=20,33,35,37" pqos -a "llc:1=61,66,67,68,69,76,62,63,64,70,71,72;llc:2=65,74,78,79;llc:3=60,73,75,77" 11
12 .RDT Test data 12
13 .Performance data with RDT OVS-DPDK/VPP vRouter performance throughput Mpps 28.2% 24.1% 23.5% 18.8% 23.4% 17.1% 18.9% 14.9% 11.9% 9.8% 64 128 256 512 1024 1280 1518 No Stress memory stress memory stress + cat + mba 13
14 .Shared Power •IntelSpeed Select Technology - Base Frequency •https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency- enhancing-performance.pdf 14
15 . Intel® SST-BF Enabled CPU SKUs Base Config Intel SST-BF Configuration Base Configuration Intel SST-BF High Intel SST-BF Low Priority Parameters Priority Cores Cores SSE Base Freq SSE Base Freq SSE Base Freq Cores (GHz) Cores (GHz) Cores (GHz) Intel® Xeon® Gold 6252N Processor 24 2.3 8 2.8 16 2.1 Intel® Xeon® Gold 6230N Processor 20 2.3 6 2.7 14 2.1 Intel® Xeon® Gold 5218N Processor 16 2.3 4 2.7 12 2.1 Intel Network Builders NDA Partner Training 15 1
16 .Linux tool for ISS-BF • Enable the Intel® SST-BF feature in the BIOS. • OS can determine high priority cores by enumerating ACPI _CPC object’s “guaranteed perf” value for each core for scheduling purposes • Linux kernel v5.0.8+ exposes /sys/devices/system/cpu*/cpufreq/base_frequency • User space script to enable High/Low priority cores • https://github.com/intel/CommsPowerManagement 16
17 .SR-IOV Performance with ISS-BF L3fwd/SR-IOV Thoughtput DPDK 9.5% 6230N wo BF(1C, 2.3G) 6230N w BF(1C, 2.7G) •1 core in Xeon 6230N, 4 * 10G, SR-IOV passthrough, DPDK l3fwd in VM, packet size 64B 17
18 .OVS-DPDK Performance with ISS-BF •2*Intel Xeon 6230N + 6*10G in a system •Only used 16 cores in CPU 1 6 cores for OVS-DPDK data plane Low priority Core ID: ovs-vswitchd Core ID: 20 High priority Core ID: ovs-pmd: 21,26,27,33,34,36 3 cores for every VPP vRouter VM Low priority Core ID: VM 1: 22,23,24 VM 2: 28,29,30 VM 3: 32,35,37 •VPP VM core configuration: VM Core 0: for control plane VM Core 1,2: VPP data plane 18
19 .OVS-DPDK Performance with ISS-BF •Enable ISS-BF(VPP vRouter 2.1Ghz/OVS-DPDK 2.7Ghz) vs Disable ISS-BF(All Core 2.3Ghz ) OVS-DPDK/VPP vRouter Throughput (Mpps) 8.0% 9.5% 9.3% 8.6% 6.2% 64B 128B 256B 512B 1024B 1518B 6230N w BF(20C, 2.3G) w 3VM Total 16C 6230N w BF(20C, 2.7/2.1G) w 3VM Total 16C 19
20 .Summary •Many resources are shared in multi-core CPU. •Application running on different core compete for the shared resources. •Shared resource partition can reduce the competition and achieve stable high performance. 20
21 .Thank You !