Vhost and virtio on ARMv8 performance tuning and optimization

展开查看详情

1. x DPDK Virtio Performance Analysis and Tuning on Armv8 JOYCE KONG/GAVIN HU ARM

2. • Test Scenario • Testbed and Baseline Performance • NUMA balancing • VHE Agenda • Analysis and Tuning • Weak-Memory Model • Loop unrolling • Prefetch • Inline function • Performance Data Result

3.Test Scenario VM (DPDK Testpmd) Virtio PMD • Traffic generator: IXIA​ QEMU-KVM • Physic Port: 40Gbps NIC​ Testpmd Vhost-user • Packet Flow: IXIA Port A -> NIC Port 0 -> Vhost-user -> Virtio -> Vhost-user -> Ubuntu 18.04 NIC port 0 -> IXIA Port A​ • Test case: RFC2544 zero packet loss test P1 P0 Ethernet Nic Card Port A Traffic generator DPDK PVP test setup 3

4. • Test Scenario • Testbed and Baseline Performance • NUMA balancing • VHE Agenda • Analysis and Tuning • Weak-Memory Model • Loop unrolling • Prefetch • Inline function • Performance Data Result

5.NUMA Balancing • NUMA balancing • Move tasks(threads/ processes) closer to the memory they are accessing PVP benchmark 5 • Move application data to memory closer to the tasks that reference it 4.5 • Automatic NUMA balancing internals 4 AggTx Rate FPS/Mpps • Periodic NUMA unmapping of process memory 3.5 • NUMA hinting page fault 3 • Migrate-on-Fault(MoF): move memory to where the program using it runs 2.5 2 • Task_numa_placement: move running programs closer to their memory 1.5 • Unmapping of memory, NUMA faults, migration and NUMA placement 1 incur overhead 0.5 0 • Configuration Soc#1(2 numa Soc#2(2 numa nodes) nodes) • # numactl –hardware shows multiple nodes Enable numa balancing • # echo 0 > /proc/sys/kernel/numa_balancing Disable numa balancing 5

6.VHE (Virtualization Host Extensions) KVM/ARM with VHE EL0 Host User VM User VM User Host User Space Space EL0 Space Space Host Kernel Hypervisor Function call VM Kernel VM Kernel EL1 Highvisor EL1 Run VM Switch Switch Lowvisor Trap EL2 Host Kernel Hypervisor EL2 Trap Perform Context Switch 5 between Host and VM execution context. 4.5 PVP benchmark AggTx Rate FPS/Mpps Perform Context Switch between 4 Configure VGIC, virtual timer. VM and Host execution context. 3.5 Setup stage 2 translation Disable virtual interrupt. 3 registers. 2.5 Disable stage 2 translation. Enable stage 2 translation 2 Disable traps 1.5 Enable traps 1 KVM/ARM without VHE 0.5 0 Soc#1(without VHE) Soc#2(with VHE) 6

7. • Test Scenario • Testbed and Baseline Performance • NUMA balancing • VHE Agenda • Analysis and Tuning • Weak-Memory Model • Loop unrolling • Prefetch • Inline function • Performance Data Result

8.Weak-Memory Model Strong-Memory Order Weak-Memory Order Reads and Writes are arbitrarily re-ordered, subject only to data All reads and writes are in-order dependencies and explicit memory barrier instructions • Hardware re-ordering improves • Certain situations require stronger ordering performance rules – barriers by software • Multiple issue of instructions • Prevent unsafe optimizations from occurring • Out-of-order execution • Enforce a specific memory ordering • Speculation • Speculative loads • Memory barriers degrade performance • Load and store combine • Whether a barrier is necessary in a specific • External memory systems situation • Cache coherent multi-core processing • Which is the correct barrier to use • Optimizing compilers 8

9.Full Barriers • The ARM architecture includes barrier instructions forcing access order and access completion at a specific point • ISB – Instruction Synchronization Barrier • Flush the pipeline, and refetch the instructions from the cache (or memory) • Effects of any completed context-changing operation before the ISB are visible to instructions after the ISB • Context-changing operations after the ISB only take effect after the ISB has been executed • DMB – Data Memory Barrier • Prevent re-ordering of data access instructions across the barrier instruction • Data accesses (loads/stores) before the DMB are visible before any data access after the DMB • Data/unified cache maintenance operations before the DMB are visible to explicit data access after the DMB • DSB – Data Synchronization Barrier • More restrictive than a DMB, any further instructions (not just loads/stores) can be observed until the DSB is completed 9

10.“One-way” barrier optimization • Aarch64 adds new load/store instructions with implicit barrier semantics • Load-Acquire (LDAR) • Store-Release (STLR) • All accesses after the LDAR are observed after the • All accesses before the STLR are observed before the LDAR STLR • Accesses before the LDAR are not affected • Accesses after the STLR are not affected LDR LDR STR STR Accesses can cross a Accesses can cross a LDAR barrier in one direction STLR barrier in one direction but not the other but not the other LDR LDR STR STR 10

11. “One-way” barrier optimization • LDAR and STLR may be used as a pair LDR • To protect a critical section of code STR • May have lower performance impact than a full DMB LDAR • No ordering is enforced within the critical section LDR Enqueue to and dequeue from STR Virtio Vring • Scope STLR • DMB/DSB take a qualifier to control which shareability LDR domains see the effect STR • LDAR/STLR use the attribute of the address accessed 11

12.“One-way” barrier optimization /drivers/net/virtio/virtqueue.h static inline void vq_update_avail_idx(struct virtqueue *vq) { - virtio_wmb(vq->hw->weak_barriers); - vq->vq_split.ring.avail->idx = vq->vq_avail_idx; + __atomic_store_n(&vq->vq_split.ring.avail->idx, vq- >vq_avail_idx, __ATOMIC_RELEASE); } Relaxed memory ordering to PVP benchmark 4.95 save DMB operation AggTxRate FPS/Mpps 4.9 4.85 4.8 4.75 4.7 4.65 4.6 Base line With "one-way" barrier 12

13.Loop unrolling Loop unrolling attempts to optimize a program's execution speed by eliminating instructions that control the loop, which is an approach known as space-time tradeoff. Loops can be re-written as a repeated sequence of similar independent statements. Advantages • Branch penalty is minimized • Can potentially be executed in parallel if the statements in the loop are independent of each other • Can be implemented dynamically if the number of array elements is unknown at compile time CortexA72 Pipeline 13

14.Loop unrolling example and benefit + } else { + rte_mempool_put_bulk(free[0]->pool, i40e_tx_free_bufs(struct i40e_tx_queue *txq) + (void *)free, if (likely(m != NULL)) { + nb_free); free[0] = m; + free[0] = m; nb_free = 1; + nb_free = 1; - for (i = 1; i < n; i++) { + } + for (i = 1; i < n-1; i++) { + } + rte_prefetch0(&txep[i].mbuf->next); + } + m = rte_pktmbuf_prefree_seg(txep[i].mbuf); + if (i == (n-1)) { + if (likely(m != NULL)) { rte_prefetch0(&txep[i].mbuf->next); + if (likely(m->pool == free[0]->pool)) { m = rte_pktmbuf_prefree_seg(txep[i].mbuf); + free[nb_free++] = m; if (likely(m != NULL)) { + } else { + rte_mempool_put_bulk(free[0]->pool, PVP benchmark + (void *)free, nb_free); 5 AggTxRate FPS/Mpps + free[0] = m; 4.95 + nb_free = 1; 4.9 + } + } 4.85 + m = rte_pktmbuf_prefree_seg(txep[++i].mbuf); 4.8 + if (likely(m != NULL)) { 4.75 + if (likely(m->pool == free[0]->pool)) { 4.7 + free[nb_free++] = m; 4.65 4.6 4.55 Base line With loop unrolling 14

15.Prefetch • Prefetch is the loading of a resource before it is required to decrease the time waiting for that resource Cache miss causes a big penalty 15

16.Prefetch example and benefit i40e_tx_free_bufs(struct i40e_tx_queue *txq) i40e_tx_free_bufs(struct i40e_tx_queue *txq) …… …… rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free); txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)]; } else { + rte_prefetch0(&txep[0].mbuf->next); for (i = 1; i < n; i++) { m = rte_pktmbuf_prefree_seg(txep[0].mbuf); + rte_prefetch0(&txep[i].mbuf->next); if (likely(m != NULL)) { m = rte_pktmbuf_prefree_seg(txep[i].mbuf); free[0] = m; if (m != NULL) nb_free = 1; rte_mempool_put(m->pool, m); for (i = 1; i < n; i++) { …… PVP benchmark + rte_prefetch0(&txep[i].mbuf->next); 4.95 AggTx Rate FPS/Mpps m = rte_pktmbuf_prefree_seg(txep[i].mbuf); 4.9 4.85 if (likely(m != NULL)) { 4.8 if (likely(m->pool == free[0]->pool)) { 4.75 …… 4.7 4.65 4.6 Base line With prefetching 16

17.Inline function PVP benchmark 4.95 4.9 AggTx Rate FPS/Mpps 4.85 4.8 4.75 Slow stack operation 4.7 4.65 4.6 • Inline function can help to save the function Base line With inline function call cost (stack operations) 17

18. • Test Scenario • Testbed and Baseline Performance • NUMA balancing • VHE Agenda • Analysis and Tuning • Weak-Memory Model • Loop unrolling • Prefetch • Inline function • Performance Data Result

19.Performance Data Result PVP benchmark 7 24% 6 AggTx Rate FPS/Mpps 5 4 3 2 1 0 Baseline Apply i40e patches Apply testpmd Apply vhost/virtio Final patches pathes Aarch64 Soc 19

20.Joyce Kong joyce.kong@arm.com Thanks Gavin Hu gavin.hu@arm.com

21.Misc rte_mbuf_refcnt_read(const struct rte_mbuf *m) { return(uint16_t)(rte_atomic16_read(&m- >refcnt_atomic)); } rte_atomic16_read(const rte_atomic16_t *v) { return v->cnt; } Compare directly 21