Accelerating NVMe IOs in VMs via SPDK vhost

In this presentation, we would like to introduce SPDK user space vhost* solution (i.e., vhost-scsi/blk/NVMe), which can be used together with QEMU/KVM to accelerate virtio-scsi, virtio-blk and even emulated NVMe controller inside Guest OS for VMs. Relying on SPDK vhost* solution, the performance of I/Os inside VMs can be greatly improved compared (e.g., I/O IOPS increasing, I/O latency decreasing ) with an existing solution (e.g., original QEMU emulation solution, kernel vhost* solution). Moreover, SPDK vhost* solution is adopted by many cloud service providers (e.g., Alibaba).

1.Accelerating NVMe I/Os in Virtual Machines via SPDK vhost Ziye Yang, Changpeng Liu Senior software Engineer Intel

2. Notices & Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit . Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2018 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.

3.Agenda • Background • SPDK vhost solution • Experiments • Conclusion


5.NVMe & virtualization • NVMe specification enables highly optimized drives (e.g., NVMe SSD) – For example, multiple I/O queues allows lockless submission from CPU cores in parallel • However, even the best kernel mode drivers have non-trivial software overhead – Long I/O stack in kernel with resource contention • Virtualization adds additional overhead – Long I/O stack in both guest OS kernel and host OS kernel – Context switch overhead (e.g., VM_EXIT caused by I/O interrupt in guest OS)

6.What is in QEMU’s solution? • The solution in QEMU to virtualize NVMe device: • Virtio virtualization • NVMe controller virtualization • Hardware assisted virtualization • Virtio virtualization – Virtio SCSI/block Controllers • NVMe controller virtualization – QEMU emulated NVMe Device (file based NVMe backend) – QEMU NVMe Block Driver based on VFIO (exclusive access by QEMU)

7.Background: What is in QEMU • Paravirtualized driver Guest VM specification (Linux*, Windows*, FreeBSD*, etc.) • Common mechanisms and layouts for device virtio front-end drivers discovery, I/O queues, etc. virtqueue virtqueue • virtio device types virtqueue include: • virtio-net virtio back-end drivers • virtio-blk • virtio-scsi device emulation • virtio-gpu • virtio-rng • virtio-crypto Hypervisor (i.e. QEMU/KVM)

8.Accelerate virtio via vhost target • Separate process for Guest VM I/O processing (Linux*, Windows*, FreeBSD*, etc.) • vhost protocol for communicating guest virtio front-end drivers VM parameters • memory • number of virtqueues virtqueue virtqueue virtqueue • virtqueue locations virtio back-enddrivers Device vhost vhost emulation vhost target Hypervisor (i.e. QEMU/KVM) (kernel or userspace)

9.SPDK vhost solution

10. What is SPDK? Intel® Platform Storage Reference Architecture • Optimized for Intel platform characteristics • Open source building blocks (BSD licensed) Storage • Available via or Performance Development Scalable and Efficient Software Ingredients Kit • User space, lockless, polled-mode components • Up to millions of IOPS per core • Designed for Intel Optane™ technology latencies 10

11. SPDK architecture 18.01 Release 18.04 Release 18.07 Release vhost- vhost- NVMe-oF* RDMA iSCSI vhost-scsi Linux Integration NVMe blk Storage Target TCP Target Target nbd Target Target Protocols Cinder NVMe SCSI VPP TCP/IP Block Device Abstraction (BDEV) Qos RocksDB Logical snapshots DPDK BlobFS 3rd Party Volumes GPT Storage clones Encryption Ceph Services Linux Ceph PMDK Virtio iSCSI Blobstore NVMe Virtio AIO RBD blk SCSI Blk initiator QEMU NVMe Devices Core Intel® QuickData Application Drivers NVMe-oF* RDMA NVMe* PCIe Technology Driver Initiator TCP Driver OCSSD Framework

12.Combine virtio and NVMe to inform a uniform SPDK vhost solution QEMU SPDK vhost QEMU SPDK vhost Guest VM Guest VM Virtio eventfd NVMe eventfd Controller virtio Controller NVMe UNIX domain UNIX domain vhost DPDK vhost vhost DPDK vhost socket socket virtqueue virtqueue sq cq virtqueue Shared Guest VM Shared Guest VM Memory Memory Host Memory Host Memory

13. Virtio VS NVMe Available Ring Submission Queue Both Use Ring Data Structures for IO Available Index TAIL 13

14.Virtio-SCSI and NVMe protocol format comparison ADDR SCSI_REQ NVMe_Req LEN FLAGS NEXT ADDR DATA DATA LEN FLAGS NEXT ADDR SCSI_RSP NVMe_Rsp LEN FLAGS NEXT (NVMe_Req + Data + (16 * 3 + SCSI_Req + SCSI_Rsp + Data) Bytes NVMe_Rsp) Bytes

15.SPDK vhost architecture QEMU Released Separate Patch for QEMU QEMU Guest 1 Guest 2 Guest 3 Vhost SCSI Driver Vhost BLK Driver Virtio SCSI Virtio BLK NVMe Vhost NVMe Driver Controller Controller Controller Kernel SPDK vhost Target SCSI BLK NVMe kvm BDEV

16. Comparison of known solutions QEMU QEMU VFIO SPDK SPDK SPDK Solution Emulated Based solution Vhost-SCSI Vhost-BLK Vhost-NVMe NVMe device Usage Guest OS NVMe NVMe Virtio SCSI Virtio BLK NVMe driver Interface Backend Y N Y Y Y Device sharing Application Y Y Y N (e.g., Command Y Transparent set is very small ) support Live Y N Y Y N Migration support VFIO N Y N N N dependency QEMU No modification Upstream is Upstream is Upstream is done Upstream is in Change done done process

17.SPDK vhost NVMe implementation details

18.vhost NVMe implementation details QEMU SPDK Vhost-NVMe Guest VM NVMe NVMe Controller NVMe IO Queue Poller … DPDK vhost NS1 NS2 NS vhost UNIX Domain Socket s c … q q Admin Queue BDEV BDEV BDEV Kernel sq cq Shared Guest VM kvm Memory

19.Create io queue Guest: Create IO Queue SPDK: Start to Create IO Queue QSIZE QID CQID QPRIO PC PRP1 Guest: Submit to Admin, Write DB SPDK: Memory Translation QEMU: Pick up Admin Command SPDK: Both Guest and SPDK see same IO Queue now QEMU: Send via Domain Socket sq 19

20.New feature to address guest NVMe performance issue Submit a new IO MMIO Writes happened, which will cause VM_EXIT NVMe 1.3 New Feature: Optional SQ1 Admin Command support for Doorbell Buffer Config, only used for emulated NVMe controllers, Guest can update Write Shadow SQ 1 shadow doorbell buffer instead of SQ 1 Doorbell Doorbell submission queue’s doorbell registers 20

21.Shadow doorbell buffer Start End Description 00h 03h Submission Queue 0 Tail Doorbell or Eventidx (Admin) 04h 07h Completion Queue 0 Head Doorbell or Eventidx (Admin) 08h 0Bh Submission Queue 1 Tail Doorbell or Eventidx 0Ch 0Fh Completion Queue 1 Head Doorbell or Eventidx Command Description PRP1 Shadow doorbell memory address, updated by Guest NVMe Driver PRP2 Eventidx memory address, updated by SPDK vhost target


23. 1 VM with 1 NVMe SSD IOPS (K) CPU Utilization (%) KVM Events 300 200000000 600 250 150000000 500 200 100000000 400 150 50000000 300 100 0 200 50 100 0 0 Guest Guest Host Host 1 Usr Sys Usr Sys QEMU-NVME Vhost-SCSI QEMU-NVMe Vhost-SCSI QEMU-NVMe Vhost-SCSI Vhost-BLK Vhost-NVMe Vhost-BLK Vhost-NVMe Vhost-BLK Vhost-NVMe System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz; 128GB, 2667 DDR4, 6 memory Channels; SSD: Intel Optane™ P4800X, FW: E2010324, 375GiB; Bios: HT disabled, Turbo disabled; OS: Fedora 25, kernel 4.16.0. 1 VM, VM config : 4 vcpu 4GB memory, 4 IO queues; VM OS: Fedora 27, kernel 4.16.5-200, blk-mq enabled; Software: QEMU-2.12.0 with SPDK Vhost-NVMe driver patch, IO distribution: 1 vhost-cores for SPDK, FIO 3.3, io depth=32, numjobs=4, direct=1, block size=4k,total tested data size=400GiB 23

24. 8 VMs with 4 NVMe SSDs IOPS (K) Latency (us) 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 randread randread Vhost-SCSI Vhost-BLK Vhost-NVMe Vhost-SCSI Vhost-BLK Vhost-NVMe • Linux kernel NVMe driver will poll completion queue when submitting a new request, which can help to decrease interrupt numbers and vm_exit events. System Configuration: 2 * Intel Xeon E5 2699v4 @ 2.2GHz; 256GB, 2667 DDR4, 6 memory Channels; SSD: Intel DC P4510, FW: VDV10110, 2TiB; BIOS: HT disabled, Turbo disabled; Host OS: CentOS 7, kernel 4.16.7. 8 VMs, VM config : 4 vcpu 4GB memory, 4 IO queues; Guest OS: Fedora 27, kernel 4.16.5-200, blk-mq enabled; Software: QEMU-2.12.0 with SPDK Vhost-NVMe driver patch, IO distribution: 2 vhost-cores for SPDK, FIO 3.3, io depth=128, numjobs=4, direct=1, block size=4k,runtime=300s,ramp_time=60s; SSDs well preconditioned with 2 hours randwrites before randread tests.


26.Conclusion & Future work • Conclusion – In this presentation, we introduce SPDK vhost solution(i.e., SCSI/Blk/NVMe) to accelerate NVMe I/Os in virtual machines • Future work – VM live migration support for the whole SPDK vhost solution(i.e., vhost SCSI/BLK/NVMe) – Upstream QEMU vhost driver. • Promotion – Welcome to evaluate & use SPDK vhost target ! – Welcome to contribute to SPDK community !