Storage Performance Tuning for FAST

There are many variables around how you could run your virtual machines. How to locate those that affect I/O performance of your virtual machine? What does each of these options mean and how do they relate to each other? What are the newcomers in the family and how can they help? In this talk, Fam Zheng will take you through the configuration stack of virtual storage devices, decipher the parameters and give suggestions on how to tune for the best performance on your systems.
展开查看详情

1.Storage Performance Tuning for FAST! Virtual Machines Fam Zheng Senior Software Engineer LC3-2018

2. Outline • Virtual storage provisioning • NUMA pinning • VM configuration options • Summary • Appendix 2

3.Virtual storage provisioning

4. Provisioning virtual disks Virtual machine • Virtual storage provisioning is to expose host persistent app app app storage to guest for applications’ use. • A device of a certain type is presented on a system bus virtual_block_device_driver.ko • Guest uses a corresponding driver to do I/O ??? KVM • The disk space is allocated from the storage available on the host. 4

5. QEMU emulated devices • Device types: virtio-blk, virtio-scsi, IDE, NVMe, ... • QEMU block features • qcow2, live snapshot • throttling • block migration • incremental backup • … • Easy and flexible backend configuration • Wide range of protocols: local file, NBD, iSCSI, NFS, Gluster, Ceph, ... • Image formats: qcow2, raw, LUKS, … • Pushed hard for performance • IOThread polling; userspace driver; multiqueue block layer (WIP) 5

6. QEMU emulated device I/O (file backed) I/O Request Lifecycle QEMU Guest virtio driver vCPU main ↓ thread KVM ioeventfd ↓ vdev vring handler ↓ KVM File system QEMU block layer ↓ LinuxAIO/POSIX syscall Block layer ↓ Host VFS/block/SCSI layer ↓ SCSI Host device driver ↓ Device driver Hardware 6

7. QEMU virtio IOThread ● A dedicated thread to handle QEMU virtio vrings main vCPU thread IOThread ● Now fully support QEMU block layer features virtio Virtio Queue ● (Previously known as x-data- plane of virtio-blk, limited to raw format, no block jobs) KVM ● Currently one IOThread per device Host ● Multi-queue support is being storage worked on ● Adaptive polling enabled ● Optimizes away the event notifiers from critical path (Linux-aio, vring, ...) ● Reduces up to 20% latency 7

8. QEMU userspace NVMe driver With the help of VFIO, QEMU accesses host controller’s submission and completion queues vCPU IOThread without doing any syscall. MSI/IRQ is delivered to IOThread NVMe drv with eventfd, if adaptive polling of completion queues doesn’t get result. KVM vfio-pci.ko No host file system, block layer or SCSI. Data path is shortened. QEMU process uses the controller exclusively. (New in QEMU 2.12) 8

9. SPDK vhost-user QEMU Virtio queues are handled by a separate process, SPDK vhost, which is built on top QEMU of DPDK and has a main SPDK vhost userspace poll mode NVMe thread vCPU driver. Hugepage VQ shared memory QEMU IOThread and host nvme pmd kernel is out of data path. Latency is greatly reduced by busy polling. KVM No QEMU block features. No migration (w/ NVMe pmd). 9

10. vfio-pci device assignment Highly efficient. Guest driver accesses device queues directly QEMU without VMEXIT. vCPU main thread No block features of host system or QEMU. Cannot do migration. nvme.ko KVM vfio-pci.ko 10

11. Provisioning virtual disks Type Configuration QEMU block Migration Special Supported in features requirements current RHEL/RHV IDE ✓ ✓ ✓ QEMU emulated NVMe ✓ ✓ ✗ virtio-blk, virtio-scsi ✓ ✓ ✓ vhost-scsi ✗ ✗ ✗ vhost SPDK vhost-user ✗ ✓ Hugepages ✗ Device Exclusive device assignment vfio-pci ✗ ✗ assignment ✓ Sometimes higher performance means less flexibility 11

12. fio randread bs=4k iodepth=1 numjobs=1 host /dev/nvme0n1 vfio-pci vhost-user-blk (SPDK) (**) virtio-blk, w/ iothread, userspace driver virtio-blk, w/ iothread virtio-scsi, w/ iothread ahci 0 2000 4000 6000 8000 10000 12000 IOPS Backend: NVMe, Intel® SSD DC P3700 Series 400G Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, Fedora 28 Guest: Q35, 1 vCPU, Fedora 28 QEMU: 8e36d27c5a (**): SPDK poll mode driver threads take 100% host CPU cores, dedicatedly 12 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

13.NUMA Pinning

14. NUMA (Non-uniform memory access) vCPU IOThread NVMe drv KVM vfio-pci.ko NUMA node 0 NUMA node 1 Goal: put vCPU, IOThread and virtual memory on the same NUMA node with the host device that undertakes I/O 14

15. Automatic NUMA balancing • Kernel feature to achieve good NUMA locality • Periodic NUMA unmapping of process memory • NUMA hinting fault • Migrate on fault - moves memory to where the program using it runs • Task NUMA placement - moves running programs closer to their memory • Enabled by default in RHEL: cat /proc/sys/kernel/numa_balancing 1 • Decent performance in most cases • Disable it if using manual pinning 15

16. Manual NUMA pinning • Option 1: Allocate all vCPUs and virtual memory on the optimal NUMA node $ numactl -N 1 -m 1 qemu-system-x86_64 … • Or use Libvirt (*) • Restrictive on resource allocation: • Cannot use all host cores • NUMA-local memory is limited • Option 2: Create a guest NUMA topology matching the host, pin IOThread to host storage controller’s NUMA node • Libvirt is your friend! (*) • Relies on the guest to do the right NUMA tuning 16 * See appendix for Libvirt XML examples

17. fio randread bs=4k iodepth=1 numjobs=1 NUMA pinning +5% no NUMA pinning 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA balancing disabled. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a 17 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

18.VM Configuration Options

19. Raw block device vs image file • Image file is more flexible, but slower • Raw block device has better performance, but harder to manage • Note: snapshot is supported with raw block device. E.g: $ qemu-img create -f qcow2 -b /path/to/base/image.qcow2 \ /dev/sdc 19

20. QEMU emulated device I/O (block device backed) Using raw block device may improve performance: no file system in host. vCPU IOThread KVM /dev/nvme0n1 nvme.ko 20

21. Middle ground: use LVM LVM is much more flexible and easier to manage than raw block or partitions, and has good performance fio randrw bs=4k iodepth=1 numjobs=1 14000 12000 10000 8000 6000 4000 2000 0 raw file (xfs) lvm block dev Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a 21 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

22. Using QEMU VirtIO IOThread • When using virtio, it’s recommended to enabled IOThread: qemu-system-x86_64 … \ -object iothread,id=iothread0 \ -device virtio-blk-pci,iothread=iothread0,id=… \ -device virtio-scsi-pci,iothread=iothread0,id=… • Or in Libvirt... 22

23. Using QEMU VirtIO IOThread (Libvirt) <domain> ... <iothreads>1</iothreads> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='none' iothread='1'/> <target dev='vda' bus='virtio'/> … </disk> <devices> <controller type='scsi' index='0' model='virtio-scsi'> <driver iothread='1'/> ... </controller> </devices> </domain> 23

24. virtio-blk with and without enabling IOThread fio randread bs=4k iodepth=1 numjobs=1 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 with IOThread without IOThread Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk QEMU: 8e36d27c5a 24 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

25. virtio-blk vs virtio-scsi • Use virtio-scsi for many disks, or for full SCSI support (e.g. unmap, write same, SCSI pass-through) • virtio-blk DISCARD and WRITE ZEROES are being worked on • Use virtio-blk for best performance fio blocksize=4k numjobs=1 (IOPS) virtio-scsi, iodepth=4, randrw vitio-blk, iodepth=4, randrw virtio-scsi, iodepth=1, randrw virtio-blk, iodepth=1, randrw virtio-scsi, iodepth=4, randread vitio-blk, iodepth=4, randread virtio-scsi, iodepth=1, randread virtio-blk, iodepth=1, randread 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Backend: Intel® SSD DC P3700 Series; QEMU userspace driver (nvme://) Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. IOThread enabled. QEMU: 8e36d27c5a 25 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

26. Raw vs qcow2 • Don’t like the trade-off between features and performance? • Try increasing qcow2 run-time cache size qemu-system-x86_64 … \ -drive \ file=my.qcow2,if=none,id=drive0,aio=native,cache=none,\ cache-size=16M \ ... • Or increase the cluster_size when creating qcow2 images qemu-img create -f qcow2 -o cluster_size=2M my.qcow2 100G 26

27. Raw vs qcow2 fio blocksize=4k numjobs=1 iodepth=1 (IOPS) raw, randread qcow2 (2M cluster), randread qcow2 (64k cluster, 16M cache), randread qcow2 (64k cluster), randread raw, randrw qcow2 (2M cluster), randrw qcow2 (64k cluster, 16M cache), randrw qcow2 (64k cluster), randrw 0 2000 4000 6000 8000 10000 12000 Backend: Intel® SSD DC P3700 Series, formatted as xfs; Virtual disk size: 100G; Preallocation: full Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a 27 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

28. AIO: native vs threads • aio=native is usually better than aio=threads • May depend on file system and workload • ext4 native is slower because io_submit is not implemented async fio 4k randread numjobs=1 iodepth=16 (IOPS) 120000 100000 80000 60000 40000 20000 0 xfs, xfs, native ext4, ext4, na- nvme, nvme, na- threads threads tive threads tive Backend: Intel® SSD DC P3700 Series Host: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, 2 sockets w/ NUMA, Fedora 28 Guest: Q35, 6 vCPU, 1 socket, Fedora 28, NUMA pinning. Virtual device: virtio-blk w/ IOThread QEMU: 8e36d27c5a 28 [*]: numbers are collected for relative comparison, not representative as a formal benchmarking result

29. Image preallocation • Reserve space on file system for user data or metadata: $ qemu-img create -f $fmt -o preallocation=$mode test.img 100G • Common modes for raw and qcow2: • off: no preallocation • falloc: use posix_fallocate() to reserve space • full: reserve by writing zeros • qcow2 specific mode: • metadata: fully create L1/L2/refcnt tables and pre-calculate cluster offsets, but don’t allocate space for clusters • Consider enabling preallocation when disk space is not a concern (it may defeat the purpose of thin provisioning) 29