Intel® Scalable I/O Virtualization

Intel® Scalable I/O Virtualization (Intel® Scalable IOV) is a new approach from Intel to hardware based I/O virtualization that enables highly-scalable and high-performance sharing of I/O devices across isolated domains (traditional VMs, containers, or application processes), while reducing their cost and complexity. Kevin will first introduce the concept of Intel® Scalable IOV, specifically about a hybrid approach through innovations in both hardware and software components to achieve the advantages of both scalability, performance and composability. Following that comes an overview of Intel®Scalable IOV reference architecture in Linux, based on extensions to VFIO mediated device framework and IOMMU sub-system.

1.Intel® Scalable I/O Virtualization Kevin Tian Principal Engineer, Intel

2.Legal Disclaimer No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.

3.Hardware-Assisted I/O Virtualization • Pursued for two classes of devices – High-performance devices where SW method imposes large overhead • E.g. NICs, RDMA devices, NVMe, etc. – Complex devices where virtualizing the device entirely in software is not practical • E.g. GPU, FPGA, etc. • Today SR-IOV is the standard framework for PCI Express® devices

4.PCI Express® SR-IOV VM Container ■PCIe® Single Root I/O Virtualization (SR-IOV) PF VF1 VFn  Physical Function (PF) PF BAR VF BAR VF BAR  Virtual Function (VF) PF Config VF Config VF Config … …Q ■VF directly assignable to Q Q … Q Q Q …Q Q Q Backend  Traditional Virtual Machine (VM) Resources  Bare metal container/process Device  VM container

5.New Requirements • Hyper-scale environment – Scale to 1000+ VMs/containers • Dynamic resource management – User-defined sharing granularity, over-provisioning, etc. • Composability – VM live migration, snapshot, generational compatibility, etc. Observed major limitations on SR-IOV!

6. Intel® Scalable I/O Virtualization (Intel® Scalable IOV) • A hardware-assisted mediated pass-through architecture – Slow-path operations emulated by software – Fast-path resources dynamically provisioned for direct access – Hardware-enforced DMA isolation between fast-path resources • Finer-grained device sharing than SR-IOV – Think about each TX/RX queue pair is now assignable • Utilizes existing PCIe® capabilities – e.g. Process Address Space ID (PASID) • Supports any type of devices – e.g. NIC, storage, GPU, accelerators, … (integrated or discrete) • Supports both VM and bare-metal usages

7.Intel® Scalable IOV Concept VM ■Device: Assignable Device Interfaces (ADI)  Queues, queue pairs, contexts VDEV VDEV … VDEV  Meet isolation criteria to be ‘assignable’  Tagged with unique PASID Software Resource Remapping Logic … ■Platform: PASID-granular DMA PF BAR ADI ADI ADI isolation PF Config  Through Intel® VT-d extensions Q Q … Q Q Q …Q Q Q … Q PASID PASID Device ■Software: Compose ADIs into DMA (BDF:PASID) Virtual Device (VDEV)  Software managed resource remapping between VDEV and ADI  Slow-path emulation & fast-path pass- IOMMU through

8.Benefits Scalability Flexibility VM1 VM2 … VMn Process Process VM Container VDEV1 VDEV2 … VDEVn syscall VDEV1 VDEV2 Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Device Device Over-provisioning Composability VDEV1 VDEV2 VM VM Live Migration VDEV VDEV Q Q Q Q Q Q Q Q Device Q Q Q Q Q Q Q Device Device

9.Assignable Device Interfaces (ADIs) • Smallest granularity of sharing a device – No PCI config space register, share common BDF – Identified by PASID • For ADI to be ‘assignable’ – Functional isolation between ADIs – ADI MMIO registers in separate system page size regions – All DMAs tagged with PASID – Independently resettable – Scalable Interrupt Message Storage (IMS) – …

10.Enumeration of Intel® Scalable IOV Capability • Designated Vendor Specific Extended Capability (DVSEC) to discover Intel® Scalable IOV capability – A simplified subset of SR-IOV capability Byte Offset 31 24 23 20 19 16 15 0 Cap PCI Express Extended Capability ID Next Capability Offset 00h Version = 0x23 DVSEC DVSEC Length = 0x18 DVSEC Vendor ID = 8086 04h rev = 0 Function Dependency Flags (RO) DVSEC ID for Scalable IOV = XXX 08h Link (RO) Supported Page Sizes (RO) 0Ch System Page Size (RW) 10h Capabilities (RO) 14h

11.Intel® VT-d Enhancement • Scalable mode DMA remapping – PASID granule 1st-level, 2nd-level, nested and pass-through – PASID table now two-level structure – Cover both Scalable IOV and SVM usages • Extended Context (ECS) is deprecated • Access/Dirty (A/D) bits in 2nd-level – Assist dirty memory tracking in live migration

12.Extended Context Mode (Deprecated)

13.Scalable Mode (New) Key Difference: PASID is a global ID space shared by all VMs. ALL page-table pointers moved to PASID Granular table

14.Software Composition • Virtual Device Composition Module (VDCM) – Compose ADIs into Virtual Device (VDEV) – Emulate slow-path operations • Need a framework to connect VDCM for – Managing VDEV life-cycle – Setting up access policy on VDEV resources – Serving slow-path operations from guest • In Linux it’s VFIO mediated device framework! – “mdev” == “VDEV” in concept

15.VFIO Mediated Device Framework Life-cycle Resource Run-time mgmt. enumeration emulation VFIO ■Mdev core User Interfaces  Connect VFIO and VDCM Mdev Core Bus Driver ■User interfaces platform pci mdev Interface  Used by libvirt, qemu, etc. Mdev Bus ? … ■IOMMU map/unmap Device Driver Interface IOMMU Interfaces ■DMA isolation for mdev Map/unmap callbacks  Purely in software, or  In vendor specific way IOMMU Host Driver Driver VDCM

16.Extensions for Intel® Scalable IOV Finer- Life-cycle Resource Run-time grained mgmt. enumeration emulation ■IOMMU-capable mdev  Link to iommu_domain (tagged VFIO by PASID) User Interfaces  Allow PASID-granular iommu map/unmap Mdev Core  Opt-in by VDCM Bus Driver platform pci mdev Interface ■Finer-grained resource Mdev Bus management IOMMU- …  Specify any number of ADIs to capable compose a mdev Device Driver IOMMU Interfaces Interface ■Unified framework for Map/unmap per PASID callbacks VM and bare metal usages Scalable mode  Mdev composition can be Host Driver usage specific, e.g. no PCI IOMMU VDCM emulation in bare metal usage Driver

17.Main Linux Enabling Tasks • To enable basic ADI assignment – Support new scalable mode – Need system-wide PASID space – Introduce iommu-capable mdev – Device specific VDCM in host driver • To support vIOMMU/vSVM with ADI – Emulate new scalable mode on vIOMMU – Enlightened PASID management scheme – Maintain compatible APIs between PF/VF and ADI

18.Summary of Architecture Changes • Support Assignable Device Interfaces (ADIs) Device Support • Support direct fast-path access from VMs • Extend Intel® VT-d to use PASID/BDF to Platform Support identify DMA upstream accesses • Move infrequent (slow-path) accesses from Software Support the device to software without affecting perf

19.Documentation • Intel® VT-d specification update (Rev 3.0) – Documents Intel® VT-d (IOMMU) support for PASID granular address translation • Intel® Scalable I/O Virtualization Technical Specification (Rev 1.0) – Documents the Scalable IOV architecture blueprint and operation, including DVSEC – Addresses architecture requirements for devices and drivers – Agnostic of type of device or specific implementation – Openly published to enable broad device and software ecosystem •