13 Virtual Machines #1

Virtual Machines #1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Xen and the Art of Virtualization

1. Today’s Papers • Disco: Running Commodity Operating Systems on Scalable EECS 262a Multiprocessors". ACM Transactions on Computer Systems 15 (4). Advanced Topics in Computer Systems Edouard Bugnion; Scott Devine; Kinshuk Govil; Mendel Rosenblum (November 1997). Lecture 13 • Xen and the Art of Virtualization P. Barham, B. Dragovic, K Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt and A. Warfield. Appears in Symposium on Virtual Machines Operating System Principles (SOSP), 2003 October 4th, 2018 John Kubiatowicz • Thoughts? Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 10/04/2018 Cs262a-F18 Lecture-13 2 Why Virtualize? Virtual Machines Background  Consolidate machines • Observation: instruction-set architectures (ISA) form Huge energy, maintenance, and management savings some of the relatively few well-documented complex  interfaces we have in world – Machine interface includes meaning of interrupt numbers, programmed  Isolate performance, security, and configuration I/O, DMA, etc.  Stronger than process-based  Stay flexible • Anything that implements this interface can execute the software for that platform  Rapid provisioning of new services  Easy failure/disaster recovery (when used with data replication) • A virtual machine is a software implementation of this  Cloud Computing interface (often using the same underlying ISA, but not  Huge economies of scale from multiple tenants in large datacenters always)  Savings on mgmt, networking, power, maintenance, purchase costs – Original paper on whether or not a machine is virtualizable: Gerald J. Popek and Robert P. Goldberg (1974). “Formal Requirements for Virtualizable Third Generation Architectures”. Communications of the  Corporate employees ACM 17 (7): 412 –421.  Can choose own devices 10/04/2018 Cs262a-F18 Lecture-13 3 10/04/2018 Cs262a-F18 Lecture-13 4

2. Many VM Examples • IBM initiated VM idea to support legacy binary code – Support an old machine’s on a newer machine (e.g., CP-67 on System 360/67) Linux WinXP ??? – Later supported multiple OS’s on one machine (System 370) Linux • Apple’s Rosetta ran old PowerPC apps on newer x86 Macs • MAME is an emulator for old arcade games (5800+ games!!) – Actually executes the game code straight from a ROM image • Modern VM research started with Stanford’s Disco project Virtual Machine ??? Monitor – Ran multiple VM’s on large shared-memory multiprocessor (since normal OS’s couldn’t scale well to lots of CPUs) • VMware (founded by Disco creators): HARDWARE – Customer support (many variations/versions on one PC using a VM for each) – Web and app hosting (host many independent low-utilization servers on one machine – “server consolidation”) 10/04/2018 Cs262a-F18 Lecture-13 5 10/04/2018 Cs262a-F18 Lecture-13 6 VM Basics Virtualization Approaches • The real master is no longer the OS but the “Virtual Machine • Disco/VMware/IBM: Complete virtualization – runs Monitor” (VMM) or “Hypervisor” (hypervisor > supervisor) unmodified OSs and applications – Use software emulation to shadow system data structures, • OS no longer runs in most privileged mode (reserved for VMM) – Dynamic binary rewriting of OS code that modifies system structures, or – x86 has four privilege “rings” with ring 0 having full access – Hardware virtualization support – VMM = ring 0, OS = ring 1, app = ring 3 – x86 rings come from Multics (also x86 segment model) • Denali introduced the idea of “paravirtualization” – change » (Newer x86 has fith “ring” for hypervisor, but not available at time of paper) interface some to improve VMM performance/simplicity – Must change OS and some apps (e.g., those using segmentation) – easy • But OS thinks it is running in most privileged mode and still for Linux, hard for MS (requires their help!) issues those instructions? – But can support 1,000s of VMs on one machine... – Ideally, such instructions should cause traps and the VMM then emulates the – Great for web hosting instruction to keep the OS happy – But in (old) x86, some such instructions fail silently! • Xen: change OS but not applications – support the full – Four solutions: SW emulation (Disco), dynamic binary code rewriting (VMware), Application Binary Interface (ABI) slightly rewrite OS (Xen), hardware virtualization (IBM System/370, IBM LPAR, – Faster than a full VM – supports ~100 VMs per machine Intel VT-x, AMD-V, SPARC T-series) – Moving to a paravirtual VM is essentially porting the software to a very similar machine 10/04/2018 Cs262a-F18 Lecture-13 7 10/04/2018 Cs262a-F18 Lecture-13 8

3.How to Build a VMM 1: SW Emulation (Disco) Disco • Extend modern OS to run efficiently on shared EMULATOR PROCESS memory multiprocessors with minimal OS changes “Physical” memory • A VMM built to run multiple copies of Silicon Graphics IRIX operating system on Stanford Flash Guest App Guest App Virtual MMU multiprocessor Virtual System Calls Guest Kernel Virtual CPU Normal OS HARDWARE IRIX Unix based OS Stanford FLASH: cache coherent NUMA 10/04/2018 Cs262a-F18 Lecture-13 9 10/04/2018 Cs262a-F18 Lecture-13 10 Disco Architecture Disco’s Interface • Processors – MIPS R10000 processor: emulates all instructions, MMU, trap architecture – Extension to support common processor operations » Enabling/disabling interrupts, accessing privileged registers • Physical memory – Contiguous, starting at address 0 • I/O devices – Virtualize devices like I/O, disks – Physical devices multiplexed by Disco – Special abstractions for SCSI disks and network interfaces » Virtual disks for VMs » Virtual subnet across all virtual machines 10/04/2018 Cs262a-F18 Lecture-13 11 10/04/2018 Cs262a-F18 Lecture-13 12

4. Disco Implementation Virtual CPUs • Multi threaded shared memory program • Direct execution on real CPU: – Set real CPU registers to those of virtual CPU/Jump to current PC of VCPU – Privileged instructions must be emulated, since we won’t run the OS in • Attention to NUMA memory placement, cache aware data privileged mode structures and IPC patterns – Disco runs privileged, OS runs supervisor mode, apps in user mode • An OS privileged instruction causes a trap which causes • Disco code copied to each flash processor Disco to emulated the intended instruction • Maintains data structure for each virtual CPU for trap • Communicate using shared memory emulation – Trap examples: page faults, system calls, bus errors • Scheduler multiplexes virtual CPU on real processor 10/04/2018 Cs262a-F18 Lecture-13 13 10/04/2018 Cs262a-F18 Lecture-13 14 Virtual Physical Memory Virtual Physical Memory • Extra level of indirection: physical-to-machine address • Extra level of indirection: physical-to-machine address mappings mappings – VM sees contiguous physical addresses starting from 0 – Map physical addresses to the 40 bit machine addresses used by FLASH – VM sees contiguous physical addresses starting from 0 – When OS inserts a virtual-to-physical mapping into TLB, translates the physical – Map physical addresses to the 40 bit machine addresses used by FLASH address into corresponding machine address – When OS inserts a virtual-to-physical mapping into TLB, translates the – To quickly compute corrected TLB entry, Disco keeps a per VM pmap data physical address into corresponding machine address structure that contains one entry per VM physical page – Each entry in TLB tagged with address space identifier (ASID) to avoid flushing App TLB on MMU context switches (within VM) VM OS App • Must flush the real TLB on VM switch OS VM TLB virtual address physical address • Somewhat slower: – OS now has TLB misses (not direct mapped) TLB virtual address physical address – TLB flushes are frequent pmap physical address machine address – TLB instructions are now emulated TLB virtual address machine address • Disco maintains a second-level cache of TLB entries: Traditional OS – This makes the VTLB seem larger than a regular R10000 TLB Disco – Disco can absorb many TLB faults without passing them through to the real OS 10/04/2018 Cs262a-F18 Lecture-13 15 10/04/2018 Cs262a-F18 Lecture-13 16

5. NUMA Memory Management Transparent Page Replication • Dynamic Page Migration and Replication – Pages frequently accessed by one node are migrated – Read-shared pages are replicated among all nodes – Write-shared are not moved, since maintaining consistency requires remote access anyway – Migration and replacement policy is driven by cache-miss-counting facility provided by the FLASH hardware 1. Two different virtual processors of same virtual machine logically read-share same physical page, but each virtual processor accesses a local copy 2. memmap maintains an entry for each machine page that contains which virtual page reference it; used during TLB shootdown* *Processors flushing their TLBs when another processor restrict access to a shared page. 10/04/2018 Cs262a-F18 Lecture-13 17 10/04/2018 Cs262a-F18 Lecture-13 18 I/O (disk and network) Transparent Page Sharing • Emulated all programmed I/O instructions • Can also use special Disco-aware device drivers (simpler) • Main task: translate all I/O instructions from using PM addresses to MM addresses • Optimizations: – Larger TLB – Copy-on-write disk blocks » Track which blocks already in memory » When possible, reuse these pages by marking all versions read-only • Global buffer cache that can be transparently shared between and using copy-on-write if they are modified virtual machines » => shared OS pages and shared executables can really be shared. • Zero-copy networking along fake “subnet” that connect VMs within an SMP – Sender and receiver can use the same buffer (copy on write) 10/04/2018 Cs262a-F18 Lecture-13 19 10/04/2018 Cs262a-F18 Lecture-13 20

6. Transparent Page Sharing over NFS Impact • Revived VMs for the next 20 years – Now VMs are commodity – Every cloud provider, and virtually every enterprise uses VMs today • VMWare successful commercialization of this work – Founded by authors of Disco in 1998, $6B revenue today – Initially targeted developers – Killer application: workload consolidation and infrastructure management in enterprises 1. The monitor’s networking device remaps data page from source’s machine address to destination’s 2. The monitor remaps data page from driver’s to client’s buffer cache 10/04/2018 Cs262a-F18 Lecture-13 21 10/04/2018 Cs262a-F18 Lecture-13 22 Summary Is this a good paper? • Disco VMM hides NUMA-ness from non-NUMA aware OS • What were the authors’ goals? • What about the evaluation/metrics? • Disco VMM is low effort • Did they convince you that this was a good – Only 13K LoC system/approach? • Were there any red-flags? • Moderate overhead due to virtualization • What mistakes did they make? – Only 16% overhead for uniprocessor workloads • Does the system/approach meet the “Test of Time” – System with eight virtual machines can run some workloads 40% faster challenge? • How would you review this paper today? 10/04/2018 Cs262a-F18 Lecture-13 23 10/04/2018 Cs262a-F18 Lecture-13 24

7. How to Build a VMM 2: Trap and Emulate EMULATOR PROCESS add %eax, %ebx “Physical” Guest memory Guest App Kernel Virtual MMU Virtual System Calls BREAK Normal OS HARDWARE 10/04/2018 Cs262a-F18 Lecture-13 25 10/04/2018 Cs262a-F18 Lecture-13 26 How to Build a VMM 2: Trap and Emulate How to Build a VMM 2: Trap and Emulate EMULATOR PROCESS for(i = 0; i < 256; i++) handle_sysenter sysenter outb %al “Physical” mangle_pagetable_entry(&ptes[i]); Guest memory Guest App Kernel Virtual MMU Virtual System Calls  256 traps into the emulator  Severe performance penalty Normal OS HARDWARE 10/04/2018 Cs262a-F18 Lecture-13 27 10/04/2018 Cs262a-F18 Lecture-13 28

8. How to Build a VMM 3: Dynamic Binary How to Build a VMM 3: Dynamic Binary Translation (VMware) Translation TRANSLATOR PROCESS for(i = 0; i < 256; i++) “Physical” mangle_pagetable_entry(&ptes[i]); Rewritten memory Rewritten Guest Virtual MMU Guest App Kernel Virtual System Calls Normal OS HARDWARE 10/04/2018 Cs262a-F18 Lecture-13 29 10/04/2018 Cs262a-F18 Lecture-13 30 How to Build a VMM 3: Dynamic Binary Translation How to Build a VMM 4: Paravirtualization (Xen) Q. But when is this a safe alteration? pte_t new_ptes[256]; A. Let the humans worry about it for(i = 0; i < 256; i++)  Manually hack the OS: “paravirtualization” new_ptes[i] = mangled_entry(&ptes[i]); Ring 3 Control User register_new_ptes(new_ptes, 256); User Applications Plane Apps Ring 2 • But when is this a safe alteration? Guest OS Ring 1 Guest OS Dom0 VMM Binary Ring 0 Xen Translation Full Virtualization Paravirtualization 10/04/2018 Cs262a-F18 Lecture-13 31 10/04/2018 Cs262a-F18 Lecture-13 32

9. CPU Xen: Founding Principles • Key idea: Minimally alter guest OS to make VMs simpler • X86 supports 4 privilege and higher performance levels – Called paravirtualization (due to Denali project) – 0 for OS, and 3 for applications – Xen downgrades OS to level 1 • Don't disguise multiplexing – System-call and page-fault handlers registered to Xen – “fast handlers” for most exceptions, • Execute faster than the competition doesn’t involve Xen  Note: VMWare does that too as “guest additions” are basically paravirtualization through specialized drivers (disk, I/O, video, …) 10/04/2018 Cs262a-F18 Lecture-13 33 10/04/2018 Cs262a-F18 Lecture-13 34 Xen: Emulate x86 (mostly) x86 Virtualization • Xen paravirtualization: • x86 harder to virtualize than Mips (as in Disco): – Required less than 2% of the total lines of code to be modified – MMU uses hardware page tables – Pros: better performance on x86, some simplifications in VM – Some privileged instructions fail silently rather than fault implementation, OS might want to know that it is virtualized! (e.g. real – VMWare fixed this using binary rewrite time clocks) – Xen by modifying the OS to avoid them – Cons: must modify the guest OS (but not its applications!) • Step 1: reduce the privilege of the OS • Aims for performance isolation (why is this hard?) – “Hypervisor” runs with full privilege instead (ring 0), OS runs in ring 1, Apps in ring 3 – Xen must intercept interrupts and convert them to events posted to • Philosophy: shared region with OS – Divide up resources and let each OS manage its own – Need both real and virtual time (and wall clock) – Ensures that real costs are correctly accounted to each OS (essentially zero shared costs, e.g., no shared buffers, no shared network stack, etc.) 10/04/2018 Cs262a-F18 Lecture-13 35 10/04/2018 Cs262a-F18 Lecture-13 36

10. Virtualizing Virtual Memory Network • Unlike MIPS, x86 does not have software TLB • Model: • Good performance requires that all valid translations should be in HW – Each guest OS has a virtual network interface connected to a virtual page table firewall/ router (VFR) • TLB not “tagged”, which means address space switch must flush TLB – The VFR both limits the guest OS and also ensure correct incoming 1) Map Xen into top 64MB in all address spaces (limit guest OS access) to avoid TLB flush packet dispatch 2) Guest OS manages the hardware page table(s), but entries must be validated by Zen on updates; guest OS has read-only access to its own page table • Page frame states: – PD=page directory, PT=page table, LDT=local descriptor table, GDT=global descriptor • Exchange pages on packet receipt (to avoid copying) table, RW=writable page – No frame available  dropped packet – The type system allows Xen to make sure that only validated pages are used for the HW page table • Each guest OS gets a dedicated set of pages, although size can grow/shrink over time • Rules enforce no IP spoofing by guest OS • Physical page numbers (those used by the guest OS) can differ from the actual hardware numbers – Xen has a table to map HWPhys • Bandwidth is round robin (is this “isolated”?) – Each guest OS has a PhyHW map – This enables the illusion of physically contiguous pages 10/04/2018 Cs262a-F18 Lecture-13 37 10/04/2018 Cs262a-F18 Lecture-13 38 Disk Domain 0 (dom0) • Virtual block devices (VBDs): similar to SCSI disks • Nice idea: run the VMM management at user level • Management of partitions, etc. done via domain 0 – Given special access to control interface for platform management – Has back-end device drivers • Could also use NFS or network-attached storage instead • Much easier to debug a user-level process than an OS • Narrow hypercall API and checks can catch potential errors 10/04/2018 Cs262a-F18 Lecture-13 39 10/04/2018 Cs262a-F18 Lecture-13 40

11. CPU Scheduling: Borrowed Virtual Time Scheduling Times and Timers • Fair sharing scheduler: effective isolation between • Times: – Real time since machine boot: always advances regardless of the executing domains domain – Virtual time: time that only advances within the context of the domain – Wall-clock time • However, allows temporary violations of fair sharing to favor recently-woken domains • Each guest OS can program timers for both: – Reduce wake-up latency  improves interactivity – Real time – Virtual time 10/04/2018 Cs262a-F18 Lecture-13 41 10/04/2018 Cs262a-F18 Lecture-13 42 Control Transfer: Hypercalls and Events Exceptions • Hypercalls: synchronous calls from a domain to Xen • Memory faults and software traps – Allows domains to perform privileged operation via software traps – Similar to system calls – Virtualized through Xen’s event handler • Events: asynchronous notifications from Xen to domains – Replace device interrupts 10/04/2018 Cs262a-F18 Lecture-13 43 10/04/2018 Cs262a-F18 Lecture-13 44

12. Memory Data Transfer: I/O Rings • Physical memory • Zero-copy semantics – Reserved at domain creation time – Statically partitioned among domains • Data is transferred to/from • Virtual memory: domains via Xen through a – OS allocates a page table from its own reservation and registers it with Xen buffer ring – OS gives up all direct privileges on page table to Xen – All subsequent updates to page table must be validated by Xen – Guest OS typically batch updates to amortize hypervisor calls 10/04/2018 Cs262a-F18 Lecture-13 45 10/04/2018 Cs262a-F18 Lecture-13 46 Benchmark Performance I/O Performance 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 L X V U L X V U L X V U L X V U • Environments SPEC INT2000 (score) Linux build time (s) OSDB-OLTP (tup/s) SPEC WEB99 (score) – L: Linux Benchmark suite running on Linux (L), Xen (X), VMware Workstation (V), and UML (U) – IO-S: Xen using IO-Space access • Benchmarks – IDD: Xen using isolated device driver – Spec INT200: compute intensive workload • Benchmarks – Linux build time: extensive file I/O, scheduling, memory management – Linux build time: file I/O, scheduling, memory management – OSBD-OLTP: transaction processing workload, extensive synchronous – PM: file system benchmark disk I/O – OSDB-OLTP: transaction processing workload, extensive synchronous disk I/O – Spec WEB99: web-like workload (file and network traffic) – httperf: static document retrieval • Fair and reasonable comparisons? – SpecWeb99: web-like workload (file and network traffic) 10/04/2018 Cs262a-F18 Lecture-13 47 10/04/2018 Cs262a-F18 Lecture-13 48

13. Xen Summary History and Impact • Performance overhead of only 2-5% • Released in 2003 (Cambridge University) – Authors founded XenSource, acquired by Citrix for $500M in 2007 • Available as open source but owned by Citrix since 2007 – Modified version of Xen powers Amazon EC2 • Widely used today: – Widely used by web hosting companies – Linux supports dom0 in Linux’s mainline kernel – AWS EC2 based on Xen • Many security benefits – Multiplexes physical resources with performance isolation across OS instances – Hypervisor can isolate/contain OS security vulnerabilities – Hypervisor has smaller attack surface » Simpler API than OS – narrow interfaces  tractable security » Less overall code than an OS • BUT hypervisor vulnerabilities compromise everything… 10/04/2018 Cs262a-F18 Lecture-13 49 10/04/2018 Cs262a-F18 Lecture-13 50 Is this a good paper? Discussion: OSes, VMs, Containers • What were the authors’ goals? • What about the evaluation/metrics? • Did they convince you that this was a good App A App B system/approach? App A App B App A App B Bins/ Bins/ • Were there any red-flags? Libs Libs Bins/ Bins/ Bins/ Bins/ App A App B • What mistakes did they make? Libs Libs Libs Libs App A App B Guest Guest OS OS Bins/ Bins/ • Does the system/approach meet the “Test of Time” OS Services LibOS LibOS Libs Libs Extensions Hypervisor challenge? Microkernel SPIN Exokernel Host OS Host OS • How would you review this paper today? Hardware Hardware Hardware Hardware Hardware Microkernels Extensible OSes VMs Containers 10/04/2018 Cs262a-F18 Lecture-13 51 10/04/2018 Cs262a-F18 Lecture-13 52