操作系统内核,hypervisor管理程序等高优先级模块可能被恶意修改,使得在虚拟机环境中可能导致程序数据的不完整或者泄露,英特尔公司的SGX(Software Guard Extensions )技术方案可以提供更安全的执行环境,来自MIT的研究人员基于公开的论文和Intel公司的公开资料,对SGX技术进行大胆的推测其内部机理,但文章末尾指出任然还有更多的问题需要解答,但作为独立第三方,文章作者也呼吁芯片开发商能更多的揭示内幕,让他们的客户真正认可其芯片的安全性。

注脚

展开查看详情

1. Intel SGX Explained Victor Costan and Srinivas Devadas victor@costan.us, devadas@mit.edu Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology A BSTRACT Data Owner’s Remote Computer Computer Intel’s Software Guard Extensions (SGX) is a set of Untrusted Software extensions to the Intel architecture that aims to pro- Computation Container vide integrity and confidentiality guarantees to security- Dispatcher sensitive computation performed on a computer where Setup Computation all the privileged software (kernel, hypervisor, etc) is Setup Private Code Receive potentially malicious. Verification Encrypted Results Private Data This paper analyzes Intel SGX, based on the 3 pa- pers [14, 78, 137] that introduced it, on the Intel Software Developer’s Manual [100] (which supersedes the SGX Owns Authors Manages manuals [94, 98]), on an ISCA 2015 tutorial [102], and Trusts on two patents [108, 136]. We use the papers, reference Trusts manuals, and tutorial as primary data sources, and only draw on the patents to fill in missing information. Data Owner Software Infrastructure Provider Owner This paper’s contributions are a summary of the Intel-specific architectural and micro-architectural details Figure 1: Secure remote computation. A user relies on a remote needed to understand SGX, a detailed and structured pre- computer, owned by an untrusted party, to perform some computation sentation of the publicly available information on SGX, on her data. The user has some assurance of the computation’s a series of intelligent guesses about some important but integrity and confidentiality. undocumented aspects of SGX, and an analysis of SGX’s security properties. performed on it. SGX relies on software attestation, like its predeces- 1 OVERVIEW sors, the TPM [71] and TXT [70]. Attestation (Figure 3) Secure remote computation (Figure 1) is the problem proves to a user that she is communicating with a specific of executing software on a remote computer owned and piece of software running in a secure container hosted maintained by an untrusted party, with some integrity by the trusted hardware. The proof is a cryptographic and confidentiality guarantees. In the general setting, signature that certifies the hash of the secure container’s secure remote computation is an unsolved problem. Fully contents. It follows that the remote computer’s owner can Homomorphic Encryption [61] solves the problem for a load any software in a secure container, but the remote limited family of computations, but has an impractical computation service user will refuse to load her data into performance overhead [138]. a secure container whose contents’ hash does not match Intel’s Software Guard Extensions (SGX) is the latest the expected value. iteration in a long line of trusted computing (Figure 2) The remote computation service user verifies the at- designs, which aim to solve the secure remote compu- testation key used to produce the signature against an tation problem by leveraging trusted hardware in the endorsement certificate created by the trusted hardware’s remote computer. The trusted hardware establishes a se- manufacturer. The certificate states that the attestation cure container, and the remote computation service user key is only known to the trusted hardware, and only used uploads the desired computation and data into the secure for the purpose of attestation. container. The trusted hardware protects the data’s con- SGX stands out from its predecessors by the amount fidentiality and integrity while the computation is being of code covered by the attestation, which is in the Trusted 1

2. Data Owner’s Remote Computer Trusted Platform Computer AK: Attestation Key Trusted Hardware Data Owner’s Computer Endorsement Certificate Untrusted Software Secure Container Computation Secure Container Initial State Dispatcher Key exchange: A, gA Public Code + Data Setup Public Loader Computation gA A Setup Key exchange: B, g Private Code Receive B A B g , SignAK(g , g , M) Shared key: K = gAB Verification Encrypted Results Private Data M = Hash(Initial State) Shared key: K = gAB Builds EncK(secret code/data) Secret Code + Data Owns Authors Manages Trusts Computation Results EncK(results) Trusts Computation Results Data Owner Software Infrastructure Manufacturer Owner Figure 3: Software attestation proves to a remote computer that Provider it is communicating with a specific secure container hosted by a Trusts trusted platform. The proof is an attestation signature produced Figure 2: Trusted computing. The user trusts the manufacturer of a by the platform’s secret attestation key. The signature covers the piece of hardware in the remote computer, and entrusts her data to a container’s initial state, a challenge nonce produced by the remote secure container hosted by the secure hardware. computer, and a message produced by the container. Computing Base (TCB) for the system using hardware SGX 1 and its security properties, the reader should be protection. The attestations produced by the original well equipped to face Intel’s reference documentation TPM design covered all the software running on a com- and learn about the changes brought by SGX 2. puter, and TXT attestations covered the code inside a 1.1 SGX Lightning Tour VMX [179] virtual machine. In SGX, an enclave (secure container) only contains the private data in a computation, SGX sets aside a memory region, called the Processor and the code that operates on it. Reserved Memory (PRM, § 5.1). The CPU protects the For example, a cloud service that performs image pro- PRM from all non-enclave memory accesses, including cessing on confidential medical images could be imple- kernel, hypervisor and SMM (§ 2.3) accesses, and DMA mented by having users upload encrypted images. The accesses (§ 2.9.1) from peripherals. users would send the encryption keys to software running The PRM holds the Enclave Page Cache (EPC, inside an enclave. The enclave would contain the code § 5.1.1), which consists of 4 KB pages that store enclave for decrypting images, the image processing algorithm, code and data. The system software, which is untrusted, and the code for encrypting the results. The code that is in charge of assigning EPC pages to enclaves. The receives the uploaded encrypted images and stores them CPU tracks each EPC page’s state in the Enclave Page would be left outside the enclave. Cache Metadata (EPCM, § 5.1.2), to ensure that each An SGX-enabled processor protects the integrity and EPC page belongs to exactly one enclave. confidentiality of the computation inside an enclave by The initial code and data in an enclave is loaded by un- isolating the enclave’s code and data from the outside trusted system software. During the loading stage (§ 5.3), environment, including the operating system and hyper- the system software asks the CPU to copy data from un- visor, and hardware devices attached to the system bus. protected memory (outside PRM) into EPC pages, and At the same time, the SGX model remains compatible assigns the pages to the enclave being setup (§ 5.1.2). with the traditional software layering in the Intel archi- It follows that the initial enclave state is known to the tecture, where the OS kernel and hypervisor manage the system software. computer’s resources. After all the enclave’s pages are loaded into EPC, the This work discusses the original version of SGX, also system software asks the CPU to mark the enclave as referred to as SGX 1. While SGX 2 brings very useful initialized (§ 5.3), at which point application software improvements for enclave authors, it is a small incre- can run the code inside the enclave. After an enclave is mental improvement, from a design and implementation initialized, the loading method described above is dis- standpoint. After understanding the principles behind abled. 2

3. While an enclave is loaded, its contents is cryptograph- description of SGX’s programming model, mostly based ically hashed by the CPU. When the enclave is initialized, on Intel’s Software Development Manual. the hash is finalized, and becomes the enclave’s measure- Section 6 analyzes other public sources of informa- ment hash (§ 5.6). tion, such as Intel’s SGX-related patents, to fill in some A remote party can undergo a software attestation of the missing details in the SGX description. The sec- process (§ 5.8) to convince itself that it is communicating tion culminates in a detailed review of SGX’s security with an enclave that has a specific measurement hash, properties that draws on information presented in the and is running in a secure environment. rest of the paper. This review outlines some troubling Execution flow can only enter an enclave via special gaps in SGX’s security guarantees, as well as some areas CPU instructions (§ 5.4), which are similar to the mech- where no conclusions can be drawn without additional anism for switching from user mode to kernel mode. information from Intel. Enclave execution always happens in protected mode, at That being said, perhaps the most troubling finding in ring 3, and uses the address translation set up by the OS our security analysis is that Intel added a launch control kernel and hypervisor. feature to SGX that forces each computer’s owner to gain To avoid leaking private data, a CPU that is executing approval from a third party (which is currently Intel) for enclave code does not directly service an interrupt, fault any enclave that the owner wishes to use on the com- (e.g., a page fault) or VM exit. Instead, the CPU first per- puter. § 5.9 explains that the only publicly documented forms an Asynchronous Enclave Exit (§ 5.4.3) to switch intended use for this launch control feature is a licensing from enclave code to ring 3 code, and then services the mechanism that requires software developers to enter a interrupt, fault, or VM exit. The CPU performs an AEX (yet unspecified) business agreement with Intel to be able by saving the CPU state into a predefined area inside the to author software that takes advantage of SGX’s protec- enclave and transfers control to a pre-specified instruc- tions. All the official documentation carefully sidesteps tion outside the enclave, replacing CPU registers with this issue, and has a minimal amount of hints that lead to synthetic values. the Intel’s patents on SGX. Only these patents disclose The allocation of EPC pages to enclaves is delegated the existence of licensing plans. to the OS kernel (or hypervisor). The OS communicates The licensing issue might not bear much relevance its allocation decisions to the SGX implementation via right now, because our security analysis reveals that the special ring 0 CPU instructions (§ 5.3). The OS can also limitations in SGX’s guarantees mean that a security- evict EPC pages into untrusted DRAM and later load conscious software developer cannot in good conscience them back, using dedicated CPU instructions. SGX uses rely on SGX for secure remote computation. At the same cryptographic protections to assure the confidentiality, time, should SGX ever develop better security properties, integrity and freshness of the evicted EPC pages while the licensing scheme described above becomes a major they are stored in untrusted memory. problem, given Intel’s near-monopoly market share of desktop and server CPUs. Specifically, the licensing limi- 1.2 Outline and Troubling Findings tations effectively give Intel the power to choose winners Reasoning about the security properties of Intel’s SGX and losers in industries that rely on cloud computing. requires a significant amount of background information that is currently scattered across many sources. For this 2 C OMPUTER A RCHITECTURE BACK - reason, a significant portion of this work is dedicated to GROUND summarizing this prerequisite knowledge. This section attempts to summarize the general archi- Section 2 summarizes the relevant subset of the Intel tectural principles behind Intel’s most popular computer architecture and the micro-architectural properties of processors, as well as the peculiarities needed to reason recent Intel processors. Section 3 outlines the security about the security properties of a system running on these landscape around trusted hardware system, including processors. Unless specified otherwise, the information cryptographic tools and relevant attack classes. Last, here is summarized from Intel’s Software Development section 4 briefly describes the trusted hardware systems Manual (SDM) [100]. that make up the context in which SGX was created. Analyzing the security of a software system requires After having reviewed the background information, understanding the interactions between all the parts of section 5 provides a (sometimes painstakingly) detailed the software’s execution environment, so this section is 3

4.quite long. We do refrain from introducing any security software complexity at manageable levels, as it allows concepts here, so readers familiar with x86’s intricacies application and OS developers to focus on their software, can safely skip this section and refer back to it when and ignore the interactions with other software that may necessary. run on the computer. We use the terms Intel processor or Intel CPU to refer A key component of virtualization is address transla- to the server and desktop versions of Intel’s Core line- tion (§ 2.5), which is used to give software the impression up. In the interest of space and mental sanity, we ignore that it owns all the memory on the computer. Address Intel’s other processors, such as the embedded line of translation provides isolation that prevents a piece of Atom CPUs, or the failed Itanium line. Consequently, buggy or malicious software from directly damaging the terms Intel computers and Intel systems refers to other software, by modifying its memory contents. computer systems built around Intel’s Core processors. The other key component of virtualization is the soft- In this paper, the term Intel architecture refers to the ware privilege levels (§ 2.3) enforced by the CPU. Hard- x86 architecture described in Intel’s SDM. The x86 ar- ware privilege separation ensures that a piece of buggy chitecture is overly complex, mostly due to the need to or malicious software cannot damage other software indi- support executing legacy software dating back to 1990 rectly, by interfering with the system software managing directly on the CPU, without the overhead of software it. interpretation. We only cover the parts of the architecture Processes express their computing power requirements visible to modern 64-bit software, also in the interest of by creating execution threads, which are assigned by the space and mental sanity. operating system to the computer’s logical processors. The 64-bit version of the x86 architecture, covered in A thread contains an execution context (§ 2.6), which is this section, was actually invented by Advanced Micro the information necessary to perform a computation. For Devices (AMD), and is also known as AMD64, x86 64, example, an execution context stores the address of the and x64. The term “Intel architecture” highlights our next instruction that will be executed by the processor. interest in the architecture’s implementation in Intel’s Operating systems give each process the illusion that it chips, and our desire to understand the mindsets of Intel has an infinite amount of logical processors at its disposal, SGX’s designers. and multiplex the available logical processors between the threads created by each process. Modern operating 2.1 Overview systems implement preemptive multithreading, where A computer’s main resources (§ 2.2) are memory and the logical processors are rotated between all the threads processors. On Intel computers, Dynamic Random- on a system every few milliseconds. Changing the thread Access Memory (DRAM) chips (§ 2.9.1) provide the assigned to a logical processor is accomplished by an memory, and one or more CPU chips expose logical execution context switch (§ 2.6). processors (§ 2.9.4). These resources are managed by Hypervisors expose a fixed number of virtual proces- system software. An Intel computer typically runs two sors (vCPUs) to each operating system, and also use kinds of system software, namely operating systems and context switching to multiplex the logical CPUs on a hypervisors. computer between the vCPUs presented to the guest op- The Intel architecture was designed to support running erating systems. multiple application software instances, called processes. The execution core in a logical processor can execute An operating system (§ 2.3), allocates the computer’s re- instructions and consume data at a much faster rate than sources to the running processes. Server computers, espe- DRAM can supply them. Many of the complexities in cially in cloud environments, may run multiple operating modern computer architectures stem from the need to system instances at the same time. This is accomplished cover this speed gap. Recent Intel CPUs rely on hyper- by having a hypervisor (§ 2.3) partition the computer’s re- threading (§ 2.9.4), out-of-order execution (§ 2.10), and sources between the operating system instances running caching (§ 2.11), all of which have security implications. on the computer. An Intel processor contains many levels of interme- System software uses virtualization techniques to iso- diate memories that are much faster than DRAM, but late each piece of software that it manages (process or also orders of magnitude smaller. The fastest intermedi- operating system) from the rest of the software running ate memory is the logical processor’s register file (§ 2.2, on the computer. This isolation is a key tool for keeping § 2.4, § 2.6). The other intermediate memories are called 4

5.caches (§ 2.11). The Intel architecture requires applica- microcode from changes that can only be accomplished tion software to explicitly manage the register file, which by modifying the hardware. serves as a high-speed scratch space. At the same time, 2.2 Computational Model caches transparently accelerate DRAM requests, and are This section pieces together a highly simplified model mostly invisible to software. for a computer that implements the Intel architecture, Intel computers have multiple logical processors. As illustrated in Figure 4. This simplified model is intended a consequence, they also have multiple caches dis- to help the reader’s intuition process the fundamental tributed across the CPU chip. On multi-socket systems, concepts used by the rest of the paper. The following sec- the caches are distributed across multiple CPU chips. tions gradually refine the simplified model into a detailed Therefore, Intel systems use a cache coherence mech- description of the Intel architecture. anism (§ 2.11.3), ensuring that all the caches have the same view of DRAM. Thanks to cache coherence, pro- … 0 grammers can build software that is unaware of caching, Memory (DRAM) and still runs correctly in the presence of distributed System Bus caches. However, cache coherence does not cover the dedicated caches used by address translation (§ 2.11.5), Processor Processor I/O device and system software must take special measures to keep Execution Execution these caches consistent. logic logic interface to outside CPUs communicate with the outside world via I/O Register file Register file world devices (also known as peripherals), such as network interface cards and display adapters (§ 2.9). Conceptu- Figure 4: A computer’s core is its processors and memory, which ally, the CPU communicates with the DRAM chips and are connected by a system bus. Computers also have I/O devices, the I/O devices via a system bus that connects all these such as keyboards, which are also connected to the processor via the components. system bus. Software written for the Intel architecture communi- The building blocks for the model presented here come cates with I/O devices via the I/O address space (§ 2.4) from [163], which introduces the key abstractions in a and via the memory address space, which is primarily computer system, and then focuses on the techniques used to access DRAM. System software must configure used to build software systems on top of these abstrac- the CPU’s caches (§ 2.11.4) to recognize the memory tions. address ranges used by I/O devices. Devices can notify The memory is an array of storage cells, addressed the CPU of the occurrence of events by dispatching in- using natural numbers starting from 0, and implements terrupts (§ 2.12), which cause a logical processor to stop the abstraction depicted in Figure 5. Its salient feature executing its current thread, and invoke a special handler is that the result of reading a memory cell at an address in the system software (§ 2.8.2). must equal the most value written to that memory cell. Intel systems have a highly complex computer initial- WRITE (addr, value) → ∅ ization sequence (§ 2.13), due to the need to support a Store value in the storage cell identified by addr. large variety of peripherals, as well as a multitude of READ (addr) → value operating systems targeting different versions of the ar- Return the value argument to the most recent WRITE chitecture. The initialization sequence is a challenge to call referencing addr. any attempt to secure an Intel computer, and has facili- tated many security compromises (§ 2.3). Figure 5: The memory abstraction Intel’s engineers use the processor’s microcode facil- A logical processor repeatedly reads instructions from ity (§ 2.14) to implement the more complicated aspects the computer’s memory and executes them, according to of the Intel architecture, which greatly helps manage the the flowchart in Figure 6. hardware’s complexity. The microcode is completely The processor has an internal memory, referred to invisible to software developers, and its design is mostly as the register file. The register file consists of Static undocumented. However, in order to evaluate the feasi- Random Access Memory (SRAM) cells, generally known bility of any architectural change proposals, one must be as registers, which are significantly faster than DRAM able to distinguish changes that can be implemented in cells, but also a lot more expensive. 5

6. IP Generation Exception Handling § 2.6. Write interrupt Under normal circumstances, the processor repeatedly Interrupted? YES data to exception reads an instruction from the memory address stored in registers NO RIP, executes the instruction, and updates RIP to point to the following instruction. Unlike many RISC architec- Fetch Read the current instruction tures, the Intel architecture uses a variable-size instruc- from the memory at RIP tion encoding, so the size of an instruction is not known Decode until the instruction has been read from memory. Identify the desired operation, inputs, and outputs While executing an instruction, the processor may encounter a fault, which is a situation where the instruc- Register Read tion’s preconditions are not met. When a fault occurs, Read the current instruction’s input registers the instruction does not store a result in the output loca- tion. Instead, the instruction’s result is considered to be Execute Execute the current instruction the fault that occurred. For example, an integer division instruction DIV where the divisor is zero results in a Exception Handling Division Fault (#DIV). Write fault data to the When an instruction results in a fault, the processor Did a fault occur? YES exception registers stops its normal execution flow, and performs the fault NO handler process documented in § 2.8.2. In a nutshell, the Locate the current Commit processor first looks up the address of the code that will exception’s handler Write the execution results to handle the fault, based on the fault’s nature, and sets up the current instruction’s output Locate the handler’s the execution environment in preparation to execute the registers exception stack top fault handler. IP Generation Push RSP and RIP to The processors are connected to each other and to the Output registers the exception stack memory via a system bus, which is a broadcast network YES include RIP? that implements the abstraction in Figure 7. Write the exception NO stack top to RSP and SEND (op, addr, data) → ∅ Increment RIP by the size of Write the exception Place a message containing the operation code op, the the current instruction handler address to RIP bus address addr, and the value data on the bus. READ () → (op, addr, value) Figure 6: A processor fetches instructions from the memory and Return the message that was written on the bus at the executes them. The RIP register holds the address of the instruction beginning of this clock cycle. to be executed. Figure 7: The system bus abstraction An instruction performs a simple computation on its During each clock cycle, at most one of the devices inputs and stores the result in an output location. The connected to the system bus can send a message, which processor’s registers make up an execution context that is received by all the other devices connected to the bus. provides the inputs and stores the outputs for most in- Each device attached to the bus decodes the operation structions. For example, ADD RDX, RAX, RBX per- codes and addresses of all the messages sent on the bus forms an integer addition, where the inputs are the regis- and ignores the messages that do not require its involve- ters RAX and RBX, and the result is stored in the output ment. register RDX. For example, when the processor wishes to read a The registers mentioned in Figure 6 are the instruction memory location, it sends a message with the operation pointer (RIP), which stores the memory address of the code READ - REQUEST and the bus address corresponding next instruction to be executed by the processor, and the to the desired memory location. The memory sees the stack pointer (RSP), which stores the memory address message on the bus and performs the READ operation. of the topmost element in the call stack used by the At a later time, the memory responds by sending a mes- processor’s procedural programming support. The other sage with the operation code READ - RESPONSE, the same execution context registers are described in § 2.4 and address as the request, and the data value set to the result 6

7.of the READ operation. agement RAM (SMRAM), and for loading all the code The computer communicates with the outside world that needs to run in SMM mode into SMRAM. The SM- via I/O devices, such as keyboards, displays, and net- RAM enjoys special hardware protections that prevent work cards, which are connected to the system bus. De- less privileged software from accessing the SMM code. vices mostly respond to requests issued by the processor. IaaS cloud providers allow their customers to run their However, devices also have the ability to issue interrupt operating system of choice in a virtualized environment. requests that notify the processor of outside events, such Hardware virtualization [179], called Virtual Machine as the user pressing a key on a keyboard. Extensions (VMX) by Intel, adds support for a hypervi- Interrupt triggering is discussed in § 2.12. On modern sor, also called a Virtual Machine Monitor (VMM) in systems, devices send interrupt requests by issuing writes the Intel documentation. The hypervisor runs at a higher to special bus addresses. Interrupts are considered to be privilege level (VMX root mode) than the operating sys- hardware exceptions, just like faults, and are handled in tem, and is responsible for allocating hardware resources a similar manner. across multiple operating systems that share the same physical machine. The hypervisor uses the CPU’s hard- 2.3 Software Privilege Levels ware virtualization features to make each operating sys- In an Infrastructure-as-a-Service (IaaS) cloud environ- tem believe it is running in its own computer, called a ment, such as Amazon EC2, commodity CPUs run soft- virtual machine (VM). Hypervisor code generally runs ware at four different privilege levels, shown in Figure 8. at ring 0 in VMX root mode. More Privileged Hypervisors that run in VMX root mode and take ad- SMM BIOS vantage of hardware virtualization generally have better VMX performance and a smaller codebase than hypervisors Root Ring 0 Hypervisor based on binary translation [159]. Ring 1 System Software The systems research literature recommends breaking Ring 2 up an operating system into a small kernel, which runs Ring 3 at a high privilege level, known as the kernel mode or VMX supervisor mode and, in the Intel architecture, as ring 0. Non-Root Ring 0 OS Kernel The kernel allocates the computer’s resources to the other Ring 1 system components, such as device drivers and services, Ring 2 which run at lower privilege levels. However, for per- Ring 3 Application formance reasons1 , mainstream operating systems have SGX Enclave large amounts of code running at ring 0. Their monolithic Less Privileged kernels include device drivers, filesystem code, network- Figure 8: The privilege levels in the x86 architecture, and the ing stacks, and video rendering functionality. software that typically runs at each security level. Application code, such as a Web server or a game Each privilege level is strictly more powerful than the client, runs at the lowest privilege level, referred to as ones below it, so a piece of software can freely read and user mode (ring 3 in the Intel architecture). In IaaS cloud modify the code and data running at less privileged levels. environments, the virtual machine images provided by Therefore, a software module can be compromised by customers run in VMX non-root mode, so the kernel runs any piece of software running at a higher privilege level. in VMX non-root ring 0, and the application code runs It follows that a software module implicitly trusts all in VMX non-root ring 3. the software running at more privileged levels, and a 2.4 Address Spaces system’s security analysis must take into account the software at all privilege levels. Software written for the Intel architecture accesses the System Management Mode (SMM) is intended for use computer’s resources using four distinct physical address by the motherboard manufacturers to implement features spaces, shown in Figure 9. The address spaces overlap such as fan control and deep sleep, and/or to emulate partially, in both purpose and contents, which can lead to missing hardware. Therefore, the bootstrapping software confusion. This section gives a high-level overview of the (§ 2.13) in the computer’s firmware is responsible for 1 Calling a procedure in a different ring is much slower than calling setting up a continuous subset of DRAM as System Man- code at the same privilege level. 7

8.physical address spaces defined by the Intel architecture, 4 GB mark) are mapped to a flash memory device that with an emphasis on their purpose and the methods used holds the first stage of the code that bootstraps the com- to manage them. puter. The memory space is partitioned between devices and CPU DRAM by the computer’s firmware during the bootstrap- MSRs ping process. Sometimes, system software includes Registers (Model-Specific Registers) motherboard-specific code that modifies the memory space partitioning. The OS kernel relies on address trans- Software lation, described in § 2.5, to control the applications’ access to the memory space. The hypervisor relies on the same mechanism to control the guest OSs. System Buses The input/output (I/O) space consists of 216 I/O ad- Memory Addresses I/O Ports dresses, usually called ports. The I/O ports are used exclusively to communicate with devices. The CPU pro- vides specific instructions for reading from and writing DRAM Device Device to the I/O space. I/O ports are allocated to devices by formal or de-facto standards. For example, ports 0xCF8 Figure 9: The four physical address spaces used by an Intel CPU. The registers and MSRs are internal to the CPU, while the memory and 0xCFC are always used to access the PCI express and I/O address spaces are used to communicate with DRAM and (§ 2.9.1) configuration space. other devices via system buses. The CPU implements a mechanism for system soft- The register space consists of names that are used to ware to provide fine-grained I/O access to applications. access the CPU’s register file, which is the only memory However, all modern kernels restrict application software that operates at the CPU’s clock frequency and can be from accessing the I/O space directly, in order to limit used without any latency penalty. The register space is the damage potential of application bugs. defined by the CPU’s architecture, and documented in The Model-Specific Register (MSR) space consists of the SDM. 232 MSRs, which are used to configure the CPU’s op- Some registers, such as the Control Registers (CRs) eration. The MSR space was initially intended for the play specific roles in configuring the CPU’s operation. use of CPU model-specific firmware, but some MSRs For example, CR3 plays a central role in address trans- have been promoted to architectural MSR status, making lation (§ 2.5). These registers can only be accessed by their semantics a part of the Intel architecture. For ex- system software. The rest of the registers make up an ample, architectural MSR 0x10 holds a high-resolution application’s execution context (§ 2.6), which is essen- monotonically increasing time-stamp counter. tially a high-speed scratch space. These registers can The CPU provides instructions for reading from and be accessed at all privilege levels, and their allocation is writing to the MSR space. The instructions can only be managed by the software’s compiler. Many CPU instruc- used by system software. Some MSRs are also exposed tions only operate on data in registers, and only place by instructions accessible to applications. For example, their results in registers. applications can read the time-stamp counter via the The memory space, generally referred to as the address RDTSC and RDTSCP instructions, which are very useful space, or the physical address space, consists of 236 for benchmarking and optimizing software. (64 GB) - 240 (1 TB) addresses. The memory space is primarily used to access DRAM, but it is also used to 2.5 Address Translation communicate with memory-mapped devices that read System software relies on the CPU’s address transla- memory requests off a system bus and write replies for tion mechanism for implementing isolation among less the CPU. Some CPU instructions can read their inputs privileged pieces of software (applications or operating from the memory space, or store the results using the systems). Virtually all secure architecture designs bring memory space. changes to address translation. We summarize the Intel A better-known example of memory mapping is that architecture’s address translation features that are most at computer startup, memory addresses 0xFFFFF000 - relevant when establishing a system’s security proper- 0xFFFFFFFF (the 64 KB of memory right below the ties, and refer the reader to [107] for a more general 8

9.presentation of address translation concepts and its other isolate the processes from each other, and prevent ap- uses. plication code from accessing memory-mapped devices directly. The latter two protection measures prevent an 2.5.1 Address Translation Concepts application’s bugs from impacting other applications or the OS kernel itself. Hypervisors also use address trans- From a systems perspective, address translation is a layer lation, to divide the DRAM among operating systems of indirection (shown in Figure 10) between the virtual that run concurrently, and to virtualize memory-mapped addresses, which are used by a program’s memory load devices. and store instructions, and the physical addresses, which The address translation mode used by 64-bit operating reference the physical address space (§ 2.4). The map- systems, called IA-32e by Intel’s documentation, maps ping between virtual and physical addresses is defined by 48-bit virtual addresses to physical addresses of at most page tables, which are managed by the system software. 52 bits2 . The translation process, illustrated in Figure 12, is carried out by dedicated hardware in the CPU, which is Virtual Address Physical referred to as the address translation unit or the memory Address Space Translation Address Space management unit (MMU). Virtual Physical Virtual Mapping Address Address Address 64…48 CR3 Register: System bus Must PML4 address Page match Software Tables DRAM bit 48 Page Map Level 4 (PML4) Figure 10: Virtual addresses used by software are translated into 47…39 physical memory addresses using a mapping defined by the page PML4 PML4 Entry: PDPT address tables. Index Operating systems use address translation to imple- Page-Directory-Pointer Table ment the virtual memory abstraction, illustrated by Fig- Virtual Page Number (VPN) (PDPT) ure 11. The virtual memory abstraction exposes the same 38…30 PDPTE PDPT Entry: PD address interface as the memory abstraction in § 2.2, but each Index process uses a separate virtual address space that only references the memory allocated to that process. From Page-Directory (PD) an application developer standpoint, virtual memory can 29…21 PDE PD Entry: PT address be modeled by pretending that each process runs on a Index separate computer and has its own DRAM. Page Table (PT) Process 1’s Process 2’s Process 3’s 20…12 address space address space address space PTE PT Entry: Page address Index Physical Page Number (PPN) 11…0 Page + Computer’s physical address space Offset Memory page Physical Address Figure 11: The virtual memory abstraction gives each process Figure 12: IA-32e address translation takes in a 48-bit virtual its own virtual address space. The operating system multiplexes address and outputs a 52-bit physical address. the computer’s DRAM between the processes, while application developers build software as if it owns the entire computer’s memory. The bottom 12 bits of a virtual address are not changed Address translation is used by the operating system to 2 The size of a physical address is CPU-dependent, and is 40 bits multiplex DRAM among multiple application processes, for recent desktop CPUs and 44 bits for recent high-end server CPUs. 9

10.by the translation. The top 36 bits are grouped into four The CPU’s address translation is also referred to as 9-bit indexes, which are used to index into the page “paging”, which is a shorthand for “page swapping”. tables. Despite its name, the page tables data structure closely resembles a full 512-ary search tree where nodes 2.5.2 Address Translation and Virtualization have fixed keys. Each node is represented in DRAM as Computers that take advantage of hardware virtualization an array of 512 8-byte entries that contain the physical use a hypervisor to run multiple operating systems at addresses of the next-level children as well as some flags. the same time. This creates some tension, because each The physical address of the root node is stored in the operating system was written under the assumption that it CR3 register. The arrays in the last-level nodes contain owns the entire computer’s DRAM. The tension is solved the physical addresses that are the result of the address by a second layer of address translation, illustrated in translation. Figure 14. The address translation function, which does not Virtual Address Virtual change the bottom bits of addresses, partitions the mem- Address Space ory address space into pages. A page is the set of all memory locations that only differ in the bottom bits Guest OS Page Tables Mapping Address Space which are not impacted by address translation, so all the memory addresses in a virtual page translate to corre- Guest-Physical Address sponding addresses in the same physical page. From this perspective, the address translation function can be seen Extended Page Mapping Physical as a mapping between Virtual Page Numbers (VPN) and Tables (EPT) Address Space Physical Page Numbers (PPN), as shown in Figure 13. Physical Address 63 48 47 12 11 0 must match bit 47 Virtual Page Number (VPN) Page Offset Figure 14: Virtual addresses used by software are translated into physical memory addresses using a mapping defined by the page Virtual address tables. Address Translation Unit When a hypervisor is active, the page tables set up 43 12 11 0 by an operating system map between virtual addresses Physical Page Number (PPN) Page Offset and guest-physical addresses in a guest-physical ad- Physical address dress space. The hypervisor multiplexes the computer’s DRAM between the operating systems’ guest-physical Figure 13: Address translation can be seen as a mapping between address spaces via the second layer of address transla- virtual page numbers and physical page numbers. tions, which uses extended page tables (EPT) to map In addition to isolating application processes, operat- guest-physical addresses to physical addresses. ing systems also use the address translation feature to run The EPT uses the same data structure as the page applications whose collective memory demands exceed tables, so the process of translating guest-physical ad- the amount of DRAM installed in the computer. The OS dresses to physical addresses follows the same steps as evicts infrequently used memory pages from DRAM to IA-32e address translation. The main difference is that a larger (but slower) memory, such as a hard disk drive the physical address of the data structure’s root node is (HDD) or solid-state drive (SSD). For historical reason, stored in the extended page table pointer (EPTP) field this slower memory is referred to as the disk. in the Virtual Machine Control Structure (VMCS) for The OS ability to over-commit DRAM is often called the guest OS. Figure 15 illustrates the address translation page swapping, for the following reason. When an ap- process in the presence of hardware virtualization. plication process attempts to access a page that has been 2.5.3 Page Table Attributes evicted, the OS “steps in” and reads the missing page back into DRAM. In order to do this, the OS might have Each page table entry contains a physical address, as to evict a different page from DRAM, effectively swap- shown in Figure 12, and some Boolean values that are ping the contents of a DRAM page with a disk page. The referred to as flags or attributes. The following attributes details behind this high-level description are covered in are used to implement page swapping and software isola- the following sections. tion. 10

11. CR3: PDPT PD PT Guest values in these registers make up an application thread’s PML4 Physical (Guest) (Guest) (Guest) (Guest) Address state, or execution context. OS kernels multiplex each logical processor (§ 2.9.4) EPTP in EPT EPT EPT EPT EPT VMCS PML4 PML4 PML4 PML4 PML4 between multiple software threads by context switching, namely saving the values of the registers that make up a EPT EPT EPT EPT EPT thread’s execution context, and replacing them with an- PDPT PDPT PDPT PDPT PDPT other thread’s previously saved context. Context switch- EPT EPT EPT EPT EPT ing also plays a part in executing code inside secure PD PD PD PD PD containers, so its design has security implications. EPT EPT EPT EPT EPT PT PT PT PT PT 64-bit integers / pointers 64-bit special-purpose registers RAX RBX RCX RDX RIP - instruction pointer Virtual PML4 PDPT PD PT Physical RSI RDI RBP RSP RSP - stack pointer Address (Physical) (Physical) (Physical) (Physical) Address R8 R9 R10 R11 RFLAGS - status / control bits Figure 15: Address translation when hardware virtualization is R12 R13 R14 R15 enabled. The kernel-managed page tables contain guest-physical segment registers addresses, so each level in the kernel’s page table requires a full walk ignored segment registers FS GS of the hypervisor’s extended page table (EPT). A translation requires CS DS ES SS 64-bit FS base 64-bit GS base up to 20 memory accesses (the bold boxes), assuming the physical address of the kernel’s PML4 is cached. Figure 16: CPU registers in the 64-bit Intel architecture. RSP can be used as a general-purpose register (GPR), e.g., in pointer arithmetic, The present (P) flag is set to 0 to indicate unused parts but it always points to the top of the program’s stack. Segment of the address space, which do not have physical memory registers are covered in § 2.7. associated with them. The system software also sets the Integers and memory addresses are stored in 16 P flag to 0 for pages that are evicted from DRAM. When general-purpose registers (GPRs). The first 8 GPRs have the address translation unit encounters a zero P flag, it historical names: RAX, RBX, RCX, RDX, RSI, RDI, aborts the translation process and issues a hardware ex- RSP, and RBP, because they are extended versions of ception, as described in § 2.8.2. This hardware exception the 32-bit Intel architecture’s GPRs. The other 8 GPRs gives system software an opportunity to step in and bring are simply known as R9-R16. RSP is designated for an evicted page back into DRAM. pointing to the top of the procedure call stack, which is The accessed (A) flag is set to 1 by the CPU whenever simply referred to as the stack. RSP and the stack that the address translation machinery reads a page table entry, it refers to are automatically read and modified by the and the dirty (D) flag is set to 1 by the CPU when an CPU instructions that implement procedure calls, such entry is accessed by a memory write operation. The as CALL and RET (return), and by specialized stack han- A and D flags give the hypervisor and kernel insight dling instructions such as PUSH and POP. into application memory access patterns and inform the All applications also use the RIP register, which con- algorithms that select the pages that get evicted from tains the address of the currently executing instruction, RAM. and the RFLAGS register, whose bits (e.g., the carry flag The main attributes supporting software isolation are - CF) are individually used to store comparison results the writable (W) flag, which can be set to 0 to prohibit3 and control various instructions. writes to any memory location inside a page, the disable Software might use other registers to interact with execution (XD) flag, which can be set to 1 to prevent specific processor features, some of which are shown in instruction fetches from a page, and the supervisor (S) Table 1. flag, which can be set to 1 to prohibit any accesses from The Intel architecture provides a future-proof method application software running at ring 3. for an OS kernel to save the values of feature-specific 2.6 Execution Contexts registers used by an application. The XSAVE instruction takes in a requested-feature bitmap (RFBM), and writes Application software targeting the 64-bit Intel architec- the registers used by the features whose RFBM bits are ture uses a variety of CPU registers to interact with the set to 1 in a memory area. The memory area written by processor’s features, shown in Figure 16 and Table 1. The XSAVE can later be used by the XRSTOR instruction to 3 Writes to non-writable pages result in #GP exceptions (§ 2.8.2). load the saved values back into feature-specific registers. 11

12. Feature Registers XCR0 bit segment, which is loaded in CS, and one data segment, FPU FP0 - FP7, FSW, FTW 0 which is loaded in SS, DS and ES. The FS and GS regis- SSE MM0 - MM7, XMM0 - 1 ters store segments covering thread-local storage (TLS). XMM15, XMCSR Due to the Intel architecture’s 16-bit origins, segment AVX YMM0 - YMM15 2 registers are exposed as 16-bit values, called segment MPX BND0 - BND 3 3 selectors. The top 13 bits in a selector are an index in a MPX BNDCFGU, BNDSTATUS 4 descriptor table, and the bottom 2 bits are the selector’s AVX-512 K0 - K7 5 ring number, which is also called requested privilege AVX-512 ZMM0 H - ZMM15 H 6 level (RPL) in the Intel documentation. Also, modern AVX-512 ZMM16 - ZMM31 7 system software only uses rings 0 and 3 (see § 2.3). PK PKRU 9 Each segment register has a hidden segment descrip- tor, which consists of a base address, limit, and type Table 1: Sample feature-specific Intel architecture registers. information, such as whether the descriptor should be The memory area includes the RFBM given to XSAVE, used for executable code or data. Figure 17 shows the so XRSTOR does not require an RFBM input. effect of loading a 16-bit selector into a segment register. Application software declares the features that it plans The selector’s index is used to read a descriptor from the to use to the kernel, so the kernel knows what XSAVE descriptor table and copy it into the segment register’s bitmap to use when context-switching. When receiving hidden descriptor. the system call, the kernel sets the XCR0 register to the Input Value feature bitmap declared by the application. The CPU Index Ring GDTR generates a fault if application software attempts to use + Base Limit features that are not enabled by XCR0, so applications Descriptor Table cannot modify feature-specific registers that the kernel Base Limit Type wouldn’t take into account when context-switching. The Index Ring Base Limit Type kernel can use the CPUID instruction to learn the size of ⋮ Register Selector the XSAVE memory area for a given feature bitmap, and Base Limit Type compute how much memory it needs to allocate for the ⋮ context of each of the application’s threads. Base Limit Type 2.7 Segment Registers Base Limit Type Register Descriptor The Intel 64-bit architecture gained widespread adoption thanks to its ability to run software targeting the older 32- Figure 17: Loading a segment register. The 16-bit value loaded by bit architecture side-by-side with 64-bit software [167]. software is a selector consisting of an index and a ring number. The This ability comes at the cost of some warts. While most index selects a GDT entry, which is loaded into the descriptor part of of these warts can be ignored while reasoning about the the segment register. security of 64-bit software, the segment registers and In 64-bit mode, all segment limits are ignored. The vestigial segmentation model must be understood. base addresses in most segment registers (CS, DS, ES, The semantics of the Intel architecture’s instructions SS) are ignored. The base addresses in FS and GS are include the implicit use of a few segments which are used, in order to support thread-local storage. Figure 18 loaded into the processor’s segment registers shown in outlines the address computation in this case. The in- Figure 16. Code fetches use the code segment (CS). struction’s address, named logical address in the Intel Instructions that reference the stack implicitly use the documentation, is added to the base address in the seg- stack segment (SS). Memory references implicitly use the ment register’s descriptor, yielding the virtual address, data segment (DS) or the destination segment (ES). Via also named linear address. The virtual address is then segment override prefixes, instructions can be modified translated (§ 2.5) to a physical address. to use the unnamed segments FS and GS for memory Outside the special case of using FS or GS to refer- references. ence thread-local storage, the logical and virtual (linear) Modern operating systems effectively disable segmen- addresses match. Therefore, most of the time, we can get tation by covering the entire addressable space with one away with completely ignoring segmentation. In these 12

13. RSI GPRs privilege level switching (§ 2.8.2). Modern operating systems do not allow application + Linear Address Address Physical software any direct access to the I/O address space, so the (Virtual Address) Translation Address kernel sets up a single TSS that is loaded into TR during early initialization, and used to represent all applications Base Limit Type running under the OS. FS Register Descriptor 2.8 Privilege Level Switching Figure 18: Example address computation process for MOV Any architecture that has software privilege levels must FS:[RDX], 0. The segment’s base address is added to the ad- dress in RDX before address translation (§ 2.5) takes place. provide a method for less privileged software to invoke the services of more privileged software. For example, cases, we use the term “virtual address” to refer to both application software needs the OS kernel’s assistance to the virtual and the linear address. perform network or disk I/O, as that requires access to Even though CS is not used for segmentation, 64-bit privileged memory or to the I/O address space. system software needs to load a valid selector into it. The At the same time, less privileged software cannot be CPU uses the ring number in the CS selector to track the offered the ability to jump arbitrarily into more privileged current privilege level, and uses one of the type bits to code, as that would compromise the privileged software’s know whether it’s running 64-bit code, or 32-bit code in ability to enforce security and isolation invariants. In our compatibility mode. example, when an application wishes to write a file to the The DS and ES segment registers are completely ig- disk, the kernel must check if the application’s user has nored, and can have null selectors loaded in them. The access to that file. If the ring 3 code could perform an CPU loads a null selector in SS when switching privilege arbitrary jump in kernel space, it would be able to skip levels, discussed in § 2.8.2. the access check. Modern kernels only use one descriptor table, the For these reasons, the Intel architecture includes Global Descriptor Table (GDT), whose virtual address privilege-switching mechanisms used to transfer control is stored in the GDTR register. Table 2 shows a typical from less privileged software to well-defined entry points GDT layout that can be used by 64-bit kernels to run in more privileged software. As suggested above, an ar- both 32-bit and 64-bit applications. chitecture’s privilege-switching mechanisms have deep implications for the security properties of its software. Descriptor Selector Furthermore, securely executing the software inside a Null (must be unused) 0 protected container requires the same security considera- Kernel code 0x08 (index 1, ring 0) tions as privilege level switching. Kernel data 0x10 (index 2, ring 0) Due to historical factors, the Intel architecture has a User code 0x1B (index 3, ring 3) vast number of execution modes, and an intimidating User data 0x1F (index 4, ring 3) amount of transitions between them. We focus on the TSS 0x20 (index 5, ring 0) privilege level switching mechanisms used by modern 64-bit software, summarized in Figure 19. Table 2: A typical GDT layout in the 64-bit Intel Architecture. The last entry in Table 2 is a descriptor for the Task VM exit State Segment (TSS), which was designed to implement VMEXIT SYSCALL VMFUNC Fault hardware context switching, named task switching in Interrupt VMX VM the Intel documentation. The descriptor is stored in the Ring 0 Ring 3 Root exit IRET Task Register (TR), which behaves like the other segment VMLAUNCH registers described above. VMRESUME SYSRET Task switching was removed from the 64-bit architec- Figure 19: Modern privilege switching methods in the 64-bit Intel ture, but the TR segment register was preserved, and it architecture. points to a repurposed TSS data structure. The 64-bit TSS contains an I/O map, which indicates what parts of 2.8.1 System Calls the I/O address space can be accessed directly from ring On modern processors, application software uses the 3, and the Interrupt Stack Table (IST), which is used for SYSCALL instruction to invoke ring 0 code, and the ker- 13

14.nel uses SYSRET to switch the privilege level back to Field Bits ring 3. SYSCALL jumps into a predefined kernel loca- Handler RIP 64 tion, which is specified by writing to a pair of architec- Handler CS 16 tural MSRs (§ 2.4). Interrupt Stack Table (IST) index 3 All MSRs can only be read or written by ring 0 code. Table 3: The essential fields of an IDT entry in 64-bit mode. Each This is a crucial security property, because it entails that entry points to a hardware exception or interrupt handler. application software cannot modify SYSCALL’s MSRs. If that was the case, a rogue application could abuse the Interrupt Stack Table (IST), which is an array of 8 stack SYSCALL instruction to execute arbitrary kernel code, pointers stored in the TSS described in § 2.7. potentially bypassing security checks. When a hardware exception occurs, the execution state The SYSRET instruction switches the current privilege may be corrupted, and the current stack cannot be relied level from ring 0 back to ring 3, and jumps to the address on. Therefore, the CPU first uses the handler’s IDT entry in RCX, which is set by the SYSCALL instruction. The to set up a known good stack. SS is loaded with a null SYSCALL / SYSRET pair does not perform any memory descriptor, and RSP is set to the IST value to which the access, so it out-performs the Intel architecture’s previous IDT entry points. After switching to a reliable stack, privilege switching mechanisms, which saved state on the CPU pushes the snapshot in Table 4 on the stack, a stack. The design can get away without referencing a then loads the IDT entry’s values into the CS and RIP stack because kernel calls are not recursive. registers, which trigger the execution of the exception handler. 2.8.2 Faults Field Bits The processor also performs a switch from ring 3 to Exception SS 64 ring 0 when a hardware exception occurs while execut- Exception RSP 64 ing application code. Some exceptions indicate bugs in RFLAGS 64 the application, whereas other exceptions require kernel Exception CS 64 action. Exception RIP 64 A general protection fault (#GP) occurs when software Exception code 64 attempts to perform a disallowed action, such as setting the CR3 register from ring 3. Table 4: The snapshot pushed on the handler’s stack when a hard- ware exception occurs. IRET restores registers from this snapshot. A page fault (#PF) occurs when address translation encounters a page table entry whose P flag is 0, or when After the exception handler completes, it uses the the memory inside a page is accessed in way that is IRET (interrupt return) instruction to load the registers inconsistent with the access bits in the page table entry. from the on-stack snapshot and switch back to ring 3. For example, when ring 3 software accesses the memory The Intel architecture gives the fault handler complete inside a page whose S bit is set, the result of the memory control over the execution context of the software that in- access is #PF. curred the fault. This privilege is necessary for handlers When a hardware exception occurs in application code, (e.g., #GP) that must perform context switches (§ 2.6) the CPU performs a ring switch, and calls the correspond- as a consequence of terminating a thread that encoun- ing exception handler. For example, the #GP handler tered a bug. It follows that all fault handlers must be typically terminates the application’s process, while the trusted to not leak or tamper with the information in an #PF handler reads the swapped out page back into RAM application’s execution context. and resumes the application’s execution. The exception handlers are a part of the OS kernel, 2.8.3 VMX Privilege Level Switching and their locations are specified in the first 32 entries of Intel systems that take advantage of the hardware virtu- the Interrupt Descriptor Table (IDT), whose structure is alization support to run multiple operating systems at shown in Table 3. The IDT’s physical address is stored in the same time use a hypervisor that manages the VMs. the IDTR register, which can only be accessed by ring 0 The hypervisor creates a Virtual Machine Control Struc- code. Kernels protect the IDT memory using page tables, ture (VMCS) for each operating system instance that so that ring 3 software cannot access it. it wishes to run, and uses the VMENTER instruction to Each IDT entry has a 3-bit index pointing into the assign a logical processor to the VM. 14

15. When a logical processor encounters a fault that must DRAM DRAM DRAM DRAM FLASH be handled by the hypervisor, the logical processor per- UEFI forms a VM exit. For example, if the address translation ME FW CPU CPU CPU CPU process encounters an EPT entry with the P flag set to 0, SPI the CPU performs a VM exit, and the hypervisor has an USB SATA opportunity to bring the page into RAM. PCH The VMCS shows a great application of the encapsula- CPU CPU CPU CPU ME tion principle [128], which is generally used in high-level software, to computer architecture. The Intel architecture DRAM DRAM DRAM DRAM NIC / PHY specifies that each VMCS resides in DRAM and is 4 KB in size. However, the architecture does not specify the QPI DDR PCIe DMI VMCS format, and instead requires the hypervisor to interact with the VMCS via CPU instructions such as Figure 20: The motherboard structures that are most relevant in a system security analysis. VMREAD and VMWRITE. This approach allows Intel to add VMX features that described in § 2.3, namely the SMM code, the hypervisor, require VMCS format changes, without the burden of operating systems, and application processes. The com- having to maintain backwards compatibility. This is no puter’s main memory is provided by Dynamic Random- small feat, given that huge amounts of complexity in the Access Memory (DRAM) chips. Intel architecture were introduced due to compatibility The Platform Controller Hub (PCH) houses (rela- requirements. tively) low-speed I/O controllers driving the slower buses 2.9 A Computer Map in the system, like SATA, used by storage devices, and This section outlines the hardware components that make USB, used by input peripherals. The PCH is also known up a computer system based on the Intel architecture. as the chipset. At a first approximation, the south bridge § 2.9.1 summarizes the structure of a motherboard. term in older documentation can also be considered as a This is necessary background for reasoning about the synonym for PCH. cost and impact of physical attacks against a computing Motherboards also have a non-volatile (flash) mem- system. § 2.9.2 describes Intel’s Management Engine, ory chip that hosts firmware which implements the Uni- which plays a role in the computer’s bootstrap process, fied Extensible Firmware Interface (UEFI) specifica- and has significant security implications. tion [178]. The firmware contains the boot code and § 2.9.3 presents the building blocks of an Intel proces- the code that executes in System Management Mode sor, and § 2.9.4 models an Intel execution core at a high (SMM, § 2.3). level. This is the foundation for implementing defenses The components we care about are connected by the against physical attacks. Perhaps more importantly, rea- following buses: the Quick-Path Interconnect (QPI [90]), soning about software attacks based on information leak- a network of point-to-point links that connect processors, age, such as timing attacks, requires understanding how the double data rate (DDR) bus that connects a CPU a processor’s computing resources are shared and parti- to DRAM, the Direct Media Interface (DMI) bus that tioned between mutually distrusting parties. connects a CPU to the PCH, the Peripheral Component The information in here is either contained in the SDM Interconnect Express (PCIe) bus that connects a CPU to or in Intel’s Optimization Reference Manual [95]. peripherals such as a Network Interface Card (NIC), and the Serial Programming Interface (SPI) used by the PCH 2.9.1 The Motherboard to communicate with the flash memory. A computer’s components are connected by a printed The PCIe bus is an extended, point-to-point version circuit board called a motherboard, shown in Figure 20, of the PCI standard, which provides a method for any which consists of sockets connected by buses. Sockets peripheral connected to the bus to perform Direct Mem- connect chip-carrying packages to the board. The Intel ory Access (DMA), transferring data to and from DRAM documentation uses the term “package” to specifically without involving an execution core and spending CPU refer to a CPU. cycles. The PCI standard includes a configuration mech- The CPU (described in § 2.9.3) hosts the execution anism that assigns a range of DRAM to each peripheral, cores that run the software stack shown in Figure 8 and but makes no provisions for restricting a peripheral’s 15

16.DRAM accesses to its assigned range. Intel PCH Network interfaces consist of a physical (PHY) mod- Intel ME Interrupt Watchdog ule that converts the analog signals on the network me- Controller Timer dia to and from digital bits, and a Media Access Con- trol (MAC) module that implements a network-level pro- Crypto I-Cache Execution Boot SPI Accelerator D-Cache Core ROM Controller tocol. Modern Intel-based motherboards forego a full- fledged NIC, and instead include an Ethernet [83] PHY Internal Bus module. SMBus HECI DRAM DMA Internal Controller Controller Access Engine SRAM 2.9.2 The Intel Management Engine (ME) Intel’s Management Engine (ME) is an embedded com- Ethernet PCIe Audio USB Integrated puter that was initially designed for remote system man- MAC Controller Controller Controller Sensor Hub agement and troubleshooting of server-class systems that are often hosted in data centers. However, all of Intel’s Ethernet PCIe Audio, MIC USB I2C SPI recent PCHs contain an ME [79], and it currently plays a PHY lanes Bluetooth PHY UART Bus crucial role in platform bootstrapping, which is described Figure 21: The Intel Management Engine (ME) is an embedded in detail in § 2.13. Most of the information in this section computer hosted in the PCH. The ME has its own execution core, is obtained from an Intel-sponsored book [160]. ROM and SRAM. The ME can access the host’s DRAM via a memory The ME is part of Intel’s Active Management Tech- controller and a DMA controller. The ME is remotely accessible over the network, as it has direct access to an Ethernet PHY via the nology (AMT), which is marketed as a convenient way SMBus. for IT administrators to troubleshoot and fix situations such as failing hardware, or a corrupted OS installation, and permanently disabled as long as it has power and without having to gain physical access to the impacted signal. computer. As the ME remains active in deep power-saving modes, The Intel ME, shown in Figure 21, remains functional its design must rely on low-power components. The ex- during most hardware failures because it is an entire ecution core is an Argonaut RISC Core (ARC) clocked embedded computer featuring its own execution core, at 200-400MHz, which is typically used in low-power bootstrap ROM, and internal RAM. The ME can be used embedded designs. On a very recent PCH [99], the inter- for troubleshooting effectively thanks to an array of abil- nal SRAM has 640KB, and is shared with the Integrated ities that include overriding the CPU’s boot vector and a Sensor Hub (ISH)’s core. The SMBus runs at 1MHz and, DMA engine that can access the computer’s DRAM. The without CPU support, the motherboard’s Ethernet PHY ME provides remote access to the computer without any runs at 10Mpbs. CPU support because it can use the System Management When the host computer is powered on, the ME’s exe- bus (SMBus) to access the motherboard’s Ethernet PHY cution core starts running code from the ME’s bootstrap or an AMT-compatible NIC [99]. ROM. The bootstrap code loads the ME’s software stack The Intel ME is connected to the motherboard’s power from the same flash chip that stores the host computer’s supply using a power rail that stays active even when the firmware. The ME accesses the flash memory chip an host computer is in the Soft Off mode [99], known as embedded SPI controller. ACPI G2/S5, where most of the computer’s components are powered off [86], including the CPU and DRAM. 2.9.3 The Processor Die For all practical purposes, this means that the ME’s exe- An Intel processor’s die, illustrated in Figure 22, is di- cution core is active as long as the power supply is still vided into two broad areas: the core area implements the connected to a power source. instruction execution pipeline typically associated with In S5, the ME cannot access the DRAM, but it can CPUs, while the uncore provides functions that were still use its own internal memories. The ME can also still traditionally hosted on separate chips, but are currently communicate with a remote party, as it can access the integrated on the CPU die to reduce latency and power motherboard’s Ethernet PHY via SMBus. This enables consumption. applications such as AMT’s theft prevention, where a At a conceptual level, the uncore of modern proces- laptop equipped with a cellular modem can be tracked sors includes an integrated memory controller (iMC) that 16

17. NIC Platform Controller Hub Logical CPU Logical CPU L1 L1 Registers Registers Fetch PCI-X DMI I-Cache I-TLB LAPIC LAPIC Chip Package Decode Microcode IOAPIC I/O Controller Core Core Graphics Instruction Scheduler L2 L2 CPU I/O to Ring Unit Cache TLB Config L3 Cache Power L1 L1 QPI Router Home Agent INT INT INT MEM Unit D-Cache D-TLB QPI Memory FP FP SSE SSE Core Core Page Miss Handler (PMH) Packetizer Controller Execution Units QPI DDR3 Figure 23: CPU core with two logical processors. Each logical CPU DRAM processor has its own execution context and LAPIC (§ 2.12). All the other core resources are shared. Figure 22: The major components in a modern CPU package. § 2.9.3 gives an uncore overview. § 2.9.4 describes execution cores. A hyper-threaded core is exposed to system software § 2.11.3 takes a deeper look at the uncore. as two logical processors (LPs), also named hardware threads in the Intel documentation. The logical proces- interfaces with the DDR bus, an integrated I/O controller sor abstraction allows the code used to distribute work (IIO) that implements PCIe bus lanes and interacts with across processors in a multi-processor system to func- the DMI bus, and a growing number of integrated pe- tion without any change on multi-core hyper-threaded ripherals, such as a Graphics Processing Unit (GPU). processors. The uncore structure is described in some processor fam- The high level of resource sharing introduced by ily datasheets [96, 97], and in the overview sections in hyper-threading introduces a security vulnerability. Soft- Intel’s uncore performance monitoring documentation ware running on one logical processor can use the high- [37, 89, 93]. resolution performance counter (RDTSCP, § 2.4) [150] Security extensions to the Intel architecture, such as to get information about the instructions and memory ac- Trusted Execution Technology (TXT) [70] and Software cess patterns of another piece of software that is executed Guard Extensions (SGX) [14, 137], rely on the fact that on the other logical processor on the same core. the processor die includes the memory and I/O controller, That being said, the biggest downside of hyper- and thus can prevent any device from accessing protected threading might be the fact that writing about Intel pro- memory areas via Direct Memory Access (DMA) trans- cessors in a rigorous manner requires the use of the cum- fers. § 2.11.3 takes a deeper look at the uncore organiza- bersome term Logical Processor instead of the shorter tion and at the machinery used to prevent unauthorized and more intuitive “CPU core”, which can often be ab- DMA transfers. breviated to “core”. 2.9.4 The Core 2.10 Out-of-Order and Speculative Execution Virtually all modern Intel processors have core areas con- CPU cores can execute instructions orders of magni- sisting of multiple copies of the execution core circuitry, tude faster than DRAM can read data. Computer archi- each of which is called a core. At the time of this writing, tects attempt to bridge this gap by using hyper-threading desktop-class Intel CPUs have 4 cores, and server-class (§ 2.9.3), out-of-order and speculative execution, and CPUs have as many as 18 cores. caching, which is described in § 2.11. In CPUs that Most Intel CPUs feature hyper-threading, which use out-of-order execution, the order in which the CPU means that a core (shown in Figure 23) has two copies carries out a program’s instructions (execution order) is of the register files backing the execution context de- not necessarily the same as the order in which the in- scribed in § 2.6, and can execute two separate streams of structions would be executed by a sequential evaluation instructions simultaneously. Hyper-threading reduces the system (program order). impact of memory stalls on the utilization of the fetch, An analysis of a system’s information leakage must decode and execution units. take out-of-order execution into consideration. Any CPU 17

18.actions observed by an attacker match the execution The Intel architecture defines a complex instruction order, so the attacker may learn some information by set (CISC). However, virtually all modern CPUs are ar- comparing the observed execution order with a known chitected following reduced instruction set (RISC) prin- program order. At the same time, attacks that try to infer ciples. This is accomplished by having the instruction a victim’s program order based on actions taken by the decode stages break down each instruction into micro- CPU must account for out-of-order execution as a source ops, which resemble RISC instructions. The other stages of noise. of the execution pipeline work exclusively with micro- This section summarizes the out-of-order and specu- ops. lative execution concepts used when reasoning about a system’s security properties. [148] and [75] cover the 2.10.1 Out-of-Order Execution concepts in great depth, while Intel’s optimization man- Different types of instructions require different logic ual [95] provides details specific to Intel CPUs. circuits, called functional units. For example, the arith- Figure 24 provides a more detailed view of the CPU metic logic unit (ALU), which performs arithmetic op- core components involved in out-of-order execution, and erations, is completely different from the load and store omits some less relevant details from Figure 23. unit, which performs memory operations. Different cir- cuits can be used at the same time, so each CPU core can Branch Instruction L1 I-Cache execute multiple micro-ops in parallel. Predictors Fetch Unit L1 I-TLB The core’s out-of-order engine receives decoded Pre-Decode Fetch Buffer micro-ops, identifies the micro-ops that can execute in Instruction Queue parallel, assigns them to functional units, and combines Microcode Complex Simple the outputs of the units so that the results are equiva- ROM Decoder Decoders lent to having the micro-ops executed sequentially in the Micro-op Micro-op Decode Queue order in which they come from the decode stages. Cache Instruction Decode For example, consider the sequence of pseudo micro- Out of Order Engine ops4 in Table 5 below. The OR uses the result of the Register Reorder Load Store Files Buffer Buffer Buffer LOAD, but the ADD does not. Therefore, a good scheduler can have the load store unit execute the LOAD and the Renamer ALU execute the ADD, all in the same clock cycle. Scheduler Reservation Station # Micro-op Meaning Port 0 Port 1 Ports 2, 3 Port 4 Port 5 Port 6 Port 7 1 LOAD RAX, RSI RAX ← DRAM[RSI] Integer ALU Shift Integer ALU LEA Load & Store Store Data Integer ALU LEA Integer ALU Shift Store Address 2 OR RDI, RDI, RAX RDI ← RDI ∨ RAX FMA FMA Address Vector Branch 3 ADD RSI, RSI, RCX RSI ← RSI + RCX FP Multiply FP Multiply Shuffle 4 SUB RBX, RSI, RDX RBX ← RSI - RDX Integer Integer Integer Vector Vector Vector Table 5: Pseudo micro-ops for the out-of-order execution example. Multiply ALU ALU Vector Logicals Vector Logicals Vector Logicals The out-of-order engine in recent Intel CPUs works Branch FP Addition roughly as follows. Micro-ops received from the decode Divide queue are written into a reorder buffer (ROB) while they Vector Shift are in-flight in the execution unit. The register allocation Execution table (RAT) matches each register with the last reorder Memory Control buffer entry that updates it. The renamer uses the RAT to rewrite the source and destination fields of micro-ops L1 D-Cache Fill Buffers L2 D-Cache when they are written in the ROB, as illustrated in Tables L1 D-TLB Memory 6 and 7. Note that the ROB representation makes it easy to determine the dependencies between micro-ops. Figure 24: The structures in a CPU core that are relevant to out- of-order and speculative execution. Instructions are decoded into 4 The set of micro-ops used by Intel CPUs is not publicly docu- micro-ops, which are scheduled on one of the execution unit’s ports. mented. The fictional examples in this section suffice for illustration The branch predictor enables speculative execution when a branch is purposes. encountered. 18

19. # Op Source 1 Source 2 Destination The most well-known branching instructions in the Intel 1 LOAD RSI ∅ RAX architecture are in the jcc family, such as je (jump if 2 OR RDI ROB #1 RSI equal). 3 ADD RSI RCX RSI Branches pose a challenge to the decode stage, because 4 SUB ROB # 3 RDX RBX the instruction that should be fetched after a branch is Table 6: Data written by the renamer into the reorder buffer (ROB), not known until the branching condition is evaluated. In for the micro-ops in Table 5. order to avoid stalling the decode stage, modern CPU Register RAX RBX RCX RDX RSI RDI designs include branch predictors that use historical in- ROB # #1 #4 ∅ ∅ #3 #2 formation to guess whether a branch will be taken or not. Table 7: Relevant entries of the register allocation table after the When the decode stage encounters a branch instruc- micro-ops in Table 5 are inserted into the ROB. tion, it asks the branch predictor for a guess as to whether The scheduler decides which micro-ops in the ROB the branch will be taken or not. The decode stage bun- get executed, and places them in the reservation station. dles the branch condition and the predictor’s guess into The reservation station has one port for each functional a branch check micro-op, and then continues decoding unit that can execute micro-ops independently. Each on the path indicated by the predictor. The micro-ops reservation station port port holds one micro-op from following the branch check are marked as speculative. the ROB. The reservation station port waits until the When the branch check micro-op is executed, the micro-op’s dependencies are satisfied and forwards the branch unit checks whether the branch predictor’s guess micro-op to the functional unit. When the functional unit was correct. If that is the case, the branch check is retired completes executing the micro-op, its result is written successfully. The scheduler handles mispredictions by back to the ROB, and forwarded to any other reservation squashing all the micro-ops following the branch check, station port that depends on it. and by signaling the instruction decoder to flush the The ROB stores the results of completed micro-ops un- micro-op decode queue and start fetching the instruc- til they are retired, meaning that the results are committed tions that follow the correct branch. to the register file and the micro-ops are removed from Modern CPUs also attempt to predict memory read pat- the ROB. Although micro-ops can be executed out-of- terns, so they can prefetch the memory locations that are order, they must be retired in program order, in order to about to be read into the cache. Prefetching minimizes handle exceptions correctly. When a micro-op causes a the latency of successfully predicted read operations, as hardware exception (§ 2.8.2), all the following micro-ops their data will already be cached. This is accomplished in the ROB are squashed, and their results are discarded. by exposing circuits called prefetchers to memory ac- In the example above, the ADD can complete before cesses and cache misses. Each prefetcher can recognize the LOAD, because it does not require a memory access. a particular access pattern, such as sequentially read- However, the ADD’s result cannot be committed before ing an array’s elements. When memory accesses match LOAD completes. Otherwise, if the ADD is committed the pattern that a prefetcher was built to recognize, the and the LOAD causes a page fault, software will observe prefetcher loads the cache line corresponding to the next an incorrect value for the RSI register. memory access in its pattern. The ROB is tailored for discovering register dependen- 2.11 Cache Memories cies between micro-ops. However, micro-ops that exe- cute out-of-order can also have memory dependencies. At the time of this writing, CPU cores can process data For this reason, out-of-order engines have a load buffer ≈ 200× faster than DRAM can supply it. This gap is and a store buffer that keep track of in-flight memory op- bridged by an hierarchy of cache memories, which are erations and are used to resolve memory dependencies. orders of magnitude smaller and an order of magnitude faster than DRAM. While caching is transparent to ap- 2.10.2 Speculative Execution plication software, the system software is responsible for Branch instructions, also called branches, change the managing and coordinating the caches that store address instruction pointer (RIP, § 2.6), if a condition is met (the translation (§ 2.5) results. branch is taken). They implement conditional statements Caches impact the security of a software system in (if) and looping statements, such as while and for. two ways. First, the Intel architecture relies on system 19

20.software to manage address translation caches, which Look for a cache Cache line storing A Lookup becomes an issue in a threat model where the system soft- ware is untrusted. Second, caches in the Intel architecture are shared by all the software running on the computer. YES NO Look for a free cache Found? hit miss line that can store A This opens up the way for cache timing attacks, an entire class of software attacks that rely on observing the time differences between accessing a cached memory location YES Found? and an uncached memory location. NO This section summarizes the caching concepts and im- plementation details needed to reason about both classes Cache Eviction Choose a cache line of security problems mentioned above. [168], [148] and that can store A [75] provide a good background on low-level cache im- plementation concepts. § 3.8 describes cache timing NO Is the line dirty? attacks. YES 2.11.1 Caching Principles Mark the line Write the cache line At a high level, caches exploit the high locality in the available to the next level memory access patterns of most applications to hide the main memory’s (relatively) high latency. By caching Cache (storing a copy of) the most recently accessed code and Get A from the Store the data at A Fill next memory level in the free line data, these relatively small memories can be used to satisfy 90%-99% of an application’s memory accesses. In an Intel processor, the first-level (L1) cache consists Return data of a separate data cache (D-cache) and an instruction associated with A cache (I-cache). The instruction fetch and decode stage Figure 25: The steps taken by a cache memory to resolve an access is directly connected to the L1 I-cache, and uses it to read to a memory address A. A normal memory access (to cacheable the streams of instructions for the core’s logical proces- DRAM) always triggers a cache lookup. If the access misses the sors. Micro-ops that read from or write to memory are cache, a fill is required, and a write-back might be required. executed by the memory unit (MEM in Figure 23), which is connected to the L1 D-cache and forwards memory the L3 cache is in the CPU’s uncore (see Figure 22), and accesses to it. is shared by all the cores in the package. Figure 25 illustrates the steps taken by a cache when it The numbers in Table 8 suggest that cache placement receives a memory access. First, a cache lookup uses the can have a large impact on an application’s execution memory address to determine if the corresponding data time. Because of this, the Intel architecture includes exists in the cache. A cache hit occurs when the address an assortment of instructions that give performance- is found, and the cache can resolve the memory access sensitive applications some control over the caching quickly. Conversely, if the address is not found, a cache of their working sets. PREFETCH instructs the CPU’s miss occurs, and a cache fill is required to resolve the prefetcher to cache a specific memory address, in prepa- memory access. When doing a fill, the cache forwards ration for a future memory access. The memory writes the memory access to the next level of the memory hierar- performed by the MOVNT instruction family bypass the chy and caches the response. Under most circumstances, cache if a fill would be required. CLFLUSH evicts any a cache fill also triggers a cache eviction, in which some cache lines storing a specific address from the entire data is removed from the cache to make room for the cache hierarchy. data coming from the fill. If the data that is evicted has The methods mentioned above are available to soft- been modified since it was loaded in the cache, it must be ware running at all privilege levels, because they were de- written back to the next level of the memory hierarchy. signed for high-performance workloads with large work- Table 8 shows the key characteristics of the memory ing sets, which are usually executed at ring 3 (§ 2.3). For hierarchy implemented by modern Intel CPUs. Each comparison, the instructions used by system software core has its own L1 and L2 cache (see Figure 23), while to manage the address translation caches, described in 20

21. Memory Size Access Time Memory Address Core Registers 1 KB no latency Address Tag Set Index Line Offset n-1…s+l s+l-1…l l-1…0 L1 D-Cache 32 KB 4 cycles L2 Cache 256 KB 10 cycles L3 Cache 8 MB 40-75 cycles Set 0, Way 0 Set 0, Way 1 … Set 0, Way W-1 DRAM 16 GB 60 ns Set 1, Way 0 Set 1, Way 1 … Set 1, Way W-1 Table 8: Approximate sizes and access times for each level in the ⋮ ⋮ ⋱ ⋮ memory hierarchy of an Intel processor, from [125]. Memory sizes Set i, Way 0 Set i, Way 1 … Set i, Way W-1 and access times differ by orders of magnitude across the different ⋮ ⋮ ⋱ ⋮ levels of the hierarchy. This table does not cover multi-processor Set S-1, Way 0 Set S-1, Way 1 … Set S-1, Way W-1 systems. § 2.11.5 below, can only be executed at ring 0. Way 0 Way 1 … Way W-1 Tag Line Tag Line Tag Line 2.11.2 Cache Organization In the Intel architecture, caches are completely imple- mented in hardware, meaning that the software stack has Tag Comparator no direct control over the eviction process. However, software can gain some control over which data gets Matched Line evicted by understanding how the caches are organized, and by cleverly placing its data in memory. Match? Matched Word The cache line is the atomic unit of cache organization. Figure 26: Cache organization and lookup, for a W -way set- A cache line has data, a copy of a continuous range of associative cache with 2l -byte lines and S = 2s sets. The cache DRAM, and a tag, identifying the memory address that works with n-bit memory addresses. The lowest l address bits point the data comes from. Fills and evictions operate on entire to a specific byte in a cache line, the next s bytes index the set, and lines. the highest n − s − l bits are used to decide if the desired address is in one of the W lines in the indexed set. The cache line size is the size of the data, and is always a power of two. Assuming n-bit memory addresses and a 2.11.3 Cache Coherence cache line size of 2l bytes, the lowest l bits of a memory address are an offset into a cache line, and the highest The Intel architecture was designed to support applica- n − l bits determine the cache line that is used to store tion software that was not written with caches in mind. the data at the memory location. All recent processors One aspect of this support is the Total Store Order (TSO) have 64-byte cache lines. [145] memory model, which promises that all the logical The L1 and L2 caches in recent processors are multi- processors in a computer see the same order of DRAM way set-associative with direct set indexing, as shown writes. in Figure 26. A W -way set-associative cache has its The same memory location might be simultaneously memory divided into sets, where each set has W lines. A cached by different cores’ caches, or even by caches on memory location can be cached in any of the w lines in a separate chips, so providing the TSO guarantees requires specific set that is determined by the highest n − l bits a cache coherence protocol that synchronizes all the of the location’s memory address. Direct set indexing cache lines in a computer that reference the same memory means that the S sets in a cache are numbered from 0 to address. S − 1, and the memory location at address A is cached The cache coherence mechanism is not visible to in the set numbered An−1...n−l mod S. software, so it is only briefly mentioned in the SDM. In the common case where the number of sets in a Fortunately, Intel’s optimization reference [95] and the cache is a power of two, so S = 2s , the lowest l bits in datasheets referenced in § 2.9.3 provide more informa- an address make up the cache line offset, the next s bits tion. Intel processors use variations of the MESIF [66] are the set index. The highest n − s − l bits in an address protocol, which is implemented in the CPU and in the are not used when selecting where a memory location protocol layer of the QPI bus. will be cached. Figure 26 shows the cache structure and The SDM and the CPUID instruction output indicate lookup process. that the L3 cache, also known as the last-level cache 21

22.(LLC) is inclusive, meaning that any location cached by Core Core an L1 or L2 cache must also be cached in the LLC. This QPI Link L2 Cache L2 Cache DDR3 Channel design decision reduces complexity in many implemen- tation aspects. We estimate that the bulk of the cache CBox CBox coherence implementation is in the CPU’s uncore, thanks QPI Ring to to the fact that cache synchronization can be achieved Packetizer QPI L3 Cache L3 Cache L3 Cache Slice Slice Home Memory without having to communicate to the lower cache levels Agent Controller UBox L3 Cache L3 Cache that are inside execution cores. Ring to Slice Slice I/O Controller PCIe The QPI protocol defines cache agents, which are connected to the last-level cache in a processor, and CBox CBox home agents, which are connected to memory controllers. Cache agents make requests to home agents for cache PCIe Lanes L2 Cache L2 Cache line data on cache misses, while home agents keep track Core Core of cache line ownership, and obtain the cache line data Figure 27: The stops on the ring interconnect used for inter-core from other cache line agents, or from the memory con- and core-uncore communication. troller. The QPI routing layer supports multiple agents per socket, and each processor has its own caching agents, which is the uncore configuration controller, and con- and at least one home agent. nects the System agent to the ring. The UBox is re- Figure 27 shows that the CPU uncore has a bidirec- sponsible for reading and writing physically distributed tional ring interconnect, which is used for communi- registers across the uncore. The UBox also receives inter- cation between execution cores and the other uncore rupts from system and dispatches them to the appropriate components. The execution cores are connected to the core. ring by CBoxes, which route their LLC accesses. The On recent Intel processors, the uncore also contains at routing is static, as the LLC is divided into same-size least one memory controller. Each integrated memory slices (common slice sizes are 1.5 MB and 2.5 MB), and controller (iMC or MBox in Intel’s documentation) is an undocumented hashing scheme maps each possible connected to the ring by a home agent (HA or BBox in physical address to exactly one LLC slice. Intel’s datasheets). Each home agent contains a Target Intel’s documentation states that the hashing scheme Address Decoder (TAD), which maps each DRAM ad- mapping physical addresses to LLC slices was designed dress to an address suitable for use by the DRAM chips, to avoid having a slice become a hotspot, but stops short namely a DRAM channel, bank, rank, and a DIMM ad- of providing any technical details. Fortunately, inde- dress. The mapping in the TAD is not documented by pendent researches have reversed-engineered the hash Intel, but it has been reverse-engineered [149]. functions for recent processors [84, 133, 195]. The integration of the memory controller on the CPU The hashing scheme described above is the reason brings the ability to filter DMA transfers. Accesses from why the L3 cache is documented as having a “complex” a peripheral connected to the PCIe bus are handled by the indexing scheme, as opposed to the direct indexing used integrated I/O controller (IIO), placed on the ring inter- in the L1 and L2 caches. connect via the UBox, and then reach the iMC. Therefore, The number of LLC slices matches the number of on modern systems, DMA transfers go through both the cores in the CPU, and each LLC slice shares a CBox SAD and TAD, which can be configured to abort DMA with a core. The CBoxes implement the cache coherence transfers targeting protected DRAM ranges. engine, so each CBox acts as the QPI cache agent for its LLC slice. CBoxes use a Source Address Decoder (SAD) 2.11.4 Caching and Memory-Mapped Devices to route DRAM requests to the appropriate home agents. Caches rely on the assumption that the underlying mem- Conceptually, the SAD takes in a memory address and ory implements the memory abstraction in § 2.2. How- access type, and outputs a transaction type (coherent, ever, the physical addresses that map to memory-mapped non-coherent, IO) and a node ID. Each CBox contains I/O devices usually deviate from the memory abstraction. a SAD replica, and the configurations of all SADs in a For example, some devices expose command registers package are identical. that trigger certain operations when written, and always The SAD configurations are kept in sync by the UBox, return a zero value. Caching addresses that map to such 22

23.memory-mapped I/O devices will lead to incorrect be- On recent Intel processors, the cache’s behavior is havior. mainly configured by the Memory Type Range Registers Furthermore, even when the memory-mapped devices (MTRRs) and by Page Attribute Table (PAT) indices in follow the memory abstraction, caching their memory is the page tables (§ 2.5). The behavior is also impacted by sometimes undesirable. For example, caching a graphic the Cache Disable (CD) and Not-Write through (NW) unit’s framebuffer could lead to visual artifacts on the bits in Control Register 0 (CR0, § 2.4), as well as by user’s display, because of the delay between the time equivalent bits in page table entries, namely Page-level when a write is issued and the time when the correspond- Cache Disable (PCD) and Page-level Write-Through ing cache lines are evicted and written back to memory. (PWT). In order to work around these problems, the Intel archi- The MTRRs were intended to be configured by the tecture implements a few caching behaviors, described computer’s firmware during the boot sequence. Fixed below, and provides a method for partitioning the mem- MTRRs cover pre-determined ranges of memory, such ory address space (§ 2.4) into regions, and for assigning as the memory areas that had special semantics in the a desired caching behavior to each region. computers using 16-bit Intel processors. The ranges Uncacheable (UC) memory has the same semantics covered by variable MTRRs can be configured by system as the I/O address space (§ 2.4). UC memory is useful software. The representation used to specify the ranges when a device’s behavior is dependent on the order of is described below, as it has some interesting properties memory reads and writes, such as in the case of memory- that have proven useful in other systems. mapped command and data registers for a PCIe NIC Each variable memory type range is specified using (§ 2.9.1). The out-of-order execution engine (§ 2.10) a range base and a range mask. A memory address be- does not reorder UC memory accesses, and does not longs to the range if computing a bitwise AND between issue speculative reads to UC memory. the address and the range mask results in the range base. Write Combining (WC) memory addresses the spe- This verification has a low-cost hardware implementa- cific needs of framebuffers. WC memory is similar to tion, shown in Figure 28. UC memory, but the out-of-order engine may reorder memory accesses, and may perform speculative reads. MTRR mask The processor stores writes to WC memory in a write AND EQ match Physical Address combining buffer, and attempts to group multiple writes MTRR base into a (more efficient) line write bus transaction. Write Through (WT) memory is cached, but write Figure 28: The circuit for computing whether a physical address matches a memory type range. Assuming a CPU with 48-bit physical misses do not cause cache fills. This is useful for pre- addresses, the circuit uses 36 AND gates and a binary tree of 35 venting large memory-mapped device memories that are XNOR (equality test) gates. The circuit outputs 1 if the address rarely read, such as framebuffers, from taking up cache belongs to the range. The bottom 12 address bits are ignored, because memory. WT memory is covered by the cache coherence memory type ranges must be aligned to 4 KB page boundaries. engine, may receive speculative reads, and is subject to Each variable memory type range must have a size that operation reordering. is an integral power of two, and a starting address that DRAM is represented as Write Back (WB) memory, is a multiple of its size, so it can be described using the which is optimized under the assumption that all the base / mask representation described above. A range’s devices that need to observe the memory operations im- starting address is its base, and the range’s size is one plement the cache coherence protocol. WB memory is plus its mask. cached as described in § 2.11, receives speculative reads, Another advantage of this range representation is that and operations targeting it are subject to reordering. the base and the mask can be easily validated, as shown Write Protected (WP) memory is similar to WB mem- in Listing 1. The range is aligned with respect to its size ory, with the exception that every write is propagated if and only if the bitwise AND between the base and the to the system bus. It is intended for memory-mapped mask is zero. The range’s size is a power of two if and buffers, where the order of operations does not matter, only if the bitwise AND between the mask and one plus but the devices that need to observe the writes do not im- the mask is zero. According to the SDM, the MTRRs are plement the cache coherence protocol, in order to reduce not validated, but setting them to invalid values results in hardware costs. undefined behavior. 23

24. Memory Entries Access Time constexpr bool is_valid_range( L1 I-TLB 128 + 8 = 136 1 cycle size_t base, size_t mask) { // Base is aligned to size. L1 D-TLB 64 + 32 + 4 = 100 1 cycle return (base & mask) == 0 && L2 TLB 1536 + 8 = 1544 7 cycles // Size is a power of two. Page Tables 236 ≈ 6 · 1010 18 cycles - 200ms (mask & (mask + 1)) == 0; } Table 9: Approximate sizes and access times for each level in the TLB hierarchy, from [4]. Listing 1: The checks that validate the base and mask of a memory- type range can be implemented very easily. flag on every write, and issue a General Protection fault No memory type range can partially cover a 4 KB page, (#GP) if the write targets a read-only page. Therefore, which implies that the range base must be a multiple of the TLB entry for each virtual address caches the logical- 4 KB, and the bottom 12 bits of range mask must be set. and of all the relevant W flags in the page table structures This simplifies the interactions between memory type leading up to the page. ranges and address translation, described in § 2.11.5. The TLB is transparent to application software. How- The PAT is intended to allow the operating system or ever, kernels and hypervisors must make sure that the hypervisor to tweak the caching behaviors specified in TLBs do not get out of sync with the page tables and the MTRRs by the computer’s firmware. The PAT has EPTs. When changing a page table or EPT, the system 8 entries that specify caching behaviors, and is stored software must use the INVLPG instruction to invalidate in its entirety in a MSR. Each page table entry contains any TLB entries for the virtual address whose translation a 3-bit index that points to a PAT entry, so the system changed. Some instructions flush the TLBs, meaning that software that controls the page tables can specify caching they invalidate all the TLB entries, as a side-effect. behavior at a very fine granularity. TLB entries also cache the desired caching behavior (§ 2.11.4) for their pages. This requires system software 2.11.5 Caches and Address Translation to flush the corresponding TLB entries when changing Modern system software relies on address translation MTRRs or page table entries. In return, the processor (§ 2.5). This means that all the memory accesses issued only needs to compute the desired caching behavior dur- by a CPU core use virtual addresses, which must undergo ing a TLB miss, as opposed to computing the caching translation. Caches must know the physical address for a behavior on every memory access. memory access, to handle aliasing (multiple virtual ad- The TLB is not covered by the cache coherence mech- dresses pointing to the same physical address) correctly. anism described in § 2.11.3. Therefore, when modifying However, address translation requires up to 20 memory a page table or EPT on a multi-core / multi-processor accesses (see Figure 15), so it is impractical to perform a system, the system software is responsible for perform- full address translation for every cache access. Instead, ing a TLB shootdown, which consists of stopping all the address translation results are cached in the translation logical processors that use the page table / EPT about look-aside buffer (TLB). to be changed, performing the changes, executing TLB- Table 9 shows the levels of the TLB hierarchy. Recent invalidating instructions on the stopped logical proces- processors have separate L1 TLBs for instructions and sors, and then resuming execution on the stopped logical data, and a shared L2 TLB. Each core has its own TLBs processors. (see Figure 23). When a virtual address is not contained Address translation constrains the L1 cache design. in a core’s TLB, the Page Miss Handler (PMH) performs On Intel processors, the set index in an L1 cache only a page walk (page table / EPT traversal) to translate the uses the address bits that are not impacted by address virtual address, and the result is stored in the TLB. translation, so that the L1 set lookup can be done in par- In the Intel architecture, the PMH is implemented in allel with the TLB lookup. This is critical for achieving hardware, so the TLB is never directly exposed to soft- a low latency when both the L1 TLB and the L1 cache ware and its implementation details are not documented. are hit. The SDM does state that each TLB entry contains the Given a page size P = 2p bytes, the requirement physical address associated with a virtual address, and above translates to l + s ≤ p. In the Intel architecture, the metadata needed to resolve a memory access. For p = 12, and all recent processors have 64-byte cache example, the processor needs to check the writable (W) lines (l = 6) and 64 sets (s = 6) in the L1 caches, as 24

25.shown in Figure 29. The L2 and L3 caches are only grammable Interrupt Controller (IOAPIC) in the PCH, accessed if the L1 misses, so the physical address for the shown in Figure 20. memory access is known at that time, and can be used The IOAPIC routes interrupt signals to one or more for indexing. Local Advanced Programmable Interrupt Controllers (LAPICs). As shown in Figure 22, each logical CPU L1 Cache Address Breakdown has a LAPIC that can receive interrupt signals from the Address Tag Set Index Line Offset 47…12 11…6 5…0 IOAPIC. The IOAPIC routing process assigns each inter- 4KB Page Address Breakdown rupt to an 8-bit interrupt vector that is used to identify PML4E Index PDPTE Index PDE Index PTE Index Page Offset the interrupt sources, and to a 32-bit APIC ID that is used 47…39 38…30 29…21 20…12 11…0 to identify the LAPIC that receives the interrupt. L2 Cache Address Breakdown Each LAPIC uses a 256-bit Interrupt Request Regis- Address Tag Set Index Line Offset ter (IRR) to track the unserviced interrupts that it has 47…16 14…6 5…0 received, based on the interrupt vector number. When the L3 Cache Address Breakdown corresponding logical processor is available, the LAPIC Address Tag Set Index Line Offset copies the highest-priority unserviced interrupt vector 47…16 18…6 5…0 to the In-Service Register (ISR), and invokes the logical 2MB Page Address Breakdown processor’s interrupt handling process. PML4E Index PDPTE Index PDE Index Page Offset At the execution core level, interrupt handling reuses 47…39 38…30 29…21 20…0 many of the mechanisms of fault handling (§ 2.8.2). The Figure 29: Virtual addresses from the perspective of cache lookup interrupt vector number in the LAPIC’s ISR is used to and address translation. The bits used for the L1 set index and line locate an interrupt handler in the IDT, and the handler is offset are not changed by address translation, so the page tables do invoked, possibly after a privilege switch is performed. not impact L1 cache placement. The page tables do impact L2 and L3 cache placement. Using large pages (2 MB or 1 GB) is not sufficient The interrupt handler does the processing that the device to make L3 cache placement independent of the page tables, because requires, and then writes the LAPIC’s End Of Interrupt of the LLC slice hashing function (§ 2.11.3). (EOI) register to signal the fact that it has completed handling the interrupt. 2.12 Interrupts Interrupts are treated like faults, so interrupt handlers Peripherals use interrupts to signal the occurrence of have full control over the execution environment of the an event that must be handled by system software. For application being interrupted. This is used to implement example, a keyboard triggers interrupts when a key is pre-emptive multi-threading, which relies on a clock pressed or depressed. System software also relies on device that generates interrupts periodically, and on an interrupts to implement preemptive multi-threading. interrupt handler that performs context switches. Interrupts are a kind of hardware exception (§ 2.8.2). System software can cause an interrupt on any logical Receiving an interrupt causes an execution core to per- processor by writing the target processor’s APIC ID into form a privilege level switch and to start executing the the Interrupt Command Register (ICR) of the LAPIC system software’s interrupt handling code. Therefore, the associated with the logical processor that the software security concerns in § 2.8.2 also apply to interrupts, with is running on. These interrupts, called Inter-Processor the added twist that interrupts occur independently of the Interrupts (IPI), are needed to implement TLB shoot- instructions executed by the interrupted code, whereas downs (§ 2.11.5). most faults are triggered by the actions of the application software that incurs them. 2.13 Platform Initialization (Booting) Given the importance of interrupts when assessing When a computer is powered up, it undergoes a boot- a system’s security, this section outlines the interrupt strapping process, also called booting, for simplicity. triggering and handling processes described in the SDM. The boot process is a sequence of steps that collectively Peripherals use bus-specific protocols to signal inter- initialize all the computer’s hardware components and rupts. For example, PCIe relies on Message Signaled load the system software into DRAM. An analysis of Interrupts (MSI), which are memory writes issued to a system’s security properties must be aware of all the specially designed memory addresses. The bus-specific pieces of software executed during the boot process, and interrupt signals are received by the I/O Advanced Pro- must account for the trust relationships that are created 25

26.when a software module loads another module. itself from the temporary memory store into DRAM, and This section outlines the details of the boot process tears down the temporary storage. When the computer is needed to reason about the security of a system based powering up or rebooting, the PEI implementation is also on the Intel architecture. [91] provides a good refer- responsible for initializing all the non-volatile storage ence for many of the booting process’s low-level details. units that contain UEFI firmware and loading the next While some specifics of the boot process depend on the stage of the firmware into DRAM. motherboard and components in a computer, this sec- PEI hands off control to the Driver eXecution Envi- tion focuses on the high-level flow described by Intel’s ronment phase (DXE). In DXE, a loader locates and documentation. starts firmware drivers for the various components in the 2.13.1 The UEFI Standard computer. DXE is followed by a Boot Device Selection (BDS) phase, which is followed by a Transient System The firmware in recent computers with Intel processors Load (TSL) phase, where an EFI application loads the implements the Platform Initialization (PI) process in operating system selected in the BDS phase. Last, the the Unified Extensible Firmware Interface (UEFI) spec- OS loader passes control to the operating system’s kernel, ification [178]. The platform initialization follows the entering the Run Time (RT) phase. steps shown in Figure 30 and described below. When waking up from sleep, the PEI implementation Security (SEC) Cache-as-RAM first initializes the non-volatile storage containing the measures microcode system snapshot saved while entering the sleep state. firmware The rest of the PEI implementation may use optimized Pre-EFI Initialization (PEI) DRAM Initialized re-initialization processes, based on the snapshot con- measures tents. The DXE implementation also uses the snapshot Driver eXecution Environment (DXE) measures to restore the computer’s state, such as the DRAM con- tents, and then directly executes the operating system’s Boot Device Selection (BDS) measures wake-up handler. bootloader Transient System Load (TSL) 2.13.2 SEC on Intel Platforms measures OS Run Time (RT) Right after a computer is powered up, circuitry in the power supply and on the motherboard starts establishing Figure 30: The phases of the Platform Initialization process in the reference voltages on the power rails in a specific or- UEFI specification. der, documented as “power sequencing” [182] in chipset The computer powers up, reboots, or resumes from specifications such as [101]. The rail powering up the sleep in the Security phase (SEC). The SEC implementa- Intel ME (§ 2.9.2) in the PCH is powered up significantly tion is responsible for establishing a temporary memory before the rail that powers the CPU cores. store and loading the next stage of the firmware into it. When the ME is powered up, it starts executing the As the first piece of software that executes on the com- code in its boot ROM, which sets up the SPI bus con- puter, the SEC implementation is the system’s root of nected to the flash memory chip (§ 2.9.1) that stores both trust, and performs the first steps towards establishing the UEFI firmware and the ME’s firmware. The ME then the system’s desired security properties. loads its firmware from flash memory, which contains For example, in a measured boot system (also known the ME’s operating system and applications. as trusted boot), all the software involved in the boot pro- After the Intel ME loads its software, it sets up some of cess is measured (cryptographically hashed, and the mea- the motherboard’s hardware, such as the PCH bus clocks, surement is made available to third parties, as described and then it kicks off the CPU’s bootstrap sequence. Most in § 3.3). In such a system, the SEC implementation of the details of the ME’s involvement in the computer’s takes the first steps in establishing the system’s measure- boot process are not publicly available, but initializing ment, namely resetting the special register that stores the the clocks is mentioned in a few public documents [5, 7, measurement result, measuring the PEI implementation, 42, 106], and is made clear in firmware bringup guides, and storing the measurement in the special register. such as the leaked confidential guide [92] documenting SEC is followed by the Pre-EFI Initialization phase firmware bringup for Intel’s Series 7 chipset. (PEI), which initializes the computer’s DRAM, copies The beginning of the CPU’s bootstrap sequence is 26

27.the SEC phase, which is implemented in the processor 0xFFFFFFFF Legacy Reset Vector 0xFFFFFFF0 circuitry. All the logical processors (LPs) on the mother- FIT Pointer 0xFFFFFFE8 board undergo hardware initialization, which invalidates Firmware Interface Table (FIT) the caches (§ 2.11) and TLBs (§ 2.11.5), performs a Built- FIT Header PEI ACM Entry In Self Test (BIST), and sets all the registers (§ 2.6) to TXT Policy Entry pre-specified values. Pre-EFI Initialization ACM After hardware initialization, the LPs perform the ACM Header Multi-Processor (MP) initialization algorithm, which Public Key results in one LP being selected as the bootstrap pro- Signature cessor (BSP), and all the other LPs being classified as PEI Implementation application processors (APs). TXT Policy Configuration According to the SDM, the details of the MP initial- DXE modules ization algorithm for recent CPUs depend on the moth- erboard and firmware. In principle, after completing Figure 31: The Firmware Interface Table (FIT) in relation to the hardware initialization, all LPs attempt to issue a spe- firmware’s memory map. cial no-op transaction on the QPI bus. A single LP will succeed in issuing the no-op, thanks to the QPI arbi- The PEI implementation is stored in an ACM listed tration mechanism, and to the UBox (§ 2.11.3) in each in the FIT. The processor loads the PEI ACM, verifies CPU package, which also serves as a ring arbiter. The the trustworthiness of the ACM’s public key, and ensures arbitration priority of each LP is based on its APIC ID that the ACM’s contents matches its signature. If the PEI (§ 2.12), which is provided by the motherboard when the passes the security checks, it is executed. Processors that system powers up. The LP that issues the no-op becomes support Intel TXT only accept Intel-signed ACMs [55, p. the BSP. Upon failing to issue the no-op, the other LPs 92]. become APs, and enter the wait-for-SIPI state. Understanding the PEI firmware loading process is 2.13.3 PEI on Intel Platforms unnecessarily complicated by the fact that the SDM de- [91] and [35] describe the initialization steps performed scribes a legacy process consisting of having the BSP set by Intel platforms during the PEI phase, from the per- its RIP register to 0xFFFFFFF0 (16 bytes below 4 GB), spective of a firmware programmer. A few steps provide where the firmware is expected to place a instruction that useful context for reasoning about threat models involv- jumps into the PEI implementation. ing the boot process. Recent processors do not support the legacy approach When the BSP starts executing PEI firmware, DRAM at all [154]. Instead, the BSP reads a word from address is not yet initialized. Therefore the PEI code starts ex- 0xFFFFFFE8 (24 bytes below 4 GB) [40, 201], and ex- ecuting in a Cache-as-RAM (CAR) mode, which only pects to find the address of a Firmware Interface Table relies on the BSP’s internal caches, at the expense of im- (FIT) in the memory address space (§ 2.4), as shown posing severe constraints on the size of the PEI’s working in Figure 31. The BSP is able to read firmware con- set. tents from non-volatile memory before the computer is One of the first tasks performed by the PEI implemen- initialized, because the initial SAD (§ 2.11.3) and PCH tation is enabling DRAM, which requires discovering (§ 2.9.1) configurations maps a region in the memory and initializing the DRAM chips connected to the moth- address space to the SPI flash chip (§ 2.9.1) that stores erboard, and then configuring the BSP’s memory con- the computer’s firmware. trollers (§ 2.11.3) and MTRRs (§ 2.11.4). Most firmware The FIT [151] was introduced in the context of Intel’s implementations use Intel’s Memory Reference Code Itanium architecture, and its use in Intel’s current 64- (MRC) for this task. bit architecture is described in an Intel patent [40] and After DRAM becomes available, the PEI code is briefly documented in an obscure piece of TXT-related copied into DRAM and the BSP is taken out of CAR documentation [88]. The FIT contains Authenticated mode. The BSP’s LAPIC (§ 2.12) is initialized and Code Modules (ACMs) that make up the firmware, and used to send a broadcast Startup Inter-Processor Inter- other platform-specific information, such as the TPM rupt (SIPI, § 2.12) to wake up the APs. The interrupt and TXT configuration [88]. vector in a SIPI indicates the memory address of the AP 27

28.initialization code in the PEI implementation. XSAVE (§ 2.6), which was takes over 200 micro-ops on The PEI code responsible for initializing APs is ex- recent CPUs [53], is most likely performed in microcode, ecuted when the APs receive the SIPI wake-up. The whereas simple arithmetic and memory accesses are han- AP PEI code sets up the AP’s configuration registers, dled directly by hardware. such as the MTRRs, to match the BSP’s configuration. The core’s execution units handle common cases in Next, each AP registers itself in a system-wide table, fast paths implemented in hardware. When an input using a memory synchronization primitive, such as a cannot be handled by the fast paths, the execution unit semaphore, to avoid having two APs access the table issues a microcode assist, which points the microcode at the same time. After the AP initialization completes, sequencer to a routine in microcode that handles the each AP is suspended again, and waits to receive an INIT edge cases. The most common cited example in Intel’s Inter-Processor Interrupt from the OS kernel. documentation is floating point instructions, which issue The BSP initialization code waits for all APs to register assists to handle denormalized inputs. themselves into the system-wide table, and then proceeds The REP MOVS family of instructions, also known to locate, load and execute the firmware module that as string instructions because of their use in strcpy- implements DXE. like functions, operate on variable-sized arrays. These 2.14 CPU Microcode instructions can handle small arrays in hardware, and issue microcode assists for larger arrays. The Intel architecture features a large instruction set. Some instructions are used infrequently, and some in- Modern Intel processors implement a microcode up- structions are very complex, which makes it impractical date facility. The SDM describes the process of applying for an execution core to handle all the instructions in hard- microcode updates from the perspective of system soft- ware. Intel CPUs use a microcode table to break down ware. Each core can be updated independently, and the rare and complex instructions into sequences of simpler updates must be reapplied on each boot cycle. A core instructions. Architectural extensions that only require can be updated multiple times. The latest SDM at the microcode changes are significantly cheaper to imple- time of this writing states that a microcode update is up ment and validate than extensions that require changes to 16 KB in size. in the CPU’s circuitry. Processor engineers prefer to build new architectural It follows that a good understanding of what can be features as microcode extensions, because microcode can done in microcode is crucial to evaluating the cost of be iterated on much faster than hardware, which reduces security features that rely on architecture extensions. Fur- development cost [191, 192]. The update facility further thermore, the limitations of microcode are sometimes the increases the appeal of microcode, as some classes of reasoning behind seemingly arbitrary architecture design bugs can be fixed after a CPU has been released. decisions. Intel patents [108, 136] describing Software Guard The first sub-section below presents the relevant facts Extensions (SGX) disclose that SGX is entirely imple- pertaining to microcode in Intel’s optimization reference mented in microcode, except for the memory encryp- [95] and SDM. The following subsections summarize tion engine. A description of SGX’s implementation information gleaned from Intel’s patents and other re- could provide great insights into Intel’s microcode, but, searchers’ findings. unfortunately, the SDM chapters covering SGX do not include such a description. We therefore rely on other 2.14.1 The Role of Microcode public information sources about the role of microcode The frequently used instructions in the Intel architecture in the security-sensitive areas covered by previous sec- are handled by the core’s fast path, which consists of tions, namely memory management (§ 2.5, § 2.11.5), simple decoders (§ 2.10) that can emit at most 4 micro- the handling of hardware exceptions (§ 2.8.2) and inter- ops per instruction. Infrequently used instructions and rupts (§ 2.12), and platform initialization (§ 2.13). instructions that require more than 4 micro-ops use a The use of microcode assists can be measured using slower decoding path that relies on a sequencer to read the Precise Event Based Sampling (PEBS) feature in re- micro-ops from a microcode store ROM (MSROM). cent Intel processors. PEBS provides counters for the The 4 micro-ops limitation can be used to guess intel- number of micro-ops coming from MSROM, including ligently whether an architectural feature is implemented complex instructions and assists, counters for the num- in microcode. For example, it is safe to assume that bers of assists associated with some micro-op classes 28

29.(SSE and AVX stores and transitions), and a counter for table [158]. Microcode events are hardware exceptions, assists generated by all other micro-ops. assists, and interrupts [24, 36, 147]. The processor de- The PEBS feature itself is implemented using mi- scribed in a 1999 patent [158] has a 64-entry event table, crocode assists (this is implied in the SDM and con- where the first 16 entries point to hardware exception firmed by [118]) when it needs to write the execution handlers and the other entries are used by assists. context into a PEBS record. Given the wide range of The execution units can issue an assist or signal a fault features monitored by PEBS counters, we assume that all by associating an event code with the result of a micro- execution units in the core can issue microcode assists, op. When the micro-op is committed (§ 2.10), the event which are performed at micro-op retirement. This find- code causes the out-of-order scheduler to squash all the ing is confirmed by an Intel patent [24], and is supported micro-ops that are in-flight in the ROB. The event code is by the existence of a PEBS counter for the “number of forwarded to the microcode sequencer, which reads the microcode assists invoked by hardware upon micro-op micro-ops in the corresponding event handler [24, 147]. writeback.” The hardware exception handling logic (§ 2.8.2) and Intel’s optimization manual describes one more inter- interrupt handling logic (§ 2.12) is implemented entirely esting assist, from a memory system perspective. SIMD in microcode [147]. Therefore, changes to this logic are masked loads (using VMASKMOV) read a series of data relatively inexpensive to implement on Intel processors. elements from memory into a vector register. A mask This is rather fortunate, as the Intel architecture’s stan- register decides whether elements are moved or ignored. dard hardware exception handling process requires that If the memory address overlaps an invalid page (e.g., the the fault handler is trusted by the code that encounters P flag is 0, § 2.5), a microcode assist is issued, even if the exception (§ 2.8.2), and this assumption cannot be the mask indicates that no element from the invalid page satisfied by a design where the software executing in- should be read. The microcode checks whether the ele- side a secure container must be isolated from the system ments in the invalid page have the corresponding mask software managing the computer’s resources. bits set, and either performs the load or issues a page The execution units in modern Intel processors support fault. microcode procedures, via dedicated microcode call and The description of machine checks in the SDM men- return micro-ops [36]. The micro-ops manage a hard- tions page assists and page faults in the same context. ware data structure that conceptually stores a stack of We assume that the page assists are issued in some cases microcode instruction pointers, and is integrated with out- when a TLB miss occurs (§ 2.11.5) and the PMH has to of-order execution and hardware exceptions, interrupts walk the page table. The following section develops this and assists. assumption and provides supporting evidence from In- Asides from special micro-ops, microcode also em- tel’s assigned patents and published patent applications. ploys special load and store instructions, which turn into special bus cycles, to issue commands to other functional 2.14.2 Microcode Structure units [157]. The memory addresses in the special loads According to a 2013 Intel patent [82], the avenues con- and stores encode commands and input parameters. For sidered for implementing new architectural features are example, stores to a certain range of addresses flush spe- a completely microcode-based implementation, using cific TLB sets. existing micro-ops, a microcode implementation with hardware support, which would use new micro-ops, and 2.14.3 Microcode and Address Translation a complete hardware implementation, using finite state Address translation (§ 2.5) is configured by CR3, which machines (FSMs). stores the physical address of the top-level page table, The main component of the MSROM is a table of and by various bits in CR0 and CR4, all of which are micro-ops [191, 192]. According to an example in a described in the SDM. Writes to these control registers 2012 Intel patent [192], the table contains on the order are implemented in microcode, which stores extra infor- of 20,000 micro-ops, and a micro-op has about 70 bits. mation in microcode-visible registers [62]. On embedded processors, like the Atom, microcode may When a TLB miss (§ 2.11.5) occurs, the memory exe- be partially compressed [191, 192]. cution unit forwards the virtual address to the Page Miss The MSROM also contains an event ROM, which is an Handler (PMH), which performs the page walk needed array of pointers to event handling code in the micro-ops to obtain a physical address. In order to minimize the 29

30.latency of a page walk, the PMH is implemented as in uncacheable memory (§ 2.11.4), and operations that a Finite-State Machine (FSM) [77, 152]. Furthermore, cause Page Faults. the PMH fetches the page table entries from memory A 2014 patent on APIC (§ 2.12) virtualization [166] by issuing “stuffed loads”, which are special micro-ops describes a memory execution unit modification that in- that bypass the reorder buffer (ROB) and go straight vokes a microcode assist for certain memory accesses, to the memory execution units (§ 2.10), thus avoiding based on the contents of some range registers. The patent the overhead associated with out-of-order scheduling also mentions that the range registers are checked when [63, 77, 157]. the TLB miss occurs and the PMH is invoked, in or- The FSM in the PMH handles the fast path of the entire der to decide whether a fast hardware path can be used address translation process, which assumes no address for APIC virtualization, or a microcode assist must be translation fault (§ 2.8.2) occurs [63, 64, 147, 158], and issued. no page table entry needs to be modified [63]. The recent patents mentioned above allow us to con- When the PMH FSM detects the conditions that trigger clude that the PMH in recent processors still relies on an a Page Fault or a General Protection Fault, it commu- FSM and stuffed loads, and still uses microcode assists to nicates a microcode event code, corresponding to the handle infrequent and complex operations. This assump- detected fault condition, to the execution unit (§ 2.10) tion plays a key role in estimating the implementation responsible for memory operations [63, 64, 147, 158]. In complexity of architectural modifications targeting the turn, the execution unit triggers the fault by associating processor’s address translation mechanism. the event code with the micro-op that caused the address translation, as described in the previous section. 2.14.4 Microcode and Booting The PMH FSM does not set the Accessed or Dirty The SDM states that microcode performs the Built-In attributes (§ 2.5.3) in page table entries. When it detects Self Test (BIST, § 2.13.2), but does not provide any de- that a page table entry must be modified, the FSM issues tails on the rest of the CPU’s hardware initialization. a microcode event code for a page walk assist [63]. The In fact, the entire SEC implementation on Intel plat- microcode handler performs the page walk again, setting forms is contained in the processor microcode [40, 41, the A and D attributes on page table entries when neces- 166]. This implementation has desirable security proper- sary [63]. This finding was indirectly confirmed by the ties, as it is significantly more expensive for an attacker description for a PEBS event in the most recent SDM to tamper with the MSROM circuitry (§ 2.14.2) than it release. is to modify the contents of the flash memory chip that The patents at the core of our descriptions above [24, stores the UEFI firmware. § 3.4.3 and § 3.6 describe 63, 64, 147, 158] were all issued between 1996 and 1999, the broad classes of attacks that an Intel platform can be which raises the concern of obsolescence. As Intel would subjected to. not be able to file new patents for the same specifications, The microcode that implements SEC performs MP we cannot present newer patents with the information initialization (§ 2.13.2), as suggested in the SDM. The above. Fortunately, we were able to find newer patents microcode then places the BSP into Cache-as-RAM that mention the techniques described above, proving (CAR) mode, looks up the PEI Authenticated Code Mod- their relevance to newer CPU models. ule (ACM) in the Firmware Interface Table (FIT), loads Two 2014 patents [77, 152] mention that the PMH is the PEI ACM into the cache, and verifies its signature executing a FSM which issues stuffing loads to obtain (§ 2.13.2) [40, 41, 142, 200, 201]. Given the structure of page table entries. A 2009 patent [62] mentions that ACM signatures, we can conclude that Intel’s microcode microcode is invoked after a PMH walk, and that the contains implementations of RSA decryption and of a microcode can prevent the translation result produced by variant of SHA hashing. the PMH from being written to the TLB. The PEI ACM is executed from the CPU’s cache, after A 2013 patent [82] and a 2014 patent [153] on scatter it is loaded by the microcode [40, 41, 200]. This removes / gather instructions disclose that the newly introduced the possibility for an attacker with physical access to the instructions use a combination of hardware in the ex- SPI flash chip to change the firmware’s contents after the ecution units that perform memory operations, which microcode computes its cryptographic hash, but before it include the PMH. The hardware issues microcode assists is executed. for slow paths, such as gathering vector elements stored On motherboards compatible with LaGrande Server 30

31.Extensions (LT-SX, also known as Intel TXT for servers), ing through the Errata section in Intel’s Specification the firmware implementing PEI verifies that each CPU Updates [87, 103, 105]. The phrase “it is possible for connected to motherboard supports LT-SX, and powers BIOS5 to contain a workaround for this erratum” gen- off the CPU sockets that don’t hold processors that im- erally means that a microcode update was issued. For plement LT-SX [142]. This prevents an attacker from example, Errata AH in [87] implies that string instruc- tampering with a TXT-protected VM by hot-plugging tions (REP MOV) are implemented in microcode, which a CPU in a running computer that is inside TXT mode. was confirmed by Intel [12]. When a hot-plugged CPU passes security tests, a hy- Errata AH43 and AH91 in [87], and AAK73 in [103] pervisor is notified that a new CPU is available. The imply that address translation (§ 2.5) is at least partially hypervisor updates its internal state, and sends the new implemented in microcode. Errata AAK53, AAK63, CPU a SIPI. The new CPU executes a SIPI handler, in- and AAK70, AAK178 in [103], and BT138, BT210, side microcode, that configures the CPU’s state to match in [105] imply that VM entries and exits (§ 2.8.2) are the state expected by the TXT hypervisor [142]. This implemented in microcode, which is confirmed by the implies that the AP initialization described in § 2.13.2 is APIC virtualization patent [166]. implemented in microcode. 2.14.5 Microcode Updates 3 S ECURITY BACKGROUND The SDM explains that the microcode on Intel CPUs Most systems rely on some cryptographic primitives for can be updated, and describes the process for applying security. Unfortunately, these primitives have many as- an update. However, no detail about the contents of an sumptions, and building a secure system on top of them update is provided. Analyzing Intel’s microcode updates is a highly non-trivial endeavor. It follows that a sys- seems like a promising avenue towards discovering the tem’s security analysis should be particularly interested microcode’s structure. Unfortunately, the updates have in what cryptographic primitives are used, and how they so far proven to be inscrutable [32]. are integrated into the system. The microcode updates cannot be easily analyzed be- § 3.1 and § 3.2 lay the foundations for such an anal- cause they are encrypted, hashed with a cryptographic ysis by summarizing the primitives used by the secure hash function like SHA-256, and signed using RSA or architectures of interest to us, and by describing the most elliptic curve cryptography [200]. The update facility common constructs built using these primitives. § 3.3 is implemented entirely in microcode, including the de- builds on these concepts and describes software attesta- cryption and signature verification [200]. tion, which is the most popular method for establishing [74] independently used fault injection and timing trust in a secure architecture. analysis to conclude that each recent Intel microcode Having looked at the cryptographic foundations for update is signed with a 2048-bit RSA key and a (possibly building secure systems, we turn our attention to the non-standard) 256-bit hash algorithm, which agrees with attacks that secure architectures must withstand. Asides the findings above. from forming a security checklist for architecture design, The microcode update implementation places the these attacks build intuition for the design decisions in core’s cache into No-Evict Mode (NEM, documented the architectures of interest to us. by the SDM) and copies the microcode update into the The attacks that can be performed on a computer sys- cache before verifying its signature [200]. The update fa- tem are broadly classified into physical attacks and soft- cility also sets up an MTRR entry to protect the update’s ware attacks. In physical attacks, the attacker takes ad- contents from modifications via DMA transfers [200] as vantage of a system’s physical implementation details it is verified and applied. to perform an operation that bypasses the limitations set While Intel publishes the most recent microcode up- by the computer system’s software abstraction layers. In dates for each of its CPU models, the release notes asso- contrast, software attacks are performed solely by execut- ciated with the updates are not publicly available. This ing software on the victim computer. § 3.4 summarizes is unfortunate, as the release notes could be used to con- the main types of physical attacks. firm guesses that certain features are implemented in 5 Basic Input/Output System (BIOS) is the predecessor of UEFI- microcode. based firmware. Most Intel documentation, including the SDM, still However, some information can be inferred by read- uses the term BIOS to refer to firmware. 31

32. The distinction between software and physical attacks A message whose confidentiality is protected can be is particularly relevant in cloud computing scenarios, transmitted over an insecure medium without an adver- where gaining software access to the computer running sary being able to obtain the information in the message. a victim’s software can be accomplished with a credit When integrity protection is used, the receiver is guaran- card backed by modest funds [155], whereas physical teed to either obtain a message that was transmitted by access is a more difficult prospect that requires trespass, the sender, or to notice that an attacker tampered with coercion, or social engineering on the cloud provider’s the message’s content. employees. When multiple messages get transmitted over an un- However, the distinction between software and phys- trusted medium, a freshness guarantee assures the re- ical attacks is blurred by the attacks presented in § 3.6, ceiver that she will obtain the latest message coming which exploit programmable peripherals connected to from the sender, or will notice an attack. A freshness the victim computer’s bus in order to carry out actions guarantee is stronger than the equivalent integrity guar- that are normally associated with physical attacks. antee, because the latter does not protect against replay While the vast majority of software attacks exploit attacks where the attacker replaces a newer message with a bug in a software component, there are a few attack an older message coming from the same sender. classes that deserve attention from architecture designers. The following example further illustrates these con- Memory mapping attacks, described in § 3.7, become a cepts. Suppose Alice is a wealthy investor who wishes possibility on architectures where the system software is to either BUY or SELL an item every day. Alice cannot not trusted. Cache timing attacks, summarized in § 3.8 trade directly, and must relay her orders to her broker, exploit microarchitectural behaviors that are completely Bob, over a network connection owned by Eve. observable in software, but dismissed by the security A communication system with confidentiality guaran- analyses of most systems. tees would prevent Eve from distinguishing between a BUY and a SELL order, as illustrated in Figure 32. With- 3.1 Cryptographic Primitives out confidentiality, Eve would know Alice’s order before This section overviews the cryptosystems used by se- it is placed by Bob, so Eve would presumably gain a cure architectures. We are interested in cryptographic financial advantage at Alice’s expense. primitives that guarantee confidentiality, integrity, and Network freshness, and we treat these primitives as black boxes, Message focusing on their use in larger systems. [114] covers the Alice Eavesdrop Bob mathematics behind cryptography, while [51] covers the topic of building systems out of cryptographic primitives. Buy Yes Tables 10 and 11 summarize the primitives covered in Eve Sell No this section. Figure 32: In a confidentiality attack, Eve sees the message sent by Guarantee Primitive Alice to Bob and can understand the information inside it. In this Confidentiality Encryption case, Eve can tell that the message is a buy order, and not a sell order. Integrity MAC / Signatures A system with integrity guarantees would prevent Eve Freshness Nonces + integrity from replacing Alice’s message with a false order, as Table 10: Desirable security guarantees and primitives that provide shown in Figure 33. In this example, without integrity them guarantees, Eve could replace Alice’s message with a SELL - EVERYTHING order, and buy Alice’s assets at a Guarantee Symmetric Asymmetric very low price. Keys Keys Last, a communication system that guarantees fresh- Confidentiality AES-GCM, RSA with ness would ensure that Eve cannot perform the replay AES-CTR PKCS #1 v2.0 attack pictured in Figure 34, where she would replace Integrity HMAC-SHA-2 DSS-RSA, Alice’s message with an older message. Without fresh- AES-GCM DSS-ECC ness guarantees, Eve could mount the following attack, which bypasses both confidentiality and integrity guaran- Table 11: Popular cryptographic primitives that are considered to be secure against today’s adversaries tees. Over a few days, Eve would copy and store Alice’s 32

33. Network executes the key generation algorithm and securely trans- Eve’s Message mits the resulting key to the other parties, as illustrated Alice Drop Bob in Figure 35. The channel used to distribute the key must Send own message message provide confidentiality and integrity guarantees, which is a non-trivial logistical burden. The symmetric key Eve Sell Everything primitives mentioned here do not make any assumption about the key, so the key generation algorithm simply Figure 33: In an integrity attack, Eve replaces Alice’s message with grabs a fixed number of bits from the CSPRNG. her own. In this case, Eve sends Bob a sell-everything order. In this case, Eve can tell that the message is a buy order, and not a sell order. Hardware Sensor Random Seed messages from the network. When an order would reach Bob, Eve would observe the market and determine if the Cryptographically Secure order was BUY or SELL. After building up a database Pseudo-Random Number of messages labeled BUY or SELL, Eve would replace Generator (CSPRNG) Alice’s message with an old message of her choice. random data Bob Alice Network Key Generation Secret private Secret Eve’s Message Algorithm Key communication Key Alice Drop Bob Figure 35: In symmetric key cryptography, a secret key is shared Send own message message by the parties that wish to communicate securely. The defining feature of asymmetric key cryptography Eve Sell Everything is that it does not require a private channel for key distri- bution. Each party executes the key generation algorithm, Figure 34: In a freshness attack, Eve replaces Alice’s message with a message that she sent at an earlier time. In this example, Eve builds which produces a private key and a public key that are a database of labeled messages over time, and is able to send Bob her mathematically related. Each party’s public key is dis- choice of a BUY or a SELL order. tributed to the other parties over a channel with integrity guarantees, as shown in Figure 36. Asymmetric key 3.1.1 Cryptographic Keys primitives are more flexible than their symmetric coun- All cryptographic primitives that we describe here rely terparts, but are more complicated and consume more on keys, which are small pieces of information that must computational resources. only be disclosed according to specific rules. A large part of a system’s security analysis focuses on ensuring that Hardware Sensor the keys used by the underlying cryptographic primitives Random Seed are produced and handled according to the primitives’ Cryptographically Secure assumptions. Pseudo-Random Number Generator (CSPRNG) Each cryptographic primitive has an associated key Bob generation algorithm that uses random data to produce random data Private a unique key. The random data is produced by a cryp- Key Alice Key Generation tographically strong pseudo-random number generator Algorithm Public tamper-proof Bob’s Public (CSPRNG) that expands a small amount of random seed Key communication Key data into a much larger amount of data, which is compu- Figure 36: An asymmetric key generation algorithm produces a tationally indistinguishable from true random data. The private key and an associated public key. The private key is held random seed must be obtained from a true source of ran- confidential, while the public key is given to any party who wishes to domness whose output cannot be predicted by an adver- securely communicate with the private key’s holder. sary, such as the least significant bits of the temperature readings coming from a hardware sensor. 3.1.2 Confidentiality Symmetric key cryptography requires that all the par- Many cryptosystems that provide integrity guarantees ties in the system establish a shared secret key, which are built upon block ciphers that operate on fixed-size is usually referred to as “the key”. Typically, one party message blocks. The sender transforms a block using an 33

34.encryption algorithm, and the receiver inverts the trans- A block cipher does not necessarily guarantee confi- formation using a decryption algorithm. The encryp- dentiality, when used on its own. A noticeable issue is tion algorithms in block ciphers obfuscate the message that in our previous example, a block cipher would gen- block’s content in the output, so that an adversary who erate the same encrypted output for any of Alice’s BUY does not have the decryption key cannot obtain the origi- orders, as they all have the same content. Furthermore, nal message block from the encrypted output. each block cipher has its own assumptions that can lead Symmetric key encryption algorithms use the same to subtle vulnerabilities if the cipher is used directly. secret key for encryption and decryption, as shown in Symmetric key block ciphers are combined with op- Figure 37, while asymmetric key block ciphers use the erating modes to form symmetric encryption schemes. public key for encryption, and the corresponding private Most operating modes require a random initialization key for decryption, as shown in Figure 38. vector (IV) to be used for each message, as shown in Figure 39. When analyzing the security of systems based Message Message on these cryptosystems, an understanding of the IV gen- Block Block Alice Bob eration process is as important as ensuring the confiden- Secret Key Encryption Decryption Secret Key tiality of the encryption key. Network Message Message Encrypted Block Alice Bob Figure 37: In a symmetric key secure permutation (block cipher), CSPRNG the same secret key must be provided to both the encryption and the decryption algorithm. Initialization Vector (IV) Message Message Block Block Secret Secret Alice Bob Encryption Decryption Key Key Bob’s Bob’s Public Encryption Decryption Private Network Key Key Encrypted IV Message Network Encrypted Figure 39: Symmetric key block ciphers are combined with oper- Block ating modes. Most operating modes require a random initialization vector (IV) to be generated for each encrypted message. Figure 38: In an asymmetric key block cipher, the encryption algorithm operates on a public key, and the decryption algorithm uses Counter (CTR) and Cipher Block Chaining (CBC) the corresponding private key. are examples of operating modes recommended [45] by The most popular block cipher based on symmetric the United States National Institute of Standards and keys at the time of this writing is the American Encryp- Technology (NIST), which informs the NSA’s require- tion Standard (AES) [39, 139], with two variants that ments. Combining a block cipher, such as AES, with an operate on 128-bit blocks using 128-bit keys or 256- operating mode, such as CTR, results in an encryption bit keys. AES is a secure permutation function, as it method, such as AES-CTR, which can be used to add can transform any 128-bit block into another 128-bit confidentiality guarantees. block. Recently, the United States National Security In the asymmetric key setting, there is no concept Agency (NSA) required the use of 256-bit AES keys for equivalent to operating modes. Each block cipher has its protecting sensitive information [141]. own assumptions, and requires a specialized scheme for The most deployed asymmetric key block cipher is the general-purpose usage. Rivest-Shamir-Adelman (RSA) [156] algorithm. RSA The RSA algorithm is used in conjunction with has variable key sizes, and 3072-bit key pairs are con- padding methods, the most popular of which are the meth- sidered to provide the same security as 128-bit AES ods described in the Public-Key Cryptography Standard keys [20]. (PKCS) #1 versions 1.5 [110] and 2.0 [111]. A security 34

35.analysis of a system that uses RSA-based encryption state to its initial values. An EXTEND algorithm is ex- must take the padding method into consideration. For ecuted for each message block in the input. After the example, the padding in PKCS #1 v1.5 can leak the pri- entire input is consumed, a FINALIZE algorithm produces vate key under certain circumstances [23]. While PKCS the hash output from the internal state. #1 v2.0 solves this issue, it is complex enough that some implementations have their own security issues [132]. Initialize Asymmetric encryption algorithms have much higher Intermediate State computational requirements than symmetric encryption algorithms. Therefore, when non-trivial quantities of Message Block Extend data is encrypted, the sender generates a single-use secret key that is used to encrypt the data, and encrypts the Intermediate State secret key with the receiver’s public key, as shown in Figure 40. Message Block Extend Message Intermediate State Message Alice Bob … … Symmetric Symmetric CSPRNG Encryption Decryption Intermediate State Symmetric Key Finalize Generation Secret Key Secret Key Algorithm Output Bob’s Bob’s Asymmetric Asymmetric Public Private Encryption Decryption Figure 41: A block hash function operates on fixed-size message Key Key blocks and uses a fixed-size internal state. Network In the symmetric key setting, integrity guarantees are Encrypted Encrypted Secret Key Message obtained using a Message Authentication Code (MAC) cryptosystem, illustrated in Figure 42. The sender uses Figure 40: Asymmetric key encryption is generally used to bootstrap a MAC algorithm that reads in a symmetric key and a a symmetric key encryption scheme. variable-legnth message, and produces a fixed-length, short MAC tag. The receiver provides the original mes- 3.1.3 Integrity sage, the symmetric key, and the MAC tag to a MAC verification algorithm that checks the authenticity of the Many cryptosystems that provide integrity guarantees are message. built upon secure hashing functions. These hash func- tions operate on an unbounded amount of input data and produce a small fixed-size output. Secure hash functions Message Message have a few guarantees, such as pre-image resistance, Alice Bob which states that an adversary cannot produce input data Secret MAC MAC Secret corresponding to a given hash output. Key Signing Verification Key At the time of this writing, the most popular se- cure hashing function is the Secure Hashing Algo- Accept Yes Message rithm (SHA) [48]. However, due to security issues in Network Correct? SHA-1 [171], new software is recommended to use at MAC tag Message Reject No Message least 256-bit SHA-2 [21] for secure hashing. The SHA hash functions are members of a large family Figure 42: In the symmetric key setting, integrity is assured by com- of block hash functions that consume their input in fixed- puting a Message Authentication Code (MAC) tag and transmitting it size message blocks, and use a fixed-size internal state. over the network along the message. The receiver feeds the MAC tag A block hash function is used as shown in Figure 41. An into a verification algorithm that checks the message’s authenticity. INITIALIZE algorithm is first invoked to set the internal The key property of MAC cryptosystems is that an 35

36.adversary cannot produce a MAC tag that will validate a Message Message message without the secret key. Alice Bob Many MAC cryptosystems do not have a separate Secure Secure Hashing Hashing MAC verification algorithm. Instead, the receiver checks the authenticity of the MAC tag by running the same Hash Hash algorithm as the sender to compute the expected MAC tag for the received message, and compares the output Alice’s Signature Alice’s Signing Verification Public Key with the MAC tag received from the network. Private Key This is the case for the Hash Message Authentica- Accept Network Yes Message tion Code (HMAC) [122] generic construction, whose Signature Message Correct? operation is illustrated in Figure 43. HMAC can use Reject No Message any secure hash function, such as SHA, to build a MAC cryptosystem. Figure 44: Signature schemes guarantee integrity in the asymmetric key setting. Signatures are created using the sender’s private key, and are verified using the corresponding public key. A cryptographically Message Message secure hash function is usually employed to reduce large messages to Alice Bob small hashes, which are then signed. HMAC HMAC Secret Secret Key Secure Secure Key encryption operating mode is Galois/Counter operation Hash Hash mode (GCM) [135], which has earned NIST’s recom- Accept mendation [47] when combined with AES to form AES- Yes Message Network GCM. Equal? HMAC tag Message Reject The most popular signature scheme combines the RSA No Message encryption algorithms with a padding schemes specified Figure 43: In the symmetric key setting, integrity is assured by in PKCS #1, as illustrated in Figure 45. Recently, elliptic computing a Hash-bassed Message Authentication Code (HMAC) curve cryptography (ECC) [119] has gained a surge in and transmitting it over the network along the message. The receiver popularity, thanks to its smaller key sizes. For example, a re-computes the HMAC and compares it against the version received 384-bit ECC key is considered to be as secure as a 3072- from the network. bit RSA key [20, 141]. The NSA requires the Digital Asymmetric key primitives that provide integrity guar- Signature Standard (DSS)[140], which specifies schemes antees are known as signatures. The message sender pro- based on RSA and ECC. vides her private key to a signing algorithm, and transmits the output signature along with the message, as shown 3.1.4 Freshness in Figure 44. The message receiver feeds the sender’s Freshness guarantees are typically built on top of a sys- public key and the signature to a signature verification al- tem that already offers integrity guarantees, by adding a gorithm, which returns TRUE if the message matches the unique piece of information to each message. The main signature, and FALSE if the message has been tampered challenge in freshness schemes comes down to economi- with. cally maintaining the state needed to generate the unique Signing algorithms can only operate on small mes- pieces of information on the sender side, and verify their sages and are computationally expensive. Therefore, in uniqueness on the receiver side. practice, the message to be transmitted is first ran through A popular solution for gaining freshness guarantees a cryptographically strong hash function, and the hash is relies on nonces, single-use random numbers. Nonces are provided as the input to the signing algorithm. attractive because the sender does not need to maintain At the time of this writing, the most popular choice for any state; the receiver, however, must store the nonces of guaranteeing integrity in shared secret settings is HMAC- all received messages. SHA, an HMAC function that uses SHA for hashing. Nonces are often combined with a message timestamp- Authenticated encryption, which combines a block ing and expiration scheme, as shown in Figure 46. An cipher with an operating mode that offers both confi- expiration can greatly reduce the receiver’s storage re- dentiality and integrity guarantees, is often an attractive quirement, as the nonces for expired messages can be alternative to HMAC. The most popular authenticated safely discarded. However, the scheme depends on the 36

37. DER-Encoded Hash Algorithm ID Message Message 30 31 30 0d 06 09 60 86 48 01 Alice Bob 65 03 04 02 01 05 00 04 20 Seen Yes OK This is a CSPRNG Before? signature Message Reject No Replay Synchronized Recent Padding String Clock Nonces 256-bit ff ff ff ... ff SHA-2 Synchronized Clock Yes OK 0x00 0x01 PS 0x00 DER Hash Network Little-Endian Integer Reject Timestamp Nonce Message Recent? No Expired RSA Private Key Decryption Figure 46: Freshness guarantees can be obtained by adding times- tamped nonces on top of a system that already offers integrity guar- antees. The sender and the receiver use synchronized clocks to PKCS #1 v1.5 RSA Signature timestamp each message and discard unreasonably old messages. The receiver must check the nonce in each new message against a database of the nonces in all the unexpired messages that it has seen. Figure 45: The RSA signature scheme with PKCS #1 v1.5 padding specified in RFC 3447 combines a secure hash of the signed message with a DER-encoded specification of the secure hash algorithm used ers of the private keys corresponding to the public keys. by the signature, and a padding string whose bits are all set to 1. More concretely, if Eve can convince Bob that her own Everything except for the secure hash output is considered to be a public key belongs to Alice, Eve can produce message part of the PKCS #1 v1.5 padding. signatures that seem to come from Alice. sender and receiver having synchronized clocks. The The introductory material in § 3.1 assumed that each message expiration time is a compromise between the de- party transmits their public key over a channel with in- sire to reduce storage costs, and the need to tolerate clock tegrity guarantees. In practice, this is not a reasonable skew and delays in message transmission and processing. assumption, and the secure distribution of public keys is Alternatively, nonces can be used in challenge- still an open research problem. response protocols, in a manner that removes the storage The most widespread solution to the public key distri- overhead concerns. The challenger generates a nonce bution problem is the Certificate Authority (CA) system, and embeds it in the challenge message. The response to which assumes the existence of a trusted authority whose the challenge includes an acknowledgement of the em- public key is securely transmitted to all the other parties bedded nonce, so the challenger can distinguish between in the system. a fresh response and a replay attack. The nonce is only The CA is responsible for securely obtaining the pub- stored by the challenger, and is small in comparison to lic key of each party, and for issuing a certificate that the rest of the state needed to validate the response. binds a party’s identity (e.g., “Alice”) to its public key, as shown in Figure 47. 3.2 Cryptographic Constructs A certificate is essentially a cryptographic signature This section summarizes two constructs that are built on produced by the private key of the certificate’s issuer, the cryptographic primitives described in § 3.1, and are who is generally a CA. The message signed by the issuer used in the rest of this work. states that a public key belongs to a subject. The cer- tificate message generally contains identifiers that state 3.2.1 Certificate Authorities the intended use of the certificate, such as “the key in Asymmetric key cryptographic primitives assume that this certificate can only be used to sign e-mail messages”. each party has the correct public keys for the other par- The certificate message usually also includes an identifier ties. This assumption is critical, as the entire security for the issuer’s certification policy, which summarizes argument of an asymmetric key system rests on the fact the means taken by the issuer to ensure the authenticity that certain operations can only be performed by the own- of the subject’s public key. 37

38. Certificate Start Subject Identity Expected Subject Public Key No Certificate subject? Certification Valid From / Until Statement Certificate Usage Subject Identity Yes Certificate Policy Subject Public Key Valid Valid From / Until No now? Secured Certificate Usage Issuer Public Key Storage Yes Certificate Policy Signing Issuer Certificate Signature Valid Algorithm Private Key for expected No use? Figure 47: A certificate is a statement signed by a certificate author- Issuer Public Key Yes ity (issuer) binding the identity of a subject to a public key. Certificate Signature Trusted A major issue in a CA system is that there is no obvi- No Issuer? ous way to revoke a certificate. A revocation mechanism Yes is desirable to handle situations where a party’s private key is accidentally exposed, to avoid having an attacker Valid use the certificate to impersonate the compromised party. No signature? While advanced systems for certificate revocation have Yes been developed, the first line of defense against key com- promise is adding expiration dates to certificates. Accept Reject Public Key Certificate In a CA system, each party presents its certificate along with its public key. Any party that trusts the CA Figure 48: A certificate issued by a CA can be validated by any and has obtained the CA’s public key securely can verify party that has securely obtained the CA’s public key. If the certificate any certificate using the process illustrated in Figure 48. is valid, the subject public key contained within can be trusted to One of the main drawbacks of the CA system is that belong to the subject identified by the certificate. the CA’s private key becomes a very attractive attack tar- get. This issue is somewhat mitigated by minimizing the which he now trusts. use of the CA’s private key, which reduces the opportuni- ties for its compromise. The authority described above In most countries, the government issues ID cards for becomes the root CA, and their private key is only used its citizens, and therefore acts as as a certificate authority. to produce certificates for the intermediate CAs who, in An ID card, shown in Figure 50, is a certificate that binds turn, are responsible for generating certificates for the a subject’s identity, which is a full legal name, to the other parties in the system, as shown in Figure 49. subject’s physical appearance, which is used as a public In hierarchical CA systems, the only public key that key. gets distributed securely to all the parties is the root The CA system is very similar to the identity document CA’s public key. Therefore, when two parties wish to (ID card) systems used to establish a person’s identity, interact, each party must present their own certificate, as and a comparison between the two may help further the well as the certificate of the issuing CA. For example, reader’s understanding of the concepts in the CA system. given the hierarchy in Figure 49, Alice would prove the authenticity of her public key to Bob by presenting her Each government’s ID card issuing operations are reg- certificate, as well as the certificate of Intermediate CA ulated by laws, so an ID card’s issue date can be used 1. Bob would first use the steps in Figure 48 to validate to track down the laws that make up its certification pol- Intermediate CA 1’s certificate against the root CA’s icy. Last, the security of ID cards does not (yet) rely public key, which would assure him of the authenticity of on cryptographic primitives. Instead, ID cards include Intermediate CA 1’s public key. Bob would then validate physical security measures designed to deter tampering Alice’s certificate using Intermediate CA 1’s public key, and prevent counterfeiting. 38

39. Subject Public Key Certificate Signature is replaced by physical Fictional Country security features Alice Smith Subject Identity Citizen ID Card Certificate Usage Issued by Secure Storage Fictional City Card Office Issuer Public Key Root CA’s Private Key Expires is replaced by the Issued Issuer Name 12/01/2015 12/01/2017 Root CA’s Public Key Root CA Valid From Valid Until Intermediate CA 1’s Intermediate CA 2’s Certificate Certificate Figure 50: An ID card is a certificate that binds a subject’s full legal Intermediate CA 1 Sign Intermediate CA 2 name (identity) to the subject’s physical appearance, which acts as a public key. CA 1’s Public Key CA 2’s Public Key Usage: CA Usage: CA 3.2.2 Key Agreement Protocols Root CA’s Public Key Root CA’s Public Key The initial design of symmetric key primitives, intro- Certificate Signature Certificate Signature duced in § 3.1, assumed that when two parties wish Secure Storage Secure Storage to interact, one party generates a secret key and shares it with the other party using a communication channel CA 1’s Private Key CA 2’s Private Key with confidentiality and integrity guarantees. In practice, CA 1’s Public Key CA 2’s Public Key a pre-existing secure communication channel is rarely available. Key agreement protocols are used by two parties to Intermediate Intermediate establish a shared secret key, and only require a com- CA 1 CA 2 munication channel with integrity guarantees. Figure 51 outlines the Diffie-Hellman Key Exchange (DKE) [43] Alice’s Certificate Sign Bob’s Certificate protocol, which should give the reader an intuition for Alice Bob how key agreement protocols work. Alice’s Public Key Bob’s Public Key This work is interested in using key agreement proto- Usage: End-User Usage: End-User cols to build larger systems, so we will neither explain CA 1’s Public Key CA 2’s Public Key the mathematic details in DKE, nor prove its correctness. We note that both Alice and Bob derive the same shared Certificate Signature Certificate Signature secret key, K = g AB mod p, without ever transmit- ting K. Furthermore, the messages transmitted in DKE, Secure Storage Secure Storage namely g A mod p and g B mod p, are not sufficient Alice’s Private Key Bob’s Private Key for an eavesdropper Eve to determine K, because effi- Alice’s Public Key Bob’s Public Key ciently solving for x in g x mod p is an open problem assumed to be very difficult. Key agreement protocols require a communication Alice Bob channel with integrity guarantees. If an active adversary Eve can tamper with the messages transmitted by Alice Figure 49: A hierarchical CA structure minimizes the usage of and Bob, she can perform a man-in-the-middle (MITM) the root CA’s private key, reducing the opportunities for it to get attack, as illustrated in Figure 52. compromised. The root CA only signs the certificates of intermediate CAs, which sign the end users’ certificates. In a MITM attack, Eve intercepts Alice’s first key exchange message, and sends Bob her own message. Eve then intercepts Bob’s response and replaces it with her own, which she sends to Alice. Eve effectively performs key exchanges with both Alice and Bob, establishing a shared secret with each of them, with neither Bob nor 39

40. comes from Alice. MITM attacks on key agreement protocols can be Alice Bob foiled by authenticating the party who sends the last mes- Pre-established parameters: large prime p, g generator in Zp sage in the protocol (in our examples, Bob) and having them sign the key agreement messages. When a CA system is in place, Bob uses his public key to sign the Choose A randomly Choose B randomly between 1 and p between 1 and p messages in the key agreement and also sends Alice his certificate, along with the certificates for any intermedi- Compute gA mod p Compute gB mod p ate CAs. Alice validates Bob’s certificate, ensures that the subject identified by the certificate is whom she ex- pects (Bob), and verifies that the key agreement messages Transmit gA mod p gA mod p Receive gA mod p exchanged between herself and Bob match the signature provided by Bob. Receive gB mod p gB mod p Transmit gB mod p In conclusion, a key agreement protocol can be used to bootstrap symmetric key primitives from an asymmetric Shared key K = Shared key K = key signing scheme, where only one party needs to be = (gB mod p)A = = (gA mod p)B = able to sign messages. = gAB mod p = gAB mod p 3.3 Software Attestation Overview Figure 51: In the Diffie-Hellman Key Exchange (DKE) protocol, Alice and Bob agree on a shared secret key K = g AB mod p. An The security of systems that employ trusted processors adversary who observes g A mod p and g B mod p cannot compute hinges on software attestation. The software running K. inside an isolated container established by trusted hard- ware can ask the hardware to sign (§ 3.1.3) a small piece gA mod p gE2 mod p of attestation data, producing an attestation signature. Asides from the attestation data, the signed message gE1 mod p gB mod p Alice Eve Bob includes a measurement that uniquely identifies the soft- ware inside the container. Therefore, an attestation signa- K1 = gAE1 mod p K2 = gBE2 mod p ture can be used to convince a verifier that the attestation data was produced by a specific piece of software, which Figure 52: Any key agreement protocol is vulnerable to a man- in-the-middle (MITM) attack. The active attacker performs key is hosted inside a container that is isolated by trusted agreements and establishes shared secrets with both parties. The hardware from outside interference. attacker can then forward messages between the victims, in order Each hardware platform discussed in this section uses to observe their communication. The attacker can also send its own a slightly different software attestation scheme. Plat- messages to either, impersonating the other victim. forms differ by the amount of software that executes Alice being aware of her presence. inside an isolated container, by the isolation guarantees After establishing shared keys with both Alice and provided to the software inside a container, and by the Bob, Eve can choose to observe the communication be- process used to obtain a container’s measurement. The tween Alice and Bob, by forwarding messages between threat model and security properties of each trusted hard- them. For example, when Alice transmits a message, Eve ware platform follow directly from the design choices can decrypt it using K1, the shared key between herself outlined above, so a good understanding of attestation and Alice. Eve can then encrypt the message with K2, is a prerequisite to discussing the differences between the key established between Bob and herself. While Bob existing platforms. still receives Alice’s message, Eve has been able to see 3.3.1 Authenticated Key Agreement its contents. Furthermore, Eve can impersonate either party in the Software attestation can be combined with a key agree- communication. For example, Eve can create a message, ment protocol (§ 3.2.2), as software attestation provides encrypt it with K2, and then send it to Bob. As Bob the authentication required by the key agreement pro- thinks that K2 is a shared secret key established between tocol. The resulting protocol can assure a verifier that himself and Alice, he will believe that Eve’s message it has established a shared secret with a specific piece 40

41.of software, hosted inside an isolated container cre- A secure processor identifies each isolated container ated by trusted hardware. The next paragraph outlines by storing a cryptographic hash of the code and data the augmented protocol, using Diffie-Hellman Key Ex- loaded inside the container. When the processor is asked change (DKE) [43] as an example of the key exchange to sign a piece of attestation data, it uses the crypto- protocol. graphic hash associated with the container as the mea- The verifier starts executing the key exchange protocol, surement in the attestation signature. After a verifier and sends the first message, g A , to the software inside validates the processor’s attestation key using its endorse- the secure container. The software inside the container ment certificate, the verifier ensures that the signature is produces the second key exchange message, g B , and asks valid, and that the measurement in the signature belongs the trusted hardware to attest the cryptographic hash of to the software with which it expects to communicate. both key exchange messages, h(g A ||g B ). The verifier re- Having checked all the links in the attestation chain, the ceives the second key exchange and attestation signature, verifier has authenticated the other party in the key ex- and authenticates the software inside the secure container change, and is assured that it now shares a secret with the by checking all the signatures along the attestation chain software that it expects, running in an isolated container of trust shown in Figure 53. on hardware that it trusts. Manufacturer Certificate Authority 3.3.2 The Role of Software Measurement PubRK PrivRK Manufacturer Root Key The measurement that identifies the software inside a secure container is always computed using a secure hash- Signs ing algorithm (§ 3.1.3). Trusted hardware designs differ Endorsement Tamper-Resistant in their secure hash function choices, and in the data Certificate Hardware provided to the hash function. However, all the designs Attestation Key PubAK PrivAK Signs Attestation Signature share the principle that each step taken to build a secure Hash of Measurement container contributes data to its measurement hash. Secure Container Data The philosophy behind software attestation is that the Hash of computer’s owner can load any software she wishes in Trusts Key Exchange Key Exchange a secure container. However, the computer owner is as- Message 1 Message 2 sumed to have an incentive to participate in a distributed system where the secure container she built is authenti- Verifier cated via software attestation. Without the requirement to undergo software attestation, the computer owner can Figure 53: The chain of trust in software attestation. The root of build any container without constraints, which would trust is a manufacturer key, which produces an endorsement certificate make it impossible to reason about the security proper- for the secure processor’s attestation key. The processor uses the attestation key to produce the attestation signature, which contains a ties of the software inside the container. cryptographic hash of the container and a message produced by the By the argument above, a trusted hardware design software inside the container. based on software attestation must assume that each con- The chain of trust used in software attestation is rooted tainer is involved in software attestation, and that the re- at a signing key owned by the hardware manufacturer, mote party will refuse to interact with a container whose which must be trusted by the verifier. The manufacturer reported measurement does not match the expected value acts as a Certificate Authority (CA, § 3.2.1), and provi- set by the distributed system’s author. sions each secure processor that it produces with a unique For example, a cloud infrastructure provider should attestation key, which is used to produce attestation sig- be able to use the secure containers provided by trusted natures. The manufacturer also issues an endorsement hardware to run any software she wishes on her com- certificate for each secure processor’s attestation key. puters. However, the provider makes money by renting The certificate indicates that the key is meant to be used her infrastructure to customers. If security savvy cus- for software attestation. The certification policy gener- tomers are only willing to rent containers provided by ally states that, at the very least, the private part of the trusted hardware, and use software attestation to authen- attestation key be stored in tamper-resistant hardware, ticate the containers that they use, the cloud provider will and only be used to produce attestation signatures. have a strong financial incentive to build the customers’ 41

42.containers according to their specifications, so that the More expensive physical attacks that still require rela- containers pass the software attestation. tively little effort target the debug ports of various periph- A container’s measurement is computed using a se- erals. The cost of these attacks is generally dominated cure hashing algorithm, so the only method of building by the expense of acquiring the development kits needed a container that matches an expected measurement is to to connect to the debug ports. For example, recent Intel follow the exact sequence of steps specified by the dis- processors include the Generic Debug eXternal Connec- tributed system’s author. The cryptographic properties of tion (GDXC) [124, 197], which collects and filters the the secure hash function guarantee that if the computer’s data transferred by the uncore’s ring bus (§ 2.11.3), and owner strays in any way from the prescribed sequence reports it to an external debugger. of steps, the measurement of the created container will The threat models of secure architectures generally not match the value expected by the distributed system’s ignore debug port attacks, under the assumption that de- author, so the container will be rejected by the software vices sold for general consumption have their debug ports attestation process. irreversibly disabled. In practice, manufacturers have Therefore, it makes sense to state that a trusted hard- strong incentives to preserve debugging ports in produc- ware design’s measurement scheme guarantees that a tion hardware, as this facilitates the diagnosis and repair property has a certain value in a secure container. The of defective units. Due to insufficient documentation precise meaning of this phrase is that the property’s value on this topic, we ignore the possibility of GDXC-based determines the data used to compute the container’s mea- attacks. surement, so an expected measurement hash effectively specifies an expected value for the property. All contain- 3.4.2 Bus Tapping Attacks ers in a distributed system that correctly uses software More complex physical attacks consist of installing a attestation will have the desired value for the given prop- device that taps a bus on the computer’s motherboard erty. (§ 2.9.1). Passive attacks are limited to monitoring the For example, the measuring scheme used by trusted bus traffic, whereas active attacks can modify the traf- hardware designed for cloud infrastructure should guar- fic, or even place new commands on the bus. Replay antee that the container’s memory was initialized using attacks are a notoriously challenging class of active at- the customer’s content, often referred to as an image. tacks, where the attacker first records the bus traffic, and 3.4 Physical Attacks then selectively replays a subset of the traffic. Replay Physical attacks are generally classified according to attacks bypass systems that rely on static signatures or their cost, which factors in the equipment needed to carry HMACs, and generally aim to double-spend a limited out the attack and the attack’s complexity. Joe Grand’s resource. DefCon presentation [69] provides a good overview with The cost of bus tapping attacks is generally dominated a large number of intuition-building figures and photos. by the cost of the equipment used to tap the bus, which The simplest type of physical attack is a denial of increases with bus speed and complexity. For example, service attack performed by disconnecting the victim the flash chip that stores the computer’s firmware is con- computer’s power supply or network cable. The threat nected to the PCH via an SPI bus (§ 2.9.1), which is models of most secure architectures ignore this attack, simpler and much slower than the DDR bus connecting because denial of service can also be achieved by soft- DRAM to the CPU. Consequently, tapping the SPI bus is ware attacks that compromise system software such as much cheaper than tapping the DDR bus. For this reason, the hypervisor. systems whose security relies on a cryptographic hash of the firmware will first copy the firmware into DRAM, 3.4.1 Port Attacks hash the DRAM copy of the firmware, and then execute Slightly more involved attacks rely on connecting a de- the firmware from DRAM. vice to an existing port on the victim computer’s case or Although the speed of the DDR bus makes tapping motherboard (§ 2.9.1). A simple example is a cold boot very difficult, there are well-publicized records of suc- attack, where the attacker plugs in a USB flash drive into cessful attempts. The original Xbox console’s booting the victim’s case and causes the computer to boot from process was reverse-engineered, thanks to a passive tap the flash drive, whose malicious system software receives on the DRAM bus [81], which showed that the firmware unrestricted access to the computer’s peripherals. used to boot the console was partially stored in its south- 42

43.bridge. The protection mechanisms of the PlayStation 3 the chip is operating. These attacks are orders of magni- hypervisor were subverted by an active tap on its memory tude more expensive than imaging attacks, because the bus [80] that targeted the hypervisor’s page tables. attacker must maintain the integrity of the chip’s circuitry, The Ascend secure processor (§ 4.10) shows that con- and therefore cannot de-layer the chip. cealing the addresses of the DRAM cells accessed by The simplest active attacks on a chip create or destroy a program is orders of magnitude more expensive than an electric connection between two components. For protecting the memory’s contents. Therefore, we are example, the debugging functionality in many chips is interested in analyzing attacks that tap the DRAM bus, disabled by “blowing” an e-fuse. Once this e-fuse is but only use the information on the address lines. These located, an attacker can reconnect its two ends, effec- attacks use the same equipment as normal DRAM bus tively undoing the “blowing” operation. More expensive tapping attacks, but require a significantly more involved attacks involve changing voltages across a component as analysis to learn useful information. One of the dif- the chip is operating, and are typically used to reverse- ficulties of such attacks is that the memory addresses engineer complex circuits. observed on the DRAM bus are generally very different Surprisingly, active attacks are not significantly more from the application’s memory access patterns, because expensive to carry out than passive non-destructive at- of the extensive cache hierarchies in modern processors tacks. This is because the tools used to measure the (§ 2.11). voltage across specific components are not very different We are not aware of any successful attack based on from the tools that can tamper with the chip’s electric tapping the address lines of a DRAM bus and analyzing circuits. Therefore, once an attacker develops a process the sequence of memory addresses. for accessing a module without destroying the chip’s circuitry, the attacker can use the same process for both 3.4.3 Chip Attacks passive and active attacks. The most equipment-intensive physical attacks involve At the architectural level, we cannot address physical removing a chip’s packaging and directly interacting with attacks against the CPU’s chip package. Active attacks its electrical circuits. These attacks generally take advan- on the CPU change the computer’s execution semantics, tage of equipment and techniques that were originally leaving us without any hardware that can be trusted to developed to diagnose design and manufacturing defects make security decisions. Passive attacks can read the in chips. [22] covers these techniques in depth. private data that the CPU is processing. Therefore, many The cost of chip attacks is dominated by the required secure computing architectures assume that the processor equipment, although the reverse-engineering involved chip package is invulnerable to physical attacks. is also non-trivial. This cost grows very rapidly as the Thankfully, physical attacks can be deterred by reduc- circuit components shrink. At the time of this writing, ing the value that an attacker obtains by compromising the latest Intel CPUs have a 14nm feature size, which an individual chip. As long as this value is below the cost requires ion beam microscopy. of carrying out the physical attack, a system’s designer The least expensive classes of chip attacks are destruc- can hope that the processor’s chip package will not be tive, and only require imaging the chip’s circuitry. These targeted by the physical attacks. attacks rely on a microscope capable of capturing the Architects can reduce the value of compromising an necessary details in each layer, and equipment for me- individual system by avoiding shared secrets, such as chanically removing each layer and exposing the layer global encryption keys. Chip designers can increase the below it to the microscope. cost of a physical attack by not storing a platform’s se- Imaging attacks generally target global secrets shared crets in hardware that is vulnerable to destructive attacks, by all the chips in a family, such as ROM masks that store such as e-fuses. global encryption keys or secret boot code. They are also used to reverse-engineer undocumented functionality, 3.4.4 Power Analysis Attacks such as debugging backdoors. E-fuses and polyfuses are An entirely different approach to physical attacks con- particularly vulnerable to imaging attacks, because of sists of indirectly measuring the power consumption of a their relatively large sizes. computer system or its components. The attacker takes Non-destructive passive chip attacks require measur- advantage of a known correlation between power con- ing the voltages across a module at specific times, while sumption and the computed data, and learns some prop- 43

44.erty of the data from the observed power consumption. [161] describes all the programmable hardware inside The earliest power analysis attacks have directly mea- Intel computers, and outlines the security implications of sured the processor chip’s power consumption. For ex- compromising the software running it. ample, [120] describes a simple power analysis (SPA) SMM, the most privileged execution level, is only used attack that exploits the correlation between the power to handle a specific kind of interrupts (§ 2.12), namely consumed by a smart card chip’s CPU and the type of System Management Interrupts (SMI). SMIs were ini- instruction it executed, and learned a DSA key that the tially designed exclusively for hardware use, and were smart card was supposed to safeguard. only triggered by asserting a dedicated pin (SMI#) in the While direct power analysis attacks necessitate some CPU’s chip package. However, in modern systems, sys- equipment, their costs are dominated by the complexity tem software can generate an SMI by using the LAPIC’s of the analysis required to learn the desired informa- IPI mechanism. This opens up the avenue for SMM- tion from the observed power trace which, in turn, is based software exploits. determined by the complexity of the processor’s circuitry. The SMM handler is stored in System Manage- Today’s smart cards contain special circuitry [177] and ment RAM (SMRAM) which, in theory, is not acces- use hardened algorithms [76] designed to frustrate power sible when the processor isn’t running in SMM. How- analysis attacks. ever, its protection mechanisms were bypassed multi- Recent work demonstrated successful power analysis ple times [44, 112, 162, 187], and SMM-based rootk- attacks against full-blown out-of-order Intel processors its [49, 184] have been demonstrated. Compromising using inexpensive off-the-shelf sensor equipment. [60] the SMM grants an attacker access to all the software on extracts an RSA key from GnuPG running on a laptop the computer, as SMM is the most privileged execution using a microphone that measures its acoustic emissions. mode. [59] and [58] extract RSA keys from power analysis- Xen [198] is a very popular representative of the fam- resistant implementations using a voltage meter and a ily of hypervisors that run in VMX root mode and use radio. All these attacks can be performed quite easily by hardware virtualization. At 150,000 lines of code [11], a disgruntled data center employee. Xen’s codebase is relatively small, especially when com- Unfortunately, power analysis attacks can be extended pared to a kernel. However, Xen still has had over 40 to displays and human input devices, which cannot be security vulnerabilities patched in each of the last three secured in any reasonable manner. For example, [180] years (2012-2014) [10]. documented a very early attack that measures the radia- [134] proposes using a very small hypervisor together tion emitted by a CRT display’s ion beam to reconstitute with Intel TXT’s dynamic root of trust for measurement the image on a computer screen in a different room. [123] (DRTM) to implement trusted execution. [181] argues extended the attack to modern LCD displays. [199] used that a dynamic root of trust mechanism, like Intel TXT, a directional microphone to measure the sound emitted is necessary to ensure a hypervisor’s integrity. Unfor- by a keyboard and learn the password that its operator tunately, the TXT design requires an implementation typed. [146] applied similar techniques to learn a user’s complex enough that exploitable security vulnerabilities input on a smartphone’s on-screen keyboard, based on have creeped in [188, 189]. Furthermore, any SMM data from the device’s accelerometer. attack can be used to compromise TXT [186]. In general, power attacks cannot be addressed at the The monolithic kernel design leads to many opportu- architectural level, as they rely on implementation de- nities for security vulnerabilities in kernel code. Linux tails that are decided during the manufacturing process. is by far the most popular kernel for IaaS cloud environ- Therefore, it is unsurprising that the secure computing ar- ments. Linux has 17 million lines of code [16], and has chitectures described in § 4 do not protect against power had over 100 security vulnerabilities patched in each of analysis attacks. the last three years (2012-2014) [8, 33]. 3.5 Privileged Software Attacks 3.6 Software Attacks on Peripherals The rest of this section points to successful exploits that Threat models for secure architectures generally only execute at each of the privilege levels described in § 2.3, consider software attacks that directly target other com- motivating the SGX design decision to assume that all ponents in the software stack running on the CPU. This the privileged software on the computer is malicious. assumption results in security arguments with the very 44

45.desirable property of not depending on implementation The DRAM engineers probably only thought of non- details, such as the structure of the motherboard hosting malicious software and assumed that an individual the processor chip. DRAM cell cannot be accessed too often, as repeated ac- The threat models mentioned above must classify at- cesses to the same memory address would be absorbed by tacks from other motherboard components as physical the CPU’s caches (§ 2.11). However, malicious software attacks. Unfortunately, these models would mis-classify can take advantage of the CLFLUSH instruction, which all the attacks described in this section, which can be flushes the cache line that contains a given DRAM ad- carried out solely by executing software on the victim dress. CLFLUSH is intended as a method for applications processor. The incorrect classification matters in cloud to extract more performance out of the cache hierarchy, computing scenarios, where physical attacks are signifi- and is therefore available to software running at all priv- cantly more expensive than software attacks. ilege levels. Rowhammer exploited the combination of CLFLUSH’s availability and the DRAM engineers’ in- 3.6.1 PCI Express Attacks valid assumptions, to obtain capabilities that are normally associated with an active DRAM bus attack. The PCIe bus (§ 2.9.1) allows any device connected to the bus to perform Direct Memory Access (DMA), read- 3.6.3 The Performance Monitoring Side Channel ing from and writing to the computer’s DRAM without the involvement of a CPU core. Each device is assigned Intel’s Software Development Manual (SDM) [100] and a range of DRAM addresses via a standard PCI config- Optimization Reference Manual [95] describe a vast ar- uration mechanism, but can perform DMA on DRAM ray of performance monitoring events exposed by recent addresses outside of that range. Intel processors, such as branch mispredictions (§ 2.10). Without any additional protection mechanism, an at- The SDM also describes digital temperature sensors em- tacker who compromises system software can take ad- bedded in each CPU core, whose readings are exposed vantage of programmable devices to access any DRAM using Model-Specific Registers (MSRs) (§ 2.4) that can region, yielding capabilities that were traditionally asso- be read by system software. ciated with a DRAM bus tap. For example, an early im- An attacker who compromises a computer’s system plementation of Intel TXT [70] was compromised by pro- software and gains access to the performance monitoring gramming a PCIe NIC to read TXT-reserved DRAM via events or the temperature sensors can obtain the informa- DMA transfers [188]. Recent versions have addressed tion needed to carry out a power analysis attack, which this attack by adding extra security checks in the DMA normally requires physical access to the victim computer bus arbiter. § 4.5 provides a more detailed description of and specialized equipment. Intel TXT. 3.6.4 Attacks on the Boot Firmware and Intel ME 3.6.2 DRAM Attacks Virtually all motherboards store the firmware used to boot The rowhammer DRAM bit-flipping attack [72, 117, the computer in a flash memory chip (§ 2.9.1) that can be 164] is an example of a different class of software attacks written by system software. This implementation strategy that exploit design defects in the computer’s hardware. provides an inexpensive avenue for deploying firmware Rowhammer took advantage of the fact that some mobile bug fixes. At the same time, an attack that compromises DRAM chips (§ 2.9.1) refreshed the DRAM’s contents the system software can subvert the firmware update slowly enough that repeatedly changing the contents of a mechanism to inject malicious code into the firmware. memory cell could impact the charge stored in a neigh- The malicious code can be used to carry out a cold boot boring cell, which resulted in changing the bit value attack, which is typically considered a physical attack. obtained from reading the cell. By carefully targeting Furthermore, malicious firmware can run code at the specific memory addresses, the attackers caused bit flips highest software privilege level, System Management in the page tables used by the CPU’s address translation Mode (SMM, § 2.3). Last, malicious firmware can mod- (§ 2.5) mechanism, and in other data structures used to ify the system software as it is loaded during the boot make security decisions. process. These avenues give the attacker capabilities The defect exploited by the rowhammer attack most that have traditionally been associated with DRAM bus likely stems from an incorrect design assumption. tapping attacks. 45

46. The Intel Management Engine (ME) [160] loads its whose ME firmware contained the AMT application. firmware from the same flash memory chip as the main computer, which opens up the possibility of compromis- 3.6.5 Accounting for Software Attacks on Peripherals ing its firmware. Due to its vast management capabilities (§ 2.9.2), a compromised ME would leak most of the pow- The attacks described in this section show that a system ers that come with installing active probes on the DRAM whose threat model assumes no software attacks must bus, the PCI bus, and the System Management bus (SM- be designed with an understanding of all the system’s Bus), as well as power consumption meters. Thanks to buses, and the programmable devices that may be at- its direct access to the motherboard’s Ethernet PHY, the tached to them. The system’s security analysis must probe would be able to communicate with the attacker argue that the devices will not be used in physical-like while the computer is in the Soft-Off state, also known attacks. The argument will rely on barriers that prevent as S5, where the computer is mostly powered off, but is untrusted software running on the CPU from communi- still connected to a power source. The ME has signifi- cating with other programmable devices, and on barriers cantly less computational power than probe equipment, that prevent compromised programmable devices from however, as it uses low-power embedded components, tampering with sensitive buses or DRAM. such as a 200-400MHz execution core, and about 600KB Unfortunately, the ME, PCH and DMI are Intel- of internal RAM. proprietary and largely undocumented, so we cannot The computer and ME firmware are protected by a assess the security of the measures set in place to pro- few security measures. The first line of defense is a tect the ME from being compromised, and we cannot security check in the firmware’s update service, which reason about the impact of a compromised ME that runs only accepts firmware updates that have been digitally malicious software. signed by a manufacturer key that is hard-coded in the 3.7 Address Translation Attacks firmware. This protection can be circumvented with relative ease by foregoing the firmware’s update services, § 3.5 argues that today’s system software is virtually and instead accessing the flash memory chip directly, via guaranteed to have security vulnerabilities. This suggests the PCH’s SPI bus controller. that a cautious secure architecture should avoid having The deeper, more powerful, lines of defense against the system software in the TCB. firmware attacks are rooted in the CPU and ME’s hard- However, removing the system software from the TCB ware. The bootloader in the ME’s ROM will only load requires the architecture to provide a method for isolat- flash firmware that contains a correct signature generated ing sensitive application code from the untrusted system by a specific Intel RSA key. The ME’s boot ROM con- software. This is typically accomplished by designing tains the SHA-256 cryptographic hash of the RSA public a mechanism for loading application code in isolated key, and uses it to validate the full Intel public key stored containers whose contents can be certified via software in the signature. Similarly, the microcode bootstrap pro- attestation (§ 3.3). One of the more difficult problems cess in recent CPUs will only execute firmware in an these designs face is that application software relies on Authenticated Code Module (ACM, § 2.13.2) signed by the memory management services provided by the sys- an Intel key whose SHA-256 hash is hard-coded in the tem software, which is now untrusted. microcode ROM. Intel’s SGX [14, 137], leaves the system software in However, both the computer firmware security checks charge of setting up the page tables (§ 2.5) used by ad- [54, 190] and the ME security checks [176] have been dress translation, inspired by Bastion [31], but instanti- subverted in the past. While the approaches described ates access checks that prevent the system software from above are theoretically sound, the intricate details and directly accessing the isolated container’s memory. complex interactions in Intel-based systems make it very This section discusses some attacks that become rel- likely that security vulnerabilities will creep into im- evant when the application software does not trust the plementations. Further proving this point, a security system software, which is in charge of the page tables. analysis [183] found that early versions of Intel’s Active Understanding these attacks is a prerequisite to reasoning Management Technology (AMT), the flagship ME appli- about the security properties of architectures with this cation, contained an assortment of security issues that threat model. For example, many of the mechanisms in allowed an attacker to completely take over a computer SGX target a subset of the attacks described here. 46

47.3.7.1 Passive Attacks In the most straightforward setting, the malicious sys- tem software directly modifies the page tables of the System software uses the CPU’s address translation fea- application inside the container, as shown in Figure 54, ture (§ 2.5) to implement page swapping, where infre- so the virtual address intended to store the errorOut quently used memory pages are evicted from DRAM procedure is actually mapped to a DRAM page that con- to a slower storage medium. Page swapping relies the tains the disclose procedure. Without any security accessed (A) and dirty (D) page table entry attributes measures in place, when the application’s code jumps (§ 2.5.3) to identify the DRAM pages to be evicted, and to the virtual address of the errorOut procedure, the on a page fault handler (§ 2.8.2) to bring evicted pages CPU will execute the code of the disclose procedure back into DRAM when they are accessed. instead. Unfortunately, the features that support efficient page swapping turn into a security liability, when the system Application code written by Application code seen by CPU developer software managing the page tables is not trusted by the application software using the page tables. The system Security Security PASS PASS software can be prevented from reading the application’s Check Check memory directly by placing the application in an iso- FAIL FAIL lated container. However, potentially malicious system 0x41000 0x41000 errorOut(): errorOut(): software can still infer partial information about the ap- write error write error return return plication’s memory access patterns, by observing the 0x42000 disclose(): 0x42000 disclose(): application’s page faults and page table attributes. write data write data return return We consider this class of attacks to be passive attacks Virtual Page that exploit the CPU’s address translation feature. It addresses tables DRAM pages may seem that the page-level memory access patterns Figure 54: An example of an active memory mapping attack. The provided by these attacks are not very useful. However, application’s author intends to perform a security check, and only [193] describes how this attack can be carried out against call the procedure that discloses the sensitive information if the check Intel’s SGX, and implements the attack in a few practical passes. Malicious system software maps the virtual address of the settings. In one scenario, which is particularly concern- procedure that is called when the check fails, to a DRAM page that contains the disclosing procedure. ing for medical image processing, the outline of a JPEG image is inferred while the image is decompressed inside 3.7.3 Active Attacks Using Page Swapping a container protected by SGX’s isolation guarantees. The most obvious active attacks on memory mapping can be defeated by tracking the correct virtual address 3.7.2 Straightforward Active Attacks for each DRAM page that belongs to a protected con- We define active address translation attacks to be the tainer. However, a naive protection measure based on class of attacks where malicious system software modi- address tracking can be defeated by a more subtle ac- fies the page tables used by an application in a way that tive attack that relies on the architectural support for breaks the virtual memory abstraction (§ 2.5). Memory page swapping. Figure 55 illustrates an attack that does mapping attacks do not include scenarios where the sys- not modify the application’s page tables, but produces tem software breaks the memory abstraction by directly the same corrupted CPU view of the application as the writing to the application’s memory pages. straight-forward attack described above. We begin with an example of a straight-forward active In the swapping attack, malicious system soft- attack. In this example, the application inside a protected ware evicts the pages that contain the errorOut container performs a security check to decide whether to and disclose procedures from DRAM to a slower disclose some sensitive information. Depending on the medium, such as a hard disk. The system software ex- security check’s outcome, the enclave code either calls changes the hard disk bytes storing the two pages, and a errorOut procedure, or a disclose procedure. then brings the two pages back into DRAM. Remarkably, The simplest version of the attack assumes that each all the steps taken by this attack are indistinguishable procedure’s code starts at a page boundary, and takes up from legitimate page swapping activity, with the excep- less than a page. These assumptions are relaxed in more tion of the I/O operations that exchange the disk bytes complex versions of the attack. storing evicted pages. 47

48.Page tables and DRAM before swapping Page tables and TLB before swapping DRAM Virtual Physical Contents Virtual Physical Physical Contents 0x41000 0x19000 errorOut HDD / SSD 0x41000 0x19000 0x19000 errorOut 0x42000 0x1A000 disclose errorOut 0x42000 0x1A000 0x1A000 disclose disclose Page tables and DRAM after swapping Page tables after swapping Virtual Physical Contents HDD / SSD Virtual Physical 0x41000 0x19000 disclose errorOut 0x41000 0x1A000 0x42000 0x1A000 errorOut 0x42000 0x19000 disclose Figure 55: An active memory mapping attack where the system software does not modify the page tables. Instead, two pages are Stale TLB after swapping DRAM evicted from DRAM to a slower storage medium. The malicious Virtual Physical Physical Contents system software swaps the two pages’ contents then brings them back 0x41000 0x19000 0x19000 disclose into DRAM, building the same incorrect page mapping as the direct 0x42000 0x1A000 0x1A000 errorOut attack shown in Figure 54. This attack defeats protection measures that rely on tracking the virtual and disk addresses for DRAM pages. Figure 56: An active memory mapping attack where the system software does not invalidate a core’s TLBs when it evicts two pages The subtle attack described in this section can be de- from DRAM and exchanges their locations when reading them back in. The page tables are updated correctly, but the core with stale TLB feated by cryptographically binding the contents of each entries has the same incorrect view of the protected container’s code page that is evicted from DRAM to the virtual address as in Figure 54. to which the page should be mapped. The cryptographic primitive (§ 3.1) used to perform the binding must ob- code that the application developer intended. Therefore, viously guarantee integrity. Furthermore, it must also the attack will pass any security checks that rely upon guarantee freshness, in order to foil replay attacks where cryptographic associations between page contents and the system software “undoes” an application’s writes by page table data, as long as the checks are performed by evicting one of its DRAM pages to disk and bringing in the core used to load pages back into DRAM. However, an older version of the same page. the core that executes the protected container’s code still uses the old page table data, because the system software 3.7.4 Active Attacks Based on TLBs did not invalidate its TLB entries. Assuming the TLBs Today’s multi-core architectures can be subjected to an are not subjected to any additional security checks, this even more subtle active attack, illustrated in Figure 56, attack causes the same private information leak as the which can bypass any protection measures that solely previous examples. focus on the integrity of the page tables. In order to avoid the attack described in this sec- For performance reasons, each execution core caches tion, the trusted software or hardware that implements address translation results in its own translation look- protected containers must also ensure that the system aside buffer (TLB, § 2.11.5). For simplicity, the TLBs software invalidates the relevant TLB entries on all the are not covered by the cache coherence protocol that cores when it evicts a page from a protected container to synchronizes data caches across cores. Instead, the sys- DRAM. tem software is responsible for invalidating TLB entries across all the cores when it modifies the page tables. 3.8 Cache Timing Attacks Malicious system software can take advantage of the Cache timing attacks [19] are a powerful class of soft- design decisions explained above by carrying out the fol- ware attacks that can be mounted entirely by application lowing attack. While the same software used in the previ- code running at ring 3 (§ 2.3). Cache timing attacks do ous examples is executing on a core, the system software not learn information by reading the victim’s memory, executes on a different core and evicts the errorOut so they bypass the address translation-based isolation and disclose pages from DRAM. As in the previous measures (§ 2.5) implemented in today’s kernels and attack, the system software loads the disclose code hypervisors. in the DRAM page that previously held errorOut. In 3.8.1 Theory this attack, however, the system software also updates the page tables. Cache timing attacks exploit the unfortunate dependency The core where the system software executed sees the between the location of a memory access and the time 48

49.it takes to perform the access. A cache miss requires processor’s cache coherence implementation (§ 2.11.3). at least one memory access to the next level cache, and The cache sharing requirement implies that L3 cache might require a second memory access if a write-back attacks are feasible in an IaaS environment, whereas L2 occurs. On the Intel architecture, the latency between cache attacks become a significant concern when running a cache hit and a miss can be easily measured by the sensitive software on a user’s desktop. RDTSC and RDTSCP instructions (§ 2.4), which read a Out-of-order execution (§ 2.10) can introduce noise in high-resolution time-stamp counter. These instructions cache timing attacks. First, memory accesses may not have been designed for benchmarking and optimizing be performed in program order, which can impact the software, so they are available to ring 3 software. lines selected by the cache eviction algorithms. Second, The fundamental tool of a cache timing attack is an out-of-order execution may result in cache fills that do attacker process that measures the latency of accesses to not correspond to executed instructions. For example, a carefully designated memory locations in its own address load that follows a faulting instruction may be scheduled space. The memory locations are chosen so that they and executed before the fault is detected. map to the same cache lines as those of some interesting Cache timing attacks must account for speculative ex- memory locations in a victim process, in a cache that is ecution, as mispredicted memory accesses can still cause shared between the attacker and the victim. This requires cache fills. Therefore, the attacker may observe cache in-depth knowledge of the shared cache’s organization fills that don’t correspond to instructions that were actu- (§ 2.11.2). ally executed by the victim software. Memory prefetch- Armed with the knowledge of the cache’s organization, ing adds further noise to cache timing attacks, as the the attacker process sets up the attack by accessing its attacker may observe cache fills that don’t correspond own memory in such a way that it fills up all the cache to instructions in the victim code, even when accounting sets that would hold the victim’s interesting memory lo- for speculative execution. cations. After the targeted cache sets are full, the attacker allows the victim process to execute. When the victim 3.8.3 Known Cache Timing Attacks process accesses an interesting memory location in its Despite these difficulties, cache timing attacks are known own address space, the shared cache must evict one of to retrieve cryptographic keys used by AES [25, 144], the cache lines holding the attacker’s memory locations. RSA [28], Diffie-Hellman [121], and elliptic-curve cryp- As the victim is executing, the attacker process repeat- tography [27]. edly times accesses to its own memory locations. When Early attacks required access to the victim’s CPU core, the access times indicate that a location was evicted from but more sophisticated recent attacks [129, 194] are able the cache, the attacker can conclude that the victim ac- to use the L3 cache, which is shared by all the cores on cessed an interesting memory location in its own cache. a CPU die. L3-based attacks can be particularly dev- Over time, the attacker collects the results of many mea- astating in cloud computing scenarios, where running surements and learns a subset of the victim’s memory software on the same computer as a victim application access pattern. If the victim processes sensitive informa- only requires modest statistical analysis skills and a small tion using data-dependent memory fetches, the attacker amount of money [155]. Furthermore, cache timing at- may be able to deduce the sensitive information from the tacks were recently demonstrated using JavaScript code learned memory access pattern. in a page visited by a Web browser [143]. Given this pattern of vulnerabilities, ignoring cache 3.8.2 Practical Considerations timing attacks is dangerously similar to ignoring the Cache timing attacks require control over a software pro- string of demonstrated attacks which led to the depreca- cess that shares a cache memory with the victim process. tion of SHA-1 [3, 6, 9]. Therefore, a cache timing attack that targets the L2 cache 3.8.4 Defending against Cache Timing Attacks would have to rely on the system software to schedule a software thread on a logical processor in the same Fortunately, invalidating any of the preconditions for core as the target software, whereas an attack on the L3 cache timing attacks is sufficient for defending against cache can be performed using any logical processor on them. The easiest precondition to focus on is that the the same CPU. The latter attack relies on the fact that attacker must have access to memory locations that map the L3 cache is inclusive, which greatly simplifies the to the same sets in a cache as the victim’s memory. This 49

50.assumption can be invalidated by the judicious use of a 4.1 The IBM 4765 Secure Coprocessor cache partitioning scheme. Secure coprocessors [196] encapsulate an entire com- Performance concerns aside, the main difficulty asso- puter system, including a CPU, a cryptographic accel- ciated with cache partitioning schemes is that they must erator, caches, DRAM, and an I/O controller within a be implemented by a trusted party. When the system tamper-resistant environment. The enclosure includes software is trusted, it can (for example) use the prin- hardware that deters attacks, such as a Faraday cage, as ciples behind page coloring [115, 175] to partition the well as an array of sensors that can detect tampering caches [127] between mutually distrusting parties. This attempts. The secure coprocessor destroys the secrets comes down to setting up the page tables in such a way that it stores when an attack is detected. This approach that no two mutually distrusting software module are has good security properties against physical attacks, stored in physical pages that map to the same sets in but tamper-resistant enclosures are very expensive [15], any cache memory. However, if the system software relatively to the cost of a computer system. is not trusted, the cache partitioning scheme must be The IBM 4758 [170], and its most current-day suc- implemented in hardware. cessor, the IBM 4765 [2] (shown in Figure 57) are rep- The other interesting precondition is that the victim resentative examples of secure coprocessors. The 4758 must access its memory in a data-dependent fashion that was certified to withstand physical attacks to FIPS 140-1 allows the attacker to infer private information from the Level 4 [169], and the 4765 meets the rigors of FIPS observed memory access pattern. It becomes tempting 140-2 Level 4 [1]. to think that cache timing attacks can be prevented by Tamper-Resistant Enclosure eliminating data-dependent memory accesses from all Tamper Battery- Boot the code handling sensitive data. Detection and Backed Flash Loader NVRAM Response RAM ROM However, removing data-dependent memory accesses is difficult to accomplish in practice because instruction Hardware Access Control Logic fetches must also be taken into consideration. [113] Application Application Battery-Backed Service CPU CPU RAM CPU gives an idea of the level of effort required to remove data-dependent accesses from AES, which is a relatively System Bus simple data processing algorithm. At the time of this Random writing, we are not aware of any approach that scales to I/O Crypto Real-Time SDRAM Number Controller Accelerator Clock Generator large pieces of software. Module Interface While the focus of this section is cache timing at- tacks, we would like to point out that any shared re- PCIe I/O Controller Batteries source can lead to information leakage. A worrying PCI Express Card PCI Express Interface example is hyper-threading (§ 2.9.4), where each CPU core is represented as two logical processors, and the Figure 57: The IBM 4765 secure coprocessor consists of an entire threads executing on these two processors share execu- computer system placed inside an enclosure that can deter and de- tion units. An attacker who can run a process on a logical tect physical attacks. The application and the system use separate processors. Sensitive memory can only be accessed by the system processor sharing a core with a victim process can use code, thanks to access control checks implemented in the system bus’ RDTSCP [150] to learn which execution units are in use, hardware. Dedicated hardware is used to clear the platform’s secrets and infer what instructions are executed by the victim and shut down the system when a physical attack is detected. process. The 4765 relies heavily on physical isolation for its security properties. Its system software is protected from attacks by the application software by virtue of using 4 R ELATED W ORK a dedicated service processor that is completely sepa- rate from the application processor. Special-purpose bus This section describes the broader picture of trusted hard- logic prevents the application processor from accessing ware projects that SGX belongs to. Table 12 summarizes privileged resources, such as the battery-backed memory the security properties of SGX and the other trusted hard- that stores the system software’s secrets. ware presented here. The 4765 implements software attestation. The co- 50

51. Attack TrustZone TPM TPM+TXT SGX XOM Aegis Bastion Ascend, Sanctum Phantom Malicious N/A (secure N/A (The whole N/A (Does not Access checks on Identifier tag Security kernel Access checks OS separates Access checks containers (direct world is trusted) computer is one allow concurrent TLB misses checks separates on each containers on TLB misses probing) container) containers) containers memory access Malicious OS Access checks N/A (OS Host OS Access checks on OS has its own Security kernel Memory X Access checks (direct probing) on TLB misses measured and preempted during TLB misses identifier measured and encryption and on TLB misses trusted) late launch isolated HMAC Malicious Access checks N/A (Hypervisor Hypervisor Access checks on N/A (No N/A (No Hypervisor N/A (No Access checks hypervisor (direct on TLB misses measured and preempted during TLB misses hypervisor hypervisor measured and hypervisor on TLB misses probing) trusted) late launch support) support) trusted support) Malicious N/A (firmware is CPU microcode SINIT ACM signed SMM handler is N/A (Firmware N/A (Firmware Hypervisor N/A (Firmware Firmware is firmware a part of the measures PEI by Intel key and subject to TLB is not active is not active measured after is not active measured and secure world) firmware measured access checks after booting) after booting) boot after booting) trusted Malicious N/A (secure N/A (Does not N/A (Does not X X X X X Each enclave containers (cache world is trusted) allow concurrent allow concurrent its gets own timing) containers) containers) cache partition Malicious OS Secure world N/A (OS Host OS X N/A (Paging not X X X Per-enclave (page fault has own page measured and preempted during supported) page tables recording) tables trusted) late launch Malicious OS X N/A (OS Host OS X X X X X Non-enclave (cache timing) measured and preempted during software uses a 51 trusted) late launch separate cache partition DMA from On-chip bus X IOMMU bounces IOMMU bounces Equivalent to Equivalent to Equivalent to Equivalent to MC bounces malicious bounces secure DMA into TXT DMA into PRM physical DRAM physical DRAM physical DRAM physical DRAM DMA outside peripheral world accesses memory range access access access access allowed range Physical DRAM Secure world X X Undocumented DRAM DRAM DRAM DRAM X read limited to on- memory encryption encryption encryption encryption encryption chip SRAM engine Physical DRAM Secure world X X Undocumented HMAC of HMAC of Merkle tree over HMAC of X write limited to on- memory encryption address and address, data, DRAM address, data, chip SRAM engine data timestamp timestamp Physical DRAM Secure world X X Undocumented X Merkle tree Merkle tree over Merkle tree X rollback write limited to on- memory encryption over HMAC DRAM over HMAC chip SRAM engine timestamps timestamps Physical DRAM Secure world in X X X X X X ORAM X Table 12: Security features overview for the trusted hardware projects related to Intel’s SGX address reads on-chip SRAM Hardware TCB CPU chip Motherboard Motherboard CPU chip package CPU chip CPU chip CPU chip CPU chip CPU chip size package (CPU, TPM, (CPU, TPM, package package package package package DRAM, buses) DRAM, buses) Software TCB Secure world All software on SINIT ACM + VM Application module Application Application Application Application Application size (firmware, OS, the computer (OS, application) + privileged module + module + module + process + module + application) containers hypervisor security kernel hypervisor trusted OS security monitor

52.processor’s attestation key is stored in battery-backed world. ARM processor cores that include TrustZone’s memory that is only accessible to the service processor. “Security Extensions” can switch between the normal Upon reset, the service processor executes a first-stage world and the secure world when executing code. The bootloader stored in ROM, which measures and loads the address in each bus access executed by a core reflects the system software. In turn, the system software measures world in which the core is currently executing. the application code stored in NVRAM and loads it into The reset circuitry in a TrustZone processor places the DRAM chip accessible to the application processor. it in secure mode, and points it to the first-stage boot- The system software provides attestation services to the loader stored in on-chip ROM. TrustZone’s TCB includes application loaded inside the coprocessor. this bootloader, which initializes the platform, sets up 4.2 ARM TrustZone the TrustZone hardware to protect the secure container from untrusted software, and loads the normal world’s ARM’s TrustZone [13] is a collection of hardware mod- bootloader. The secure container must also implement ules that can be used to conceptually partition a system’s a monitor that performs the context switches needed to resources between a secure world, which hosts a secure transition an execution core between the two worlds. The container, and a normal world, which runs an untrusted monitor must also handle hardware exceptions, such as software stack. The TrustZone documentation [18] de- interrupts, and route them to the appropriate world. scribes semiconductor intellectual property cores (IP blocks) and ways in which they can be combined to The TrustZone design gives the secure world’s monitor achieve certain security properties, reflecting the fact that unrestricted access to the normal world, so the monitor ARM is an IP core provider, not a chip manufacturer. can implement inter-process communication (IPC) be- Therefore, the mere presence of TrustZone IP blocks in a tween the software in the two worlds. Specifically, the system is not sufficient to determine whether the system monitor can issue bus accesses using both secure and non- is secure under a specific threat model. Figure 58 illus- secure addresses. In general, the secure world’s software trates a design for a smartphone System-on-Chip (SoC) can compromise any level in the normal world’s software design that uses TrustZone IP blocks. stack. For example, the secure container’s software can jump into arbitrary locations in the normal world by flip- System-on-Chip Package ping a bit in a register. The untrusted software in the Interrupt Controller normal world can only access the secure world via an Processor instruction that jumps into a well-defined location inside with SRAM Processor 4G Modem Secure the monitor. without Extensions DMA Conceptually, each TrustZone CPU core provides sep- Secure Boot ROM TZMA Extensions Controller L2 Cache arate address translation units for the secure and normal worlds. This is implemented by two page table base AMBA AXI On-Chip Bus registers, and by having the page walker use the page L3 Cache table base corresponding to the core’s current world. The Real-Time OTP AXI to APB Clock Polyfuses Bridge physical addresses in the page table entries are extended AMBA AXI Bus to include the values of the secure bit to be issued on the APB Bus TZASC AXI bus. The secure world is protected from untrusted software by having the CPU core force the secure bit in Memory Memory Display Keypad ADC / DAC the address translation result to zero for normal world Controller Controller Controller Controller address translations. As the secure container manages its DRAM Flash Display Audio Keypad own page tables, its memory accesses cannot be directly observed by the untrusted OS’s page fault handler. Figure 58: Smartphone SoC design based on TrustZone. The red IP blocks are TrustZone-aware. The red connections ignore TrustZone-aware hardware modules, such as caches, the TrustZone secure bit in the bus address. Defining the system’s are trusted to use the secure address bit in each bus access security properties requires a complete understanding of all the red to enforce the isolation between worlds. For example, elements in this figure. TrustZone’s caches store the secure bit in the address TrustZone extends the address lines in the AMBA AXI tag for each cache line, which effectively provides com- system bus [17] with one signal that indicates whether pletely different views of the memory space to the soft- an access belongs to the secure or normal (non-secure) ware running in different worlds. This design assumes 52

53.that memory space is partitioned between the two worlds, line a method for implementing secure boot, which so no aliasing can occur. comes down to having the first-stage bootloader verify a The TrustZone documentation describes two TLB con- signature in the second-stage bootloader against a public figurations. If many context switches between worlds key whose cryptographic hash is burned into on-chip are expected, the TLB IP blocks can be configured to One-Time Programmable (OTP) polysilicon fuses. A include the secure bit in the address tag. Alternatively, hardware measurement root can be built on top of the the secure bit can be omitted from the TLBs, as long as same components, by storing a per-chip attestation key the monitor flushes the TLBs when switching contexts. in the polyfuses, and having the first-stage bootloader The hardware modules that do not consume Trust- measure the second-stage bootloader and store its hash Zone’s address bit are expected to be connected to the in an on-chip SRAM region allocated to the secure world. AXI bus via IP cores that implement simple partition- The polyfuses would be gated by a TZMA IP block that ing techniques. For example, the TrustZone Memory makes them accessible only to the secure world. Adapter (TZMA) can be used to partition an on-chip ROM or SRAM into a secure region and a normal region, 4.3 The XOM Architecture and the TrustZone Address Space Controller (TZASC) The execute-only memory (XOM) architecture [126] in- partitions the memory space provided by a DRAM con- troduced the approach of executing sensitive code and troller into secure and normal regions. A TrustZone- data in isolated containers managed by untrusted host aware DMA controller rejects DMA transfers from the software. XOM outlined the mechanisms needed to iso- normal world that reference secure world addresses. late a container’s data from its untrusted software envi- It follows that analyzing the security properties of a ronment, such as saving the register state to a protected TrustZone system requires a precise understanding of memory area before servicing an interrupt. the behavior and configuration of all the hardware mod- XOM supports multiple containers by tagging every ules that are attached to the AXI bus. For example, the cache line with the identifier of the container owning it, caches described in TrustZone’s documentation do not and ensures isolation by disallowing memory accesses enforce a complete separation between worlds, as they al- to cache lines that don’t match the current container’s low a world’s memory accesses to evict the other world’s identifier. The operating system and the untrusted appli- cache lines. This exposes the secure container software cations are considered to belong to a container with a to cache timing attacks from the untrusted software in the null identifier. normal world. Unfortunately, hardware manufacturers XOM also introduced the integration of encryption that license the TrustZone IP cores are reluctant to dis- and HMAC functionality in the processor’s memory con- close all the details of their designs, making it impossible troller to protect container memory from physical attacks for security researchers to reason about TrustZone-based on DRAM. The encryption and HMAC functionality is hardware. used for all cache line evictions and fetches, and the The TrustZone components do not have any counter- ECC bits in DRAM chips are repurposed to store HMAC measures for physical attacks. However, a system that values. follows the recommendations in the TrustZone documen- XOM’s design cannot guarantee DRAM freshness, so tation will not be exposed to physical attacks, under a the software in its containers is vulnerable to physical threat model that trusts the processor chip package. The replay attacks. Furthermore, XOM does not protect a AXI bus is designed to connect components in an SoC container’s memory access patterns, meaning that any design, so it cannot be tapped by an attacker. The Trust- piece of malicious software can perform cache timing Zone documentation recommends having all the code attacks against the software in a container. Last, XOM and data in the secure world stored in on-chip SRAM, containers are destroyed when they encounter hardware which is not subject to physical attacks. However, this ap- exceptions, such as page faults, so XOM does not support proach places significant limits on the secure container’s paging. functionality, because on-chip SRAM is many orders of XOM predates the attestation scheme described above, magnitude more expensive than a DRAM chip of the and relies on a modified software distribution scheme same capacity. instead. Each container’s contents are encrypted with TrustZone’s documentation does not describe any soft- a symmetric key, which also serves as the container’s ware attestation implementation. However, it does out- identity. The symmetric key, in turn, is encrypted with 53

54.the public key of each CPU that is trusted to run the next stage, and send the hash to the TPM. The TPM up- container. A container’s author can be assured that the dates the PCRs to incorporate the new hashes it receives, container is running on trusted software by embedding a as shown in Figure 59. Most importantly, the PCR value secret into the encrypted container data, and using it to at any point reflects all the software hashes received by authenticate the container. While conceptually simpler the TPM up to that point. This makes it impossible for than software attestation, this scheme does not allow the software that has been measured to “remove” itself from container author to vet the container’s software environ- the measurement. ment. 0 (zero) 4.4 The Trusted Platform Module (TPM) TPM MR after reboot Boot Loader The Trusted Platform Module (TPM) [71] introduced the software attestation model described at the beginning SHA-1( ) of this section. The TPM design does not require any sent to TPM hardware modifications to the CPU, and instead relies OS Kernel on an auxiliary tamper-resistant chip. The TPM chip SHA-1( ) is only used to store the attestation key and to perform TPM MR when SHA-1( ) boot loader software attestation. The TPM was widely deployed on executes sent to TPM Kernel module commodity computers, because it does not rely on CPU modifications. Unfortunately, the cost of this approach SHA-1( ) SHA-1( ) is that the TPM has very weak security guarantees, as TPM MR when OS kernel sent to TPM explained below. executes The TPM design provides one isolation container, cov- SHA-1( ) ering all the software running on the computer that has TPM MR when Kernel Module executes the TPM chip. It follows that the measurement included in an attestation signature covers the entire OS kernel and Figure 59: The measurement stored in a TPM platform configura- all the kernel modules, such as device drivers. However, tion register (PCR). The PCR is reset when the system reboots. The commercial computers use a wide diversity of devices, software at every boot stage hashes the next boot stage, and sends the hash to the TPM. The PCR’s new value incorporates both the old and their system software is updated at an ever-increasing PCR value, and the new software hash. pace, so it is impossible to maintain a list of acceptable measurement hashes corresponding to a piece of trusted For example, the firmware on most modern comput- software. Due to this issue, the TPM’s software attes- ers implements the platform initialization process in the tation is not used in many security systems, despite its Unified Extensible Firmware Interface (UEFI) specifi- wide deployment. cation [178]. Each platform initialization phase is re- The TPM design is technically not vulnerable to any sponsible for verifying or measuring the firmware that software attacks, because it trusts all the software on the implements the next phase. The SEC firmware initializes computer. However, a TPM-based system is vulnerable the TPM PCR, and then stores the PEI’s measurement to an attacker who has physical access to the machine, into a measurement register. In turn, the PEI imple- as the TPM chip does not provide any isolation for the mentation measures the DXE firmware and updates the software on the computer. Furthermore, the TPM chip measurement register that stores the PEI hash to account receives the software measurements from the CPU, so for the DXE hash. When the OS is booted, the hash in TPM-based systems are vulnerable to attackers who can the measurement register accounts for all the firmware tap the communication bus between the CPU and the that was used to boot the computer. TPM. Unfortunately, the security of the whole measurement Last, the TPM’s design relies on the software running scheme hinges on the requirement that the first hash sent on the CPU to report its own cryptographic hash. The to the TPM must reflect the software that runs in the first TPM chip resets the measurements stored in Platform boot stage. The TPM threat model explicitly acknowl- Configuration Registers (PCRs) when the computer is edges this issue, and assumes that the firmware respon- rebooted. Then, the TPM expects the software at each sible for loading the first stage bootloader is securely boot stage to cryptographically hash the software at the embedded in the motherboard. However, virtually ev- 54

55.ery TPM-enabled computer stores its firmware in a flash DRAM region used by a TXT container [186, 189]. In memory chip that can be re-programmed in software recent Intel CPUs, the memory controller is integrated (§ 2.9.1), so the TPM’s measurement can be subverted on the CPU die, so the SINIT ACM can securely set by an attacker who can reflash the computer’s firmware up the memory controller to reject DMA transfers tar- [29]. geting TXT memory. An Intel chipset datasheet [104] On very recent Intel processors, the attack described documents an “Intel TXT DMA Protected Range” IIO above can be defeated by having the initialization mi- configuration register. crocode (§ 2.14.4) hash the computer’s firmware (specifi- Early TXT implementations did not measure the cally, the PEI code in UEFI [178] firwmare) and commu- SINIT ACM. Instead, the microcode implementing the nicate the hash to the TPM chip. This is marketed as the TXT launch instruction verified that the code module Measured Boot feature of Intel’s Boot Guard [160]. contained an RSA signature by a hard-coded Intel key. Sadly, most computer manufacturers use Verified Boot SINIT ACM signatures cannot be revoked if vulnerabili- (also known as “secure boot”) instead of Measured Boot ties are found, so TXT’s software attestation had to be (also known as “trusted boot”). Verified Boot means that revised when SINIT ACM exploits [188] surfaced. Cur- the processor’s microcode only boots into PEI firmware rently, the SINIT ACM’s cryptographic hash is included that contains a signature produced by a key burned into in the attestation measurement. the chip’s e-fuses. Verified Boot does not impact the Last, the warm reset performed by the SINIT ACM measurements stored on the TPM, so it does not improve does not include the software running in System Manage- the security of software attestation. ment Mode (SMM). SMM was designed solely for use by firmware, and is stored in a protected memory area 4.5 Intel’s Trusted Execution Technology (TXT) (SMRAM) which should not be accessible to non-SMM Intel’s Trusted Execution Technology (TXT) [70] uses software. However, the SMM handler was compromised the TPM’s software attestation model and auxiliary on multiple occasions [44, 49, 162, 184, 187], and an tamper-resistant chip, but reduces the software inside the attacker who obtains SMM execution can access the secure container to a virtual machine (guest operating memory used by TXT’s container. system and application) hosted by the CPU’s hardware virtualization features (VMX [179]). 4.6 The Aegis Secure Processor TXT isolates the software inside the container from The Aegis secure processor [172] relies on a security untrusted software by ensuring that the container has kernel in the operating system to isolate containers, and exclusive control over the entire computer while it is includes the kernel’s cryptographic hash in the measure- active. This is accomplished by a secure initialization ment reported by the software attestation signature. [174] authenticated code module (SINIT ACM) that effectively argued that Physical Unclonable Functions (PUFs) [56] performs a warm system reset before starting the con- can be used to endow a secure processor with a tamper- tainer’s VM. resistant private key, which is required for software attes- TXT requires a TPM chip with an extended register tation. PUFs do not have the fabrication process draw- set. The registers used by the measured boot process de- backs of EEPROM, and are significantly more resilient scribed in § 4.4 are considered to make up the platform’s to physical attacks than e-fuses. Static Root of Trust Measurement (SRTM). When a TXT Aegis relies on a trusted security kernel to isolate each VM is initialized, it updates TPM registers that make container from the other software on the computer by up the Dynamic Root of Trust Measurement (DRTM). configuring the page tables used in address translation. While the TPM’s SRTM registers only reset at the start of The security kernel is a subset of a typical OS kernel, a boot cycle, the DRTM registers are reset by the SINIT and handles virtual memory management, processes, and ACM, every time a TXT VM is launched. hardware exceptions. As the security kernel is a part of TXT does not implement DRAM encryption or the trusted code base (TCB), its cryptographic hash is HMACs, and therefore is vulnerable to physical DRAM included in the software attestation measurement. The attacks, just like TPM-based designs. Furthermore, early security kernel uses processor features to isolate itself TXT implementations were vulnerable to attacks where from the untrusted part of the operating system, such as a malicious operating system would program a device, device drivers. such as a network card, to perform DMA transfers to the The Aegis memory controller encrypts the cache lines 55

56.in one memory range, and HMACs the cache lines in one tag against the current container’s identifier on every other memory range. The two memory ranges can over- memory access. lap, and are configurable by the security kernel. Thanks Bastion offers the same protection against physical to the two ranges, the memory controller can avoid the DRAM attacks as Aegis does, without the restriction that latency overhead of cryptographic operations for the a container’s data must be stored inside a continuous DRAM outside containers. Aegis was the first secure DRAM range. This is accomplished by extending cache processor not vulnerable to physical replay attacks, as it lines and TLB entries with flags that enable memory uses a Merkle tree construction [57] to guarantee DRAM encryption and HMACing. The hypervisor’s TLB miss freshness. The latency overhead of the Merkle tree is handler sets the flags on TLB entries, and the flags are greatly reduced by augmenting the L2 cache with the propagated to cache lines on memory writes. tree nodes for the cache lines. The Bastion hypervisor allows the untrusted operat- Aegis’ security kernel allows the OS to page out con- ing system to evict secure container pages. The evicted tainer memory, but verifies the correctness of the paging pages are encrypted, HMACed, and covered by a Merkle operations. The security kernel uses the same encryption tree maintained by the hypervisor. Thus, the hypervisor and Merkle tree algorithms as the memory controller to ensures the confidentiality, authenticity, and freshness guarantee the confidentiality and integrity of the con- of the swapped pages. However, the ability to freely tainer pages that are swapped out from DRAM. The OS evict container pages allows a malicious OS to learn a is free to page out container memory, so it can learn a container’s memory accesses with page granularity. Fur- container’s memory access patterns, at page granular- thermore, Bastion’s threat model excludes cache timing ity. Aegis containers are also vulnerable to cache timing attacks. attacks. Bastion does not trust the platform’s firmware, and computes the cryptographic hash of the hypervisor af- 4.7 The Bastion Architecture ter the firmware finishes playing its part in the booting The Bastion architecture [31] introduced the use of a process. The hypervisor’s hash is included in the mea- trusted hypervisor to provide secure containers to appli- surement reported by software attestation. cations running inside unmodified, untrusted operating systems. Bastion’s hypervisor ensures that the operating 4.8 Intel SGX in Context system does not interfere with the secure containers. We Intel’s Software Guard Extensions (SGX) [14, 78, 137] only describe Bastion’s virtualization extensions to ar- implements secure containers for applications without chitectures that use nested page tables, like Intel’s VMX making any modifications to the processor’s critical ex- [179]. ecution path. SGX does not trust any layer in the com- The hypervisor enforces the containers’ desired mem- puter’s software stack (firmware, hypervisor, OS). In- ory mappings in the OS page tables, as follows. Each stead, SGX’s TCB consists of the CPU’s microcode and Bastion container has a Security Segment that lists the a few privileged containers. SGX introduces an approach virtual addresses and permissions of all the container’s to solving some of the issues raised by multi-core pro- pages, and the hypervisor maintains a Module State Table cessors with a shared, coherent last-level cache. that stores an inverted page map, associating each physi- SGX does not extend caches or TLBs with container cal memory page to its container and virtual address. The identity bits, and does not require any security checks processor’s hardware page walker is modified to invoke during normal memory accesses. As suggested in the the hypervisor on every TLB miss, before updating the TrustZone documentation, SGX always ensures that a TLB with the address translation result. The hypervisor core’s TLBs only contain entries for the container that checks that the virtual address used by the translation it is executing, which requires flushing the CPU core’s matches the expected virtual address associated with the TLBs when context-switching between containers and physical address in the Module State Table. untrusted software. Bastion’s cache lines are not tagged with container SGX follows Bastion’s approach of having the un- identifiers. Instead, only TLB entries are tagged. The trusted OS manage the page tables used by secure con- hypervisor’s TLB miss handler sets the container iden- tainers. The containers’ security is preserved by a TLB tifier for each TLB entry as it is created. Similarly to miss handler that relies on an inverted page map (the XOM and Aegis, the secure processor checks the TLB EPCM) to reject address translations for memory that 56

57.does not belong to the current container. scheme, where a computer’s DRAM is split into equally- Like Bastion, SGX allows the untrusted operating sys- sized continuous DRAM regions, and each DRAM re- tem to evict secure container pages, in a controlled fash- gion uses distinct sets in the shared last-level cache ion. After the OS initiates a container page eviction, (LLC). Each DRAM region is allocated to exactly one it must prove to the SGX implementation that it also container, so containers are isolated in both DRAM and switched the container out of all cores that were execut- the LLC. Containers are isolated in the other caches by ing its code, effectively performing a very coarse-grained flushing on context switches. TLB shootdown. Like XOM, Aegis, and Bastion, Sanctum also consid- SGX’s microcode ensures the confidentiality, authen- ers the hypervisor, OS, and the application software to ticity, and freshness of each container’s evicted pages, conceptually belong to a separate container. Containers like Bastion’s hypervisor. However, SGX relies on a are protected from the untrusted outside software by the version-based Merkle tree, inspired by Aegis [172], and same measures that isolate containers from each other. adds an innovative twist that allows the operating system Sanctum relies on a trusted security monitor, which to dynamically shape the Merkle tree. SGX also shares is the first piece of firmware executed by the processor, Bastion’s and Aegis’ vulnerability to memory access pat- and has the same security properties as those of Aegis’ tern leaks, namely a malicious OS can directly learn a security kernel. The monitor is measured by bootstrap container’s memory accesses at page granularity, and any code in the processor’s ROM, and its cryptographic hash piece of software can perform cache timing attacks. is included in the software attestation measurement. The SGX’s software attestation is implemented using monitor verifies the operating system’s resource alloca- Intel’s Enhanced Privacy ID (EPID) group signature tion decisions. For example, it ensures that no DRAM scheme [26], which is too complex for a microcode region is ever accessible to two different containers. implementation. Therefore, SGX relies on an assort- Each Sanctum container manages its own page tables ment of privileged containers that receive direct access mapping its DRAM regions, and handles its own page to the SGX processor’s hardware keys. The privileged faults. It follows that a malicious OS cannot learn the containers are signed using an Intel private key whose virtual addresses that would cause a page fault in the corresponding public key is hard-coded into the SGX container. Sanctum’s hardware modifications work in microcode, similarly to TXT’s SINIT ACM. conjunction with the security monitor to make sure that As SGX does not protect against cache timing at- a container’s page tables only reference memory inside tacks, the privileged enclave’s authors cannot use data- the container’s DRAM regions. dependent memory accesses. For example, cache attacks The Sanctum design focuses completely on software on the Quoting Enclave, which computes attestation sig- attacks, and does not offer protection from any physical natures, would provide an attack with a processor’s EPID attack. The authors expect Sanctum’s hardware modifica- signing key and completely compromise SGX. tions to be combined with the physical attack protections Intel’s documentation states that SGX guarantees in Aegis or Ascend. DRAM confidentiality, authentication, and freshness by virtue of a Memory Encryption Engine (MEE). The MEE 4.10 Ascend and Phantom is informally described in an ISCA 2015 tutorial [102], The Ascend [52] and Phantom [130] secure processors and appears to lack a formal specification. In the absence introduced practical implementations of Oblivious RAM of further information, we assume that SGX provides [65] techniques in the CPU’s memory controller. These the same protection against physical DRAM attacks that processors are resilient to attackers who can probe the Aegis and Bastion provide. DRAM address bus and attempt to learn a container’s private information from its DRAM memory access pat- 4.9 Sanctum tern. Sanctum [38] introduced a straightforward software/hard- Implementing an ORAM scheme in a memory con- ware co-design that yields the same resilience against troller is largely orthogonal to the other secure archi- software attacks as SGX, and adds protection against tectures described above. It follows, for example, that memory access pattern leaks, such as page fault monitor- Ascend’s ORAM implementation can be combined with ing attacks and cache timing attacks. Aegis’ memory encryption and authentication, and with Sanctum uses a conceptually simple cache partitioning Sanctum’s hardware extensions and security monitor, 57

58.yielding a secure processor that can withstand both soft- implementation details surrounding the PRM, and will ware attacks and physical DRAM attacks. have to be re-evaluated for SGX future implementations. 5 SGX P ROGRAMMING M ODEL 5.1.1 The Enclave Page Cache (EPC) The central concept of SGX is the enclave, a protected The contents of enclaves and the associated data struc- environment that contains the code and data pertaining tures are stored in the Enclave Page Cache (EPC), which to a security-sensitive computation. is a subset of the PRM, as shown in Figure 60. SGX-enabled processors provide trusted computing by EPC PRM DRAM EPCM isolating each enclave’s environment from the untrusted Entry 4kb page software outside the enclave, and by implementing a soft- Entry 4kb page EPC PRM Entry 4kb page ware attestation scheme that allows a remote party to au- ⋮ ⋮ thenticate the software running inside an enclave. SGX’s Entry 4kb page isolation mechanisms are intended to protect the confi- Entry 4kb page dentiality and integrity of the computation performed inside an enclave from attacks coming from malicious Figure 60: Enclave data is stored into the EPC, which is a subset of software executing on the same computer, as well as the PRM. The PRM is a contiguous range of DRAM that cannot be accessed by system software or peripherals. from a limited set of physical attacks. This section summarizes the SGX concepts that make The SGX design supports having multiple enclaves up a mental model which is sufficient for programmers to on a system at the same time, which is a necessity in author SGX enclaves and to add SGX support to existing multi-process environments. This is achieved by having system software. All the information in this section is the EPC split into 4 KB pages that can be assigned to backed up by Intel’s Software Developer Manual (SDM). different enclaves. The EPC uses the same page size as The following section builds on the concepts introduced the architecture’s address translation feature (§ 2.5). This here to fill in some of the missing pieces in the manual, is not a coincidence, as future sections will reveal that the and analyzes some of SGX’s security properties. SGX implementation is tightly coupled with the address translation implementation. 5.1 SGX Physical Memory Organization The EPC is managed by the same system software The enclaves’ code and data is stored in Processor Re- that manages the rest of the computer’s physical mem- served Memory (PRM), which is a subset of DRAM that ory. The system software, which can be a hypervisor or cannot be directly accessed by other software, including an OS kernel, uses SGX instructions to allocate unused system software and SMM code. The CPU’s integrated pages to enclaves, and to free previously allocated EPC memory controllers (§ 2.9.3) also reject DMA transfers pages. The system software is expected to expose en- targeting the PRM, thus protecting it from access by clave creation and management services to application other peripherals. software. The PRM is a continuous range of memory whose Non-enclave software cannot directly access the EPC, bounds are configured using a base and a mask regis- as it is contained in the PRM. This restriction plays a key ter with the same semantics as a variable memory type role in SGX’s enclave isolation guarantees, but creates an range (§ 2.11.4). Therefore, the PRM’s size must be obstacle when the system software needs to load the ini- an integer power of two, and its start address must be tial code and data into a newly created enclave. The SGX aligned to the same power of two. Due to these restric- design solves this problem by having the instructions tions, checking if an address belongs to the PRM can be that allocate an EPC page to an enclave also initialize the done very cheaply in hardware, using the circuit outlined page. Most EPC pages are initialized by copying data in § 2.11.4. from a non-PRM memory page. The SDM does not describe the PRM and the PRM 5.1.2 The Enclave Page Cache Map (EPCM) range registers (PRMRR). These concepts are docu- mented in the SGX manuals [94, 98] and in one of The SGX design expects the system software to allocate the SGX papers [137]. Therefore, the PRM is a micro- the EPC pages to enclaves. However, as the system soft- architectural detail that might change in future implemen- ware is not trusted, SGX processors check the correctness tations of SGX. Our security analysis of SGX relies on of the system software’s allocation decisions, and refuse 58

59.to perform any action that would compromise SGX’s types will be described in future sections. security guarantees. For example, if the system software Last, a page’s EPCM entry also identifies the enclave attempts to allocate the same EPC page to two enclaves, that owns the EPC page. This information is used by the SGX instruction used to perform the allocation will the mechanisms that enforce SGX’s isolation guarantees fail. to prevent an enclave from accessing another enclave’s In order to perform its security checks, SGX records private information. As the EPCM identifies a single some information about the system software’s allocation owning enclave for each EPC page, it is impossible for decisions for each EPC page in the Enclave Page Cache enclaves to communicate via shared memory using EPC Map (EPCM). The EPCM is an array with one entry pages. Fortunately, enclaves can share untrusted non- per EPC page, so computing the address of a page’s EPC memory, as will be discussed in § 5.2.3. EPCM entry only requires a bitwise shift operation and an addition. 5.1.3 The SGX Enclave Control Structure (SECS) The EPCM’s contents is only used by SGX’s security SGX stores per-enclave metadata in a SGX Enclave checks. Under normal operation, the EPCM does not Control Structure (SECS) associated with each enclave. generate any software-visible behavior, and enclave au- Each SECS is stored in a dedicated EPC page with the thors and system software developers can mostly ignore page type PT SECS. These pages are not intended to it. Therefore, the SDM only describes the EPCM at a be mapped into any enclave’s address space, and are very high level, listing the information contained within exclusively used by the CPU’s SGX implementation. and noting that the EPCM is “trusted memory”. The An enclave’s identity is almost synonymous to its SDM does not disclose the storage medium or memory SECS. The first step in bringing an enclave to life al- layout used by the EPCM. locates an EPC page to serve as the enclave’s SECS, and The EPCM uses the information in Table 13 to track the last step in destroying an enclave deallocates the page the ownership of each EPC page. We defer a full discus- holding its SECS. The EPCM entry field identifying the sion of the EPCM to a later section, because its contents enclave that owns an EPC page points to the enclave’s is intimately coupled with all of SGX’s features, which SECS. The system software uses the virtual address of will be described over the next few sections. an enclave’s SECS to identify the enclave when invoking Field Bits Description SGX instructions. VALID 1 0 for un-allocated EPC All SGX instructions take virtual addresses as their in- pages puts. Given that SGX instructions use SECS addresses to PT 8 page type identify enclaves, the system software must create entries in its page tables pointing to the SECS of the enclaves it ENCLAVESECS identifies the enclave own- manages. However, the system software cannot access ing the page any SECS page, as these pages are stored in the PRM. Table 13: The fields in an EPCM entry that track the ownership of SECS pages are not intended to be mapped inside their pages. enclaves’ virtual address spaces, and SGX-enabled pro- The SGX instructions that allocate an EPC page set cessors explicitly prevent enclave code from accessing the VALID bit of the corresponding EPCM entry to 1, SECS pages. and refuse to operate on EPC pages whose VALID bit is This seemingly arbitrary limitation is in place so that already set. the SGX implementation can store sensitive information The instruction used to allocate an EPC page also in the SECS, and be able to assume that no potentially determines the page’s intended usage, which is recorded malicious software will access that information. For ex- in the page type (PT) field of the corresponding EPCM ample, the SDM states that each enclave’s measurement entry. The pages that store an enclave’s code and data is stored in its SECS. If software would be able to modify are considered to have a regular type (PT REG in the an enclave’s measurement, SGX’s software attestation SDM). The pages dedicated to the storage of SGX’s scheme would provide no security assurances. supporting data structures are tagged with special types. The SECS is strongly coupled with many of SGX’s For example, the PT SECS type identifies pages that features. Therefore, the pieces of information that make hold SGX Enclave Control Structures, which will be up the SECS will be gradually introduced as the different described in the following section. The other EPC page aspects of SGX are described. 59

60.5.2 The Memory Layout of an SGX Enclave The word “linear” in ELRANGE references the linear SGX was designed to minimize the effort required to addresses produced by the vestigial segmentation fea- convert application code to take advantage of enclaves. ture (§ 2.7) in the 64-bit Intel architecture. For most History suggests this is a wise decision, as a large factor purposes, “linear” can be treated as a synonym for “vir- in the continued dominance of the Intel architecture is tual”. its ability to maintain backward compatibility. To this ELRANGE is specified using a base (the BASEADDR end, SGX enclaves were designed to be conceptually field) and a size (the SIZE) in the enclave’s similar to the leading software modularization construct, SECS (§ 5.1.3). ELRANGE must meet the same con- dynamically loaded libraries, which are packaged as .so straints as a variable memory type range (§ 2.11.4) and as files on Unix, and .dll files on Windows. the PRM range (§ 5.1), namely the size must be a power For simplicity, we describe the interaction between of 2, and the base must be aligned to the size. These enclaves and non-enclave software assuming that each restrictions are in place so that the SGX implementation enclave is used by exactly one application process, which can inexpensively check whether an address belongs to we shall refer to as the enclave’s host process. We do an enclave’s ELRANGE, in either hardware (§ 2.11.4) or note, however, that the SGX design does not explicitly software. prohibit multiple application processes from sharing an When an enclave represents a dynamic library, it is enclave. natural to set ELRANGE to the memory range reserved for the library by the loader. The ability to access non- 5.2.1 The Enclave Linear Address Range (ELRANGE) enclave memory from enclave code makes it easy to Each enclave designates an area in its virtual address reuse existing library code that expects to work with space, called the enclave linear address range (EL- pointers to memory buffers managed by code in the host RANGE), which is used to map the code and the sensi- process. tive data stored in the enclave’s EPC pages. The virtual Non-enclave software cannot access PRM memory. A address space outside ELRANGE is mapped to access memory access that resolves inside the PRM results in non-EPC memory via the same virtual addresses as the an aborted transaction, which is undefined at an archi- enclave’s host process, as shown in Figure 61. tectural level, On current processors, aborted writes are ignored, and aborted reads return a value whose bits are Page Tables DRAM Host Application all set to 1. This comes into play in the scenario described Virtual Memory Enclave Virtual managed by View Memory View system software above, where an enclave is loaded into a host application process as a dynamically loaded library. The system soft- ware maps the enclave’s code and data in ELRANGE EPC into EPC pages. If application software attempts to ac- Abort Page ELRANGE cess memory inside ELRANGE, it will experience the abort transaction semantics. The current semantics do not cause the application to crash (e.g., due to a Page Fault), but also guarantee that the host application will not be able to tamper with the enclave or read its private information. Figure 61: An enclave’s EPC pages are accessed using a dedicated region in the enclave’s virtual address space, called ELRANGE. The 5.2.2 SGX Enclave Attributes rest of the virtual address space is used to access the memory of the host process. The memory mappings are established using the page The execution environment of an enclave is heavily in- tables managed by system software. fluenced by the value of the ATTRIBUTES field in the The SGX design guarantees that the enclave’s mem- enclave’s SECS (§ 5.1.3). The rest of this work will refer ory accesses inside ELRANGE obey the virtual memory to the field’s sub-fields, shown in Table 14, as enclave abstraction (§ 2.5.1), while memory accesses outside EL- attributes. RANGE receive no guarantees. Therefore, enclaves must The most important attribute, from a security perspec- store all their code and private data inside ELRANGE, tive, is the DEBUG flag. When this flag is set, it enables and must consider the memory outside ELRANGE to be the use of SGX’s debugging features for this enclave. an untrusted interface to the outside world. These debugging features include the ability to read and 60

61. Field Bits Description each enclave’s code uses the same address translation DEBUG 1 Opts into enclave debugging process and page tables (§ 2.5) as its host application. features. This minimizes the amount of changes required to add XFRM 64 The value of XCR0 (§ 2.6) SGX support to existing system software. At the same while this enclave’s code is time, having the page tables managed by untrusted sys- executed. tem software opens SGX up to the address translation MODE64BIT 1 Set for 64-bit enclaves. attacks described in § 3.7. As future sections will reveal, Table 14: An enclave’s attributes are the sub-fields in the AT- a good amount of the complexity in SGX’s design can TRIBUTES field of the enclave’s SECS. This table shows a subset of be attributed to the need to prevent these attacks. the attributes defined in the SGX documentation. SGX’s active memory mapping attacks defense mech- anisms revolve around ensuring that each EPC page modify most of the enclave’s memory. Therefore, DE- can only be mapped at a specific virtual address (§ 2.7). BUG should only be set in a development environment, When an EPC page is allocated, its intended virtual ad- as it causes the enclave to lose all the SGX security guar- dress is recorded in the EPCM entry for the page, in the antees. ADDRESS field. SGX guarantees that enclave code will always run When an address translation (§ 2.5) result is the physi- with the XCR0 register (§ 2.6) set to the value indicated cal address of an EPC page, the CPU ensures6 that the by extended features request mask (XFRM). Enclave au- virtual address given to the address translation process thors are expected to use XFRM to specify the set of matches the expected virtual address recorded in the architectural extensions enabled by the compiler used to page’s EPCM entry. produce the enclave’s code. Having XFRM be explicitly SGX also protects against some passive memory map- specified allows Intel to design new architectural exten- ping attacks and fault injection attacks by ensuring that sions that change the semantics of existing instructions, the access permissions of each EPC page always match such as Memory Protection Extensions (MPX), without the enclave author’s intentions. The access permissions having to worry about the security implications on en- for each EPC page are specified when the page is allo- clave code that was developed without an awareness of cated, and recorded in the readable (R), writable (W), the new features. and executable (X) fields in the page’s EPCM entry, The MODE64BIT flag is set to true for enclaves that shown in Table 15. use the 64-bit Intel architecture. From a security stand- point, this flag should not even exist, as supporting a Field Bits Description secondary architecture adds unnecessary complexity to ADDRESS 48 the virtual address used to ac- the SGX implementation, and increases the probability cess this page that security vulnerabilities will creep in. It is very likely R 1 allow reads by enclave code that the 32-bit architecture support was included due to W 1 allow writes by enclave code Intel’s strategy of offering extensive backwards compati- X 1 allow execution of code inside bility, which has paid off quite well so far. the page, inside enclave In the interest of mental sanity, this work does Table 15: The fields in an EPCM entry that indicate the enclave’s not analyze the behavior of SGX for enclaves whose intended virtual memory layout. MODE64BIT flag is cleared. However, a security re- searcher who wishes to find vulnerabilities in SGX might When an address translation (§ 2.5) resolves into an study this area. EPC page, the corresponding EPCM entry’s fields over- Last, the INIT flag is always false when the enclave’s ride the access permission attributes (§ 2.5.3) specified in SECS is created. The flag is set to true at a certain point the page tables. For example, the W field in the EPCM in the enclave lifecycle, which will be summarized in entry overrides the writable (W) attribute, and the X field § 5.3. overrides the disable execution (XD) attribute. It follows that an enclave author must include mem- 5.2.3 Address Translation for SGX Enclaves ory layout information along with the enclave, in such a way that the system software loading the enclave will Under SGX, the operating system and hypervisor are 6 still in full control of the page tables and EPTs, and A mismatch triggers a general protection fault (#GP, § 2.8.2). 61

62.know the expected virtual memory address and access For example, the OENTRY field specifies the value permissions for each enclave page. In return, the SGX loaded in the instruction pointer (RIP) when the TCS is design guarantees to the enclave authors that the sys- used to start executing enclave code, so the enclave au- tem software, which manages the page tables and EPT, thor has strict control over the entry points available to en- will not be able to set up an enclave’s virtual address clave’s host application. Furthermore, the OFSBASGX space in a manner that is inconsistent with the author’s and OFSBASGX fields specify the base addresses loaded expectations. in the FS and GS segment registers (§ 2.7), which typi- The .so and .dll file formats, which are SGX’s cally point to Thread Local Storage (TLS). intended enclave delivery vehicles, already have provi- 5.2.5 The State Save Area (SSA) sions for specifying the virtual addresses that a software module was designed to use, as well as the desired access When the processor encounters a hardware excep- permissions for each of the module’s memory areas. tion (§ 2.8.2), such as an interrupt (§ 2.12), while exe- Last, a SGX-enabled CPU will ensure that the virtual cuting the code inside an enclave, it performs a privilege memory inside ELRANGE (§ 5.2.1) is mapped to EPC level switch (§ 2.8.2) and invokes a hardware exception pages. This prevents the system software from carry- handler provided by the system software. Before ex- ing out an address translation attack where it maps the ecuting the exception handler, however, the processor enclave’s entire virtual address space to DRAM pages needs a secure area to store the enclave code’s execution outside the PRM, which do not trigger any of the checks context (§ 2.6), so that the information in the execution above, and can be directly accessed by the system soft- context is not revealed to the untrusted system software. ware. In the SGX design, the area used to store an enclave thread’s execution context while a hardware exception is 5.2.4 The Thread Control Structure (TCS) handled is called a State Save Area (SSA), illus- The SGX design fully embraces multi-core processors. trated in Figure 62. Each TCS references a contiguous se- It is possible for multiple logical processors (§ 2.9.3) to quence of SSAs. The offset of the SSA array (OSSA) field concurrently execute the same enclave’s code at the same specifies the location of the first SSA in the enclave’s time, via different threads. virtual address space. The number of SSAs (NSSA) field The SGX implementation uses a Thread Control Struc- indicates the number of available SSAs. ture (TCS) for each logical processor that executes an Each SSA starts at the beginning of an EPC page, and enclave’s code. It follows that an enclave’s author must uses up the number of EPC pages that is specified in the provision at least as many TCS instances as the maxi- SSAFRAMESIZE field of the enclave’s SECS. These mum number of concurrent threads that the enclave is alignment and size restrictions most likely simplify the intended to support. SGX implementation by reducing the number of special Each TCS is stored in a dedicated EPC page whose cases that it needs to handle. EPCM entry type is PT TCS. The SDM describes the An enclave thread’s execution context consists of first few fields in the TCS. These fields are considered the general-purpose registers (GPRs) and the result of to belong to the architectural part of the structure, and the XSAVE instruction (§ 2.6). Therefore, the size of therefore are guaranteed to have the same semantics on the execution context depends on the requested-feature all the processors that support SGX. The rest of the TCS bitmap (RFBM) used by to XSAVE. All the code in an is not documented. enclave uses the same RFBM, which is declared in the The contents of an EPC page that holds a TCS cannot XFRM enclave attribute (§ 5.2.2). The number of EPC be directly accessed, even by the code of the enclave that pages reserved for each SSA, specified in SSAFRAME- owns the TCS. This restriction is similar to the restric- SIZE, must7 be large enough to fit the XSAVE output for tion on accessing EPC pages holding SECS instances. the feature bitmap specified by XFRM. However, the architectural fields in a TCS can be read by SSAs are stored in regular EPC pages, whose EPCM enclave debugging instructions. page type is PT REG. Therefore, the SSA contents is The architectural fields in the TCS lay out the context accessible to enclave software. The SSA layout is archi- switches (§ 2.6) performed by a logical processor when tectural, and is completely documented in the SDM. This it transitions between executing non-enclave and enclave opens up possibilities for an enclave exception handler 7 code. ECREATE (§ 5.3.1) fails if SSAFRAMESIZE is too small. 62

63. SECS Enclave virtual Non- EADD ECREATE Uninitialized SIZE 40000 address space existing EEXTEND BASEADDR C00000 ELF / PE Header SSAFRAMESIZE 3 TCS 1 EINIT EGETKEY EREMOVE EPCM entries OENTRY 01D038 EREPORT ADDRESS PT RWX OFSBASGX 008000 EENTER OGSBASGX Initialized ERESUME Initialized 0 PT_SECS In use Not in use C00000 PT_REG R OSSA 001000 NSSA 2 EEXIT C01000 PT_TCS EBLOCK AEX EBLOCK ETRACK ETRACK C02000 PT_REG RW SSA 1 Page 1 ELDU, ELDB ELDU, ELDB C03000 PT_REG RW SSA 1 Page 2 EWB C04000 PT_REG RW SSA 1 Page 3 Figure 63: The SGX enclave life cycle management instructions C05000 PT_REG RW SSA 2 Page 1 and state transition diagram C06000 PT_REG RW SSA 2 Page 2 C07000 PT_REG RW SSA 2 Page 3 fields defined in the SDM, such as BASEADDR and C08000 PT_REG RW Thread 1 TLS SIZE, using an architectural layout that is guaranteed to C09000 PT_TCS TCS 2 be preserved by future implementations. ⋮ ⋮ ⋮ ⋮ While is very likely that the actual SECS layout used C1C000 PT_REG RWX Code Pages by initial SGX implementations matches the architec- C1D000 PT_REG RWX _main tural layout quite closely, future implementations are ⋮ ⋮ ⋮ free to deviate from this layout, as long as they main- Data Pages C3F000 PT_REG RW tain the ability to initialize the SECS using the archi- tectural layout. Software cannot access an EPC page Figure 62: A possible layout of an enclave’s virtual address space. that holds a SECS, so it cannot become dependent on Each enclave has a SECS, and one TCS per supported concurrent an internal SECS layout. This is a stronger version of thread. Each TCS points to a sequence of SSAs, and specifies initial the encapsulation used in the Virtual Machine Control values for RIP and for the base addresses of FS and GS. Structure (VMCS, § 2.8.3). that is invoked by the host application after a hardware ECREATE validates the information used to initialize exception occurs, and acts upon the information in a the SECS, and results in a page fault (#PF, § 2.8.2) or SSA. general protection fault (#GP, § 2.8.2) if the information is not valid. For example, if the SIZE field is not a 5.3 The Life Cycle of an SGX Enclave power of two, ECREATE results in #GP. This validation, An enclave’s life cycle is deeply intertwined with re- combined with the fact that the SECS is not accessible source management, specifically the allocation of EPC by software, simplifies the implementation of the other pages. Therefore, the instructions that transition between SGX instructions, which can assume that the information different life cycle states can only be executed by the inside the SECS is valid. system software. The system software is expected to Last, ECREATE initializes the enclave’s INIT attribute expose the SGX instructions described below as enclave (sub-field of the ATTRIBUTES field in the enclave’s loading and teardown services. SECS, § 5.2.2) to the false value. The enclave’s code The following subsections describe the major steps in cannot be executed until the INIT attribute is set to true, an enclave’s lifecycle, which is illustrated by Figure 63. which happens in the initialization stage that will be described in § 5.3.3. 5.3.1 Creation An enclave is born when the system software issues the 5.3.2 Loading ECREATE instruction, which turns a free EPC page into ECREATE marks the newly created SECS as uninitial- the SECS (§ 5.1.3) for the new enclave. ized. While an enclave’s SECS is in this state, the system ECREATE initializes the newly created SECS using software can use EADD instructions to load the initial the information in a non-EPC page owned by the system code and data into the enclave. EADD is used to create software. This page specifies the values for all the SECS both TCS pages (§ 5.2.4) and regular pages. 63

64. EADD reads its input data from a Page Informa- bytes, and therefore each SECINFO instance must be 64- tion (PAGEINFO) structure, illustrated in Figure 64. The byte aligned. The alignment requirements likely simplify structure’s contents are only used to communicate in- the SGX implementation by reducing the number of formation to the SGX implementation, so it is entirely special cases that must be handled. architectural and documented in the SDM. EADD validates its inputs before modifying the newly allocated EPC page or its EPCM entry. Most importantly, Enclave and Host Application Virtual Address Space attempting to EADD a page to an enclave whose SECS is in the initialized state will result in a #GP. Furthermore, SECS BASEADDR PAGEINFO attempting to EADD an EPC page that is already allocated SIZE SECS (the VALID field in its EPCM entry is 1) results in a #PF. LINADDR EADD also ensures that the page’s virtual address falls ELRANGE SRCPGE within the enclave’s ELRANGE, and that all the reserved SECINFO fields in SECINFO are set to zero. New EPC Page While loading an enclave, the system software will also use the EEXTEND instruction, which updates the Initial Page Contents enclave’s measurement used in the software attestation process. Software attestation is discussed in § 5.8. SECINFO EPCM Entry FLAGS ADDRESS 5.3.3 Initialization R, W, X R, W, X After loading the initial code and data pages into the PAGE_TYPE PT ENCLAVESECS enclave, the system software must use a Launch En- clave (LE) to obtain an EINIT Token Structure, via an under-documented process that will be described in more Figure 64: The PAGEINFO structure supplies input data to SGX detail in § 5.9.1. The token is then provided to the EINIT instructions such as EADD. instruction, which marks the enclave’s SECS as initial- Currently, the PAGEINFO structure contains the vir- ized. tual address of the EPC page that will be allocated The LE is a privileged enclave provided by Intel, and (LINADDR), the virtual address of the non-EPC page is a prerequisite for the use of enclaves authored by whose contents will be copied into the newly allocated parties other than Intel. The LE is an SGX enclave, EPC page (SRCPGE), a virtual address that resolves to so it must be created, loaded and initialized using the the SECS of the enclave that will own the page (SECS), processes described in this section. However, the LE is and values for some of the fields of the EPCM entry asso- cryptographically signed (§ 3.1.3) with a special Intel ciated with the newly allocated EPC page (SECINFO). key that is hard-coded into the SGX implementation, and The SECINFO field in the PAGEINFO structure is ac- that causes EINIT to initialize the LE without checking tually a virtual memory address, and points to a Security for a valid EINIT Token Structure. Information (SECINFO) structure, some of which is also When EINIT completes successfully, it sets the en- illustrated in Figure 64. The SECINFO structure contains clave’s INIT attribute to true. This opens the way for ring the newly allocated EPC page’s access permissions (R, 3 (§ 2.3) application software to execute the enclave’s W, X) and its EPCM page type (PT REG or PT TCS). code, using the SGX instructions described in § 5.4. On Like PAGEINFO, the SECINFO structure is solely used the other hand, once INIT is set to true, EADD cannot be to communicate data to the SGX implementation, so its invoked on that enclave anymore, so the system software contents are also entirely architectural. However, most must load all the pages that make up the enclave’s initial of the structure’s 64 bytes are reserved for future use. state before executing the EINIT instruction. Both the PAGEINFO and the SECINFO structures 5.3.4 Teardown are prepared by the system software that invokes the EADD instruction, and therefore must be contained in After the enclave has done the computation it was de- non-EPC pages. Both structures must be aligned to their signed to perform, the system software executes the sizes – PAGEINFO is 32 bytes long, so each PAGEINFO EREMOVE instruction to deallocate the EPC pages used instance must be 32-byte aligned, while SECINFO has 64 by the enclave. 64

65. EREMOVE marks an EPC page as available by setting clave’s host process uses the EENTER instruction, de- the VALID field of the page’s EPCM entry to 0 (zero). scribed in § 5.4.1, to execute enclave code. When the en- Before freeing up the page, EREMOVE makes sure that clave code finishes performing its task, it uses the EEXIT there is no logical processor executing code inside the instruction, covered in § 5.4.2, to return the execution enclave that owns the page to be removed. control to the host process that invoked the enclave. An enclave is completely destroyed when the EPC If a hardware exception occurs while a logical proces- page holding its SECS is freed. EREMOVE refuses to sor is in enclave mode, the processor is taken out of en- deallocate a SECS page if it is referenced by any other clave mode using an Asynchronous Enclave Exit (AEX), EPCM entry’s ENCLAVESECS field, so an enclave’s summarized in § 5.4.3, before the system software’s ex- SECS page can only be deallocated after all the enclave’s ception handler is invoked. After the system software’s pages have been deallocated. handler is invoked, the enclave’s host process can use 5.4 The Life Cycle of an SGX Thread the ERESUME instruction, described in § 5.4.4, to re- enter the enclave and resume the computation that it was Between the time when an enclave is initialized (§ 5.3.3) performing. and the time when it is torn down (§ 5.3.4), the enclave’s code can be executed by any application process that has 5.4.1 Synchronous Enclave Entry the enclave’s EPC pages mapped into its virtual address space. At a high level, EENTER performs a controlled jump into When executing the code inside an enclave, a logical enclave code, while performing the processor configura- processor is said to be in enclave mode, and the code tion that is needed by SGX’s security guarantees. Going that it executes can access the regular (PT REG, § 5.1.2) through all the configuration steps is a tedious exercise, EPC pages that belong to the currently executing en- but it a necessary prerequisite to understanding how all clave. When a logical process is outside enclave mode, data structures used by SGX work together. For this it bounces any memory accesses inside the Processor reason, EENTER and its siblings are described in much Reserved Memory range (PRM, § 5.1), which includes more detail than the other SGX instructions. the EPC. EENTER, illustrated in Figure 66 can only be exe- Each logical processor that executes enclave code uses cuted by unprivileged application software running at a Thread Control Structure (TCS, § 5.2.4). When a TCS ring 3 (§ 2.3), and results in an undefined instruction is used by a logical processor, it is said to be busy, and it (#UD) fault if is executed by system software. cannot be used by any other logical processor. Figure 65 EENTER switches the logical processor to en- illustrates the instructions used by a host process to ex- clave mode, but does not perform a privilege level ecute enclave code and their interactions with the TCS switch (§ 2.8.2). Therefore, enclave code always exe- that they target. cutes at ring 3, with the same privileges as the application code that calls it. This makes it possible for an infras- Logical Processor in tructure owner to allow user-supplied software to create Enclave Mode and use enclaves, while having the assurance that the OS TCS Available EEXIT TCS Busy kernel and hypervisor can still protect the infrastructure CSSA = 0 EENTER CSSA = 0 from buggy or malicious software. EENTER takes the virtual address of a TCS as its input, ERESUME AEX and requires that the TCS is available (not busy), and that TCS Available EEXIT TCS Busy at least one State Save Area (SSA, § 5.2.5) is available CSSA = 1 EENTER CSSA = 1 in the TCS. The latter check is implemented by making sure that the current SSA index (CSSA) field in the TCS ERESUME AEX is less than the number of SSAs (NSSA) field. The SSA indicated by the CSSA, which shall be called the current TCS Available CSSA = 2 SSA, is used in the event that a hardware exception occurs while enclave code is executed. Figure 65: The stages of the life cycle of an SGX Thread Control EENTER transitions the logical processor into enclave Structure (TCS) that has two State Save Areas (SSAs). mode, and sets the instruction pointer (RIP) to the value Assuming that no hardware exception occurs, an en- indicated by the entry point offset (OENTRY) field in 65

66. x the SGX design makes it easy to implement per-thread TCS EPCM Entry Thread Local Storage (TLS). For 64-bit enclaves, this is + ENCLAVESECS SECS a convenience feature rather than a security measure, as R, W, X, PT SSAFRAMESIZE PT BASEADDR enclave code can securely load new bases into FS and XFRM GS using the WRFSBASE and WRGSBASE instructions. Input Register File SSA The EENTER implementation backs up the old val- RSP U_RSP RBP U_RBP ues of the registers that it modifies, so they can be re- RCX AEP stored when the enclave finishes its computation. Just RIP XSAVE like SYSCALL, EEENTER saves the address of the fol- RBX GPRSGX lowing instruction in the RCX register. XCR0 Interestingly, the SDM states that the old values of the FS Base Limit Type Selector Output Register File XCR0, FS, and GS registers are saved in new registers GS Base Limit Type Selector dedicated to the SGX implementation. However, given XCR0 RCX that they will only be used on an enclave exit, we expect TCS RIP + that the registers are saved in DRAM, in the reserved Reserved GS area in the TCS. CR_SAVE_GS Limit Base + Like SYSCALL, EENTER does not modify the stack CR_SAVE_FS FS pointer register (RSP). To avoid any security exploits, CR_SAVE_XCR0 Limit Base enclave code should set RSP to point to a stack area GSLIMIT that is entirely contained in EPC pages. Multi-threaded FSLIMIT enclaves can easily implement per-thread stack areas by OFSBASGX + OGSBASGX setting up each thread’s TLS area to include a pointer OENTRY to the thread’s stack, and by setting RSP to the value CSSA obtained by reading the TLS area at which the FS or GS OSSA Read Write segment points. Figure 66: Data flow diagram for a subset of the logic in EENTER. Last, when EENTER enters enclave mode, it suspends The figure omits the logic for disabling debugging features, such as some of the processor’s debugging features, such as hardware breakpoints and performance monitoring events. hardware breakpoints and Precise Event Based Sam- pling (PEBS). Conceptually, a debugger attached to the the TCS that it receives. EENTER is used by an un- host process sees the enclave’s execution as one single trusted caller to execute code in a protected environment, processor instruction. and therefore has the same security considerations as SYSCALL (§ 2.8), which is used to call into system soft- 5.4.2 Synchronous Enclave Exit ware. Setting RIP to the value indicated by OENTRY EEXIT can only be executed while the logical processor guarantees to the enclave author that the enclave code is in enclave mode, and results in a (#UD) if executed will only be invoked at well defined points, and prevents in any other circumstances. In a nutshell, the instruction a malicious host application from bypassing any security returns the processor to ring 3 outside enclave mode checks that the enclave author may perform. and restores the registers saved by EENTER, which were EENTER also sets XCR0 (§ 2.6), the register that con- described above. trols which extended architectural features are in use, to Unlike SYSRET, EEXIT sets RIP to the value read the value of the XFRM enclave attribute (§ 5.2.2). En- from RBX, after exiting enclave mode. This is inconsis- suring that XCR0 is set according to the enclave author’s tent with EENTER, which saves the RIP value to RCX. intentions prevents a malicious operating system from Unless this inconsistency stems from an error in the bypassing an enclave’s security by enabling architectural SDM, enclave code must be sure to note the difference. features that the enclave is not prepared to handle. The SDM explicitly states that EEXIT does not mod- Furthermore, EENTER loads the bases of the segment ify most registers, so enclave authors must make sure to registers (§ 2.7) FS and GS using values specified in the clear any secrets stored in the processor’s registers before TCS. The segments’ selectors and types are hard-coded returning control to the host process. Furthermore, en- to safe values for ring 3 data segments. This aspect of clave software will most likely cause a fault in its caller 66

67.if it doesn’t restore the stack pointer RSP and the stack Application Code int call() { frame base pointer RBP to the values that they had when prepare call arguments TCS AEX Path EENTER was called. try { CSSA It may seem unfortunate that enclave code can induce EENTER Enclave Code OENTRY void entry() { faults in its caller. For better or for worse, this perfectly RCX: AEP RBX: TCS RCX set by matches the case where an application calls into a dynam- store call results EENTER ically loaded module. More specifically, the module’s } catch (AEX e) { read ESP from code is also responsible for preserving stack-related reg- FS:TLS Resumable isters, and a buggy module might jump anywhere in the Yes PUSH RCX exception? application code of the host process. perform enclave No computation This section describes the EENTER behavior for 64- POP RBX bit enclaves. The EENTER implementation for 32-bit return ERROR; EEXIT ERESUME Synchronous enclaves is significantly more complex, due to the extra } Execution Path RCX: AEP RBX: TCS special cases introduced by the full-fledged segmentation store call results model that is still present in the 32-bit Intel architecture. } AEX As stated in the introduction, we are not interested in return SUCCESS; } such legacy aspects. Ring 0 SSA System Software Stack XSAVE 5.4.3 Asynchronous Enclave Exit (AEX) Hardware Exception Handler GPRs GPRSGX void handler() { Code AEP If a hardware exception, like a fault (§ 2.8.2) or an in- save GPRs RIP U_RBP terrupt (§ 2.12), occurs while a logical processor is ex- handle exception CS U_RSP ecuting an enclave’s code, the processor performs an restore GPRs RFLAGS Registers IRET RSP Asynchronous Enclave Exit (AEX) before invoking the cleared } SS by AEX system software’s exception handler, as shown in Fig- ure 67. Figure 67: If a hardware exception occurs during enclave execution, The AEX saves the enclave code’s execution con- the synchronous execution path is aborted, and an Asynchronous text (§ 2.6), restores the state saved by EENTER, and Enclave Exit (AEX) occurs instead. sets up the processor registers so that the system soft- ware’s hardware exception handler will return to an asyn- in § 2.8.2. Actual Intel processors may interleave the chronous exit handler in the enclave’s host process. The AEX implementation with the exception handling imple- exit handler is expected to use the ERESUME instruction mentation. However, for simplicity, this work describes to resume the enclave computation that was interrupted AEX as a separate process that is performed before any by the hardware exception. exception handling steps are taken. Asides from the behavior described in § 5.4.1, In the Intel architecture, if a hardware exception oc- EENTER also writes some information to the current curs, the application code’s execution context can be read SSA, which is only used if an AEX occurs. As shown and modified by the system software’s exception handler in Figure 66, EENTER stores the stack pointer register (§ 2.8.2). This is acceptable when the system software RSP and the stack frame base pointer register RBP into is trusted by the application software. However, under the U RSP and U RBP fields in the current SSA. Last, SGX’s threat model, the system software is not trusted EENTER stores the value in RCX in the Asynchronous by enclaves. Therefore, the AEX step erases any secrets Exit handler Pointer (AEP) field in the current SSA. that may exist in the execution state by resetting all its When a hardware exception occurs in enclave mode, registers to predefined values. the SGX implementation performs a sequence of steps Before the enclave’s execution state is reset, it is that takes the logical processor out of enclave mode and backed up inside the current SSA. Specifically, an AEX invokes the hardware exception handler in the system backs up the general purpose registers (GPRs, § 2.6) software. Conceptually, the SGX implementation first in the GPRSGX area in the SSA, and then performs performs an AEX to take the logical processor out of en- an XSAVE (§ 2.6) using the requested-feature bitmap clave mode, and then the hardware exception is handled (RFBM) specified in the XFRM field in the enclave’s using the standard Intel architecture’s behavior described SECS. As each SSA is entirely stored in EPC pages al- 67

68.located to the enclave, the system software cannot read Application Code int call() { or tamper with the backed up execution state. When an prepare call arguments TCS AEX Path SSA receives the enclave’s execution state, it is marked try { CSSA as used by incrementing the CSSA field in the current EENTER Enclave Code OENTRY RCX: AEP RBX: TCS void entry() { TCS. store call results RCX set by After clearing the execution context, the AEX process ERESUME } catch (AEX e) { sets RSP and RBP to the values saved by EENTER in read ESP from the current SSA, and sets RIP to the value in the current Resumable FS:TLS Yes PUSH RCX SSA’s AEP field. This way, when the system software’s exception? perform enclave hardware exception handler completes, the processor No computation will execute the asynchronous exit handler code in the return ERROR; POP RBX enclave’s host process. The SGX design makes it easy ERESUME EEXIT to set up the asynchronous handler code as an exception } RCX: AEP RBX: TCS handler in the routine that contains the EENTER instruc- store call results tion, because the RSP and RBP registers will have the Synchronous } Execution Path AEX same values as they had when EENTER was executed. return SUCCESS; } Many of the actions taken by AEX to get the logical Ring 0 SSA processor outside of enclave mode match EEXIT. The System Software Stack XSAVE Hardware Exception Handler GPRs GPRSGX segment registers FS and GS are restored to the values void handler() { Code AEP saved by EENTER, and all the debugging facilities that save GPRs RIP U_RBP were suppressed by EENTER are restored to their previ- handle exception CS U_RSP ous states. restore GPRs RFLAGS Registers IRET RSP cleared 5.4.4 Recovering from an Asynchronous Exit } SS by AEX When a hardware exception occurs inside enclave mode, Figure 68: If a hardware exception occurs during enclave execution, the processor performs an AEX before invoking the ex- the synchronous execution path is aborted, and an Asynchronous ception’s handler set up by the system software. The Enclave Exit (AEX) occurs instead. AEX sets up the execution context in such a way that when the system software finishes processing the excep- fails if CSSA is greater than or equal to NSSA. tion, it returns into an asynchronous exit handler in the When successful, ERESUME decrements the CSSA enclave’s host process. The asynchronous exception han- field of the TCS, and restores the execution context dler usually executes the ERESUME instruction, which backed up in the SSA pointed to by the CSSA field causes the logical processor to go back into enclave mode in the TCS. Specifically, the ERESUME implementation and continue the computation that was interrupted by the restores the GPRs (§ 2.6) from the GPRSGX field in hardware exception. the SSA, and performs an XRSTOR (§ 2.6) to load the ERESUME shares much of its functionality with execution state associated with the extended architectural EENTER. This is best illustrated by the similarity be- features used by the enclave. tween Figures 68 and 67. ERESUME shares the following behavior with EENTER and ERESUME receive the same inputs, EENTER (§ 5.4.1). Both instructions write the U RSP, namely a pointer to a TCS, described in § 5.4.1, and U RBP, and AEP fields in the current SSA. Both instruc- an AEP, described in § 5.4.3. The most common appli- tions follow the same process for backing up XCR0 and cation design will pair each EENTER instance with an the FS and GS segment registers, and set them to the asynchronous exit handler that invokes ERESUME with same values, based on the current TCS and its enclave’s exactly the same arguments. SECS. Last, both instructions disable the same subset of The main difference between ERESUME and EENTER the logical processor’s debugging features. is that the former uses an SSA that was “filled out” by An interesting edge case that ERESUME handles cor- an AEX (§ 5.4.3), whereas the latter uses an empty SSA. rectly is that it sets XCR0 to the XFRM enclave at- Therefore, ERESUME results in a #GP fault if the CSSA tribute before performing an XRSTOR. It follows that field in the provided TCS is 0 (zero), whereas EENTER ERESUME fails if the requested feature bitmap (RFBM) 68

69.in the SSA is not a subset of XFRM. This matters be- Enclave Non-PRM Memory Memory cause, while an AEX will always use the XFRM value as the RFBM, enclave code executing on another thread is free to modify the SSA contents before ERESUME is EPC EWB classical called. page HDD / SSD ELDU, swapping The correct sequencing of actions in the ERESUME im- ELDB plementation prevents a malicious application from using Disk an enclave to modify registers associated with extended architectural features that are not declared in XFRM. DRAM DRAM This would break the system software’s ability to provide Figure 69: SGX offers a method for the OS to evict EPC pages into thread-level execution context isolation. non-PRM DRAM. The OS can then use its standard paging feature to evict the pages out of DRAM. 5.5 EPC Page Eviction Modern OS kernels take advantage of address transla- The SGX design relies on symmetric key cryp- tion (§ 2.5) to implement page swapping, also referred tograpy 3.1.1 to guarantee the confidentiality and in- to as paging (§ 2.5). In a nutshell, paging allows the OS tegrity of the evicted EPC pages, and on nonces (§ 3.1.4) kernel to over-commit the computer’s DRAM by evicting to guarantee the freshness of the pages brought back rarely used memory pages to a slower storage medium into the EPC. These nonces are stored in Version Ar- called the disk. rays (VAs), covered in § 5.5.2, which are EPC pages Paging is a key contributor to utilizing a computer’s dedicated to nonce storage. resources effectively. For example, a desktop system Before an EPC page is evicted and freed up for use whose user runs multiple programs concurrently can by other enclaves, the SGX implementation must ensure evict memory pages allocated to inactive applications that no TLB has address translations associated with the without a significant degradation in user experience. evicted page, in order to avoid the TLB-based address Unfortunately, the OS cannot be allowed to evict an translation attack described in § 3.7.4. enclave’s EPC pages via the same methods that are used As explained in § 5.1.1, SGX leaves the system soft- to implement page swapping for DRAM memory outside ware in charge of managing the EPC. It naturally follows the PRM range. In the SGX threat model, enclaves do that the SGX instructions described in this section, which not trust the system software, so the SGX design offers are used to implement EPC paging, are only available to an EPC page eviction method that can defend against system software, which runs at ring 0 § 2.3. a malicious OS that attempts any of the active address In today’s software stacks (§ 2.3), only the OS ker- translation attacks described in § 3.7. nel implements page swapping in order to support the The price of the security afforded by SGX is that an over-committing of DRAM. The hypervisor is only used OS kernel that supports evicting EPC pages must use to partition the computer’s physical resources between a modified page swapping implementation that inter- operating systems. Therefore, this section is written with acts with the SGX mechanisms. Enclave authors can the expectation that the OS kernel will also take on the mostly ignore EPC evictions, similarly to how today’s responsibility of EPC page swapping. For simplicity, application developers can ignore the OS kernel’s paging we often use the term “OS kernel” instead of “system implementation. software”. The reader should be aware that the SGX As illustrated in Figure 69, SGX supports evicting design does not preclude a system where the hypervisor EPC pages to DRAM pages outside the PRM range. The implements its own EPC page swapping. Therefore, “OS system software is expected to use its existing page swap- kernel” should really be read as “the system software ping implementation to evict the contents of these pages that performs EPC paging”. out of DRAM and onto a disk. 5.5.1 Page Eviction and the TLBs SGX’s eviction feature revolves around the EWB in- struction, described in detail in § 5.5.4. Essentially, EWB One of the least promoted accomplishments of SGX is evicts an EPC page into a DRAM page outside the EPC that it does not add any security checks to the memory and marks the EPC page as available, by zeroing the execution units (§ 2.9.4, § 2.10). Instead, SGX’s access VALID field in the page’s EPCM entry. control checks occur after an address translation (§ 2.5) 69

70.is performed, right before the translation result is written Blocked pages are not considered accessible to en- into the TLBs (§ 2.11.5). This aspect is generally down- claves. If an address translation results in a blocked EPC played throughout the SDM, but it becomes visible when page, the SGX implementation causes the translation to explaining SGX’s EPC page eviction mechanism. result in a Page Fault (#PF, § 2.8.2). This guarantees that A full discussion of SGX’s memory access protections once a page is blocked, the CPU will not create any new checks merits its own section, and is deferred to § 6.2. TLB entries pointing to it. The EPC page eviction mechanisms can be explained Furthermore, every SGX instruction makes sure that using only two requirements from SGX’s security model. the EPC pages on which it operates are not blocked. For First, when a logical processor exits an enclave, either example, EENTER ensures that the TCS it is given is not via EEXIT (§ 5.4.2) or via an AEX (§ 5.4.3), its TLBs blocked, that its enclave’s SECS is not blocked, and that are flushed. Second, when an EPC page is deallocated every page in the current SSA is not blocked. from an enclave, all logical processors executing that In order to evict a batch of EPC pages, the OS kernel enclave’s code must be directed to exit the enclave. This must first issue EBLOCK instructions targeting them. The is sufficient to guarantee the removal of any TLB entry OS is also expected to remove the EPC page’s mapping targeting the deallocated EPC. from page tables, but is not trusted to do so. System software can cause a logical processor to exit After all the desired pages have been blocked, the OS an enclave by sending it an Inter-Processor Interrupt kernel must execute an ETRACK instruction, which di- (IPI, § 2.12), which will trigger an AEX when received. rects the SGX implementation to keep track of which log- Essentially, this is a very coarse-grained TLB shootdown. ical processors have had their TLBs flushed. ETRACK re- SGX does not trust system software. Therefore, be- quires the virtual address of an enclave’s SECS (§ 5.1.3). fore marking an EPC page’s EPCM entry as free, the If the OS wishes to evict a batch of EPC pages belonging SGX implementation must ensure that the OS kernel has to multiple enclaves, it must issue an ETRACK for each flushed all the TLBs that might contain translations for enclave. the page. Furthermore, performing IPIs and TLB flushes Following the ETRACK instructions, the OS kernel for each page eviction would add a significant overhead must induce enclave exits on all the logical processors to a paging implementation, so the SGX design allows that are executing code inside the enclaves that have been a batch of pages to be evicted using a single IPI / TLB ETRACKed. The SGX design expects that the OS will flush sequence. use IPIs to cause AEXs in the logical processors whose The TLB flush verification logic relies on a 1-bit TLBs must be flushed. EPCM entry field called BLOCKED. As shown in Fig- The EPC page eviction process is completed when the ure 70, the VALID and BLOCKED fields yield three OS executes an EWB instruction for each EPC page to be possible EPC page states. A page is free when both bits evicted. This instruction, which will be fully described are zero, in use when VALID is zero and BLOCKED is in § 5.5.4, writes an encrypted version of the EPC page one, and blocked when both bits are one. to be evicted into DRAM, and then frees the page by clearing the VALID and BLOCKED bits in its EPCM entry. Before carrying out its tasks, EWB ensures that the Free BLOCKED = 0 EPC page that it targets has been blocked, and checks the VALID = 0 state set up by ETRACK to make sure that all the relevant ELDU ELDB TLBs have been flushed. ECREATE, EADD, EPA An evicted page can be loaded back into the EPC via EREMOVE EWB the ELDU and ELDB instructions. Both instructions start EREMOVE up with a free EPC page and a DRAM page that has the In Use Blocked evicted contents of an EPC page, decrypt the DRAM BLOCKED = 0 EBLOCK BLOCKED = 1 VALID = 1 VALID = 1 page’s contents into the EPC page, and restore the cor- responding EPCM entry. The only difference between Figure 70: The VALID and BLOCKED bits in an EPC page’s ELDU and ELDB is that the latter sets the BLOCKED bit EPCM entry can be in one of three states. EADD and its siblings in the page’s EPCM entry, whereas the former leaves it allocate new EPC pages. EREMOVE permanently deallocates an EPC cleared. page. EBLOCK blocks an EPC page so it can be evicted using EWB. ELDB and ELDU load an evicted page back into the EPC. ELDU and ELDB resemble ECREATE and EADD, in 70

71.the sense that they populate a free EPC page. Since This section explains the need for having two values rep- the page that they operate on was free, the SGX secu- resent the same concept by comparing the two values rity model predicates that no TLB entries can possibly and their uses. target it. Therefore, these instructions do not require a The SDM states that ENCLAVESECS field in an mechanism similar to EBLOCK or ETRACK. EPCM entry is used to identify the SECS of the enclave owning the associated EPC page, but stops short of de- 5.5.2 The Version Array (VA) scribing its format. In theory, the ENCLAVESECS field When EWB evicts the contents of an EPC, it creates an can change its representation between SGX implemen- 8-byte nonce (§ 3.1.4) that Intel’s documentation calls a tations since SGX instructions never expose its value to page version. SGX’s freshness guarantees are built on the software. assumption that nonces are stored securely, so EWB stores However, we will later argue that the most plausible the nonce that it creates inside a Version Array (VA). representation of the ENCLAVESECS field is the phys- Version Arrays are EPC pages that are dedicated to ical address of the enclave’s SECS. Therefore, the EN- storing nonces generated by EWB. Each VA is divided CLAVESECS value associated with a given enclave will into slots, and each slot is exactly large enough to store change if the enclave’s SECS is evicted from the EPC one nonce. Given that the size of an EPC page is 4KB, and loaded back at a different location. It follows that the and each nonce occupies 8 bytes, it follows that each VA ENCLAVESECS value is only suitable for identifying has 512 slots. an enclave while its SECS remains in the EPC. VA pages are allocated using the EPA instruction, According to the SDM, the EID field is a 64-bit field which takes in the virtual address of a free EPC page, and stored in an enclave’s SECS. ECREATE’s pseudocode turns it into a Version Array with empty slots. VA pages in the SDM reveals that an enclave’s ID is generated are identified by the PT VA type in their EPCM entries. when the SECS is allocated, by atomically incrementing Like SECS pages, VA pages have the ENCLAVEAD- a global counter. Assuming that the counter does not roll DRESS fields in their EPCM entries set to zero, and over8 , this process guarantees that every enclave created cannot be accessed directly by any software, including during a power cycle has a unique EID. enclaves. Although the SDM does not specifically guarantee Unlike the other page types discussed so far, VA pages this, the EID field in an enclave’s SECS does not appear are not associated with any enclave. This means they to be modified by any instruction. This makes the EID’s can be deallocated via EREMOVE without any restriction. value suitable for identifying an enclave throughout its However, freeing up a VA page whose slots are in use ef- lifetime, even across evictions of its SECS page from the fectively discards the nonces in those slots, which results EPC. in losing the ability to load the corresponding evicted 5.5.4 Evicting an EPC Page pages back into the EPC. Therefore, it is unlikely that a correct OS implementation will ever call EREMOVE on a The system software evicts an EPC page using the EWB VA with non-free slots. instruction, which produces all the data needed to restore According to the pseudo-code for EPA and EWB in the the evicted page at a later time via the ELDU instruction, SDM, SGX uses the zero value to represent the free slots as shown in Figure 71. in a VA, implying that all the generated nonces have to EWB’s output consists of an encrypted version of the be non-zero. This also means that EPA initializes a VA evicted EPC page’s contents, a subset of the fields in simply by zeroing the underlying EPC page. However, the EPCM entry corresponding to the page, the nonce since software cannot access a VA’s contents, neither the discussed in § 5.5.2, and a message authentication use of a special value, nor the value itself is architectural. code (MAC, § 3.1.3) tag. With the exception of the nonce, EWB writes its output in DRAM outside the PRM 5.5.3 Enclave IDs area, so the system software can choose to further evict The EWB and ELDU / ELDB instructions use an en- it to disk. clave ID (EID) to identify the enclave that owns an The EPC page contents is encrypted, to protect the evicted page. The EID has the same purpose as the EN- confidentiality of the enclave’s data while the page is CLAVESECS (§ 5.1.2) field in an EPCM entry, which is 8 A 64-bit counter incremented at 4Ghz would roll over in slightly also used to identify the enclave that owns an EPC page. more than 136 years 71

72. EPCM EPC Enclave and Host Application Virtual Address Space ⋮ ⋮ SECS ELDB target metadata ELDB target page BASEADDR ⋮ ⋮ SIZE VA page metadata VA page EID PAGEINFO ⋮ ⋮ SECS EWB source metadata EWB source page ELRANGE LINADDR ⋮ ⋮ SRCPGE EPC Page PCMD VA page = Encrypted EPC Page EWB ⋮ PCMD Untrusted DRAM nonce ⋮ SECINFO EPCM Entry FLAGS ADDRESS R, W, X R, W, X Encrypted Page MAC PAGE_TYPE PT EPC Page Metadata Tag ENCLAVESECS ENCLAVEID MAC ELDU / ELDB Figure 72: The PAGEINFO structure used by the EWB and ELDU / ELDB instructions Figure 71: The EWB instruction outputs the encrypted contents of the evicted EPC page, a subset of the fields in the page’s EPCM entry, entry for the EPC page that is reloaded. a MAC tag, and a nonce. All this information is used by the ELDB or ELDU instruction to load the evicted page back into the EPC, with The metadata described above is stored unencrypted, confidentiality, integrity and freshness guarantees. so the OS has the option of using the information inside as-is for its own bookkeeping. This has no negative im- stored in the untrusted DRAM outside the PRM range. pact on security, because the metadata is not confidential. Without the use of encryption, the system software could In fact, with the exception of the enclave ID, all the meta- learn the contents of an EPC page by evicting it from the data fields are specified by the system software when EPC. ECREATE is called. The enclave ID is only useful for The page metadata is stored in a Page Informa- identifying the enclave that the EPC page belongs to, and tion (PAGEINFO) structure, illustrated in Figure 72. This the system software already has this information as well. structure is similar to the PAGEINFO structure described Asides from the metadata described above, the PCMD in § 5.3.2 and depicted in Figure 64, except that the structure also stores the MAC tag generated by EWB. SECINFO field has been replaced by a PCMD field, The MAC tag covers the authenticity of the EPC page which contains the virtual address of a Page Crypto Meta- contents, the metadata, and the nonce. The MAC tag is data (PCMD) structure. checked by ELDU and ELDB, which will only load an The LINADDR field in the PAGEINFO structure is evicted page back into the EPC if the MAC verification used to store the ADDRESS field in the EPCM entry, confirms the authenticity of the page data, metadata, and which indicates the virtual address intended for accessing nonce. This security check protects against the page the page. The PCMD structure embeds the Security Infor- swapping attacks described in § 3.7.3. mation (SECINFO) described in § 5.3.2, which is used Similarly to EREMOVE, EWB will only evict the EPC to store the page type (PT) and the access permission page holding an enclave’s SECS if there is no other flags (R, W, X) in the EPCM entry. The PCMD structure EPCM entry whose ENCLAVESECS field references also stores the enclave’s ID (EID, § 5.5.3). These fields the SECS. At the same time, as an optimization, the are later used by ELDU or ELDB to populate the EPCM SGX implementation does not perform ETRACK-related 72

73.checks when evicting a SECS. This is safe because a EPC Page Address PAGEINFO (Input) (Input/Output) SECS is only evicted if the EPC has no pages belonging LINADDR to the SECS’ enclave, which implies that there isn’t any SRCPGE EPCM entry TCS belonging to the enclave in the EPC, so no processor PCMD PCMD (Output) BLOCKED can be executing enclave code. LINADDR SECS SECINFO The pages holding Version Arrays can be evicted, just ENCLAVESECS FLAGS like any other EPC page. VA pages are never accessible R, W, X R, W, X by software, so they can’t have any TLB entries point- PT PAGE_TYPE VALID ing to them. Therefore, EWB evicts VA pages without reserved fields performing any ETRACK-related checks. The ability to zero ENCLAVEID evict VA pages has profound implications that will be MAC_HDR reserved fields (Temporary) SECS discussed in § 5.5.6. MAC EID EID EWB’s data flow, shown in detail in Figure 73, has SECINFO TRACKING an aspect that can be confusing to OS developers. The FLAGS instruction reads the virtual address of the EPC page to PAGE_TYPE be evicted from a register (RBX) and writes it to the R, W, X MAC data LINADDR field of the PAGEINFO structure that it is reserved fields provided. The separate input (RBX) could have been LINADDR MAC removed by providing the EPC page’s address in the LINADDR field. EPC Page plaintext AES-GCM counter ciphertext 5.5.5 Loading an Evicted Page Back into EPC non-EPC VA page After an EPC page belonging to an enclave is evicted, any Page Page Version attempt to access the page from enclave code will result (Generated) ⋮ VA slot address in a Page Fault (#PF, § 2.8.2). The #PF will cause the target VA slot (Input) logical processor to exit enclave mode via AEX (§ 5.4.3), ⋮ points to and then invoke the OS kernel’s page fault handler. copied to Page faults receive special handling from the AEX process. While leaving the enclave, the AEX logic specif- Figure 73: The data flow of the EWB instruction that evicts an EPC ically checks if the hardware exception that triggered the page. The page’s content is encrypted in a non-EPC RAM page. A nonce is created and saved in an empty slot inside a VA page. The AEX was #PF. If that is the case, the AEX implementa- page’s EPCM metadata and a MAC are saved in a separate area in tion clears the least significant 12 bits of the CR2 register, non-EPC memory. which stores the virtual address whose translation caused a page fault. ELDU or ELDB instruction to load the evicted page back In general, the OS kernel’s page handler needs to be into the EPC. If the outputs of EWB have been evicted able to extract the virtual page number (VPN, § 2.5.1) from DRAM to a slower storage medium, the OS kernel from CR2, so that it knows which memory page needs will have to read the outputs back into DRAM before to be loaded back into DRAM. The OS kernel may also invoking ELDU / ELDB. be able to use the 12 least significant address bits, which ELDU and ELDB verify the MAC tag produced by are not part of the VPN, to better predict the application EWB, described in § 5.5.4. This prevents the OS kernel software’s memory access patterns. However, unlike the from performing the page swapping-based active address bits that make up the VPN, the bottom 12 bits are not translation attack described in § 3.7.3. absolutely necessary for the fault handler to carry out its job. Therefore, SGX’s AEX implementation clears these 5.5.6 Eviction Trees 12 bits, in order to limit the amount of information that The SGX design allows VA pages to be evicted from is learned by the page fault handler. the EPC, just like enclave pages. When a VA page is When the OS page fault handler examines the address evicted from EPC, all the nonces stored by the VA slots in the CR2 register and determines that the faulting ad- become inaccessible to the processor. Therefore, the dress is inside the EPC, it is generally expected to use the evicted pages associated with these nonces cannot be 73

74.restored by ELDB until the OS loads the VA page back the EPC, it needs to load all the VA pages on the path into the EPC. from the eviction tree’s root to the leaf corresponding to In other words, an evicted page depends on the VA the enclave page. Therefore, the number of page loads page storing its nonce, and cannot be loaded back into required to satisfy a page fault inside the EPC depends the EPC until the VA page is reloaded as well. The de- on the shape of the eviction tree that contains the page. pendency graph created by this relationship is a forest The SGX design leaves the OS in complete control of eviction trees. An eviction tree, shown in Fig- of the shape of the eviction trees. This has no negative ure 74, has enclave EPC pages as leaves, and VA pages impact on security, as the tree shape only impacts the as inner nodes. A page’s parent is the VA page that holds performance of the eviction scheme, and not its correct- its nonce. Since EWB always outputs a nonce in a VA ness. page, the root node of each eviction tree is always a VA page in the EPC. 5.6 SGX Enclave Measurement SGX implements a software attestation scheme that fol- VA Page lows the general principles outlined in § 3.3. For the ⋮ purposes of this section, the most relevant principle is ⋮ that a remote party authenticates an enclave based on its measurement, which is intended to identify the soft- ware that is executing inside the enclave. The remote party compares the enclave measurement reported by Encrypted VA the trusted hardware with an expected measurement, and Page only proceeds if the two values match. ⋮ § 5.3 explains that an SGX enclave is built us- ⋮ ing the ECREATE (§ 5.3.1), EADD (§ 5.3.2) and ⋮ EEXTEND instructions. After the enclave is initialized via EINIT (§ 5.3.3), the instructions mentioned above Page MAC Metadata Tag cannot be used anymore. As the SGX measurement scheme follows the principles outlined in § 3.3.2, the measurement of an SGX enclave is obtained by com- puting a secure hash (§ 3.1.3) over the inputs to the Encrypted VA Encrypted ECREATE, EADD and EEXTEND instructions used to Page EPC Page create the enclave and load the initial code and data into ⋮ Page MAC its memory. EINIT finalizes the hash that represents the Metadata Tag enclave’s measurement. ⋮ Along with the enclave’s contents, the enclave author Page MAC is expected to specify the sequence of instructions that Metadata Tag should be used in order to create an enclave whose mea- surement will match the expected value used by the re- mote party in the software attestation process. The .so and .dll dynamically loaded library file formats, which Encrypted Encrypted are SGX’s intended enclave delivery methods, already EPC Page EPC Page include informal specifications for loading algorithms. Page MAC Page MAC Metadata Tag Metadata Tag We expect the informal loading specifications to serve as the starting points for specifications that prescribe the Figure 74: A version tree formed by evicted VA pages and enclave exact sequences of SGX instructions that should be used EPC pages. The enclave pages are leaves, and the VA pages are to create enclaves from .so and .dll files. inner nodes. The OS controls the tree’s shape, which impacts the As argued in § 3.3.2, an enclave’s measurement is performance of evictions, but not their correctness. computed using a secure hashing algorithm, so the sys- A straightforward inductive argument shows that when tem software can only build an enclave that matches an an OS wishes to load an evicted enclave page back into expected measurement by following the exact sequence 74

75.of instructions specified by the enclave’s author. The SGX software attestation definitely needs to cover The SGX design uses the 256-bit SHA-2 [21] secure the enclave attributes. For example, if XFRM (§ 5.2.2, hash function to compute its measurements. SHA-2 is § 5.2.5) would not be covered, a malicious enclave loader a block hash function (§ 3.1.3) that operates on 64-byte could attempt to subvert an enclave’s security checks blocks, uses a 32-byte internal state, and produces a 32- by setting XFRM to a value that enables architectural byte output. Each enclave’s measurement is stored in extensions that change the semantics of instructions used the MRENCLAVE field of the enclave’s SECS. The 32- by the enclave, but still produces an XSAVE output that byte field stores the internal state and final output of the fits in SSAFRAMESIZE. 256-bit SHA-2 secure hash function. The special treatment applied to the ATTRIBUTES SECS field seems questionable from a security stand- 5.6.1 Measuring ECREATE point, as it adds extra complexity to the software attesta- The ECREATE instruction, overviewed in § 5.3.1, first tion verifier, which translates into more opportunities for initializes the MRENCLAVE field in the newly created exploitable bugs. This decision also adds complexity to SECS using the 256-bit SHA-2 initialization algorithm, the SGX software attestation design, which is described and then extends the hash with the 64-byte block depicted in § 5.8. in Table 16. The most likely reason why the SGX design decided to go this route, despite the concerns described above, is the Offset Size Description wish to be able to use a single measurement to represent 0 8 ”ECREATE\0” an enclave that can take advantage of some architectural 8 8 SECS.SSAFRAMESIZE (§ 5.2.5) extensions, but can also perform its task without them. 16 8 SECS.SIZE (§ 5.2.1) Consider, for example, an enclave that performs image 32 8 32 zero (0) bytes processing using a library such as OpenCV, which has Table 16: 64-byte block extended into MRENCLAVE by ECREATE routines optimized for SSE and AVX, but also includes generic fallbacks for processors that do not have these The enclave’s measurement does not include the features. The enclave’s author will likely wish to allow BASEADDR field. The omission is intentional, as it an enclave loader to set bits 1 (SSE) and 2 (AVX) to allows the system software to load an enclave at any either true or false. If ATTRIBUTES (and, by extension, virtual address inside a host process that satisfies the XFRM) was a part of the enclave’s measurement, the ELRANGE restrictions (§ 5.2.1), without changing the enclave author would have to specify that the enclave has enclave’s measurement. This feature can be combined 4 valid measurements. In general, allowing n architec- with a compiler that generates position-independent en- tural extensions to be used independently will result in clave code to obtain relocatable enclaves. 2n valid measurements. The enclave’s measurement includes the SSAFRAMESIZE field, which guarantees that 5.6.3 Measuring EADD the SSAs (§ 5.2.5) created by AEX and used by The EADD instruction, described in § 5.3.2, extends the EENTER (§ 5.4.1) and ERESUME (§ 5.4.4) have the SHA-2 hash in MRENCLAVE with the 64-byte block size that is expected by the enclave’s author. Leaving shown in Table 17. this field out of an enclave’s measurement would allow a malicious enclave loader to attempt to attack Offset Size Description the enclave’s security checks by specifying a bigger 0 8 ”EADD\0\0\0\0” SSAFRAMESIZE than the enclave’s author intended, 8 8 ENCLAVEOFFSET which could cause the SSA contents written by an AEX 16 48 SECINFO (first 48 bytes) to overwrite the enclave’s code or data. Table 17: 64-byte block extended into MRENCLAVE by EADD. The ENCLAVEOFFSET is computed by subtracting the BASEADDR 5.6.2 Measuring Enclave Attributes in the enclave’s SECS from the LINADDR field in the PAGEINFO The enclave’s measurement does not include the en- structure. clave attributes (§ 5.2.2), which are specified in the AT- The address included in the measurement is the ad- TRIBUTES field in the SECS. Instead, it is included dress where the EADDed page is expected to be mapped directly in the information that is covered by the attesta- in the enclave’s virtual address space. This ensures that tion signature, which will be discussed in § 5.8.1. the system software sets up the enclave’s virtual memory 75

76.layout according to the enclave author’s specifications. Offset Size Description If a malicious enclave loader attempts to set up the en- 0 8 ”EEXTEND\0” clave’s layout incorrectly, perhaps in order to mount an 8 8 ENCLAVEOFFSET active address translation attack (§ 3.7.2), the loaded en- 16 48 48 zero (0) bytes clave’s measurement will differ from the measurement 64 64 bytes 0 - 64 in the chunk expected by the enclave’s author. 128 64 bytes 64 - 128 in the chunk The virtual address of the newly created page is mea- 192 64 bytes 128 - 192 in the chunk sured relatively to the start of the enclave’s ELRANGE. In other words, the value included in the measurement 256 64 bytes 192 - 256 in the chunk is LINADDR - BASEADDR. This makes the enclave’s Table 18: 64-byte blocks extended into MRENCLAVE by measurement invariant to BASEADDR changes, which EEXTEND. The ENCLAVEOFFSET is computed by subtracting the is desirable for relocatable enclaves. Measuring the rel- BASEADDR in the enclave’s SECS from the LINADDR field in the PAGEINFO structure. ative addresses still preserves all the information about the memory layout inside ELRANGE, and therefore has that SGX’s security guarantees only hold when the con- no negative security impact. tents of the enclave’s key pages is measured. For ex- EADD also measures the first 48 bytes of the SECINFO ample, EENTER (§ 5.4.1) is only guaranteed to perform structure (§ 5.3.2) provided to EADD, which contain the controlled jumps inside an enclave’s code if the contents page type (PT) and access permissions (R, W, X) field of all the Thread Control Structure (TCS, § 5.2.4) pages values used to initialize the page’s EPCM entry. By the are measured. Otherwise, a malicious enclave loader same argument as above, including these values in the can change the OENTRY field (§ 5.2.4, § 5.4.1) in a measurement guarantees that the memory layout built TCS while building the enclave, and then a malicious by the system software loading the enclave matches the OS can use the TCS to perform an arbitrary jump inside specifications of the enclave author. enclave code. By the same argument, all the enclave’s The EPCM field values mentioned above take up less code should be measured by EEXTEND. Any code frag- than one byte in the SECINFO structure, and the rest of ment that is not measured can be replaced by a malicious the bytes are reserved and expected to be initialized to enclave loader. zero. This leaves plenty of expansion room for future Given these pitfalls, it is surprising that the SGX de- SGX features. sign opted to decouple the virtual address space layout The most notable omission from Table 17 is the data measurements done by EADD from the memory content used to initialize the newly created EPC page. Therefore, measurements done by EEXTEND. the measurement data contributed by EADD guarantees At a first pass, it appears that the decoupling only has that the enclave’s memory layout will have pages allo- one benefit, which is the ability to load un-measured user cated with prescribed access permissions at the desired input into an enclave while it is being built. However, this virtual addresses. However, the measurements don’t benefit only translates into a small performance improve- cover the code or data loaded in these pages. ment, because enclaves can alternatively be designed to For example, EADD’s measurement data guarantees copy the user input from untrusted DRAM after being that an enclave’s memory layout consists of three exe- initialized. At the same time, the decoupling opens up cutable pages followed by five writable data pages, but it the possibility of relying on an enclave that provides no does not guarantee that any of the code pages contains meaningful security guarantees, due to not measuring all the code supplied by the enclave’s author. the important data via EEXTEND calls. However, the real reason behind the EADD / EEXTEND 5.6.4 Measuring EEXTEND separation is hinted at by the EINIT pseudo-code in the The EEXTEND instruction exists solely for the reason of SDM, which states that the instruction opens an inter- measuring data loaded inside the enclave’s EPC pages. rupt (§ 2.12) window while it performs a computationally The instruction reads in a virtual address, and extends the intensive RSA signature check. If an interrupt occurs enclave’s measurement hash with the five 64-byte blocks during the check, EINIT fails with an error code, and in Table 18, which effectively guarantee the contents of the interrupt is serviced. This very unusual approach for a 256-byte chunk of data in the enclave’s memory. a processor instruction suggests that the SGX implemen- Before examining the details of EEXTEND, we note tation was constrained in respect to how much latency its 76

77.instructions were allowed to add to the interrupt handling on the MRENCLAVE field of the enclave’s SECS. Af- process. ter EINIT, the field no longer stores the intermediate In light of the concerns above, it is reasonable to con- state of the SHA-2 algorithm, and instead stores the final clude that EEXTEND was introduced because measur- output of the secure hash function. This value remains ing an entire page using 256-bit SHA-2 is quite time- constant after EINIT completes, and is included in the consuming, and doing it in EADD would have caused the attestation signature produced by the SGX software at- instruction to exceed SGX’s latency budget. The need to testation process. hit a certain latency goal is a reasonable explanation for the seemingly arbitrary 256-byte chunk size. 5.7 SGX Enclave Versioning Support The EADD / EEXTEND separation will not cause secu- The software attestation model (§ 3.3) introduced by rity issues if enclaves are authored using the same tools the Trusted Platform Module (§ 4.4) relies on a mea- that build today’s dynamically loaded modules, which surement (§ 5.6), which is essentially a content hash, to appears to be the workflow targeted by the SGX design. identify the software inside a container. The downside In this workflow, the tools that build enclaves can easily of using content hashes for identity is that there is no identify the enclave data that needs to be measured. relation between the identities of containers that hold It is correct and meaningful, from a security perspec- different versions of the same software. tive, to have the message blocks provided by EEXTEND In practice, it is highly desirable for systems based to the hash function include the address of the 256-byte on secure containers to handle software updates without chunk, in addition to the contents of the data. If the having access to the remote party in the initial software address were not included, a malicious enclave loader attestation process. This entails having the ability to could mount the memory mapping attack described in migrate secrets between the container that has the old § 3.7.2 and illustrated in Figure 54. version of the software and the container that has the More specifically, the malicious loader would EADD updated version. This requirement translates into a need the errorOut page contents at the virtual address in- for a separate identity system that can recognize the tended for disclose, EADD the disclose page con- relationship between two versions of the same software. tents at the virtual address intended for errorOut, SGX supports the migration of secrets between en- and then EEXTEND the pages in the wrong order. If claves that represent different versions of the same soft- EEXTEND would not include the address of the data ware, as shown in Figure 75. chunk that is measured, the steps above would yield the The secret migration feature relies on a one-level cer- same measurement as the correctly constructed enclave. tificate hierarchy ( § 3.2.1), where each enclave author The last aspect of EEXTEND worth analyzing is its is a Certificate Authority, and each enclave receives a support for relocating enclaves. Similarly to EADD, certificate from its author. These certificates must be for- the virtual address measured by EEXTEND is relative matted as Signature Structures (SIGSTRUCT), which are to the enclave’s BASEADDR. Furthermore, the only described in § 5.7.1. The information in these certificates SGX structure whose content is expected to be mea- is the basis for an enclave identity scheme, presented in sured by EEXTEND is the TCS. The SGX design has § 5.7.2, which can recognize the relationship between carefully used relative addresses for all the TCS fields different versions of the same software. that represent enclave addresses, which are OENTRY, The EINIT instruction (§ 5.3.3) examines the target OFSBASGX and OGSBASGX. enclave’s certificate and uses the information in it to pop- ulate the SECS (§ 5.1.3) fields that describe the enclave’s 5.6.5 Measuring EINIT certificate-based identity. This process is summarized in The EINIT instruction (§ 5.3.3) concludes the enclave § 5.7.4. building process. After EINIT is successfully invoked Last, the actual secret migration process is based on on an enclave, the enclave’s contents are “sealed”, mean- the key derivation service implemented by the EGETKEY ing that the system software cannot use the EADD instruc- instruction, which is described in § 5.7.5. The sending tion to load code and data into the enclave, and cannot enclave uses the EGETKEY instruction to obtain a sym- use the EEXTEND instruction to update the enclave’s metric key (§ 3.1.1) based on its identity, encrypts its measurement. secrets with the key, and hands off the encrypted secrets EINIT uses the SHA-2 finalization algorithm (§ 3.1.3) to the untrusted system software. The receiving enclave 77

78. SIGSTRUCT A SIGSTRUCT B Enclave Contents SECS SGX EINIT SGX EINIT ATTRIBUTES Enclave A Enclave B BASEADDR SIZE SECS SECS SSAFRAMESIZE Certificate-Based Identity Certificate-Based Identity Enclave A Identity Other EPC Pages Secret Secret SGX Measurement SIGSTRUCT SGX SGX Simulation EGETKEY EGETKEY Signed Fields ENCLAVEHASH Symmetric Authenticated Authenticated Secret VENDOR zero (not Intel) Key Encryption Decryption Key AND ATTRIBUTES ATTRIBUTEMASK RFC ISVPRODID 3447 Non-volatile memory Build Toolchain ISVSVN Configuration 256-bit SHA-2 Encrypted DATE Secret RSA Signature PKCS #1 v1.5 Padding Enclave Author’s EXPONENT (3) Figure 75: SGX has a certificate-based enclave identity scheme, Public RSA Key MODULUS which can be used to migrate secrets between enclaves that contain RSA SIGNATURE different versions of the same software module. Here, enclave A’s Exponentiation secrets are migrated to enclave B. Q1 Q2 Enclave Author’s passes the sending enclave’s identity to EGETKEY, ob- Private RSA Key tains the same symmetric key as above, and uses the key to decrypt the secrets received from system software. Figure 76: An enclave’s Signature Structure (SIGSTRUCT) is The symmetric key obtained from EGETKEY can be intended to be generated by an enclave building toolchain that has access to the enclave author’s private RSA key. used in conjunction with cryptographic primitives that protect the confidentiality (§ 3.1.2) and integrity (§ 3.1.3) of an enclave’s secrets while they are migrated to another and an RSA signature that guarantees the authenticity enclave by the untrusted system software. However, sym- of the metadata, formatted as shown in Table 20. The metric keys alone cannot be used to provide freshness semantics of the fields will be revealed in the following guarantees (§ 3.1), so secret migration is subject to re- sections. play attacks. This is acceptable when the secrets being The enclave certificates must be signed by RSA signa- migrated are immutable, such as when the secrets are tures (§ 3.1.3) that follow the method described in RFC encryption keys obtained via software attestation 3447 [109], using 256-bit SHA-2 [21] as the hash func- tion that reduces the input size, and the padding method 5.7.1 Enclave Certificates described in PKCS #1 v1.5 [110], which is illustrated in The SGX design requires each enclave to have a certifi- Figure 45. cate issued by its author. This requirement is enforced by The SGX implementation only supports 3072-bit RSA EINIT (§ 5.3.3), which refuses to operate on enclaves keys whose public exponent is 3. The key size is without valid certificates. likely chosen to meet FIPS’ recommendation [20], which The SGX implementation consumes certificates for- makes SGX eligible for use in U.S. government applica- matted as Signature Structures (SIGSTRUCT), which are tions. The public exponent 3 affords a simplified signa- intended to be generated by an enclave building toolchain, ture verification algorithm, which is discussed in § 6.5. as shown in Figure 76. The simplified algorithm also requires the fields Q1 and A SIGSTRUCT certificate consists of metadata fields, Q2 in the RSA signature, which are also described in the most interesting of which are presented in Table 19, § 6.5. 78

79. Field Bytes Description An enclave author can use the same RSA key to issue ENCLAVEHASH 32 Must equal the certificates for enclaves that represent different software enclave’s measure- modules. Each module is identified by a unique Product ment (§ 5.6). ID (ISVPRODID) value. Conversely, all the enclaves ISVPRODID 32 Differentiates mod- whose certificates have the same ISVPRODID and are ules signed by the issued by the same RSA key (and therefore have the same public key. same MRENCLAVE) are assumed to represent different ISVSVN 32 Differentiates ver- versions of the same software module. Enclaves whose sions of the same certificates are signed by different keys are always as- module. sumed to contain different software modules. VENDOR 4 Differentiates Intel Enclaves that represent different versions of a module enclaves. can have different security version numbers (SVN). The ATTRIBUTES 16 Constrains the en- SGX design disallows the migration of secrets from an clave’s attributes. enclave with a higher SVN to an enclave with a lower ATTRIBUTEMASK 16 Constrains the en- SVN. This restriction is intended to assist with the distri- clave’s attributes. bution of security patches, as follows. Table 19: A subset of the metadata fields in a SIGSTRUCT enclave If a security vulnerability is discovered in an enclave, certificate the author can release a fixed version with a higher SVN. As users upgrade, SGX will facilitate the migration of Field Bytes Description secrets from the vulnerable version of the enclave to the MODULUS 384 RSA key modulus fixed version. Once a user’s secrets have migrated, the EXPONENT 4 RSA key public exponent SVN restrictions in SGX will deflect any attack based on SIGNATURE 384 RSA signature (See § 6.5) building the vulnerable enclave version and using it to Q1 384 Simplifies RSA signature read the migrated secrets. verification. (See § 6.5) Software upgrades that add functionality should not be Q2 384 Simplifies RSA signature accompanied by an SVN increase, as SGX allows secrets verification. (See § 6.5) to be migrated freely between enclaves with matching Table 20: The format of the RSA signature used in a SIGSTRUCT SVN values. As explained above, a software module’s enclave certificate SVN should only be incremented when a security vulner- ability is found. SIGSTRUCT only allocates 2 bytes to 5.7.2 Certificate-Based Enclave Identity the ISVSVN field, which translates to 65,536 possible An enclave’s identity is determined by three fields in its SVN values. This space can be exhausted if a large team certificate (§ 5.7.1): the modulus of the RSA key used (incorrectly) sets up a continuous build system to allocate to sign the certificate (MODULUS), the enclave’s prod- a new SVN for every software build that it produces, and uct ID (ISVPRODID) and the security version number each code change triggers a build. (ISVSVN). The public RSA key used to issue a certificate iden- 5.7.3 CPU Security Version Numbers tifies the enclave’s author. All RSA keys used to issue The SGX implementation itself has a security version enclave certificates must have the public exponent set to number (CPUSVN), which is used in the key derivation 3, so they are only differentiated by their moduli. SGX process implemented [136] by EGETKEY, in addition to does not use the entire modulus of a key, but rather a the enclave’s identity information. CPUSVN is a 128-bit 256-bit SHA-2 hash of the modulus. This is called a value that, according to the SDM, reflects the processor’s signer measurement (MRSIGNER), to parallel the name microcode update version. of enclave measurement (MRENCLAVE) for the SHA-2 The SDM does not describe the structure of CPUSVN, hash that identifies an enclave’s contents. but it states that comparing CPUSVN values using inte- The SGX implementation relies on a hard-coded MR- ger comparison is not meaningful, and that only some SIGNER value to recognize certificates issued by Intel. CPUSVN values are valid. Furthermore, CPUSVNs Enclaves that have an Intel-issued certificate can receive admit an ordering relationship that has the same seman- additional privileges, which are discussed in § 5.8. tics as the ordering relationship between enclave SVNs. 79

80.Specifically, an SGX implementation will consider all checks have completed, EINIT computes MRSIGNER, SGX implementations with lower SVNs to be compro- the 256-bit SHA-2 hash of the MODULUS field in the mised due to security vulnerabilities, and will not trust SIGSTRUCT, and writes it into the enclave’s SECS. them. EINIT also copies the ISVPRODID and ISVSVN fields An SGX patent [136] discloses that CPUSVN is a con- from SIGSTRUCT into the enclave’s SECS. As ex- catenation of small integers representing the SVNs of the plained in § 5.7.2, these fields make up the enclave’s various components that make up SGX’s implementation. certificate-based identity. This structure is consistent with all the statements made After verifying the RSA signature in SIGSTRUCT, in the SDM. EINIT copies the signature’s padding into the 5.7.4 Establishing an Enclave’s Identity PADDING field in the enclave’s SECS. The PKCS #1 v1.5 padding scheme, outlined in Figure 45, does not When the EINIT (§ 5.3.3) instruction prepares an en- involve randomness, so PADDING should have the same clave for code execution, it also sets the SECS (§ 5.1.3) value for all enclaves. fields that make up the enclave’s certificate-based iden- EINIT performs a few checks to make sure that the tity, as shown in Figure 77. enclave undergoing initialization was indeed authorized SIGSTRUCT Enclave Contents by the provided SIGSTRUCT certificate. The most obvi- Signed Fields SECS ous check involves making sure that the MRENCLAVE ENCLAVEHASH Must be equal MRENCLAVE value in SIGSTRUCT equals the enclave’s measurement, ATTRIBUTES BASEADDR which is stored in the MRENCLAVE field in the en- VENDOR Must be equal SIZE clave’s SECS. DATE SSAFRAMESIZE However, MRENCLAVE does not cover the enclave’s ATTRIBUTEMASK AND ATTRIBUTES attributes, which are stored in the ATTRIBUTES field ISVPRODID ISVPRODID of the SECS. As discussed in § 5.6.2, omitting AT- ISVSVN ISVSVN 256-bit SHA-2 MRSIGNER TRIBUTES from MRENCLAVE facilitates writing en- RSA Signature claves that have optimized implementations that can use PADDING MODULUS architectural extensions when present, and also have fall- EXPONENT (3) Other EPC back implementations that work on CPUs without the ex- SIGNATURE Pages tensions. Such enclaves can execute correctly when built Q1 RSA Signature Q2 Verification Intel’s with a variety of values in the XFRM (§ 5.2.2, § 5.2.5) MRSIGNER attribute. At the same time, allowing system software to use arbitrary values in the ATTRIBUTES field would Privileged attribute check Equality check compromise SGX’s security guarantees. When an enclave uses software attestation (§ 3.3) to Figure 77: EINIT verifies the RSA signature in the enclave’s certificate. If the certificate is valid, the information in it is used to gain access to secrets, the ATTRIBUTES value used populate the SECS fields that make up the enclave’s certificate-based to build it is included in the SGX attestation signa- identity. ture (§ 5.8). This gives the remote party in the attestation EINIT requires the virtual address of the process the opportunity to reject an enclave built with SIGSTRUCT certificate issued to the enclave, an undesirable ATTRIBUTES value. However, when se- and uses the information in the certificate to initial- crets are obtained using the migration process facilitated ize the certificate-based identity information in the by certificate-based identities, there is no remote party enclave’s SECS. Before using the information in the that can check the enclave’s attributes. certificate, EINIT first verifies its RSA signature. The The SGX design solves this problem by having en- SIGSTRUCT fields Q1 and Q2, along with the RSA clave authors convey the set of acceptable attribute exponent 3, facilitate a simplified verification algorithm, values for an enclave in the ATTRIBUTES and AT- which is discussed in § 6.5. TRIBUTEMASK fields of the SIGSTRUCT certificate If the SIGSTRUCT certificate is found to be properly issued for the enclave. EINIT will refuse to initialize signed, EINIT follows the steps discussed in the fol- an enclave using a SIGSTRUCT if the bitwise AND be- lowing few paragraphs to ensure that the certificate was tween the ATTRIBUTES field in the enclave’s SECS issued to the enclave that is being initialized. Once the and the ATTRIBUTESMASK field in the SIGSTRUCT 80

81.does not equal the SIGSTRUCT’s ATTRIBUTES field. This check prevents enclaves with undesirable attributes AND SECS KEYREQUEST from obtaining and potentially leaking secrets using the ATTRIBUTES ATTRIBUTEMASK migration process. BASEADDR KEYNAME Current SIZE Must be >= CPUSVN Any enclave author can use SIGSTRUCT to request CPUSVN ISVPRODID KEYID any of the bits in an enclave’s ATTRIBUTES field to ISVSVN Must be >= ISVSVN be zero. However, certain bits can only be set to one MRENCLAVE KEYPOLICY zero for enclaves that are signed by Intel. EINIT has a MRSIGNER MRENCLAVE mask of restricted ATTRIBUTES bits, discussed in § 5.8. SSAFRAME 1 0 MRSIGNER The EINIT implementation contains a hard-coded MR- SIZE zero PADDING SIGNER value that is used to identify Intel’s privileged 1 0 enclaves, and only allows privileged enclaves to be built with an ATTRIBUTES value that matches any of the PADDING MRSIGNER MRENCLAVE ISVSVN KEYID bits in the restricted mask. This check is essential to the security of the SGX software attestation process, which ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN is described in § 5.8. OWNEPOCH Last, EINIT also inspects the VENDOR field in Key Derivation Material SEAL_FUSES SIGSTRUCT. The SDM description of the VENDOR OWNEREPOCH SEAL_FUSES field in the section dedicated to SIGSTRUCT suggests SGX Register that the field is essentially used to distinguish between special enclaves signed by Intel, which use a VENDOR SGX Master AES-CMAC 128-bit Derivation Key Key Derivation symmetric key value of 0x8086, and everyone else’s enclaves, which should use a VENDOR value of zero. However, the Figure 78: EGETKEY implements a key derivation service that is EINIT pseudocode seems to imply that the SGX imple- primarily used by SGX’s secret migration feature. The key derivation mentation only checks that VENDOR is either zero or material is drawn from the SECS of the calling enclave, the informa- 0x8086. tion in a Key Request structure, and secure storage inside the CPU’s hardware. 5.7.5 Enclave Key Derivation This additional information implies that all EGETKEY SGX’s secret migration mechanism is based on the sym- invocations that use the same key derivation material will metric key derivation service that is offered to enclaves result in the same key, even across CPU power cycles. by the EGETKEY instruction, illustrated in Figure 78. Furthermore, it is impossible for an adversary to obtain The keys produced by EGETKEY are derived based on the key produced from a specific key derivation material the identity information in the current enclave’s SECS without access to the secret stored in the CPU’s e-fuses. and on two secrets stored in secure hardware inside the SGX’s key hierarchy is further described in § 5.8.2. SGX-enabled processor. One of the secrets is the input to a largely undocumented series of transformations that The following paragraphs discuss the pieces of data yields the symmetric key for the cryptographic primitive used in the key derivation material, which are selected underlying the key derivation process. The other secret, by the Key Request (KEYREQUEST) structure shown referred to as the CR SEAL FUSES in the SDM, is one in in Table 21, of the pieces of information used in the key derivation The KEYNAME field in KEYREQUEST always par- material. ticipates in the key generation material. It indicates the The SDM does not specify the key derivation algo- type of the key to be generated. While the SGX design rithm, but the SGX patents [108, 136] disclose that the defines a few key types, the secret migration feature al- keys are derived using the method described in FIPS ways uses Seal keys. The other key types are used by the SP 800-108 [34] using AES-CMAC [46] as a Pseudo- SGX software attestation process, which will be outlined Random Function (PRF). The same patents state that the in § 5.8. secrets used for key derivation are stored in the CPU’s The KEYPOLICY field in KEYREQUEST has two e-fuses, which is confirmed by the ISCA 2015 SGX tuto- flags that indicate if the MRENCLAVE and MRSIGNER rial [102]. fields in the enclave’s SECS will be used for key deriva- 81

82. Field Bytes Description The SVN restrictions prevent the migration of secrets KEYNAME 2 The desired key from enclaves with higher SVNs to enclaves with lower type; secret mi- SVNs, or from SGX implementations with higher SVNs gration uses Seal to implementations with lower SVNs. § 5.7.2 argues that keys the SVN restrictions can reduce the impact of security KEYPOLICY 2 The identity informa- vulnerabilities in enclaves and in SGX’s implementation. tion (MRENCLAVE EGETKEY always uses the ISVPRODID value from and/or MRSIGNER) the current enclave’s SECS for key derivation. It fol- ISVSVN 2 The enclave SVN lows that secrets can never flow between enclaves whose used in derivation SIGSTRUCT certificates assign them different Product CPUSVN 16 SGX implementa- IDs. tion SVN used in Similarly, the key derivation material always includes derivation the value of an 128-bit Owner Epoch (OWNEREPOCH) ATTRIBUTEMASK 16 Selects enclave at- SGX configuration register. This register is intended to tributes be set by the computer’s firmware to a secret generated KEYID 32 Random bytes once and stored in non-volatile memory. Before the Table 21: A subset of the fields in the KEYREQUEST structure computer changes ownership, the old owner can clear the OWNEREPOCH from non-volatile memory, making tion. Although the fields admits 4 values, only two seem it impossible for the new owner to decrypt any enclave to make sense, as argued below. secrets that may be left on the computer. Setting the MRENCLAVE flag in KEYPOLICY ties Due to the cryptographic properties of the key deriva- the derived key to the current enclave’s measurement, tion process, outside observers cannot correlate keys which reflects its contents. No other enclave will be able derived using different OWNEREPOCH values. This to obtain the same key. This is useful when the derived makes it impossible for software developers to use the key is used to encrypt enclave secrets so they can be EGETKEY-derived keys described in this section to track stored by system software in non-volatile memory, and a processor as it changes owners. thus survive power cycles. The EGETKEY derivation material also includes a 256- If the MRSIGNER flag in KEYPOLICY is set, the bit value supplied by the enclave, in the KEYID field. derived key is tied to the public RSA key that issued This makes it possible for an enclave to generate a col- the enclave’s certificate. Therefore, other enclaves is- lection of keys from EGETKEY, instead of a single key. sued by the same author may be able to obtain the same The SDM states that KEYID should be populated with key, subject to the restrictions below. This is the only a random number, and is intended to help prevent key KEYPOLICY value that allows for secret migration. wear-out. It makes little sense to have no flag set in KEYPOL- Last, the key derivation material includes the bitwise ICY. In this case, the derived key has no useful security AND of the ATTRIBUTES (§ 5.2.2) field in the enclave’s property, as it can be obtained by other enclaves that are SECS and the ATTRIBUTESMASK field in the KEYRE- completely unrelated to the enclave invoking EGETKEY. QUEST structure. The mask has the effect of removing Conversely, setting both flags is redundant, as setting some of the ATTRIBUTES bits from the key derivation MRENCLAVE alone will cause the derived key to be material, making it possible to migrate secrets between tied to the current enclave, which is the strictest possible enclaves with different attributes. § 5.6.2 and § 5.7.4 policy. explain the need for this feature, as well as its security The KEYREQUEST structure specifies the enclave implications. SVN (ISVSVN, § 5.7.2) and SGX implementation Before adding the masked attributes value to the SVN (CPUSVN, § 5.7.3) that will be used in the key key generation material, the EGETKEY implementation derivation process. However, EGETKEY will reject the forces the mask bits corresponding to the INIT and DE- derivation request and produce an error code if the de- BUG attributes (§ 5.2.2) to be set. From a practical sired enclave SVN is greater than the current enclave’s standpoint, this means that secrets will never be migrated SVN, or if the desired SGX implementation’s SVN is between enclaves that support debugging and production greater than the current implementation’s SVN. enclaves. 82

83. Without this restriction, it would be unsafe for an en- signature is too complex to be implemented in hardware, clave author to use the same RSA key to issue certificates so the signing process is performed by a privileged Quot- to both debugging and production enclaves. Debugging ing Enclave, which is issued by Intel, and can access the enclaves receive no integrity guarantees from SGX, so SGX attestation key. This enclave is discussed in § 5.8.2. it is possible for an attacker to modify the code inside a Pushing the signing functionality into the Quoting debugging enclave in a way that causes it to disclose any Enclave creates the need for a secure communication secrets that it has access to. path between an enclave undergoing software attestation and the Quoting Enclave. The SGX design solves this 5.8 SGX Software Attestation problem with a local attestation mechanism that can be The software attestation scheme implemented by SGX used by an enclave to prove its identity to any other follows the principles outlined in § 3.3. An SGX-enabled enclave hosted by the same SGX-enabled CPU. This processor computes a measurement of the code and data scheme, described in § 5.8.1, is implemented by the that is loaded in each enclave, which is similar to the mea- EREPORT instruction. surement computed by the TPM (§ 4.4). The software The SGX attestation key used by the Quoting Enclave inside an enclave can start a process that results in an does not exist at the time SGX-enabled processors leave SGX attestation signature, which includes the enclave’s the factory. The attestation key is provisioned later, using measurement and an enclave message. a largely undocumented process that is known to involve at least one other enclave issued by Intel, and two special Source Enclave Enclave Author Enclave Author Files Runtime Private Key Public Key EGETKEY ( § 5.7.5) key types. The publicly available details of this process are summarized in § 5.8.2. The SGX Launch Enclave and EINITTOKEN struc- Compiler Linker ture will be discussed in § 5.9. Enclave Enclave Build 5.8.1 Local Attestation Contents Toolchain An enclave proves its identity to another target enclave Enclave SIGSTRUCT Authoring via the EREPORT instruction shown in Figure 80. The SGX instruction produces an attestation Report (RE- SGX ECREATE PORT) that cryptographically binds a message sup- SGX Launch Enclave plied by the enclave with the enclave’s measurement- SGX EADD based (§ 5.6) and certificate-based (§ 5.7.2) identities. SGX EEXTEND EINITTOKEN The cryptographic binding is accomplished by a MAC Enclave Launch tag (§ 3.1.3) computed using a symmetric key that is Loading Policy (Licensing) only shared between the target enclave and the SGX Enclave Environment implementation. MRENCLAVE The EREPORT instruction reads the current enclave’s Enclave Launch identity information from the enclave’s SECS (§ 5.1.3), INITIALIZED SGX EINIT and uses it to populate the REPORT structure. Specifi- MRSIGNER cally, EREPORT copies the SECS fields indicating the en- Software Attestation clave’s measurement (MRENCLAVE), certificate-based identity (MRSIGNER, ISVPRODID, ISVSVN), and at- Attestation SGX EREPORT tributes (ATTRIBUTES). The attestation report also in- Challenge SGX Quoting Attestation cludes the SVN of the SGX implementation (CPUSVN) REPORT Enclave Signature and a 64-byte (512-bit) message supplied by the enclave. The target enclave that receives the attestation re- Figure 79: Setting up an SGX enclave and undergoing the soft- port can convince itself of the report’s authenticity as ware attestation process involves the SGX instructions EINIT and shown in Figure 81. The report’s authenticity proof EREPORT, and two special enclaves authored by Intel, the SGX is its MAC tag. The key required to verify the MAC Launch Enclave and the SGX Quoting Enclave. can only be obtained by the target enclave, by asking The cryptographic primitive used in SGX’s attestation EGETKEY (§ 5.7.5) to derive a Report key. The SDM 83

84. Input Register File EREPORT RBX MAC Yes Trust Report RDX EREPORT MACed Fields ATTRIBUTES Equal? RCX MAC MRENCLAVE No Reject Report SECS MACed Fields ATTRIBUTES ATTRIBUTES MRSIGNER AES-CMAC MRENCLAVE MRENCLAVE ISVPRODID MRSIGNER MRSIGNER ISVSVN AES-CMAC REPORTDATA ISVPRODID ISVPRODID ISVSVN ISVSVN CPUSVN Report Key BASEADDR REPORTDATA KEYID KEYREQUEST SIZE CPUSVN ATTRIBUTEMASK SECS SSAFRAMESIZE KEYID KEYNAME PADDING ATTRIBUTES CPUSVN Current REPORTDATA CPUSVN BASEADDR KEYID SIZE ISVSVN CR_EREPORT_KEYID ISVPRODID KEYPOLICY TARGETINFO ISVSVN MRENCLAVE MEASUREMENT MRENCLAVE MRSIGNER ATTRIBUTES MRSIGNER SSAFRAME Hard-coded PKCS SIZE #1 v1.5 Padding EGETKEY Current PADDING zero MRENCLAVE zero KEYID CPUSVN zero MASKEDATTRIBUTES KEYNAME CPUSVN OWNEPOCH Key Derivation Material SEAL_FUSES PADDING zero MRENCLAVE zero KEYID zero MASKEDATTRIBUTES KEYNAME CPUSVN OWNEREPOCH SEAL_FUSES OWNEPOCH SGX Register Report Key Key Derivation Material SEAL_FUSES SGX Master AES-CMAC 128-bit OWNEREPOCH SEAL_FUSES Derivation Key Key Derivation Report key SGX Register SGX Master AES-CMAC 128-bit Figure 80: EREPORT data flow Derivation Key Key Derivation Report key states that the MAC tag is computed using a block cipher- Figure 81: The authenticity of the REPORT structure created by based MAC (CMAC, [46]), but stops short of specifying EREPORT can and should be verified by the report’s target enclave. The target’s code uses EGETKEY to obtain the key used for the MAC the underlying cipher. One of the SGX papers [14] states tag embedded in the REPORT structure, and then verifies the tag. that the CMAC is based on 128-bit AES. The Report key returned by EGETKEY is derived from enclave’s measurement, so only the target enclave can a secret embedded in the processor (§ 5.7.5), and the obtain the key used to produce the MAC tag in the report. key material includes the target enclave’s measurement. The target enclave can be assured that the MAC tag in EREPORT uses the same key derivation process as the report was produced by the SGX implementation, EGETKEY does when invoked with KEYNAME set to for the following reasons. The cryptographic properties the value associated with Report keys. For this rea- of the underlying key derivation and MAC algorithms son, EREPORT requires the virtual address of a Report ensure that only the SGX implementation can produce Target Info (TARGETINFO) structure that contains the the MAC tag, as it is the only entity that can access measurement-based identity and attributes of the target the processor’s secret, and it would be impossible for enclave. an attacker to derive the Report key without knowing When deriving a Report key, EGETKEY behaves the processor’s secret. The SGX design guarantees that slightly differently than it does in the case of seal keys, the key produced by EGETKEY depends on the calling as shown in Figure 81. The key generation material 84

85.never includes the fields corresponding to the enclave’s Intel Key Generation certificate-based identity (MRSIGNER, ISVPRODID, Facility ISVSVN), and the KEYPOLICY field in the KEYRE- CPU e-fuses Provisioned QUEST structure is ignored. It follows that the report Seal Provisioning Keys Secret Secret can only be verified by the target enclave. Furthermore, the SGX implementation’s SVN (CPUSVN) value used for key generation is Provisioning determined by the current CPUSVN, instead of being Enclave Provisioning Provisioning Proof of read from the Key Request structure. Therefore, SGX Provisioning Key Seal Key Key Intel implementation upgrades that increase the CPUSVN ownership Provisioning invalidate all outstanding reports. Given that CPUSVN Authenticated Attestation Service Attestation Key increases are associated with security fixes, the argument Encryption Key in § 5.7.2 suggests that this restriction may reduce the Encrypted impact of vulnerabilities in the SGX implementation. Attestation Key Last, EREPORT sets the KEYID field in the key gen- eration material to the contents of an SGX configuration Attested Enclave register (CR REPORT KEYID) that is initialized with Key Agreement Key Agreement Challenge a random value when SGX is initialized. The KEYID Message 2 Message 1 value is also saved in the attestation report, but it is not covered by the MAC tag. EREPORT Report Data 5.8.2 Remote Attestation Report Remote Party in The SDM paints a complete picture of the local attesta- Quoting Enclave Software tion mechanism that was described in § 5.8.1. In compar- Reporting Report Attestation ison, the remote attestation process, which includes the Key Verification Quoting Enclave and the underlying keys, is shrouded Provisioning Attestation Seal Key Response in mystery. This section presents the information that Signature can be gleaned from the SDM, from one [14] of the SGX Authenticated Attestation papers, and from the ISCA 2015 SGX tutorial [102]. Encryption Key SGX’s software attestation scheme, which is illus- trated in Figure 82, relies on a key generation facility and Figure 82: SGX’s software attestation is based on two secrets stored on a provisioning service, both operated by Intel. in e-fuses inside the processor’s die, and on a key received from During the manufacturing process, an SGX-enabled Intel’s provisioning service. processor communicates with Intel’s key generation fa- cility, and has two secrets burned into e-fuses, which Seal Secret. are a one-time programmable storage medium that can The names “Seal Secret” and “Provisioning Secret” be economically included on a high-performance chip’s deviate from Intel’s official documents, which confus- die. We shall refer to the secrets stored in e-fuses as the ingly use the “Seal Key” and “Provisioning Key” names Provisioning Secret and the Seal Secret. to refer to both secrets stored in e-fuses and keys derived The Provisioning Secret is the main input to a largely by EGETKEY. undocumented process that outputs the SGX master The SDM briefly describes the keys produced by derivation key used by EGETKEY, which was referenced EGETKEY, but no official documentation explicitly de- in Figures 78, 79, 80, and 81. scribes the secrets in e-fuses. The description below is The Seal Secret is not exposed to software by any of is the only interpretation of all the public information the architectural mechanisms documented in the SDM. sources that is consistent with all the SDM’s statements The secret is only accessed when it is included in the about key derivation. material used by the key derivation process implemented The Provisioning Secret is generated at the key gener- by EGETKEY (§ 5.7.5). The pseudocode in the SDM ation facility, where it is burned into the processor’s e- uses the CR SEAL FUSES register name to refer to the fuses and stored in the database used by Intel’s provision- 85

86.ing service. The Seal Secret is generated inside the pro- multiple mutually distrusting provisioning services. cessor chip, and therefore is not known to Intel. This ap- EGETKEY only derives Provisioning keys for enclaves proach has the benefit that an attacker who compromises whose PROVISIONKEY attribute is set to true. § 5.9.3 Intel’s facilities cannot derive most keys produced by argues that this mechanism is sufficient to protect the EGETKEY, even if the attacker also compromises a vic- computer owner from a malicious software provider that tim’s firmware and obtains the OWNEREPOCH (§ 5.7.5) attempts to use Provisioning keys to track a CPU chip value. These keys include the Seal keys (§ 5.7.5) and across OWNEREPOCH changes. Report keys (§ 5.8.1) introduced in previous sections. After the Provisioning Enclave obtains a Provision- The only documented exception to the reasoning above ing key, it uses the key to authenticate itself to Intel’s is the Provisioning key, which is effectively a shared se- provisioning service. Once the provisioning service is cret between the SGX-enabled processor and Intel’s pro- convinced that it is communicating to a trusted Provi- visioning service. Intel has to be able to derive this key, sioning enclave in the secure environment provided by so the derivation material does not include the Seal Secret a SGX-enabled processor, the service generates an At- or the OWNEREPOCH value, as shown in Figure 83. testation Key and sends it to the Provisioning Enclave. The enclave then encrypts the Attestation Key using a PROVISIONKEY Provisioning Seal key, and hands off the encrypted key must be true AND Provisioning Key to the system software for storage. SECS KEYREQUEST Provisioning Seal keys, are the last publicly docu- ATTRIBUTES ATTRIBUTEMASK mented type of special keys derived by EGETKEY, using BASEADDR KEYNAME the process illustrated in Figure 84. As their name sug- Current SIZE Must be >= CPUSVN CPUSVN gests, Provisioning Seal keys are conceptually similar to ISVPRODID KEYID the Seal Keys (§ 5.7.5) used to migrate secrets between ISVSVN Must be >= ISVSVN MRENCLAVE KEYPOLICY enclaves. MRSIGNER MRENCLAVE SSAFRAME MRSIGNER SIZE AND Provisioning Seal Key PADDING SECS KEYREQUEST ATTRIBUTES ATTRIBUTEMASK PADDING MRSIGNER zero ISVSVN zero BASEADDR KEYNAME Current ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN SIZE Must be >= CPUSVN CPUSVN zero ISVPRODID KEYID ISVSVN Must be >= ISVSVN Key Derivation Material zero MRENCLAVE KEYPOLICY MRSIGNER MRENCLAVE SSAFRAME MRSIGNER SGX Master AES-CMAC 128-bit SIZE Derivation Key Key Derivation Provisioning Key PADDING Figure 83: When EGETKEY is asked to derive a Provisioning key, it does not use the Seal Secret or OWNEREPOCH. The Provisioning PADDING MRSIGNER zero ISVSVN zero key does, however, depend on MRSIGNER and on the SVN of the ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN SGX implementation. zero EGETKEY derives the Provisioning key using the cur- SEAL_FUSES Key Derivation Material rent enclave’s certificate-based identity (MRSIGNER, ISVPRODID, ISVSVN) and the SGX implementation’s SEAL_FUSES SVN (CPUSVN). This approach has a few desirable se- curity properties. First, Intel’s provisioning service can 128-bit SGX Master AES-CMAC Provisioning be assured that it is authenticating a Provisioning Enclave Derivation Key Key Derivation Seal key signed by Intel. Second, the provisioning service can use the CPUSVN value to reject SGX implementations with Figure 84: The derivation material used to produce Provisioning Seal keys does not include the OWNEREPOCH value, so the keys known security vulnerabilities. Third, this design admits survive computer ownership changes. 86

87. The defining feature of Provisioning Seal keys is that cal attestation reports (§ 5.8.1) and verifies them using they are not based on the OWNEREPOCH value, so they the Report keys generated by EGETKEY. The Quoting survive computer ownership changes. Since Provisioning Enclave then obtains the Provisioning Seal Key from Seal keys can be used to track a CPU chip, their use is EGETKEY and uses it to decrypt the Attestation Key, gated on the PROVISIONKEY attribute, which has the which is received from system software. Last, the en- same semantics as for Provisioning keys. clave replaces the MAC in the local attestation report Like Provisioning keys, Seal keys are based on the with an Attestation Signature produced with the Attesta- current enclave’s certificate-based identity (MRSIGNER, tion Key. ISVPROD, ISVSVN), so the Attestation Key encrypted The SGX patents state that the name “Quoting Enclave” by Intel’s Provisioning Enclave can only be decrypted was chosen as a reference to the TPM (§ 4.4)’s quoting by another enclave signed with the same Intel RSA key. feature, which is used to perform software attestation on However, unlike Provisioning keys, the Provisioning Seal a TPM-based system. keys are based on the Seal Secret in the processor’s e- The Attestation Key uses Intel’s Enhanced Privacy fuses, so they cannot be derived by Intel. ID (EPID) cryptosystem [26], which is a group signature When considered independently from the rest of the scheme that is intended to preserve the anonymity of the SGX design, Provisioning Seal keys have desirable se- signers. Intel’s key provisioning service is the issuer in curity properties. The main benefit of these keys is that the EPID scheme, so it publishes the Group Public Key, when a computer with an SGX-enabled processor ex- while securely storing the Master Issuing Key. After a changes owners, it does not need to undergo the provi- Provisioning Enclave authenticates itself to the provision- sioning process again, so Intel does not need to be aware ing service, it generates an EPID Member Private Key, of the ownership change. The confidentiality issue that which serves as the Attestation Key, and executes the stems from not using OWNEREPOCH was already intro- EPID Join protocol to join the group. Later, the Quoting duced by Provisioning keys, and is mitigated using the Enclave uses the EPID Member Private Key to produce access control scheme based on the PROVISIONKEY Attestation Signatures. attribute that will be discussed in § 5.9.3. The Provisioning Secret stored in the e-fuses of each Similarly to the Seal key derivation process, both the SGX-enabled processor can be used by Intel to trace Provisioning and Provisioning Seal keys depend on the individual chips when a Provisioning Enclave authen- bitwise AND of the ATTRIBUTES (§ 5.2.2) field in the ticates itself to the provisioning service. However, if enclave’s SECS and the ATTRIBUTESMASK field in the EPID Join protocol is blinded, Intel’s provisioning the KEYREQUEST structure. While most attributes can service cannot trace an Attestation Signature to a spe- be masked away, the DEBUG and INIT attributes are cific Attestation Key, so Intel cannot trace Attestation always used for key derivation. Signatures to individual chips. This dependency makes it safe for Intel to use its pro- Of course, the security properties of the description duction RSA key to issue certificates for Provisioning above hinge on the correctness of the proofs behind the or Quoting Enclaves with debugging features enabled. EPID scheme. Analyzing the correctness of such cryp- Without the forced dependency on the DEBUG attribute, tographic schemes is beyond the scope of this work, so using the production Intel signing key on a single de- we defer the analysis of EPID to the crypto research bug Provisioning or Quoting Enclave could invalidate community. SGX’s security guarantees on all the CPU chips whose attestation-related enclaves are signed by the same key. 5.9 SGX Enclave Launch Control Concretely, if the issued SIGSTRUCT would be leaked, The SGX design includes a launch control process, any attacker could build a debugging Provisioning or which introduces an unnecessary approval step that is Quoting enclave, use the SGX debugging features to required before running most enclaves on a computer. modify the code inside it, and extract the 128-bit Pro- The approval decision is made by the Launch Enclave visioning key used to authenticated the CPU to Intel’s (LE), which is an enclave issued by Intel that gets to provisioning service. approve every other enclave before it is initialized by After the provisioning steps above have been com- EINIT (§ 5.3.3). The officially documented information pleted, the Quoting Enclave can be invoked to perform about this approval process is discussed in § 5.9.1. SGX’s software attestation. This enclave receives lo- The SGX patents [108, 136] disclose in no uncertain 87

88.terms that the Launch Enclave was introduced to ensure Vetted Enclave Desired ATTRIBUTES SIGSTRUCT that each enclave’s author has a business relationship Launch Signed Fields AES-CMAC with Intel, and implements a software licensing system. Control ATTRIBUTES Policy § 5.9.2 briefly discusses the implications, should this turn ATTRIBUTEMASK Checks EINITTOKEN out to be true. VENDOR MAC DATE MACed Fields The remainder of the section argues that the Launch ENCLAVEHASH ATTRIBUTES Enclave should be removed from the SGX design. § 5.9.3 1 VALID ISVPRODID explains that the LE is not required to enforce the com- ISVSVN MRENCLAVE puter owner’s launch control policy, and concludes that MRSIGNER RSA Signature the LE is only meaningful if it enforces a policy that is MODULUS 256-bit ISVSVNLE SHA-2 detrimental to the computer owner. § 5.9.4 debunks the EXPONENT (3) KEYID CPUSVNLE myth that an enclave can host malware, which is likely to SIGNATURE ISVPRODIDLE be used to justify the LE. § 5.9.5 argues that Anti-Virus Q1 MASKED (AV) software is not fundamentally incompatible with Q2 ATTRIBUTESLE enclaves, further disproving the theory that Intel needs AND Signed by Enclave KEYREQUEST to actively police the software that runs inside enclaves. Author’s RSA Key ATTRIBUTEMASK KEYNAME 5.9.1 Enclave Attributes Access Control Launch Enclave CPUSVN SECS KEYID The SGX design requires that all enclaves be vetted by a MRENCLAVE MRSIGNER ISVSVN Launch Enclave (LE), which is only briefly mentioned KEYPOLICY ISVPRODID in Intel’s official documentation. Neither its behavior MRENCLAVE PADDING nor its interface with the system software is specified. ISVSVN MRSIGNER We speculate that Intel has not been forthcoming about ATTRIBUTES the LE because of its role in enforcing software licens- BASEADDR RDRAND ing, which will be discussed in § 5.9.2. This section SIZE Launch Key SSAFRAMESIZE abstracts away the licensing aspect and assumes that the LE enforces a black-box Launch Control Policy. AND EGETKEY The LE approves an enclave by issuing an EINIT Must be >= Token (EINITTOKEN), using the process illustrated PADDING zero zero ISVSVN KEYID in Figure 85. The EINITTOKEN structure contains ISVPRODID MASKEDATTRIBUTES KEYNAME CPUSVN the approved enclave’s measurement-based (§ 5.6) and OWNEPOCH certificate-based (§ 5.7.2) identities, just like a local at- Key Derivation Material SEAL_FUSES testation REPORT (§ 5.8.1). This token is inspected by OWNEREPOCH SEAL_FUSES EINIT (§ 5.3.3), which refuses to initialize enclaves SGX Register Current Must be >= with incorrect tokens. CPUSVN While an EINIT token is handled by untrusted system SGX Master AES-CMAC 128-bit software, its integrity is protected by a MAC tag (§ 3.1.3) Derivation Key Key Derivation Launch Key that is computed using a Launch Key obtained from EGETKEY. The EINIT implementation follows the Figure 85: The SGX Launch Enclave computes the EINITTOKEN. same key derivation process as EGETKEY to convince itself that the EINITTOKEN provided to it was indeed MACed using AES-CMAC with 128-bit keys. generated by an LE that had access to the Launch Key. The EGETKEY instruction only derives the Launch The SDM does not document the MAC algorithm Key for enclaves that have the LAUNCHKEY attribute used to confer integrity guarantees to the EINITTOKEN set to true. The Launch Key is derived using the same structure. However, the EINIT pseudocode verifies the process as the Seal Key (§ 5.7.5). The derivation mate- token’s MAC tag using the same function that the ERE- rial includes the current enclave’s versioning information PORT pseudocode uses to create the REPORT structure’s (ISVPRODID and ISVSVN) but it does not include the MAC tag. It follows that the reasoning in § 5.8.1 can main fields that convey an enclave’s identity, which are be reused to conclude that EINITTOKEN structures are MRSIGNER and MRENCLAVE. The rest of the deriva- 88

89.tion material follows the same rules as the material used control system, namely using a zeroed out EINITTO- for Seal Keys. KEN to initialize the Launch Enclave. At the same time, The EINITTTOKEN structure contains the identi- the cryptographic primitives behind the MRSIGNER ties of the approved enclave (MRENCLAVE and MR- check guarantee that only Intel-provided enclaves will SIGNER) and the approved enclave attributes (AT- be able to bypass the attribute checks. This does not TRIBUTES). The token also includes the information change SGX’s security properties because Intel is already used for the Launch Key derivation, which includes the a trusted party, as it is responsible for generating the Pro- LE’s Product ID (ISVPRODIDLE), SVN (ISVSVNLE), visioning Keys and Attestation Keys used by software and the bitwise AND between the LE’s ATTRIBUTES attestation (§ 5.8.2). and the ATTRIBUTEMASK used in the KEYREQUEST Curiously, the EINIT pseudocode in the SDM states (MASKEDATTRIBUTESLE). that the instruction enforces an additional restriction, The EINITTOKEN information used to derive the which is that all enclaves with the LAUNCHKEY at- Launch Key can also be used by EINIT for damage tribute must have its certificate issued by the same Intel control, e.g. to reject tokens issued by Launch Enclaves public key that is used to bypass the EINITTTOKEN with known security vulnerabilities. The reference pseu- checks. This restriction appears to be redundant, as the docode supplied in the SDM states that EINIT checks same restriction could be enforced in the Launch En- the DEBUG bit in the MASKEDATTRIBUTESLE field, clave. and will not initialize a production enclave using a to- ken issued by a debugging LE. It is worth noting that 5.9.2 Licensing MASKEDATTRIBUTESLE is guaranteed to include The SGX patents [108, 136] disclose that EINIT To- the LE’s DEBUG attribute, because EGETKEY forces kens and the Launch Enclave (§ 5.9.1) were introduced the DEBUG attribute’s bit in the attributes mask to 1 to verify that the SIGSTRUCT certificates associated (§ 5.7.5). with production enclaves are issued by enclave authors The check described above make it safe for Intel to who have a business relationship with Intel. In other supply SGX enclave developers with a debugging LE that words, the Launch Enclave is intended to be an enclave has its DEBUG attribute set, and performs minimal or licensing mechanism that allows Intel to force itself no security checks before issuing an EINITTOKEN. The as an intermediary in the distribution of all enclave DEBUG attribute disables SGX’s integrity protection, software. so the only purpose of the security checks performed in The SGX patents are likely to represent an early ver- the debug LE would be to help enclave development by sion of the SGX design, due to the lengthy timelines mimicking its production counterpart. The debugging LE associated with patent application approval. In light of can only be used to launch any enclave with the DEBUG this consideration, we cannot make any claims about In- attribute set, so it does not undermining Intel’s ability to tel’s current plans. However, given that we know for sure enforce a Launch Control Policy on production enclaves. that Intel considered enclave licensing at some point, we The enclave attributes access control system described briefly discuss the implications of implementing such a above relies on the LE to reject initialization requests licensing plan. that set privileged attributes such as PROVISIONKEY Intel has a near-monopoly on desktop and server-class on unauthorized enclaves. However, the LE cannot vet processors, and being able to decide which software ven- itself, as there will be no LE available when the LE itself dors are allowed to use SGX can effectively put Intel in needs to be initialized. Therefore, the Launch Key access a position to decide winners and losers in many software restrictions are implemented in hardware. markets. EINIT accepts an EINITTOKEN whose VALID bit is Assuming SGX reaches widespread adoption, this is- set to zero, if the enclave’s MRSIGNER (§ 5.7.1) equals sue is the software security equivalent to the Net Neutral- a hard-coded value that corresponds to an Intel public ity debates that have pitted the software industry against key. For all other enclave authors, an invalid EINIT token telecommunication giants. Given that virtually all com- causes EINIT to reject the enclave and produce an error petent software development companies have argued that code. losing Net Neutrality will stifle innovation, it is fairly This exemption to the token verification policy pro- safe to assume that Intel’s ability to regulate access to vides a way to bootstrap the enclave attributes access SGX will also stifle innovation. 89

90. Furthermore, from a historical perspective, the enclave would only allow the PROVISIONKEY attribute to be licensing scheme described in the SGX patents is very set for the enclaves that implement software attestation, similar to Verified Boot, which was briefly discussed such as Intel’s Provisioning Enclave and Quoting En- in § 4.4. Verified Boot has mostly received negative clave. This policy can easily be implemented by system reactions from software developers, so it is likely that software, given its exclusive access to the EINIT instruc- an enclave licensing scheme would meet the same fate, tion. should the developer community become aware of it. The only concern with the approach outlined above is that a malicious system software might abuse the PRO- 5.9.3 System Software Can Enforce a Launch Policy VISIONKEY attribute to generate a unique identifier for § 5.3 explains that the SGX instructions used to load and the hardware that it runs on, similar to the much ma- initialize enclaves (ECREATE, EADD, EINIT) can only ligned Intel Processor Serial Number [85]. We dismiss be issued by privileged system software, because they this concern by pointing out that system software has manage the EPC, which is a system resource. access to many unique identifiers, such as the Media A consequence on the restriction that only privileged Access Control (MAC) address of the Ethernet adapter software can issue ECREATE and EADD instructions is integrated into the motherboard’s chipset (§ 2.9.1). that the system software is able to track all the public 5.9.4 Enclaves Cannot Damage the Host Computer contents that is loaded into each enclave. The privilege requirements of EINIT mean that the system software SGX enclaves execute at the lowest privilege level (user can also examine each enclave’s SIGSTRUCT. It follows mode / ring 3), so they are subject to the same security that the system software has access to a superset of the checks as their host application. For example, modern information that the Launch Enclave might use. operating systems set up the I/O maps (§ 2.7) to pre- Furtheremore, EINIT’s privileged instruction status vent application software from directly accessing the I/O means that the system software can perform its own address space (§ 2.4), and use the supervisor (S) page policy checks before allowing application software to table attribute (§ 2.5.3) to deny application software di- initialize an enclave. So, the system software can enforce rect access to memory-mapped devices (§ 2.4) and to a Launch Control Policy set by the computer’s owner. the DRAM that stores the system software. Enclave For example, an IaaS cloud service provider may use its software is subject to I/O privilege checks and address hypervisor to implement a Launch Control Policy that translation checks, so a malicious enclave cannot directly limits what enclaves its customers are allowed to execute. interact with the computer’s devices, and cannot tamper Given that the system software has access to a superset the system software. of the information that the Launch Enclave might use, It follows that software running in an enclave has the it is easy to see that the set of policies that can be en- same means to compromise the system software as its forced by system software is a superset of the policies host application, which come down to exploiting a secu- that can be supported by an LE. Therefore, the only ra- rity vulnerability. The same solutions used to mitigate tional explanation for the existence of the LE is that it vulnerabilities exploited by application software (e.g., was designed to implement a Launch Control Policy that seccomp/bpf [116]) apply to enclaves. is not beneficial to the computer owner. The only remaining concern is that an enclave can per- As an illustration of this argument, we consider the form a denial of service (DoS) attack against the system case of restricting access to EGETKEY’s Provisioning software. The rest of this section addresses the concern. keys (§ 5.8.2). The derivation material for Provisioning The SGX design provides system software the tools keys does not include OWNEREPOCH, so malicious it needs to protect itself from enclaves that engage in enclaves can potentially use these keys to track a CPU CPU hogging and DRAM hogging. As enclaves cannot chip package as it exchanges owners. For this reason, the perform I/O directly, these are the only two classes of SGX design includes a simple access control mechanism DoS attacks available to them. that can be used by system software to limiting enclave An enclave that attempts to hog an LP assigned to it access to Provisioning keys. EGETKEY refuses to derive can be preempted by the system software via an Inter- Provisioning keys for enclaves whose PROVISIONKEY Processor Interrupt (IPI, § 2.12) issued from another attribute is not set to true. processor. This method is available as long as the sys- It follows that a reasonable Launch Control Policy tem software reserves at least one LP for non-enclave 90

91.computation. status-quo for the AV vendors. The attack described Furthermore, most OS kernels use tick schedulers, above would be called a protection scheme if the payload which use a real-time clock (RTC) configured to issue pe- would be a proprietary image processing algorithm, or a riodical interrupts (ticks) to all cores. The RTC interrupt DRM scheme. handler invokes the kernel’s scheduler, which chooses On a brighter note, enclaves do not bring the com- the thread that will get to use the logical processor until plete extinction of AV, they merely require a change in the next RTC interrupt is received. Therefore, kernels approach. Enclave code always executes at the lowest that use tick schedulers always have the opportunity to privilege mode (ring 3 / user mode), so it cannot perform de-schedule enclave threads, and don’t need to rely on any I/O without invoking the services of system software. the ability to send IPIs. For all intents and purposes, this effectively means that In SGX, the system software can always evict an en- enclave software cannot perform any malicious action clave’s EPC pages to non-EPC memory, and then to disk. without the complicity of system software. Therefore, The system software can also outright deallocate an en- enclaves can be policed effectively by intelligent AV clave’s EPC pages, though this will probably cause the software that records and filters the I/O performed by enclave code to encounter page faults that cannot be re- software, and detects malicious software according to solved. The only catch is that the EPC pages that hold the actions that it performs, rather than according to bit metadata for running enclave threads cannot be evicted patterns in its code. or removed. However, this can easily be resolved, as Furthermore, SGX’s enclave loading model allows the system software can always preempt enclave threads, the possibility of performing static analysis on the en- using one of the methods described above. clave’s software. For simplicity, assume the existence of a standardized static analysis framework. The initial 5.9.5 Interaction with Anti-Virus Software enclave contents is not encrypted, so the system software Today’s anti-virus (AV) systems are glorified pattern can easily perform static analysis on it. Dynamically matchers. AV software simply scans all the executable loaded code or Just-In-Time code generation (JIT) can files on the system and the memory of running processes, be handled by requiring that all enclaves that use these looking for bit patterns that are thought to only occur techniques embed the static analysis framework and use in malicious software. These patterns are somewhat it to analyze any dynamically loaded code before it is pompously called “virus signatures”. executed. The system software can use static verification SGX (and TXT, to some extent) provides a method for to ensure that enclaves follow these rules, and refuse to executing code in an isolated container that we refer to initialize any enclaves that fail verification. as an enclave. Enclaves are isolated from all the other In conclusion, enclaves in and of themselves don’t software on the computer, including any AV software introduce new attack vectors for malware. However, the that might be installed. enclave isolation mechanism is fundamentally incompati- The isolation afforded by SGX opens up the possibility ble with the approach employed by today’s AV solutions. for bad actors to structure their attacks as a generic loader Fortunately, it is possible (though non-trivial) to develop that would end up executing a malicious payload without more intelligent AV software for enclave software. tripping the AV’s pattern matcher. More specifically, the 6 SGX A NALYSIS attack would create an enclave and initialize it with a generic loader that looks innocent to an AV. The loader 6.1 SGX Implementation Overview inside the enclave would obtain an encrypted malicious An under-documented and overlooked feat achieved by payload, and would undergo software attestation with the SGX design is that implementing it on an Intel pro- an Internet server to obtain the payload’s encryption key. cessor has a very low impact on the chip’s hardware The loader would then decrypt the malicious payload and design. SGX’s modifications to the processor’s execu- execute it inside the enclave. tion cores (§ 2.9.4) are either very small or completely In the scheme suggested here, the malicious payload inexistent. The CPU’s uncore (§ 2.9.3, § 2.11.3) receives only exists in a decrypted form inside an enclave’s mem- a new module, the Memory Encryption Engine, which ory, which cannot be accessed by the AV. Therefore, the appears to be fairly self-contained. AV’s pattern matcher will not trip. The bulk of the SGX implementation is relegated to This issue does not have a solution that maintains the the processor’s microcode (§ 2.14), which supports a 91

92.much higher development speed than the chip’s electrical The MEE is briefly described in the ISCA 2015 SGX circuitry. tutorial [102]. According to the information presented there, the MEE roughly follows the approach introduced 6.1.1 Execution Core Modifications by Aegis [172] [174], which relies on a variation of Merkle trees to provide the EPC with confidentiality, At a minimum, the SGX design requires a very small integrity, and freshness guarantees (§ 3.1). Unlike Aegis, modification to the processor’s execution cores (§ 2.9.4), the MEE uses non-standard cryptographic primitives that in the Page Miss Handler (PMH, § 2.11.5). include a slightly modified AES operating mode (§ 3.1.2) The PMH resolves TLB misses, and consists of a fast and a Carter-Wegman [30, 185] MAC (§ 3.1.3) construc- path that relies on an FSM page walker, and a microcode tion. assist fallback that handles the edge cases (§ 2.14.3). Both the ISCA SGX tutorial and the patents state that The bulk of SGX’s memory access checks, which are the MEE is connected to to the Memory Controller (MC) discussed in § 6.2, can be implemented in the microcode integrated in the CPU’s uncore. However, all sources are assist. completely silent on further implementation details. The The only modification to the PMH hardware that is MEE overview slide states that “the Memory Controller absolutely necessary to implement SGX is developing an detects [the] address belongs to the MEE region, and ability to trigger the microcode assist for all address trans- routes transaction to MEE”, which suggests that the MEE lations when a logical processor (§ 2.9.4) is in enclave is fairly self-contained and has a narrow interface to the mode (§ 5.4), or when the physical address produced by rest of the MC. the page walker FSM matches the Processor Reserved Intel’s SGX patents use the name Crypto Memory Memory (PRM, § 5.1) range. Aperture (CMA) to refer to the MEE. The CMA descrip- The PRM range is configured by the PRM Range Reg- tion matches the MEE and PRM concepts, as follows. isters (§ 5.1), which have exactly the same semantics as According to the patents, the CMA is used to securely the Memory Type Range Registers (MTRRs, § 2.11.4) store the EPC, relies on crypto controllers in the MC, used to configure a variable memory range. The page and loses its keys during deep sleep. These details align walker FSM in the PMH is already configured to issue a perfectly with the SDM’s statements regarding the MEE microcode assist when the page tables are in uncacheable and PRM. memory (§ 2.11.4). Therefore, the PRMRR can be repre- The Intel patents also disclose that the EPCM (§ 5.1.2) sented as an extra MTRR pair. and other structures used by the SGX implementation are also stored in the PRM. This rules out the possibility 6.1.2 Uncore Modifications that the EPCM requires on-chip memory resembling the The SDM states that DMA transactions (§ 2.9.1) that last-level cache (§ 2.11, § 2.11.3). target the PRM range are aborted by the processor. The Last, the SGX patents shine a bit of light on an area SGX patents disclose that the PRMRR protection against that the official Intel documentation is completely silent unauthorized DMA is implemented by having the SGX about, namely the implementation concerns brought by microcode set up entries in the Source Address De- computer systems with multiple processor chips. The coder (SAD) in the uncore CBoxes and in the Target patents state that the MEE also protects the Quick-Path Address Decoder (TAD) in the integrated Memory Con- Interconnect (QPI, § 2.9.1) traffic using link-layer en- troller (MC). cryption. § 2.11.3 mentions that Intel’s Trusted Execution Tech- nology (TXT) [70] already takes advantage of the inte- 6.1.3 Microcode Modifications grated MC to protect a DRAM range from DMA. It is According to the SGX patents, all the SGX instructions highly likely that the SGX implementation reuses the are implemented in microcode. This can also be de- mechanisms brought by TXT, and only requires the ex- duced by reading the SDM’s pseuodocode for all the tension of the SADs and TADs by one entry. instructions, and realizing that it is highly unlikely that SGX’s major hardware modification is the Memory any SGX instruction can be implemented in 4 or fewer Encryption Engine (MEE) that is added to the processor’s micro-ops (§ 2.10), which is the most that can be handled uncore (§ 2.9.3, § 2.11.3) to protect SGX’s Enclave Page by the simple decoders used in the hardware fast paths Cache (EPC, § 5.1.1) against physical attacks. (S 2.14.1). 92

93. The Asynchronous Enclave Exit (AEX, § 5.4.3) behav- need to be aware of this priority. ior is also implemented in microcode. § 2.14.2 draws on 6.2 SGX Memory Access Protection an assortment of Intel patents to conclude that hardware SGX guarantees that the software inside an enclave is exceptions (§ 2.8.2), including both faults and interrupts, isolated from all the software outside the enclave, includ- trigger microcode events (§ 2.14.2). It follows that the ing the software running in other enclaves. This isolation SGX implementation can implement AEX by modifying guarantee is at the core of SGX’s security model. the hardware exception handlers in the microcode. It is tempting to assume that the main protection The SGX initialization sequence is also implemented mechanism in SGX is the Memory Encryption Engine in microcode. SGX is initialized in two phases. First, it is (MEE) described in § 6.1.2, as it encrypts and MACs very likely that the boot sequence in microcode (§ 2.14.4) the DRAM’s contents. However, the MEE sits in the was modified to initialize the registers associated with processor’s memory controller, which is at the edge of the SGX microcode. The ISCA SGX tutorial states that the on-chip memory hierarchy, below the caches (§ 2.11). the MEE’ keys are initialized during the boot process. Therefore, the MEE cannot protect an enclave’s memory Second, SGX instructions are enabled by setting a bit from software attacks. in a Model-Specific Register (MSR, § 2.4). This second The root of SGX’s protections against software attacks phase involves enabling the MEE and configuring the is a series of memory access checks which prevents the SAD and TAD to protect the PRM range. Both tasks are currently running software from accessing memory that amenable to a microcode implementation. does not belong to it. Specifically, non-enclave software The SGX description in the SDM implies that the SGX is only allowed to access memory outside the PRM range, implementation uses a significant number of new regis- while the code inside an enclave is allowed to access non- ters, which are only exposed to microcode. However, PRM memory, and the EPC pages owned by the enclave. the SGX patents reveal that most of these registers are Although it is believed [50] that SGX’s access checks actually stored in DRAM. are performed on every memory access check, Intel’s For example, the patents state that each TCS (§ 5.2.4) patents disclose that the checks are performed in the has two fields that receive the values of the DR7 and Page Miss Handler (PMH, § 2.11.5), which only handles IA32 DEBUGCTL registers when the processor enters TLB misses. enclave mode (§ 5.4.1), and are used to restore the original register values during enclave exit (§ 5.4.2). 6.2.1 Functional Description The SDM documents these fields as “internal CREGs” The intuition behind SGX’s memory access protections (CR SAVE DR7 and CR SAVE DEBUGCTL), which can be built by considering what it would take to imple- are stated to be “hardware specific registers”. ment the same protections in a trusted operating system The SGX patents document a small subset of the or hypervisor, solely by using the page tables that direct CREGs described in the SDM, summarized in Table 22, the CPU’s address translation feature (§ 2.5). as microcode registers. While in general we trust offi- The hypothetical trusted software proposed above can cial documentation over patents, in this case we use the implement enclave entry (§ 5.4.1) as a system call § 2.8.1 CREG descriptions provided by the patents, because they that creates page table entries mapping the enclave’s appear to be more suitable for implementation purposes. memory. Enclave exit (§ 5.4.2) can be a symmetric From a cost-performance standpoint, the cost of regis- system call that removes the page table entries created ter memory only seems to be justified for the state used during enclave entry. When modifying the page tables, by the PMH to implement SGX’s memory access checks, the system software has to consider TLB coherence is- which will be discussed in § 6.2). The other pieces of sues (§ 2.11.5) and perform TLB shootdowns when ap- state listed as CREGs are accessed so infrequently that propriate. storing them in dedicated SRAM would make very little SGX leaves page table management under the sys- sense. tem software’s control, but it cannot trust the software The SGX patents state that SGX requires very few to set up the page tables in any particular way. There- hardware changes, and most of the implementation is in fore, the hypothetical design described above cannot be microcode, as a positive fact. We therefore suspect that used by SGX as-is. Instead, at a conceptual level, the minimizing hardware changes was a high priority in the SGX implementation approximates the effect of hav- SGX design, and that any SGX modification proposals ing the page tables set up correctly by inspecting every 93

94. SDM Name Bits Scope Description CSR SGX OWNEREPOCH 128 CPU Chip Package Used by EGETKEY (§ 5.7.5) CR ENCLAVE MODE 1 Logical Processor 1 when executing code inside an enclave CR ACTIVE SECS 16 Logical Processor The index of the EPC page storing the current en- clave’s SECS CR TCS LA 64 Logical Processor The virtual address of the TCS (§ 5.2.4) used to en- ter (§ 5.4.1) the current enclave CR TCS PH 16 Logical Processor The index of the EPC page storing the TCS used to enter the current enclave CR XSAVE PAGE 0 16 Logical Processor The index of the EPC page storing the first page of the current SSA (§ 5.2.5) Table 22: The fields in an EPCM entry. address translation that comes out of the Page Miss Han- Field Bits Description dler (PMH, § 2.11.5). The address translations that do VALID 1 0 for un-allocated EPC not obey SGX’s access control restrictions are rejected pages before they reach the TLBs. BLOCKED 1 page is being evicted SGX’s approach relies on the fact that software al- R 1 enclave code can read ways references memory using virtual addresses, so all W 1 enclave code can write the micro-ops (§ 2.10) that reach the memory execu- X 1 enclave code can execute tion units (§ 2.10.1) use virtual addresses that must be PT 8 page type (Table 24) resolved using the TLBs before the actual memory ac- ADDRESS 48 the virtual address used to cesses are carried out. By contrast, the processor’s mi- access this page crocode (§ 2.14) has the ability to issue physical memory ENCLAVESECS the EPC slot number for accesses, which bypass the TLBs. Conveniently, SGX the SECS of the enclave instructions are implemented in microcode (§ 6.1.3), so owning the page they can bypass the TLBs and access memory that is Table 23: The fields in an EPCM entry. off limits to software, such as the EPC page holding an enclave’s SECS˜(§ 5.1.3). Type Allocated by Contents The SGX address translation checks use the informa- PT REG EADD enclave code and data tion in the Enclave Page Cache Map (EPCM, § 5.1.2), PT SECS ECREATE SECS (§ 5.1.3) which is effectively an inverted page table that covers the PT TCS EADD TCS (§ 5.2.4) entire EPC. This means that each EPC page is accounted PT VA EPA VA (§ 5.5.2) for by an EPCM entry, using the structure is summarized Table 24: Values of the PT (page type) field in an EPCM entry. in Table 23. The EPCM fields were described in detail in § 5.1.2, § 5.2.3, § 5.2.4, § 5.5.1, and § 5.5.2. First, virtual addresses inside the enclave’s virtual Conceptually, SGX adds the access control logic il- memory range (ELRANGE, § 5.2.1) must always trans- lustrated in Figure 86 to the PMH. SGX’s security late into physical addresses inside the EPC. This way, checks are performed after the page table attributes-based an enclave is assured that all the code and data stored checks (§ 2.5.3) defined by the Intel architecture. It fol- in ELRANGE receives SGX’s confidentiality, integrity, lows that SGX’s access control logic has access to the and freshness guarantees. Since the memory outside physical address produced by the page walker FSM. ELRANGE does not enjoy these guarantees, the SGX de- SGX’s security checks depend on whether the logi- sign disallows having enclave code outside ELRANGE. cal processor (§ 2.9.4) is in enclave mode (§ 5.4) or not. This is most likely accomplished by setting the disable While the processor is outside enclave mode, the PMH al- execution (XD, § 2.5.3) attribute on the TLB entry. lows any address translation that does not target the PRM Second, an EPC page must only be accessed by range (§ 5.1). When the processor is inside enclave mode, the code of the enclave who owns the page. For the the PMH performs the checks described below, which purpose of this check, each enclave is identified by provide the security guarantees described in § 5.2.3. the index of the EPC page that stores the enclave’s 94

95. SECS (§ 5.1.3). The current enclave’s identifier is stored in the CR ACTIVE SECS microcode register during en- clave entry. This register is compared against the enclave identifier stored in the EPCM entry corresponding to the Perform Address Translation using FSM EPC page targeted by the address translation. Prepare TLB entry Third, some EPC pages cannot be accessed by soft- ware. Pages that hold SGX internal structures, such as a SECS, a TCS (§ 5.2.4), or a VA (§ 5.5.2) must only Executing No enclave be accessed by SGX’s microcode, which uses physical code? addresses and bypasses the address translation unit, in- Yes Physical cluding the PMH. Therefore, the PMH rejects address address translations targeting these pages. in PRM? Yes Blocked (§ 5.5.1) EPC pages are in the process of Physical Replace TLB address being evicted (§ 5.5), so the PMH must not create new No entry address in PRM? Yes TLB entries targeting them. with abort No page Next, an enclave’s EPC pages must always be accessed using the virtual addresses associated with them when Insert new entry in TLB they were allocated to the enclave. Regular EPC pages, Virtual address in ELRANGE? which can be accessed by software, are allocated to en- No Yes claves using the EADD (§ 5.3.2) instruction, which reads Set XD attribute in the page’s address in the enclave’s virtual address on TLB entry Page Fault space. This address is stored in the LINADDR field in Insert new entry the corresponding EPCM entry. Therefore, all the PMH in TLB has to do is to ensure that LINADDR in the address trans- Physical Yes address in lation’s target EPCM entry equals the virtual address that EPC? No caused the TLB miss which invoked the PMH. Read EPCM entry for the physical address At this point, the PMH’s security checks have com- Page Fault pleted, and the address translation result will definitely Yes EPCM entry No be added to the TLB. Before that happens, however, the blocked? SGX extensions to the PMH apply the access restrictions Page Fault EPCM in the EPCM entry for the page to the address translation Yes entry type is No PT_REG? result. While the public SGX documentation we found Page Fault did not describe this process, there is a straightforward EPCM entry EID equals implementation that fulfills SGX’s security requirements. current enclave’s Specifically, the TLB entry bits P, W, and XD can be No Yes ID? AND-ed with the EPCM entry bits R, W, and X. Page Fault EPCM entry ADDRESS equals translated 6.2.2 EPCM Entry Representation No virtual address? Yes Most EPCM entry fields have obvious representations. Modify TLB entry flags The exception is the LINADDR and ENCLAVESECS Page Fault according to EPCM entry fields, described below. These representations explain SGX’s seemingly arbitrary limit on the size of an en- Insert new entry in TLB clave’s virtual address range (ELRANGE). Figure 86: SGX adds a few security checks to the PMH. The checks The SGX patents disclose that the LINADDR field ensure that all the TLB entries created by the address translation unit in an EPCM entry stores the virtual page num- meet SGX’s memory access restrictions. ber (VPN, § 2.5.1) of the corresponding EPC page’s expected virtual address, relative to the ELRANGE base of the enclave that owns the page. The representation described above reduces the num- 95

96.ber of bits needed to store LINADDR, assuming that the ware changes sets up the PMH to trigger a microcode maximum ELRANGE size is significantly smaller than assist for every address translation. This can be done the virtual address size supported by the CPU. This desire by setting the SERR to cover all the physical memory to save EPCM entry bits is the most likely motivation for (e.g., by setting both the base and the mask to zero). In specifying a processor model-specific ELRANGE size, this approach, the microcode assist implements all the which is reported by the CPUID instruction. enclave mode security checks illustrated in Figure 86. The SDM states that the ENCLAVESECS field of an A speedier alternative adds a pair of registers to the ECPM entry corresponding to an EPC page indicates PMH that represents the current enclave’s ELRANGE the SECS of belonging to the enclave that owns the and modifies the PMH so that, in addition to checking page. Intel’s patents reveal that the SECS address in physical addresses against the SERR, it also checks the ENCLAVESECS is represented as a physical page num- virtual addresses going into address translations against ber (PPN, § 2.5.1) relative to the start of the EPC. Effec- ELRANGE. When either check is true, the PMH in- tively, this relative PPN is the 0-based EPC page index. vokes the microcode assist used by SGX to implement The EPC page index representation saves bits in the its memory access checks. Assuming the ELRANGE reg- ECPM entry, assuming that the EPCM size is signifi- isters use the same base / mask representation as variable cantly smaller than the physical address space supported MTRRs, enclave exists can clear ELRANGE by zeroing by the CPU. The ISCA 2015 SGX tutorial slides men- both the base and the mask. This approach uses the same tion an EPC size of 96MB, which is significantly smaller microcode assist implementation, minus the ELRANGE than the physical addressable space on today’s typical check that moves into the PMH hardware. processors, which is 236 - 240 bytes. The second alternative described above has the ben- efit that the microcode assist is not invoked for enclave 6.2.3 PMH Hardware Modifications mode accesses outside ELRANGE. However, § 5.2.1 The SDM describes the memory access checks per- argues that an enclave should treat all the virtual mem- formed after SGX is enabled, but does not provide any ory addresses outside ELRANGE as untrusted storage, insight into their implementation. Intel’s patents hint and only use that memory to communicate with soft- at three possible implementations that make different ware outside the enclave. Taking this into considera- cost-performance tradeoffs. This section summarizes the tion, well-designed enclaves would spend relatively little three approaches and argues in favor of the implementa- time performing memory accesses outside ELRANGE. tion that requires the fewest hardware modifications to Therefore, this second alternative is unlikely to obtain the PMH. performance gains that are worth its cost. All implementations of SGX’s security checks en- The last and most performant alternative would entail tail adding a pair of memory type range regis- implementing all the access checks shown in Figure 86 in ters (MTRRs, § 2.11.4) to the PMH. These registers are hardware. Similarly to the address translation FSM, the named the Secure Enclave Range Registers (SERR) in hardware would only invoke a microcode assist when a Intel’s patents. Enabling SGX on a logical processor ini- security check fails and a Page Fault needs to be handled. tializes the SERR to the values of the Protected Memory The high-performance implementation described Range Registers (PMRR, § 5.1). above avoids the cost of microcode assists for all Furthermore, all implementations have the same be- TLB misses, assuming well-behaved system software. havior when a logical processor is outside enclave mode. In this association, a microcode assist results in a The memory type range described by the SERR is en- Page Fault, which triggers an Asynchronous Enclave abled, causing a microcode assist to trigger for every Exit (AEX, § 5.4.3). The cost of the AEX dominates the address translation that resolves inside the PRM. SGX’s performance overhead of the microcode assist. implementation uses the microcode assist to replace the While this last implementation looks attractive, one address translation result with an address that causes needs to realize that TLB misses occur quite infrequently, memory access transactions to be aborted. so a large improvement in the TLB miss speed trans- The three implementations differ in their behavior lates into a much less impressive improvement in overall when the processor enters enclave mode (§ 5.4) and starts enclave code execution performance. Taking this into executing enclave code. consideration, it seems unwise to commit to extensive The alternative that requires the least amount of hard- hardware modifications in the PMH before SGX gains 96

97.adoption. presented here also hold when the executed instruction 6.3 SGX Security Check Correctness sequence is considered in retirement order, for reasons that will be described below. In § 6.2.1, we argued that SGX’s security guarantees An LP will only transition between enclave mode and can be obtained by modifying the Page Miss Han- non-enclave mode at a few well-defined points, which are dler (PMH, § 2.11.5) to block undesirable address trans- EENTER (§ 5.4.1), ERESUME (§ 5.4.4), EEXIT (§ 5.4.2), lations from reaching the TLB. This section builds on the and Asynchronous Enclave Exits (AEX, § 5.4.3). Ac- result above and outlines a correctness proof for SGX’s cording to the SDM, all the transition points flush the memory access protection. TLBs and the out-of-order execution pipeline. In other Specifically, we outline a proof for the following in- words, the TLBs are guaranteed to be empty after every variant. At all times, all the TLB entries in every log- transition between enclave mode and non-enclave mode, ical processor will be consistent with SGX’s security so we can consider all these transitions to be trivial base guarantees. By the argument in § 6.2.1, the invariant cases for our induction proofs. translates into an assurance that all the memory accesses While SGX initialization is not thoroughly discussed, performed by software obey SGX’s security model. The the SDM mentions that loading some Model-Specific high-level proof structure is presented because it helps Registers (MSRs, § 2.4) triggers TLB flushes, and that understand how the SGX security checks come together. system software should flush TLBs when modifying By contrast, a detailed proof would be incredibly tedious, Memory Type Range Registers (MTRRs, § 2.11.4). and would do very little to boost the reader’s understand- Given that all the possible SGX implementations de- ing of SGX. scribed in § 6.2.3 entail adding a MTRR, it is safe to 6.3.1 Top-Level Invariant Breakdown assume that enabling SGX mode also results in a TLB We first break down the above invariant into specific flush and out-of-order pipeline flush, and can be used by cases based on whether a logical processor (LP) is ex- our induction proof as well. ecuting enclave code or not, and on whether the TLB All the base cases in the induction proofs are serializa- entries translate virtual addresses in the current enclave’s tion points for out-of-order execution, as the pipeline is ELRANGE (§ 5.2.1). When the processor is outside en- flushed during both enclave mode transitions and SGX clave mode, ELRANGE can be considered to be empty. initialization. This makes the proofs below hold when This reasoning yields the three cases outlined below. the program order instruction sequence is replaced with the retirement order sequence. 1. At all times when an LP is outside enclave mode, its The first invariant case holds because while the LP is TLB may only contain physical addresses belonging outside enclave mode, the SGX security checks added to to DRAM pages outside the PRM. the PMH (§ 6.2.1, Figure 86) reject any address transla- tion that would point into the PRM before it reaches the 2. At all times when an LP is inside enclave mode, TLBs. A key observation for proving the induction step the TLB entries for virtual addresses outside the of this invariant case is that the PRM never changes after current enclave’s ELRANGE must contain physical SGX is enabled on an LP. addresses belonging to DRAM pages outside the The second invariant case can be proved using a simi- PRM. lar argument. While an LP is executing an enclave’s code, 3. At all times when an LP is in enclave mode, the the SGX memory access checks added to the PMH reject TLB entries for virtual addresses inside the current any address translation that resolves to a physical address enclave’s ELRANGE must match the virtual mem- inside the PRM, if the translated virtual address falls out- ory layout specified by the enclave author. side the current enclave’s ELRANGE. The induction step for this invariant case can be proven by observing that a The first two invariant cases can be easily proven in- change in an LP’s current ELRANGE is always accom- dependently for each LP, by induction over the sequence panied by a TLB flush, which results in an empty TLB of instructions executed by the LP. For simplicity, the that trivially satisfies the invariant. This follows from the reader can assume that instructions are executed in pro- constraint that an enclave’s ELRANGE never changes gram mode. While the assumption is not true on proces- after it is established, and from the observation that the sors with out-of-order execution (§ 2.10), the arguments LP’s current enclave can only be changed by an enclave 97

98.entry, which must be preceded by an enclave exit, which EPCM entries allocated to an enclave are created triggers a TLB flush. by instructions that can only be issued before the en- The third invariant case is best handled by recognizing clave is initialized, specifically ECREATE (§ 5.3.1) and that the Enclave Page Cache Map (EPCM, § 5.1.2) is EADD (§ 5.3.2). The contents of the EPCM entries cre- an intermediate representation for the virtual memory ated by these instructions contributes to the enclave’s layout specified by the enclave authors. This suggests measurement (§ 5.6), together with the initial data loaded breaking down the case into smaller sub-invariants cen- into the corresponding EPC pages. tered around the EPCM, which will be proven in the § 3.3.2 argues that we can assume that enclaves with sub-sections below. incorrect measurements do not exist, as they will be re- jected by software attestation. Therefore, we can assume 1. At all times, each EPCM entry for a page that is that the attributes used to initialize EPCM pages match allocated to an enclave matches the virtual memory the enclave authors’ memory layout specifications. layout desired by the enclave’s author. EPCM entries can be evicted to untrusted DRAM, together with their corresponding EPC pages, by the 2. Assuming that the EPCM contents is constant, at EWB (§ 5.5.4) instruction. The ELDU / ELDB (§ 5.5) in- all times when an LP is in enclave mode, the TLB structions re-load evicted page contents and metadata entries for virtual addresses inside the current en- back into the EPC and EPCM. By induction, we can clave’s ELRANGE must match EPCM entries that assume that an EPCM entry matches the enclave au- belong to the enclave. thor’s specification when it is evicted. Therefore, if we can prove that the EPCM entry that is reloaded from 3. An EPCM entry is only modified when there is no DRAM is equivalent to the entry that was evicted, we mapping for it in any LP’s TLB. can conclude that the reloaded entry matches the author’s specification. The second and third invariant combined prove that A detailed analysis of the cryptographic primitives all the TLBs in an SGX-enabled computer always reflect used by the SGX design to protect the evicted EPC the contents of the EPCM, as the third invariant essen- page contents and its associated metadata is outside the tially covers the gaps in the second invariant. This result, scope of this work. Summarizing the description in § 5.5, in combination with the first invariant, shows that the the contents of evicted pages is encrypted using AES- EPCM is a bridge between the memory layout specifi- GMAC (§ 3.1.3), which is an authenticated encryption cations of the enclave authors and the TLB entries that mechanism. The MAC tag produced by AES-GMAC regulate what memory can be accessed by software ex- covers the EPCM metadata as well as the page data, and ecuting on the LPs. When further combined with the includes a 64-bit version that is stored in a version tree reasoning in § 6.2.1, the whole proof outlined here re- whose nodes are Version Array (VA, (§ 5.5.2) pages. sults in an end-to-end argument for the correctness of Assuming no cryptographic weaknesses, SGX’s SGX’s memory protection scheme. scheme does appear to guarantee the confidentiality, in- tegrity, and freshness of the EPC page contents and asso- 6.3.2 EPCM Entries Reflect Enclave Author Design ciated metadata while it is evicted in untrusted memory. This sub-section outlines the proof for the following in- It follows that EWB will only reload an EPCM entry if variant. At all times, each EPCM entry for a page that the contents is equivalent to the contents of an evicted is allocated to an enclave matches the virtual mem- entry. ory layout desired by the enclave’s author. The equivalence notion invoked here is slightly dif- A key observation, backed by the SDM pseudocode for ferent from perfect equality, in order to account for the SGX instructions, is that all the instructions that modify allowable operation of evicting an EPC page and its asso- the EPCM pages allocated to an enclave are synchro- ciated EPCM entry, and then reloading the page contents nized using a lock in the enclave’s SECS. This entails the to a different EPC page and a different EPCM entry, as existence of a time ordering of the EPCM modifications illustrated in Figure 69. Loading the contents of an EPC associated with an enclave. We prove the invariant stated page at a different physical address than it had before above using a proof by induction over this sequence of does not break the virtual memory abstraction, as long EPCM modifications. as the contents is mapped at the same virtual address 98

99.(the LINEARADDRESS EPCM field), and has the same to its version at t1 . access control attributes (R, W, X, PT EPCM fields) as it While replay attacks look relatively benign, they can had when it was evicted. be quite devastating when used to facilitate double spend- The rest of this section enumerates the address trans- ing. lation attacks prevented by the MAC verification that occurs in ELDU / ELDB. This is intended to help the reader develop some intuition for the reasoning behind 6.3.3 TLB Entries for ELRANGE Reflect EPCM Con- using the page data and all the EPCM fields to compute tents and verify the MAC tag. This sub-section sketches a proof for the following invari- The most obvious attack is prevented by having the ant. At all times when an LP is in enclave mode, the MAC cover the contents of the evicted EPC page, so the TLB entries for virtual addresses inside the current untrusted OS cannot modify the data in the page while it enclave’s ELRANGE must match EPCM entries that is stored in untrusted DRAM. The MAC also covers the belong to the enclave. The argument makes the assump- metadata that makes up the EPCM entry, which prevents tion that the EPCM contents is constant, which will be the more subtle attacks described below. justified in the following sub-section. The enclave ID (EID) field is covered by the MAC tag, The invariant can be proven by induction over the so the OS cannot evict an EPC page belonging to one sequence of TLB insertions that occur in the LP. This enclave, and assign the page to a different enclave when sequence is well-defined because an LP has a single it is loaded back into the EPC. If EID was not covered by PMH, so the address translation requests triggered by authenticity guarantees, a malicious OS could read any TLB misses must be serialized to be processed by the enclave’s data by evicting an EPC page belonging to the PMH. victim enclave, and loading it into a malicious enclave that would copy the page’s contents to untrusted DRAM. The proof’s induction step depends on the fact that the The virtual address (LINADDR) field is covered by TLB on hyper-threaded cores (§ 2.9.4) is dynamically the MAC tag, so the OS cannot modify the virtual mem- partitioned between the two LPs that share the core, and ory layout of an enclave by evicting an EPC page and no TLB entry is shared between the LPs. This allows specifying a different LINADDR when loading it back. our proof to consider the TLB insertions associated with If LINADDR was not covered by authenticity guarantees, one LP independently from the other LP’s insertions, a malicious OS could perform the exact attack shown in which means we don’t have to worry about the state (e.g., Figure 55 and described in § 3.7.3. enclave mode) of the other LP on the core. The page access permission flags (R, W, X) are also The proof is further simplified by observing that when covered by the MAC tag. This prevents the OS from an LP exits enclave mode, both its TLB and its out-of- changing the access permission bits in a page’s EPCM order instruction pipeline are flushed. Therefore, the entry by evicting the page and loading it back in. If enclave mode and current enclave register values used by the permission flags were not covered by authenticity address translations are guaranteed to match the values guarantees, the OS could use the ability to change EPCM obtained by performing the translations in program order. access permissions to facilitate exploiting vulnerabilities Having eliminated all the complexities associated with in enclave code. For example, exploiting a stack overflow hyper-threaded (§ 2.9.4) out-of-order (§ 2.10) execution vulnerability is generally easier if OS can make the stack cores, it is easy to see that the security checks outlined in pages executable. Figure 86 and § 6.2.1 ensure that TLB entries that target The nonce stored in the VA slot is also covered by EPC pages are guaranteed to reflect the constraints in the the MAC. This prevents the OS from mounting a replay corresponding EPCM entries. attack that reverts the contents of an EPC page to an Last, the SGX access checks implemented in the PMH older version. If the nonce would not be covered by reject any address translation for a virtual address in integrity guarantees, the OS could evict the target EPC ELRANGE that does not resolve to an EPC page. It page at different times t1 and t2 in the enclave’s life, and follows that memory addresses inside ELRANGE can then provide the EWB outputs at t1 to the ELDU / ELDB only map to EPC pages which, by the argument above, instruction. Without the MAC verification, this attack must follow the constraints of the corresponding EPCM would successfully revert the contents of the EPC page entries. 99

100.6.3.4 EPCM Entries are Not In TLBs When Modified verify that the relevant TLB entries have been flushed. Thus, we must base our proof on the assumption that In this sub-section, we outline a proof that an EPCM the SGX implementation produced by Intel’s engineers entry is only modified when there is no mapping for matches the claims in the SDM. In § 6.4, we propose a it in any LP’s TLB.. This proof analyzes each of the method for ensuring that EWB will only succeed when instructions that modify EPCM entries. all the LPs executing an enclave’s code at the time when For the purposes of this proof, we consider that setting ETRACK is called have exited enclave mode at least once the BLOCKED attribute does not count as a modification between the ETRACK call and the EWB call. Having to an EPCM entry, as it does not change the EPC page proven the existence of a correct algorithm by construc- that the entry is associated with, or the memory layout tion, we can only hope that the SGX implementation uses specification associated with the page. our algorithm, or a better algorithm that is still correct. The instructions that modify EPCM entries in such a way that the resulting EPCM entries have the VALID 6.4 Tracking TLB Flushes field set to true require that the EPCM entries were in- This section proposes a straightforward method that the valid before they were modified. These instructions are SGX implementation can use to verify that the system ECREATE (§ 5.3.1), EADD (§ 5.3.2), EPA (§ 5.5.2), and software plays its part correctly in the EPC page evic- ELDU / ELDB (§ 5.5). The EPCM entry targeted by any tion (§ 5.5) process. Our method meets the SDM’s spec- these instructions must have had its VALID field set to ification for EBLOCK (§ 5.5.1), ETRACK (§ 5.5.1) and false, so the invariant proved in the previous sub-section EWB (§ 5.5.4). implies that the EPCM entry had no TLB entry associ- The motivation behind this section is that, at least at ated with it. the time of this writing, there is no official SGX doc- Conversely, the instructions that modify EPCM en- umentation that contains a description of the mecha- tries and result in entries whose VALID field is false nism used by EWB to ensure that all the Logical Pro- start out with valid entries. These instructions are cessors (LPs, § 2.9.4) running an enclave’s code exit EREMOVE (§ 5.3.4) and EWB (§ 5.5.4). enclave mode (§ 5.4) between an ETRACK invocation The EPCM entries associated with EPC pages that and a EWB invocation. Knowing that there exists a cor- store Version Arrays (VA, § 5.5.2) represent a special rect mechanism that has the same interface as the SGX case for both instructions mentioned above, as these instructions described in the SDM gives us a reason to pages are not associated with any enclave. As these hope that the SGX implementation is also correct. pages can only be accessed by the microcode used to im- Our method relies on the fact that an enclave’s plement SGX, they never have TLB entries representing SECS (§ 5.1.3) is not accessible by software, and is them. Therefore, both EREMOVE and EWB can invalidate already used to store information used by the SGX mi- EPCM entries for VA pages without additional checks. crocode implementation (§ 6.1.3). We store the follow- EREMOVE only invalidates an EPCM entry associated ing fields in the SECS. tracking and done-tracking are with an enclave when there is no LP executing in enclave Boolean variables. tracked -threads and active-threads mode using a TCS associated with the same enclave. An are non-negative integers that start at zero and must EPCM entry can only result in TLB translations when an store numbers up to the number of LPs in the computer. LP is executing code from the entry’s enclave, and the lp-mask is an array of Boolean flags that has one mem- TLB translations are flushed when the LP exits enclave ber per LP in the computer. The fields are initialized as mode. Therefore, when EREMOVE invalidates an EPCM shown in Figure 87. entry, any associated TLB entry is guaranteed to have The active-threads SECS field tracks the number of been flushed. LPs that are currently executing the code of the enclave EWB’s correctness argument is more complex, as it who owns the SECS. The field is atomically incremented relies on the EBLOCK / ETRACK sequence described in by EENTER (§ 5.4.1) and ERESUME (§ 5.4.4) and is § 5.5.1 to ensure that any TLB entry that might have been atomically decremented by EEXIT (§ 5.4.2) and Asyn- created for an EPCM entry is flushed before the EPCM chronous Enclave Exits (AEXs, § 5.4.3). Asides from entry is invalidated. helping track TLB flushes, this field can also be used by Unfortunately, the SDM pseudocode for the instruc- EREMOVE (§ 5.3.4) to decide when it is safe to free an tions mentioned above leaves out the algorithm used to EPC page that belongs to an enclave. 100

101. lp-mask prevents us from double-counting an LP when ECREATE(SECS ) it exits the same enclave while TLB flush tracking is ✄ Initialize the SECS state used for tracking. active. 1 SECS . tracking ← FALSE Once active-threads reaches zero, we are assured that 2 SECS . done-tracking ← FALSE all the LPs running the enclave’s code when ETRACK 3 SECS . active-threads ← 0 was issued have exited enclave mode at least once, and 4 SECS . tracked -threads ← 0 can set the done-tracking flag. Figure 89 enumerates all 5 SECS . lp-mask ← 0 the steps taken on enclave exit. Figure 87: The algorithm used to initialize the SECS fields used by ENCLAVE - EXIT (SECS ) the TLB flush tracking method presented in this section. ✄ Track an enclave exit. As specified in the SDM, ETRACK activates TLB flush 1 ATOMIC - DECREMENT(SECS . active-threads) tracking for an enclave. In our method, this is accom- 2 if ATOMIC - TEST- AND - SET( plished by setting the tracking field to TRUE and the SECS . lp-mask [LP - ID]) done-tracking field to FALSE. 3 then ATOMIC - DECREMENT( When tracking is enabled, tracked -threads is the num- SECS . tracked -threads) ber of LPs that were executing the enclave’s code when 4 if SECS . tracked -threads = 0 the ETRACK instruction was issued, and have not yet ex- 5 then SECS . done-tracking ← TRUE ited enclave mode. Therefore, executing ETRACK atom- ically reads active-threads and writes the result into Figure 89: The algorithm that updates the TLB flush tracking state tracked -threads. Also, lp-mask keeps track of the LPs when an LP exits an enclave via EEXIT or AEX. that have exited the current enclave after the ETRACK Without any compensating measure, the method above instruction was issued. Therefore, the ETRACK imple- will incorrectly decrement tracked -threads, if the LP ex- mentation atomically zeroes lp-mask . The full ETRACK iting the enclave had entered it after ETRACK was issued. algorithm is listed in Figure 88. We compensate for this with the following trick. When an LP starts executing code inside an enclave that has TLB flush tracking activated, we set its corresponding ETRACK(SECS ) flag in lp-mask . This is sufficient to avoid counting the ✄ Abort if tracking is already active. LP when it exits the enclave. Figure 90 lists the steps 1 if SECS . tracking = TRUE required by our method when an LP enters an enclave. 2 then return SGX - PREV- TRK - INCMPL ✄ Activate TLB flush tracking. 3 SECS . tracking ← TRUE ENCLAVE - ENTER (SECS ) 4 SECS . done-tracking ← FALSE ✄ Track an enclave entry. 5 SECS . tracked -threads ← 1 ATOMIC - INCREMENT (SECS . active-threads) ATOMIC - READ (SECS . active-threads) 2 ATOMIC - SET (SECS . lp-mask [ LP - ID ]) 6 for i ← 0 to MAX - LP - ID 7 do ATOMIC - CLEAR(SECS . lp-mask [i]) Figure 90: The algorithm that updates the TLB flush tracking state when an LP enters an enclave via EENTER or ERESUME. Figure 88: The algorithm used by ETRACK to activate TLB flush With these algorithms in place, EWB can simply verify tracking. that both tracking and done-tracking are TRUE. This When an LP exits an enclave that has TLB flush ensures that the system software has triggered enclave tracking activated, we atomically test and set the cur- exits on all the LPs that were running the enclave’s code rent LP’s flag in lp-mask . If the flag was not previ- when ETRACK was executed. Figure 91 lists the algo- ously set, it means that an LP that was executing the rithm used by the EWB tracking verification step. enclave’s code when ETRACK was invoked just exited Last, EBLOCK marks the end of a TLB flush tracking enclave mode for the first time, and we atomically decre- cycle by clearing the tracking flag. This ensures that sys- ment tracked -threads to reflect this fact. In other words, tem software must go through another cycle of ETRACK 101

102. This section describes an algorithm for computing the EWB- VERIFY(virtual -addr ) signed message while only using subtraction and multi- 1 physical -addr ← TRANSLATE(virtual -addr ) plication on large non-negative integers. The algorithm 2 epcm-slot ← EPCM - SLOT(physical -addr ) admits a significantly simpler implementation than the 3 if EPCM [slot]. BLOCKED = FALSE typical RSA signature verification algorithm, by avoiding 4 then return SGX - NOT- BLOCKED the use of long division and negative numbers. The de- 5 SECS ← EPCM - ADDR( scription here is essentially the idea in [73], specialized EPCM [slot]. ENCLAVESECS ) for e = 3. ✄ Verify that the EPC page can be evicted. The algorithm provided here requires the signer to 6 if SECS . tracking = FALSE compute the q1 and q2 values shown below. The values 7 then return SGX - NOT- TRACKED can be computed from the public information in the sig- 8 if SECS . done-tracking = FALSE nature, so they do not leak any additional information 9 then return SGX - NOT- TRACKED about the private signing key. Furthermore, the algorithm verifies the correctness of the values, so it does not open Figure 91: The algorithm that ensures that all LPs running an up the possibility for an attack that relies on supplying enclave’s code when ETRACK was executed have exited enclave incorrect values for q1 and q2 . mode at least once. and enclave exits before being able to use EWB on the s2 q1 = page whose BLOCKED EPCM field was just set to TRUE m by EBLOCK. Figure 92 shows the details. s3 − q1 × s × m q2 = m EBLOCK(virtual -addr ) Due to the desirable properties mentioned above, it is 1 physical -addr ← TRANSLATE(virtual -addr ) very likely that the algorithm described here is used by 2 epcm-slot ← EPCM - SLOT(physical -addr ) the SGX implementation to verify the RSA signature in 3 if EPCM [slot]. BLOCKED = TRUE an enclave’s SIGSTRUCT (§ 5.7.1). 4 then return SGX - BLKSTATE The algorithm in Figure 93 computes the signed mes- 5 if SECS . tracking = TRUE sage M = s3 mod m, while also verifying that the given 6 then if SECS . done-tracking = FALSE values of q1 and q2 are correct. The latter is necessary 7 then return SGX - ENTRYEPOCH - LOCKED because the SGX implementation of signature verifica- 8 SECS . tracking ← FALSE tion must handle the case where an attacker attempts 9 EPCM [slot]. BLOCKED ← TRUE to exploit the signature verification implementation by supplying invalid values for q1 and q2 . Figure 92: The algorithm that marks the end of a TLB flushing The rest of this section proves the correctness of the cycle when EBLOCK is executed. algorithm in Figure 93. Our method’s correctness can be easily proven by ar- 6.5.1 Analysis of Steps 1 - 4 guing that each SECS field introduced in this section has its intended value throughout enclave entries and exits. Steps 1 − 4 in the algorithm check the correctness of q1 and use it to compute s2 mod m. The key observation 6.5 Enclave Signature Verification to understanding these steps is recognizing that q1 is the Let m be the public modulus in the enclave author’s quotient of the integer division s2 /m. RSA key, and s be the enclave signature. Since the Having made this observation, we can use elementary SGX design fixes the value of the public exponent e to division properties to prove that the supplied value for q1 3, verifying the RSA signature amounts to computing is correct if and only if the following property holds. the signed message M = s3 mod m, checking that the value meets the PKCS v1.5 padding requirements, and 0 ≤ s2 − q1 × m < m comparing the 256-bit SHA-2 hash inside the message with the value obtained by hashing the relevant fields in We observe that the first comparison, 0 ≤ s2 − q1 × m, the SIGSTRUCT supplied with the enclave. is equivalent to q1 × m ≤ s2 , which is precisely the 102

103. 1. Compute u ← s × s and v ← q1 × m 2. If u < v, abort. q1 must be incorrect. w×s (s2 mod m) × s = m m 3. Compute w ← u − v s2 (s2 − m × m) × s = 4. If w ≥ m, abort. q1 must be incorrect. m s2 5. Compute x ← w × s and y ← q2 × m s3 − m ×m×s = m 6. If x < y, abort. q2 must be incorrect. s 3 − q1 × m × s 7. Compute z ← x − y. = m 3 s − q1 × s × m 8. If z ≥ m, abort. q2 must be incorrect. = m 9. Output z. = q2 Figure 93: An RSA signature verification algorithm specialized for By the same argument used to analyze steps 1 − 4, the case where the public exponent is 3. s is the RSA signature and we use elementary division properties to prove that q2 is m is the RSA key modulus. The algorithm uses two additional inputs, correct if and only if the equation below is correct. q1 and q2 . 0 ≤ w × s − q2 × m < m check performed by step 2. We can also see that the The equation’s first comparison, 0 ≤ w × s − q2 × m, second comparison, s2 − q1 × m < m corresponds to the is equivalent to q2 × m ≤ w × s, which corresponds to condition verified by step 4. Therefore, if the algorithm the check performed by step 6. The second comparison, passes step 4, it must be the case that the value supplied w × s − q2 × m < m, matches the condition verified by for q1 is correct. step 8. It follows that, if the algorithm passes step 8, it must be the case that the value supplied for q2 is correct. We can also plug s2 , q1 and m into the integer division By plugging w × s, q2 and m into the integer division remainder definition to obtain the identity s2 mod m = remainder definition, we obtain the identity w × s mod s2 − q1 × m. However, according to the computations m = w × s − q2 × m. Trivial substitution reveals that the performed in steps 1 and 3, w = s2 − q1 × m. Therefore, computations in steps 5 and 7 result in z = w×s−q2 ×m, we can conclude that w = s2 mod m. which allows us to conclude that z = w × s mod m. In the analysis for steps 1 − 4, we have proven that w = s2 mod m. By substituting this into the above identity, we obtain the proof that the algorithm’s output is indeed the desired signed message. 6.5.2 Analysis of Steps 5 - 8 z = w × s mod m Similarly, steps 5 − 8 in the algorithm check the correct- = (s2 mod m) × s mod m ness of q2 and use it to compute w × s mod m. The key = s2 × s mod m observation here is that q2 is the quotient of the integer division (w × s)/m. = s3 mod m We can convince ourselves of the truth of this obser- 6.5.3 Implementation Requirements vation by using the fact that w = s2 mod m, which was The main advantage of the algorithm in Figure 93 is that proven above, by plugging in the definition of the re- it relies on the implementation of very few arithmetic mainder in integer division, and by taking advantage of operations on large integers. The maximum integer size the distributivity of integer multiplication with respect to that needs to be handled is twice the size of the modulus addition. in the RSA key used to generate the signature. 103

104. Steps 1 and 5 use large integer multiplication. Steps of times that the software attestation process needs to 3 and 7 use integer subtraction. Steps 2, 4, 6, and 8 use be performed in a distributed system. In fact, SGX’s large integer comparison. The checks in steps 2 and 6 software attestation process is implemented by enclaves guarantee that the results of the subtractions performed with special privileges that use the certificate-based iden- in steps 3 and 7 will be non-negative. It follows that the tity system to securely store the CPU’s attestation key in algorithm will never encounter negative numbers. untrusted memory. 6.6 SGX Security Properties 6.6.2 Physical Attacks We have summarized SGX’s programming model and the implementation details that are publicly documented We begin by discussing SGX’s resilience to the physical in Intel’s official documentation and published patents. attacks described in § 3.4. Unfortunately, this section We are now ready to bring this the information together is set to disappoint readers expecting definitive state- in an analysis of SGX’s security properties. We start ments. The lack of publicly available details around the the analysis by restating SGX’s security guarantees, and hardware implementation aspects of SGX precludes any spend the bulk of this section discussing how SGX fares rigorous analysis. However, we do know enough about when pitted against the attacks described in § 3. We SGX’s implementation to point out a few avenues for conclude the analysis with some troubling implications future exploration. of SGX’s lack of resistance to software side-channel Due to insufficient documentation, one can only hope attacks. that the SGX security model is not trivially circum- vented by a port attack (§ 3.4.1). We are particularly 6.6.1 Overview concerned about the Generic Debug eXternal Connec- Intel’s Software Guard Extensions (SGX) is Intel’s latest tion (GDXC) [124, 197], which collects and filters the iteration of a trusted hardware solution to the secure re- data transferred by the uncore’s ring bus (§ 2.11.3), and mote computation problem. The SGX design is centered reports it to an external debugger. around the ability to create an isolated container whose The SGX memory protection measures are imple- contents receives special hardware protections that are mented at the core level, in the Page Miss Han- intended to translate into confidentiality, integrity, and dler (PMH, § 2.11.5) (§ 6.2) and at the chip die level, freshness guarantees. in the memory controller (§ 6.1.2). Therefore, the code An enclave’s initial contents is loaded by the system and data inside enclaves is stored in plaintext in on-chip software on the computer, and therefore cannot contain caches (§ 2.11), which entails that the enclave contents secrets in plain text. Once initialized, an enclave is ex- travels without any cryptographic protection on the un- pected to participate in a software attestation process, core’s ring bus (§ 2.11.3). where it authenticates itself to a remote server. Upon suc- Fortunately, a recent Intel patent [165] indicates that cessful authentication, the remote server is expected to Intel engineers are tackling at least some classes of at- disclose some secrets to an enclave over a secure commu- tacks targeting debugging ports. nication channel. The SGX design attempts to guarantee The SDM and SGX papers discuss the most obvi- that the measurement presented during software attesta- ous class of bus tapping attacks (§ 3.4.2), which is the tion accurately represents the contents loaded into the DRAM bus tapping attack. SGX’s threat model con- enclave. siders DRAM and the bus connecting it to the CPU SGX also offers a certificate-based identity system that chip to be untrusted. Therefore, SGX’s Memory En- can be used to migrate secrets between enclaves that have cryption Engine (MEE, § 6.1.2) provides confidentiality, certificates issued by the same authority. The migration integrity and freshness guarantees to the Enclave Page process involves securing the secrets via authenticated Cache (EPC, § 5.1.1) data while it is stored in DRAM. encryption before handing them off to the untrusted sys- However, both the SGX papers and the ISCA 2015 tem software, which passes them to another enclave that tutorial on SGX admit that the MEE does not protect the can decrypt them. addresses of the DRAM locations accessed when cache The same mechanism used for secret migration can lines holding EPC data are evicted or loaded. This pro- also be used to cache the secrets obtained via software vides an opportunity for a malicious computer owner to attestation in an untrusted storage medium managed by observe an enclave’s memory access patterns by combin- system software. This caching can reduce the number ing a DRAM address line bus tap with carefully crafted 104

105.system software that creates artificial pressure on the last- key is encrypted with the GWK and transmitted to the level cache (LLC ,§ 2.11) lines that hold the enclave’s key generation server. At a later stage, the key generation EPC pages. server encrypts the key material that will be burned into On a brighter note, as mentioned in § 3.4.2, we are not the processor chip’s e-fuses with the PUF key, and trans- aware of any successful DRAM address line bus tapping mits the encrypted material to the chip. The PUF key attack. Furthermore, SGX is vulnerable to cache timing increases the cost of obtaining a chip’s fuse key material, attacks that can be carried out completely in software, so as an attacker must compromise both provisioning stages malicious computer owners do not need to bother setting in order to be able to decrypt the fuse key material. up a physical attack to obtain an enclave’s memory access As mentioned in previous sections, patents reveal de- patterns. sign possibilities considered by the SGX engineers. How- While the SGX documentation addresses DRAM bus ever, due to the length of timelines involved in patent ap- tapping attacks, it makes no mention of the System Man- plications, patents necessarily describe earlier versions of agement bus (SMBus, § 2.9.2) that connects the Intel the SGX implementation plans, which might not match Management Engine (ME, § 2.9.2) to various compo- the shipping implementation. We expect this might be nents on the computer’s motherboard. the case with the PUF provisioning patents, as it makes In § 6.6.5, we will explain that the ME needs to be little sense to include a PUF in a chip die and rely on taken into account when evaluating SGX’s memory pro- e-fuses and a GWK to store SGX’s root keys. Deriving tection guarantees. This makes us concerned about the the root keys from the PUF would be more resilient to possibility of an attack that taps the SMBus to reach into chip imaging attacks. the Intel ME. The SMBus is much more accessible than SGX’s threat model excludes power analysis at- the DRAM bus, as it has fewer wires that operate at a tacks (§ 3.4.4) and other side-channel attacks. This is significantly lower speed. Unfortunately, without more understandable, as power attacks cannot be addressed at information about the role that the Intel ME plays in a the architectural level. Defending against power attacks computer, we cannot move beyond speculation on this requires expensive countermeasures at the lowest levels topic. of hardware implementation, which can only be designed The threat model stated by the SGX design excludes by engineers who have deep expertise in both system se- physical attacks targeting the CPU chip (§ 3.4.3). Fortu- curity and Intel’s manufacturing process. It follows that nately, Intel’s patents disclose an array of countermea- defending against power analysis attacks has a very high sures aimed at increasing the cost of chip attacks. cost-to-benefit ratio. For example, the original SGX patents [108, 136] dis- 6.6.3 Privileged Software Attacks close that the Fused Seal Key and the Provisioning Key, which are stored in e-fuses (§ 5.8.2), are encrypted with The SGX threat model considers system software to be a global wrapping logic key (GWK). The GWK is a untrusted. This is a prerequisite for SGX to qualify as 128-bit AES key that is hard-coded in the processor’s a solution to the secure remote computation problem circuitry, and serves to increase the cost of extracting the encountered by software developers who wish to take ad- keys from an SGX-enabled processor. vantage of Infrastructure-as-a-Service (IaaS) cloud com- As explained in § 3.4.3, e-fuses have a large feature puting. size, which makes them relatively easy to “read” using a SGX’s approach is also an acknowledgement of the high-resolution microscope. In comparison, the circuitry realities of today’s software landscape, where the sys- on the latest Intel processors has a significantly smaller tem software that runs at high privilege levels (§ 2.3) feature size, and is more difficult to reverse engineer. is so complex that security researchers constantly find Unfortunately, the GWK is shared among all the chip dies vulnerabilities in it (§ 3.5). created from the same mask, so it has all the drawbacks The SGX design prevents malicious software from of global secrets explained in § 3.4.3. directly reading or from modifying the EPC pages that Newer Intel patents [67, 68] describe SGX-enabled store an enclave’s code and data. This security property processors that employ a Physical Unclonable Func- relies on two pillars in the SGX design. tion (PUF), e.g., [173], [131], which generates a symmet- First, the SGX implementation (§ 6.1) runs in the pro- ric key that is used during the provisioning process. cessor’s microcode (§ 2.14), which is effectively a higher Specifically, at an early provisioning stage, the PUF privilege level that system software does not have access 105

106.to. Along the same lines, SGX’s security checks (§ 6.2) values in the SSA used to enter the enclave, but stores are the last step performed by the PMH, so they cannot XCR0 (§ 2.6), FS and GS (§ 2.7) in the non-architectural be bypassed by any other architectural feature. area of the TCS (§ 6.1.3). At first glance, it may seem This implementation detail is only briefly mentioned elegant to remove this inconsistency and have EENTER in SGX’s official documentation, but has a large impact store the contents of the XCR0, FS, and GS registers on security. For context, Intel’s Trusted Execution Tech- in the current SSA, along with RSP and RBP. However, nology (TXT, [70]), which is the predecessor of SGX, this approach would break the Intel architecture’s guar- relied on Intel’s Virtual Machine Extensions (VMX) for antees that only system software can modify XCR0, and isolation. The approach was unsound, because software application software can only load segment registers us- running in System Management Mode (SMM, § 2.3) ing selectors that index into the GDT or LDT set up by could bypass the restrictions used by VMX to provide system software. Specifically, a malicious application isolation. could modify these privileged registers by creating an The security properties of SGX’s memory protection enclave that writes the desired values to the current SSA mechanisms are discussed in detail in § 6.6.4. locations backing up the registers, and then executes Second, SGX’s microcode is always involved when a EEXIT (§ 5.4.2). CPU transitions between enclave code and non-enclave Unfortunately, the following sections will reveal that code (§ 5.4), and therefore regulates all interactions be- while SGX offers rather thorough guarantees against tween system software and an enclave’s environment. straightforward attacks on enclaves, its guarantees are On enclave entry (§ 5.4.1), the SGX implementation almost non-existent when it comes to more sophisticated sets up the registers (§ 2.2) that make up the execution attacks, such as side-channel attacks. This section con- state (§ 2.6) of the logical processor (LP § 2.9.4), so cludes by describing what might be the most egregious a malicious OS or hypervisor cannot induce faults in side-channel vulnerability in SGX. the enclave’s software by tampering with its execution Most modern Intel processors feature hyper-threading. environment. On these CPUs, the execution units (§ 2.10) and When an LP transitions away from an enclave’s code caches (§ 2.11) on a core (§ 2.9.4) are shared by two due to a hardware exception (§ 2.8.2), the SGX imple- LPs, each of which has its own execution state. SGX mentation stashes the LP’s execution state into a State does not prevent hyper-threading, so malicious system Save Area (SSA, § 5.2.5) area inside the enclave and software can schedule a thread executing the code of a scrubs it, so the system software’s exception handler can- victim enclave on an LP that shares the core with an LP not access any enclave secrets that may be stored in the executing a snooping thread. This snooping thread can execution state. use the processor’s high-resolution performance counter The protections described above apply to the all the [150], in conjunction with microarchitectural knowledge levels of privileged software. SGX’s transitions between of the CPU’s execution units and out-of-order scheduler, an enclave’s code and non-enclave code place SMM to learn the instructions executed by the victim enclave, software on the same footing as the system software as well as its memory access patterns. at lower privilege levels. System Management Inter- This vulnerability can be fixed using two approaches. rupts (SMI, § 2.12, § 3.5), which cause the processor to The straightforward solution is to require cloud comput- execute SMM code, are handled using the same Asyn- ing providers to disable hyper-threading when offering chronous Enclave Exit (AEX, § 5.4.3) process as all other SGX. The SGX enclave measurement would have to hardware exceptions. be extended to include the computer’s hyper-threading Reasoning about the security properties of SGX’s tran- configuration, so the remote parties in the software at- sitions between enclave mode and non-enclave mode is testation process can be assured that their enclaves are very difficult. A correctness proof would have to take hosted by a secure environment. into account all the CPU’s features that expose registers. A more complex approach to fixing the hyper- Difficulty aside, such a proof would be very short-lived, threading vulnerability would entail having the SGX because every generation of Intel CPUs tends to intro- implementation guarantee that when an LP is executing duce new architectural features. The paragraph below an enclave’s code, the other LP sharing its core is either gives a taste of what such a proof would look like. inactive, or is executing the same enclave’s code. While EENTER (§ 5.4.1) stores the RSP and RBP register this approach is possible, its design would likely be quite 106

107.cumbersome. reader is directed to [193] for a fully working system. 6.6.4 Memory Mapping Attacks Passive address translation attacks rely on the fact that memory accesses issued by SGX enclaves go through § 5.4 explained that the code running inside an enclave the Intel architecture’s address translation process (§ 2.5), uses the same address translation process (§ 2.5) and including delivering page faults (§ 2.8.2) and setting the page tables as its host application. While this design accessed (A) and dirty (D) attributes (§ 2.5.3) on page approach makes it easy to retrofit SGX support into ex- table entries. isting codebases, it also enables the address translation attacks described in § 3.7. A malicious OS kernel or hypervisor can obtain the The SGX design protects the code inside enclaves page-level trace of an application executing inside an against the active attacks described in § 3.7. These pro- enclave by setting the present (P) attribute to 0 on all tections have been extensively discussed in prior sections, the enclave’s pages before starting enclave execution. so we limit ourselves to pointing out SGX’s answer to While an enclave executes, the malicious system software each active attack. We also explain the lack of protec- maintains exactly one instruction page and one data page tions against passive attacks, which can be used to learn present in the enclave’s address space. an enclave’s memory access pattern at 4KB page granu- When a page fault is generated, CR2 contains the larity. virtual address of a page accessed by enclave, and the SGX uses the Enclave Page Cache error code indicates whether the memory access was a Map (EPCM, § 5.1.2) to store each EPC page’s read or a write (bit 1) and whether the memory access is position in its enclave’s virtual address space. The a data access or an instruction fetch access (bit 4). On a EPCM is consulted by SGX’s extensions to the Page data access, the kernel tracing the enclave code’s memory Miss Handler (PMH, § 6.2.1), which prevent straight- access pattern would set the P flag of the desired page to forward active address translation attacks (§ 3.7.2) by 1, and set the P flag of the previously accessed data page rejecting undesirable address translations before they to 0. Instruction accesses can be handled in a similar reach the TLB (§ 2.11.5). manner. SGX allows system software to evict (§ 5.5) EPC For a slightly more detailed trace, the kernel can set pages into untrusted DRAM, so that the EPC can be a desired page’s writable (W) attribute to 0 if the page over-subscribed. The contents of the evicted pages and fault’s error code indicates a read access, and only set the associated EPCM metadata are protected by cryp- it to 1 for write accesses. Also, applications that use tographic primitives that offer confidentiality, integrity a page as both code and data (self-modifying code and and freshness guarantees. This protects against the active just-in-time compiling VMs) can be handled by setting a attacks using page swapping described in § 3.7.3. page’s disable execution (XD) flag to 0 for a data access, When system software wishes to evict EPC pages, and by carefully accounting for the case where the last it must follow the process described in § 5.5.1, which accessed data page is the same as the last accessed code guarantees to the SGX implementation that all the LPs page. have invalidated any TLB entry associated with pages that will be evicted. This defeats the active attacks based Leaving an enclave via an Asynchronous Enclave on stale TLB entries described in § 3.7.4. Exit (AEX, § 5.4.3) and re-entering the enclave via § 6.3 outlines a correctness proof for the memory pro- ERESUME (§ 5.4.4) causes the CPU to flush TLB en- tection measures described above. tries that contain enclave addresses, so a tracing kernel Unfortunately, SGX does not protect against passive would not need to worry about flushing the TLB. The address translation attacks (§ 3.7.1), which can be used tracing kernel does not need to flush the caches either, to learn an enclave’s memory access pattern at page gran- because the CPU needs to perform address translation ularity. While this appears benign, recent work [193] even for cached data. demonstrates the use of these passive attacks in a few A straightforward way to reduce this attack’s power practical settings, which are immediately concerning for is to increase the page size, so the trace contains less image processing applications. information. However, the attack cannot be completely The rest of this section describes the theory behind prevented without removing the kernel’s ability to over- planning a passive attack against an SGX enclave. The subscribe the EPC, which is a major benefit of paging. 107

108.6.6.5 Software Attacks on Peripherals Specifically, the documentation that is publicly available from Intel does not provide enough information to model Since the SGX design does not trust the system software, the information leakage due to performance counters. it must be prepared to withstand the attacks described in For example, Intel does not document the mapping § 3.6, which can be carried out by the system software implemented in CBoxes (§ 2.11.3) between physical thanks to its ability to control peripheral devices on the DRAM addresses and the LLC slices used to cache the computer’s motherboard (§ 2.9.1). This section summa- addresses. This mapping impacts several uncore per- rizes the security properties of SGX when faced with formance counters, and the impact is strong enough to these attacks, based on publicly available information. allow security researches to reverse-engineer the map- When SGX is enabled on an LP, it configures the mem- ping [84, 133, 195]. Therefore, it is safe to assume that ory controller (MC, § 2.11.3) integrated on the CPU chip a malicious computer owner who knows the CBox map- die to reject any DMA transfer that falls within the Pro- ping can use the uncore performance counters to learn cessor Reserved Memory (PRM, § 5.1) range. The PRM about an enclave’s memory access patterns. includes the EPC, so the enclaves’ contents is protected The SGX papers mention that SGX’s threat model from the PCI Express attacks described in § 3.6.1. This includes attacks that overwrite the flash memory chip protection guarantee relies on the fact that the MC is that stores the computer’s firmware, which result in ma- integrated on the processor’s chip die, so the MC con- licious code running in SMM. However, all the official figuration commands issued by SGX’s microcode imple- SGX documentation is silent about the implications of mentation (§ 6.1.3) are transmitted over a communication an attack that compromises the firmware executed by the path that never leaves the CPU die, and therefore can be Intel ME. trusted. § 3.6.4 states that the ME’s firmware is stored in the SGX regards DRAM as an untrusted storage medium, same flash memory as the boot firmware, and enumer- and uses cryptographic primitives implemented in the ates some of ME’s special privileges that enable it to help MEE to guarantee the confidentiality, integrity and fresh- system administrators remotely diagnose and fix hard- ness of the EPC contents that is stored into DRAM. This ware and software issues. Given that the SGX design is protects against software attacks on DRAM’s integrity, concerned about the possibility of malicious computer like the rowhammer attack described in § 3.6.2. firmware, it is reasonable to be concerned about mali- The SDM describes an array of measures that SGX cious ME firmware. takes to disable processor features intended for debug- § 3.6.4 argues that an attacker who compromises the ging when a LP starts executing an enclave’s code. For ME can carry out actions that are usually classified as example, enclave entry (§ 5.4.1) disables Precise Event physical attacks. An optimistic security researcher can Based Sampling (PEBS) for the LP, as well as any hard- observe that the most scary attack vector afforded by ware breakpoints placed inside the enclave’s virtual ad- an ME takeover appears to be direct DRAM access, dress range (ELRANGE, § 5.2.1). This addresses some and SGX already assumes that the DRAM is untrusted. of the attacks described in § 3.6.3, which take advantage Therefore, an ME compromise would be equivalent to of performance monitoring features to get information the DRAM attacks analyzed in § 6.6.2. that typically requires access to hardware probes. However, we are troubled by the lack of documenta- At the same time, the SDM does not mention any- tion on the ME’s implementation, as certain details are thing about uncore PEBS counters, which can be used critical to SGX’s security analysis. For example, the to learn about an enclave’s LLC activity. Furthermore, ME is involved in the computer’s boot process (§ 2.13, the ISCA 2015 tutorial slides mention that SGX does § 2.14.4), so it is unclear if it plays any part in the SGX not protect against software side-channel attacks that initialization sequence. Furthermore, during the security rely on performance counters. boot stage (SEC, § 2.13.2), the bootstrap LP (BSP) is This limitation in SGX’s threat model leaves security- placed in Cache-As-Ram (CAR) mode so that the PEI conscious enclave authors in a rather terrible situation. firmware can be stored securely while it is measured. These authors know that SGX does not protect their This suggests that it would be convenient for the ME enclaves against a class of software attacks. At the same to receive direct access to the CPU’s caches, so that the time, they cannot even contemplate attempting to defeat ME’s TPM implementation can measure the firmware these attacks on their own, due to lack of information. directly. At the same time, a special access path from the 108

109.ME to the CPU’s caches might sidestep the MEE, allow- OS ing an attacker who has achieved ME code execution to RAM directly read the EPC’s contents. … 6.6.6 Cache Timing Attacks The SGX threat model excludes the cache timing attacks described in § 3.8. The SGX documentation bundles Cache these attacks together with other side-channel attacks and Enclave summarily dismisses them as complex physical attacks. However, cache timing attacks can be mounted entirely by unprivileged software running at ring 3. This section describes the implications of SGX’s environment and threat model on cache timing attacks. The main difference between SGX and a standard … architecture is that SGX’s threat model considers the sys- tem software to be untrusted. As explained earlier, this accurately captures the situation in remote computation … scenarios, such as cloud computing. SGX’s threat model Cache Line implies that the system software can be carrying out a cache timing attack on the software inside an enclave. Page A malicious system software translates into signifi- cantly more powerful cache timing attacks, compared to Figure 94: A malicious OS can partition a cache between the those described in § 3.8. The system software is in charge software running inside an enclave and its own malicious code. Both of scheduling threads on LPs, and also in charge of set- the OS and the enclave software have cache sets dedicated to them. ting up the page tables used by address translation (§ 2.5), When allocating DRAM to itself and to the enclave software, the malicious OS is careful to only use DRAM regions that map to the which control cache placement (§ 2.11.5). appropriate cache sets. On a system with an Intel CPU, the the OS For example, the malicious kernel set out to trace an can partition the L2 cache by manipulating the page tables in a way enclave’s memory access patterns described in § 6.6.4 that is completely oblivious to the enclave’s software. can improve the accuracy of a cache timing attack by using page coloring [115] principles to partition [127] cache timing attacks from targeting them. This measure the cache targeted by the attack. In a nutshell, the kernel would defeat the cache timing attacks described below, divides the cache’s sets (§ 2.11.2) into two regions, as and would only be vulnerable to more sophisticated at- shown in Figure 94. tacks that target the shared LLC, such as [129, 194]. The The system software stores all the victim enclave’s description above assumes that multi-threading has been code and data in DRAM addresses that map to the cache disabled, for the reasons explained in § 6.6.3. sets in one of the regions, and stores its own code and Barring the additional protection measures described data in DRAM addresses that map to the other region’s above, a tracing kernel can extend the attack described in cache sets. The snooping thread’s code is assumed to be § 6.6.4 with the steps outlined below to take advantage a part of the OS. For example, in a typical 256 KB (per- of cache timing and narrow down the addresses in an ap- core) L2 cache organized as 512 8-way sets of 64-byte plication’s memory access trace to cache line granularity. lines, the tracing kernel could allocate lines 0-63 for the Right before entering an enclave via EENTER or enclave’s code page, lines 64-127 for the enclave’s data ERESUME, the kernel would issue CLFLUSH instruc- page, and use lines 128-511 for its own pages. tions to flush the enclave’s code page and data page from To the best of our knowledge, there is no minor modifi- the cache. The enclave could have accessed a single code cation to SGX that would provably defend against cache page and a single data page, so flushing the cache should timing attacks. However, the SGX design could take a be reasonably efficient. The tracing kernel then uses 16 few steps to increase the cost of cache timing attacks. bogus pages (8 for the enclave’s code page, and 8 for For example, SGX’s enclave entry implementation could the enclave’s data page) to load all the 8 ways in the 128 flush the core’s private caches, which would prevent cache sets allocated by enclave pages. After an AEX 109

110.gives control back to the tracing kernel, it can read the that, while it may be possible to write simple pieces of 16 bogus pages, and exploit the time difference between software in such a way that they do not require data- an L2 cache hit and a miss to see which cache lines were dependent memory accesses, there is no known process evicted and replaced by the enclave’s memory accesses. that can scale this to large software systems. For example, An extreme approach that can provably defeat cache each virtual method call in an object-oriented language timing attacks is disabling caching for the PRM range, results in data-dependent code fetches. which contains the EPC. The SDM is almost com- The ISCA 2015 SGX tutorial slides also suggest that pletely silent about the PRM, but the SGX manuals that the efforts of removing data-dependent memory accesses it is based on state that the allowable caching behav- should focus on cryptographic algorithm implementa- iors (§ 2.11.4) for the PRM range are uncacheable (UC) tions, in order to protect the keys that they handle. This and write-back (WB). This could become useful if the is a terribly misguided suggestion, because cryptographic SGX implementation would make sure that the PRM’s key material has no intrinsic value. Attackers derive ben- caching behavior cannot be changed while SGX is en- efits from obtaining the data that is protected by the keys, abled, and if the selected behavior would be captured by such as medical and financial records. the enclave’s measurement (§ 5.6). Some security researchers focus on protecting cryp- tographic keys because they are the target of today’s 6.6.7 Software Side-Channel Attacks and SGX attacks. Unfortunately, it is easy to lose track of the fact The SGX design reuses a few terms from the Trusted Plat- that keys are being attacked simply because they are the form Module (TPM, § 4.4) design. This helps software lowest hanging fruit. A system that can only protect developers familiar with TPM understand SGX faster. the keys will have a very small positive impact, as the At the same time, the term reuse invites the assump- attackers will simply shift their focus on the algorithms tion that SGX’s software attestation is implemented in that process the valuable information, and use the same tamper-resistant hardware, similarly to the TPM design. software side-channel attacks to obtain that information § 5.8 explains that, in fact, the SGX design delegates directly. the creation of attestation signatures to software that runs The second drawback of the approach described to- inside a Quoting Enclave with special privileges that wards the beginning of this section is that while eliminat- allows it to access the processor’s attestation key. Re- ing data-dependent memory accesses should thwart the stated, SGX includes an enclave whose software reads attacks described in § 6.6.4 and § 6.6.6, the measure may the attestation key and produces attestation signatures. not be sufficient to prevent the hyper-threading attacks Creating the Quoting Enclave is a very elegant way of described in § 6.6.3. The level of sharing between the reducing the complexity of the hardware implementation two logical processors (LP, § 2.9.4) on the same CPU of SGX, assuming that the isolation guarantees provided core is so high that it is possible that a snooping LP can by SGX are sufficient to protect the attestation key. How- learn more than the memory access pattern from the other ever, the security analysis in § 6.6 reveals that enclaves LP on the same core. are vulnerable to a vast array of software side-channel For example, if the number of cycles taken by an inte- attacks, which have been demonstrated effective in ex- ger ALU to execute a multiplication or division micro- tracting a variety of secrets from isolated environments. op (§ 2.10) depends on its inputs, the snooping LP could The gaps in the security guarantees provided to en- learn some information about the numbers multiplied claves place a large amount of pressure on Intel’s soft- or divided by the other LP. While this may be a simple ware developers, as they must attempt to implement the example, it is safe to assume that the Quoting Enclave EPID signing scheme used by software attestation with- will be studied by many motivated attackers, and that any out leaking any information. Intel’s ISCA 2015 SGX information leak will be exploited. tutorial slides suggest that the SGX designers will ad- vise developers to write their code in a way that avoids 7 C ONCLUSION data-dependent memory accesses, as suggested in § 3.8.4, Shortly after we learned about Intel’s Software Guard and perhaps provide analysis tools that detect code that Extensions (SGX) initiative, we set out to study it in the performs data-dependent memory accesses. hope of finding a practical solution to its vulnerability The main drawback of the approach described above to cache timing attacks. After reading the official SGX is that it is extremely cumbersome. § 3.8.4 describes manuals, we were left with more questions than when we 110

111.started. The SGX patents filled some of the gaps in the html?vendor_id=33, 2014. [Online; accessed official documentation, but also revealed Intel’s enclave 27-April-2015]. licensing scheme, which has troubling implications. [9] Nist’s policy on hash functions. http://csrc. After learning about the SGX implementation and nist.gov/groups/ST/hash/policy.html, 2014. [Online; accessed 4-May-2015]. inferring its design constraints, we discarded our draft [10] Xen: Cve security vulnerabilities, versions and de- proposals for defending enclave software against cache tailed reports. http://www.cvedetails.com/ timing attacks. We concluded that it would be impossi- product/23463/XEN-XEN.html?vendor_ ble to claim to provide this kind of guarantee given the id=6276, 2014. [Online; accessed 27-April-2015]. design constraints and all the unknowns surrounding the [11] Xen project software overview. http: SGX implementation. Instead, we applied the knowledge //wiki.xen.org/wiki/Xen_Project_ that we gained to design Sanctum [38], which is briefly Software_Overview, 2015. [Online; accessed described in § 4.9. 27-April-2015]. This paper describes our findings while studying SGX. [12] Seth Abraham. Time to revisit rep;movs - comment. https://software.intel.com/ We hope that it will help fellow researchers understand en-us/forums/topic/275765, Aug 2006. [On- the breadth of issues that need to be considered before line; accessed 23-January-2015]. accepting a trusted hardware design as secure. We also [13] Tiago Alves and Don Felton. Trustzone: Integrated hope that our work will prompt the research community hardware and software security. Information Quarterly, to expect more openness from the vendors who ask us to 3(4):18–24, 2004. trust their hardware. [14] Ittai Anati, Shay Gueron, Simon P Johnson, and Vin- cent R Scarlata. Innovative technology for cpu based 8 ACKNOWLEDGEMENTS attestation and sealing. In Proceedings of the 2nd In- ternational Workshop on Hardware and Architectural Funding for this research was partially provided by the Support for Security and Privacy, HASP, volume 13, National Science Foundation under contract number 2013. CNS-1413920. [15] Ross Anderson. Security engineering: A guide to build- ing dependable distributed systems. Wiley, 2001. R EFERENCES [16] Sebastian Anthony. Who actually develops linux? the answer might surprise you. http: [1] FIPS 140-2 Consolidated Validation Certificate No. //www.extremetech.com/computing/ 0003. 2011. 175919-who-actually-develops-linux, [2] IBM 4765 Cryptographic Coprocessor Security Module 2014. [Online; accessed 27-April-2015]. - Security Policy. Dec 2012. [17] ARM Limited. AMBA R AXI Protocol, Mar 2004. Ref- [3] Sha1 deprecation policy. http://blogs. erence no. IHI 0022B, IHI 0024B, AR500-DA-10004. technet.com/b/pki/archive/2013/11/ [18] ARM Limited. ARM Security Technology Building 12/sha1-deprecation-policy.aspx, 2013. a Secure System using TrustZone R Technology, Apr [Online; accessed 4-May-2015]. 2009. Reference no. PRD29-GENC-009492C. [4] 7-zip lzma benchmark: Intel haswell. http://www. [19] Sebastian Banescu. Cache timing attacks. 2011. [On- 7-cpu.com/cpu/Haswell.html, 2014. [On- line; accessed 26-January-2014]. line; accessed 10-Februrary-2015]. [20] Elaine Barker, William Barker, William Burr, William [5] Bios freedom status. https://puri.sm/posts/ Polk, and Miles Smid. Recommendation for key man- bios-freedom-status/, Nov 2014. [Online; ac- agement part 1: General (revision 3). Federal Informa- cessed 2-Dec-2015]. tion Processing Standards (FIPS) Special Publications [6] Gradually sunsetting sha-1. http:// (SP), 800-57, Jul 2012. googleonlinesecurity.blogspot.com/ [21] Elaine Barker, William Barker, William Burr, William 2014/09/gradually-sunsetting-sha-1. Polk, and Miles Smid. Secure hash standard (shs). html, 2014. [Online; accessed 4-May-2015]. Federal Information Processing Standards (FIPS) Pub- [7] Ipc2 hardware specification. http://fit-pc. lications (PUBS), 180-4, Aug 2015. com/download/intense-pc2/documents/ [22] Friedrich Beck. Integrated Circuit Failure Analysis: a ipc2-hw-specification.pdf, Sep 2014. Guide to Preparation Techniques. John Wiley & Sons, [Online; accessed 2-Dec-2015]. 1998. [8] Linux kernel: Cve security vulnerabilities, versions [23] Daniel Bleichenbacher. Chosen ciphertext attacks and detailed reports. http://www.cvedetails. against protocols based on the rsa encryption standard com/product/47/Linux-Linux-Kernel. 111

112. pkcs# 1. In Advances in Cryptology CRYPTO’98, ware isolation. Cryptology ePrint Archive, Report pages 1–12. Springer, 1998. 2015/564, 2015. http://eprint.iacr.org/. [24] D.D. Boggs and S.D. Rodgers. Microprocessor with [39] J. Daemen and V. Rijmen. Aes proposal: Rijndael, aes novel instruction for signaling event occurrence and algorithm submission, Sep 1999. for providing event handling information in response [40] S.M. Datta and M.J. Kumar. Technique for providing thereto, 1997. US Patent 5,625,788. secure firmware, 2013. US Patent 8,429,418. [25] Joseph Bonneau and Ilya Mironov. Cache-collision [41] S.M. Datta, V.J. Zimmer, and M.A. Rothman. System timing attacks against aes. In Cryptographic Hardware and method for trusted early boot flow, 2010. US Patent and Embedded Systems-CHES 2006, pages 201–215. 7,752,428. Springer, 2006. [42] Pete Dice. Booting an intel architecture system, part [26] Ernie Brickell and Jiangtao Li. Enhanced privacy id i: Early initialization. Dr. Dobb’s, Dec 2011. [Online; from bilinear pairing. IACR Cryptology ePrint Archive, accessed 2-Dec-2015]. 2009. [43] Whitfield Diffie and Martin E Hellman. New directions [27] Billy Bob Brumley and Nicola Tuveri. Remote tim- in cryptography. Information Theory, IEEE Transac- ing attacks are still practical. In Computer Security– tions on, 22(6):644–654, 1976. ESORICS 2011, pages 355–371. Springer, 2011. [44] Lo¨ıc Duflot, Daniel Etiemble, and Olivier Grumelard. [28] David Brumley and Dan Boneh. Remote timing at- Using cpu system management mode to circumvent op- tacks are practical. Computer Networks, 48(5):701–716, erating system security functions. CanSecWest/core06, 2005. 2006. [29] John Butterworth, Corey Kallenberg, Xeno Kovah, and [45] Morris Dworkin. Recommendation for block cipher Amy Herzog. Bios chronomancy: Fixing the core root modes of operation: Methods and techniques. Fed- of trust for measurement. In Proceedings of the 2013 eral Information Processing Standards (FIPS) Special ACM SIGSAC conference on Computer & Communica- Publications (SP), 800-38A, Dec 2001. tions Security, pages 25–36. ACM, 2013. [46] Morris Dworkin. Recommendation for block cipher [30] J Lawrence Carter and Mark N Wegman. Universal modes of operation: The cmac mode for authentica- classes of hash functions. In Proceedings of the 9th an- tion. Federal Information Processing Standards (FIPS) nual ACM Symposium on Theory of Computing, pages Special Publications (SP), 800-38B, May 2005. 106–112. ACM, 1977. [47] Morris Dworkin. Recommendation for block cipher [31] David Champagne and Ruby B Lee. Scalable architec- modes of operation: Galois/counter mode (gcm) and tural support for trusted software. In High Performance gmac. Federal Information Processing Standards Computer Architecture (HPCA), 2010 IEEE 16th Inter- (FIPS) Special Publications (SP), 800-38D, Nov 2007. national Symposium on, pages 1–12. IEEE, 2010. [48] D. Eastlake and P. Jones. RFC 3174: US Secure Hash [32] Daming D Chen and Gail-Joon Ahn. Security analysis Algorithm 1 (SHA1). Internet RFCs, 2001. of x86 processor microcode. 2014. [Online; accessed [49] Shawn Embleton, Sherri Sparks, and Cliff C Zou. Smm 7-January-2015]. rootkit: a new breed of os independent malware. Secu- [33] Haogang Chen, Yandong Mao, Xi Wang, Dong Zhou, rity and Communication Networks, 2010. Nickolai Zeldovich, and M Frans Kaashoek. Linux [50] Dmitry Evtyushkin, Jesse Elwell, Meltem Ozsoy, kernel vulnerabilities: State-of-the-art defenses and Dmitry Ponomarev, Nael Abu Ghazaleh, and Ryan open problems. In Proceedings of the Second Asia- Riley. Iso-x: A flexible architecture for hardware- Pacific Workshop on Systems, page 5. ACM, 2011. managed isolated execution. In Microarchitecture (MI- [34] Lily Chen. Recommendation for key derivation using CRO), 2014 47th annual IEEE/ACM International Sym- pseudorandom functions. Federal Information Pro- posium on, pages 190–202. IEEE, 2014. cessing Standards (FIPS) Special Publications (SP), [51] Niels Ferguson, Bruce Schneier, and Tadayoshi Kohno. 800-108, Oct 2009. Cryptography Engineering: Design Principles and [35] Coreboot. Developer manual, Sep 2014. [Online; ac- Practical Applications. John Wiley & Sons, 2011. cessed 4-March-2015]. [52] Christopher W Fletcher, Marten van Dijk, and Srinivas [36] M.P. Cornaby and B. Chaffin. Microinstruction pointer Devadas. A secure processor architecture for encrypted stack including speculative pointers for out-of-order computation on untrusted programs. In Proceedings execution, 2007. US Patent 7,231,511. of the Seventh ACM Workshop on Scalable Trusted [37] Intel Corporation. Intel R Xeon R Processor E5 v3 Computing, pages 3–8. ACM, 2012. Family Uncore Performance Monitoring Reference [53] Agner Fog. Instruction tables - lists of instruction laten- Manual, Sep 2014. Reference no. 331051-001. cies, throughputs and micro-operation breakdowns for [38] Victor Costan, Ilia Lebedev, and Srinivas Devadas. intel, amd and via cpus. Dec 2014. [Online; accessed Sanctum: Minimal hardware extensions for strong soft- 23-January-2015]. 112

113.[54] Andrew Furtak, Yuriy Bulygin, Oleksandr Bazhaniuk, [68] K.C. Gotze, J. Li, and G.M. Iovino. Fuse attestation to John Loucaides, Alexander Matrosov, and Mikhail secure the provisioning of secret keys during integrated Gorobets. Bios and secure boot attacks uncovered. circuit manufacturing, 2014. US Patent 8,885,819. The 10th ekoparty Security Conference, 2014. [Online; [69] Joe Grand. Advanced hardware hacking techniques, Jul accessed 22-October-2015]. 2004. [55] William Futral and James Greene. Intel R Trusted [70] David Grawrock. Dynamics of a Trusted Platform: A Execution Technology for Server Platforms. Apress building block approach. Intel Press, 2009. Open, 2013. [71] Trusted Computing Group. Tpm [56] Blaise Gassend, Dwaine Clarke, Marten Van Dijk, and main specification. http://www. Srinivas Devadas. Silicon physical random functions. trustedcomputinggroup.org/resources/ In Proceedings of the 9th ACM Conference on Com- tpm_main_specification, 2003. puter and Communications Security, pages 148–160. [72] Daniel Gruss, Cl´ementine Maurice, and Stefan Man- ACM, 2002. gard. Rowhammer. js: A remote software-induced fault [57] Blaise Gassend, G Edward Suh, Dwaine Clarke, Marten attack in javascript. CoRR, abs/1507.06955, 2015. Van Dijk, and Srinivas Devadas. Caches and hash [73] Shay Gueron. Quick verification of rsa signatures. In trees for efficient memory integrity verification. In 8th International Conference on Information Technol- Proceedings of the 9th International Symposium on ogy: New Generations (ITNG), pages 382–386. IEEE, High-Performance Computer Architecture, pages 295– 2011. 306. IEEE, 2003. [74] Ben Hawkes. Security analysis of x86 processor mi- [58] Daniel Genkin, Lev Pachmanov, Itamar Pipman, and crocode. 2012. [Online; accessed 7-January-2015]. Eran Tromer. Stealing keys from pcs using a radio: [75] John L Hennessy and David A Patterson. Computer Cheap electromagnetic attacks on windowed exponen- Architecture - a Quantitative Approach (5 ed.). Mogran tiation. Cryptology ePrint Archive, Report 2015/170, Kaufmann, 2012. 2015. [76] Christoph Herbst, Elisabeth Oswald, and Stefan Man- [59] Daniel Genkin, Itamar Pipman, and Eran Tromer. Get gard. An aes smart card implementation resistant to your hands off my laptop: Physical side-channel key- power analysis attacks. In Applied cryptography and extraction attacks on pcs. Cryptology ePrint Archive, Network security, pages 239–252. Springer, 2006. Report 2014/626, 2014. [77] G. Hildesheim, I. Anati, H. Shafi, S. Raikin, G. Gerzon, [60] Daniel Genkin, Adi Shamir, and Eran Tromer. Rsa key U.R. Savagaonkar, C.V. Rozas, F.X. McKeen, M.A. extraction via low-bandwidth acoustic cryptanalysis. Goldsmith, and D. Prashant. Apparatus and method Cryptology ePrint Archive, Report 2013/857, 2013. for page walk extension for enhanced security checks, [61] Craig Gentry. A fully homomorphic encryption scheme. 2014. US Patent App. 13/730,563. PhD thesis, Stanford University, 2009. [78] Matthew Hoekstra, Reshma Lal, Pradeep Pappachan, [62] R.T. George, J.W. Brandt, K.S. Venkatraman, and S.P. Vinay Phegade, and Juan Del Cuvillo. Using innovative Kim. Dynamically partitioning pipeline resources, instructions to create trustworthy software solutions. 2009. US Patent 7,552,255. In Proceedings of the 2nd International Workshop on [63] A. Glew, G. Hinton, and H. Akkary. Method and ap- Hardware and Architectural Support for Security and paratus for performing page table walks in a micropro- Privacy, HASP, volume 13, 2013. cessor capable of processing speculative instructions, [79] Gael Hofemeier. Intel manageability firmware recovery 1997. US Patent 5,680,565. agent. Mar 2013. [Online; accessed 2-Dec-2015]. [64] A.F. Glew, H. Akkary, R.P. Colwell, G.J. Hinton, D.B. [80] George Hotz. Ps3 glitch hack. 2010. [Online; accessed Papworth, and M.A. Fetterman. Method and apparatus 7-January-2015]. for implementing a non-blocking translation lookaside [81] Andrew Huang. Hacking the Xbox: an Introduction to buffer, 1996. US Patent 5,564,111. Reverse Engineering. No Starch Press, 2003. [65] Oded Goldreich. Towards a theory of software protec- [82] C.J. Hughes, Y.K. Chen, M. Bomb, J.W. Brandt, M.J. tion and simulation by oblivious rams. In Proceedings Buxton, M.J. Charney, S. Chennupaty, J. Corbal, M.G. of the 19th annual ACM symposium on Theory of Com- Dixon, M.B. Girkar, et al. Gathering and scattering puting, pages 182–194. ACM, 1987. multiple data elements, 2013. US Patent 8,447,962. [66] J.R. Goodman and H.H.J. Hum. Mesif: A two-hop [83] IEEE Computer Society. IEEE Standard for Ethernet, cache coherency protocol for point-to-point intercon- Dec 2012. IEEE Std. 802.3-2012. nects. 2009. [84] Mehmet Sinan Inci, Berk Gulmezoglu, Gorka Irazoqui, [67] K.C. Gotze, G.M. Iovino, and J. Li. Secure provisioning Thomas Eisenbarth, and Berk Sunar. Seriously, get off of secret keys during integrated circuit manufacturing, my cloud! cross-vm rsa key recovery in a public cloud. 2014. US Patent App. 13/631,512. Cryptology ePrint Archive, Report 2015/898, 2015. 113

114. [85] Intel Corporation. Intel R Processor Serial Number, ries - Specification Update, 2 2015. Reference no. Mar 1999. Order no. 245125-001. 321324-018US. [86] Intel Corporation. Intel R architecture Platform Basics, [104] Intel Corporation. Intel R Xeon R Processor E5-1600, Sep 2010. Reference no. 324377. E5-2400, and E5-2600 v3 Product Family Datasheet - [87] Intel Corporation. Intel R Core 2 Duo and Intel R Core Volume Two, Jan 2015. Reference no. 330784-002. 2 Solo Processor for Intel R Centrino R Duo Processor [105] Intel Corporation. Intel R Xeon R Processor E5 Prod- Technology Intel R Celeron R Processor 500 Series - uct Family - Specification Update, Jan 2015. Reference Specification Update, Dec 2010. Reference no. 314079- no. 326150-018. 026. [106] Intel Corporation. Mobile 4th Generation Intel R [88] Intel Corporation. Intel R Trusted Execution Technol- Core R Processor Family I/O Datasheet, Feb 2015. ogy (Intel R TXT) LAB Handout, 2010. [Online; ac- Reference no. 329003-003. cessed 2-July-2015]. [107] Bruce Jacob and Trevor Mudge. Virtual memory: Is- [89] Intel Corporation. Intel R Xeon R Processor 7500 Se- sues of implementation. Computer, 31(6):33–43, 1998. ries Uncore Programming Guide, Mar 2010. Reference [108] Simon P Johnson, Uday R Savagaonkar, Vincent R no. 323535-001. Scarlata, Francis X McKeen, and Carlos V Rozas. Tech- [90] Intel Corporation. An Introduction to the Intel R Quick- nique for supporting multiple secure enclaves, Dec Path Interconnect, Mar 2010. Reference no. 323535- 2010. US Patent 8,972,746. 001. [109] Jakob Jonsson and Burt Kaliski. RFC 3447: Public-Key [91] Intel Corporation. Minimal Intel R Architecture Boot Cryptography Standards (PKCS) #1: RSA Cryptogra- LoaderBare Bones Functionality Required for Booting phy Specifications Version 2.1. Internet RFCs, Feb an Intel R Architecture Platform, Jan 2010. Reference 2003. no. 323246. [110] Burt Kaliski. RFC 2313: PKCS #1: RSA Encryption [92] Intel Corporation. Intel R 7 Series Family - Intel R Version 1.5. Internet RFCs, Mar 1998. Management Engine Firmware 8.1 - 1.5MB Firmware [111] Burt Kaliski and Jessica Staddon. RFC 2437: PKCS Bring Up Guide, Jul 2012. Revision 8.1.0.1248 - PV #1: RSA Encryption Version 2.0. Internet RFCs, Oct Release. 1998. [93] Intel Corporation. Intel R Xeon R Processor E5-2600 [112] Corey Kallenberg, Xeno Kovah, John Butterworth, and Product Family Uncore Performance Monitoring Guide, Sam Cornwell. Extreme privilege escalation on win- Mar 2012. Reference no. 327043-001. dows 8/uefi systems, 2014. [94] Intel Corporation. Software Guard Extensions Program- [113] Emilia K¨asper and Peter Schwabe. Faster and timing- ming Reference, 2013. Reference no. 329298-001US. attack resistant aes-gcm. In Cryptographic Hard- [95] Intel Corporation. Intel R 64 and IA-32 Architectures ware and Embedded Systems-CHES 2009, pages 1–17. Optimization Reference Manual, Sep 2014. Reference Springer, 2009. no. 248966-030. [114] Jonathan Katz and Yehuda Lindell. Introduction to [96] Intel Corporation. Intel R Xeon R Processor 7500 Se- modern cryptography. CRC Press, 2014. ries Datasheet - Volume Two, Mar 2014. Reference no. [115] Richard E Kessler and Mark D Hill. Page placement 329595-002. algorithms for large real-indexed caches. ACM Trans- [97] Intel Corporation. Intel R Xeon R Processor E7 v2 actions on Computer Systems (TOCS), 10(4):338–359, 2800/4800/8800 Product Family Datasheet - Volume 1992. Two, Mar 2014. Reference no. 329595-002. [116] Taesoo Kim and Nickolai Zeldovich. Practical and [98] Intel Corporation. Software Guard Extensions Program- effective sandboxing for non-root users. In USENIX ming Reference, 2014. Reference no. 329298-002US. Annual Technical Conference, pages 139–144, 2013. [99] Intel Corporation. Intel R 100 Series Chipset Family [117] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Platform Controller Hub (PCH) Datasheet - Volume Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad One, Aug 2015. Reference no. 332690-001EN. Lai, and Onur Mutlu. Flipping bits in memory with- [100] Intel Corporation. Intel R 64 and IA-32 Architectures out accessing them: An experimental study of dram Software Developer’s Manual, Sep 2015. Reference no. disturbance errors. In Proceeding of the 41st annual 325462-056US. International Symposium on Computer Architecuture, [101] Intel Corporation. Intel R C610 Series Chipset and pages 361–372. IEEE Press, 2014. Intel R X99 Chipset Platform Controller Hub (PCH) [118] L.A. Knauth and P.J. Irelan. Apparatus and method Datasheet, Oct 2015. Reference no. 330788-003. for providing eventing ip and source data address in [102] Intel Corporation. Intel R Software Guard Extensions a statistical sampling infrastructure, 2014. US Patent (Intel R SGX), Jun 2015. Reference no. 332680-002. App. 13/976,613. [103] Intel Corporation. Intel R Xeon R Processor 5500 Se- [119] N. Koblitz. Elliptic curve cryptosystems. Mathematics 114

115. of Computation, 48(177):203–209, 1987. [132] James Manger. A chosen ciphertext attack on rsa op- [120] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Dif- timal asymmetric encryption padding (oaep) as stan- ferential power analysis. In Advances in Cryptology dardized in pkcs# 1 v2.0. In Advances in Cryptology (CRYPTO), pages 388–397. Springer, 1999. CRYPTO 2001, pages 230–238. Springer, 2001. [121] Paul C Kocher. Timing attacks on implementations of [133] Clmentine Maurice, Nicolas Le Scouarnec, Christoph diffie-hellman, rsa, dss, and other systems. In Advances Neumann, Olivier Heen, and Aurlien Francillon. Re- in CryptologyCRYPTO96, pages 104–113. Springer, verse engineering intel last-level cache complex ad- 1996. dressing using performance counters. In Proceedings [122] Hugo Krawczyk, Ran Canetti, and Mihir Bellare. of the 18th International Symposium on Research in Hmac: Keyed-hashing for message authentication. Attacks, Intrusions and Defenses (RAID), 2015. 1997. [134] Jonathan M McCune, Yanlin Li, Ning Qu, Zongwei [123] Markus G Kuhn. Electromagnetic eavesdropping risks Zhou, Anupam Datta, Virgil Gligor, and Adrian Perrig. of flat-panel displays. In Privacy Enhancing Technolo- Trustvisor: Efficient tcb reduction and attestation. In gies, pages 88–107. Springer, 2005. Security and Privacy (SP), 2010 IEEE Symposium on, [124] Tsvika Kurts, Guillermo Savransky, Jason Ratner, Eilon pages 143–158. IEEE, 2010. Hazan, Daniel Skaba, Sharon Elmosnino, and Gee- [135] David McGrew and John Viega. The galois/counter yarpuram N Santhanakrishnan. Generic debug external mode of operation (gcm). 2004. [Online; accessed connection (gdxc) for high integration integrated cir- 28-December-2015]. cuits, 2011. US Patent 8,074,131. [136] Francis X McKeen, Carlos V Rozas, Uday R Sava- [125] David Levinthal. Performance analysis guide for gaonkar, Simon P Johnson, Vincent Scarlata, Michael A intel R core i7 processor and intel R xeon 5500 Goldsmith, Ernie Brickell, Jiang Tao Li, Howard C Her- processors. https://software.intel.com/ bert, Prashant Dewan, et al. Method and apparatus to sites/products/collateral/hpc/vtune/ provide secure application execution, Dec 2009. US performance_analysis_guide.pdf, 2010. Patent 9,087,200. [Online; accessed 26-January-2015]. [137] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, [126] David Lie, Chandramohan Thekkath, Mark Mitchell, Carlos V Rozas, Hisham Shafi, Vedvyas Shanbhogue, Patrick Lincoln, Dan Boneh, John Mitchell, and Mark and Uday R Savagaonkar. Innovative instructions and Horowitz. Architectural support for copy and tamper software model for isolated execution. HASP, 13:10, resistant software. ACM SIGPLAN Notices, 35(11):168– 2013. 177, 2000. [138] Michael Naehrig, Kristin Lauter, and Vinod Vaikun- [127] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, tanathan. Can homomorphic encryption be practical? Xiaodong Zhang, and P Sadayappan. Gaining in- In Proceedings of the 3rd ACM workshop on Cloud sights into multicore cache partitioning: Bridging the computing security workshop, pages 113–124. ACM, gap between simulation and real systems. In 14th In- 2011. ternational IEEE Symposium on High Performance [139] National Institute of Standards and Technology (NIST). Computer Architecture (HPCA), pages 367–378. IEEE, The advanced encryption standard (aes). Federal In- 2008. formation Processing Standards (FIPS) Publications [128] Barbara Liskov and Stephen Zilles. Programming with (PUBS), 197, Nov 2001. abstract data types. In ACM Sigplan Notices, volume 9, [140] National Institute of Standards and Technology (NIST). pages 50–59. ACM, 1974. The digital signature standard (dss). Federal Informa- [129] Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and tion Processing Standards (FIPS) Processing Standards Ruby B Lee. Last-level cache side-channel attacks are Publications (PUBS), 186-4, Jul 2013. practical. In Security and Privacy (SP), 2015 IEEE [141] National Security Agency (NSA) Central Security Ser- Symposium on, pages 143–158. IEEE, 2015. vice (CSS). Cryptography today on suite b phase- [130] Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari, out. https://www.nsa.gov/ia/programs/ Elaine Shi, Krste Asanovic, John Kubiatowicz, and suiteb_cryptography/, Aug 2015. [Online; ac- Dawn Song. Phantom: Practical oblivious computation cessed 28-December-2015]. in a secure processor. In Proceedings of the 2013 ACM [142] M.S. Natu, S. Datta, J. Wiedemeier, J.R. Vash, S. Kotta- SIGSAC conference on Computer & communications palli, S.P. Bobholz, and A. Baum. Supporting advanced security, pages 311–324. ACM, 2013. ras features in a secured computing system, 2012. US [131] R. Maes, P. Tuyls, and I. Verbauwhede. Low-Overhead Patent 8,301,907. Implementation of a Soft Decision Helper Data Algo- [143] Yossef Oren, Vasileios P Kemerlis, Simha Sethumadha- rithm for SRAM PUFs. In Cryptographic Hardware van, and Angelos D Keromytis. The spy in the sandbox and Embedded Systems (CHES), pages 332–347, 2009. – practical cache attacks in javascript. arXiv preprint 115

116. arXiv:1502.07373, 2015. erations based upon the addresses of microinstructions, [144] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache 1997. US Patent 5,636,374. attacks and countermeasures: the case of aes. In Topics [158] S.D. Rodgers, R. Vidwans, J. Huang, M.A. Fetterman, in Cryptology–CT-RSA 2006, pages 1–20. Springer, and K. Huck. Method and apparatus for generating 2006. event handler vectors based on both operating mode [145] Scott Owens, Susmit Sarkar, and Peter Sewell. A better and event type, 1999. US Patent 5,889,982. x86 memory model: x86-tso (extended version). Uni- [159] M. Rosenblum and T. Garfinkel. Virtual machine mon- versity of Cambridge, Computer Laboratory, Technical itors: current technology and future trends. Computer, Report, (UCAM-CL-TR-745), 2009. 38(5):39–47, May 2005. [146] Emmanuel Owusu, Jun Han, Sauvik Das, Adrian Perrig, [160] Xiaoyu Ruan. Platform Embedded Security Technology and Joy Zhang. Accessory: password inference using Revealed. Apress, 2014. accelerometers on smartphones. In Proceedings of the [161] Joanna Rutkowska. Intel x86 considered harmful. Oct Twelfth Workshop on Mobile Computing Systems & 2015. [Online; accessed 2-Nov-2015]. Applications, page 9. ACM, 2012. [162] Joanna Rutkowska and Rafał Wojtczuk. Preventing [147] D.B. Papworth, G.J. Hinton, M.A. Fetterman, R.P. Col- and detecting xen hypervisor subversions. Blackhat well, and A.F. Glew. Exception handling in a processor Briefings USA, 2008. that performs speculative out-of-order instruction exe- [163] Jerome H Saltzer and M Frans Kaashoek. Principles cution, 1999. US Patent 5,987,600. of Computer System Design: An Introduction. Morgan [148] David A Patterson and John L Hennessy. Computer Kaufmann, 2009. Organization and Design: the hardware/software inter- [164] Mark Seaborn and Thomas Dullien. Exploit- face. Morgan Kaufmann, 2013. ing the dram rowhammer bug to gain kernel [149] P. Pessl, D. Gruss, C. Maurice, M. Schwarz, and S. Man- privileges. http://googleprojectzero. gard. Reverse engineering intel dram addressing and blogspot.com/2015/03/ exploitation. ArXiv e-prints, Nov 2015. exploiting-dram-rowhammer-bug-to-gain. [150] Stefan M Petters and Georg Farber. Making worst case html, Mar 2015. [Online; accessed 9-March-2015]. execution time analysis for hard real-time tasks on state [165] V. Shanbhogue, J.W. Brandt, and J. Wiedemeier. Pro- of the art processors feasible. In Sixth International tecting information processing system secrets from de- Conference on Real-Time Computing Systems and Ap- bug attacks, 2015. US Patent 8,955,144. plications, pages 442–449. IEEE, 1999. [166] V. Shanbhogue and S.J. Robinson. Enabling virtu- [151] S.A. Qureshi and M.O. Nicholes. System and method alization of a processor resource, 2014. US Patent for using a firmware interface table to dynamically load 8,806,104. an acpi ssdt, 2006. US Patent 6,990,576. [167] Stephen Shankland. Itanium: A cautionary tale. Dec [152] S. Raikin, O. Hamama, R.S. Chappell, C.B. Rust, H.S. 2005. [Online; accessed 11-February-2015]. Luu, L.A. Ong, and G. Hildesheim. Apparatus and [168] Alan Jay Smith. Cache memories. ACM Computing method for a multiple page size translation lookaside Surveys (CSUR), 14(3):473–530, 1982. buffer (tlb), 2014. US Patent App. 13/730,411. [169] Sean W Smith, Ron Perez, Steve Weingart, and Vernon [153] S. Raikin and R. Valentine. Gather cache architecture, Austel. Validating a high-performance, programmable 2014. US Patent 8,688,962. secure coprocessor. In 22nd National Information Sys- [154] Stefan Reinauer. x86 intel: Add firmware interface tems Security Conference. IBM Thomas J. Watson Re- table support. http://review.coreboot.org/ search Division, 1999. #/c/2642/, 2013. [Online; accessed 2-July-2015]. [170] Sean W Smith and Steve Weingart. Building a high- [155] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and performance, programmable secure coprocessor. Com- Stefan Savage. Hey, you, get off of my cloud: Exploring puter Networks, 31(8):831–860, 1999. information leakage in third-party compute clouds. In [171] Marc Stevens, Pierre Karpman, and Thomas Peyrin. Proceedings of the 16th ACM Conference on Computer Free-start collision on full sha-1. Cryptology ePrint and Communications Security, pages 199–212. ACM, Archive, Report 2015/967, 2015. 2009. [172] G Edward Suh, Dwaine Clarke, Blaise Gassend, Marten [156] RL Rivest, A. Shamir, and L. Adleman. A method for Van Dijk, and Srinivas Devadas. Aegis: architecture for obtaining digital signatures and public-key cryptosys- tamper-evident and tamper-resistant processing. In Pro- tems. Communications of the ACM, 21(2):120–126, ceedings of the 17th annual international conference 1978. on Supercomputing, pages 160–171. ACM, 2003. [157] S.D. Rodgers, K.K. Tiruvallur, M.W. Rhodehamel, K.G. [173] G Edward Suh and Srinivas Devadas. Physical unclon- Konigsfeld, A.F. Glew, H. Akkary, M.A. Karnik, and able functions for device authentication and secret key J.A. Brayton. Method and apparatus for performing op- generation. In Proceedings of the 44th annual Design 116

117. Automation Conference, pages 9–14. ACM, 2007. [189] Rafal Wojtczuk, Joanna Rutkowska, and Alexander [174] G. Edward Suh, Charles W. O’Donnell, Ishan Sachdev, Tereshkin. Another way to circumvent intel R trusted and Srinivas Devadas. Design and Implementation of execution technology. Invisible Things Lab, 2009. the AEGIS Single-Chip Secure Processor Using Phys- [190] Rafal Wojtczuk and Alexander Tereshkin. Attacking ical Random Functions. In Proceedings of the 32nd intel R bios. Invisible Things Lab, 2010. ISCA’05. ACM, June 2005. [191] Y. Wu and M. Breternitz. Genetic algorithm for mi- [175] George Taylor, Peter Davies, and Michael Farmwald. crocode compression, 2008. US Patent 7,451,121. The tlb slice - a low-cost high-speed address translation [192] Y. Wu, S. Kim, M. Breternitz, and H. Hum. Compress- mechanism. SIGARCH Computer Architecture News, ing and accessing a microcode rom, 2012. US Patent 18(2SI):355–363, 1990. 8,099,587. [176] Alexander Tereshkin and Rafal Wojtczuk. Introducing [193] Yuanzhong Xu, Weidong Cui, and Marcus Peinado. ring-3 rootkits. Master’s thesis, 2009. Controlled-channel attacks: Deterministic side chan- [177] Kris Tiri, Moonmoon Akmal, and Ingrid Verbauwhede. nels for untrusted operating systems. In Proceedings A dynamic and differential cmos logic with signal in- of the 36th IEEE Symposium on Security and Privacy dependent power consumption to withstand differential (Oakland). IEEE Institute of Electrical and Electronics power analysis on smart cards. In Proceedings of the Engineers, May 2015. 28th European Solid-State Circuits Conference (ESS- [194] Yuval Yarom and Katrina E Falkner. Flush+ reload: a CIRC), pages 403–406. IEEE, 2002. high resolution, low noise, l3 cache side-channel attack. [178] UEFI Forum. Unified Extensible Firmware Interface IACR Cryptology ePrint Archive, 2013:448, 2013. Specification, Version 2.5, 2015. [Online; accessed [195] Yuval Yarom, Qian Ge, Fangfei Liu, Ruby B. Lee, and 1-Jul-2015]. Gernot Heiser. Mapping the intel last-level cache. Cryp- [179] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L Santoni, tology ePrint Archive, Report 2015/905, 2015. Fernando CM Martins, Andrew V Anderson, Steven M [196] Bennet Yee. Using secure coprocessors. PhD thesis, Bennett, Alain Kagi, Felix H Leung, and Larry Smith. Carnegie Mellon University, 1994. Intel virtualization technology. Computer, 38(5):48–56, [197] Marcelo Yuffe, Ernest Knoll, Moty Mehalel, Joseph 2005. Shor, and Tsvika Kurts. A fully integrated multi-cpu, [180] Wim Van Eck. Electromagnetic radiation from video gpu and memory controller 32nm processor. In Solid- display units: an eavesdropping risk? Computers & State Circuits Conference Digest of Technical Papers Security, 4(4):269–286, 1985. (ISSCC), 2011 IEEE International, pages 264–266. [181] Amit Vasudevan, Jonathan M McCune, Ning Qu, Leen- IEEE, 2011. dert Van Doorn, and Adrian Perrig. Requirements for [198] Xiantao Zhang and Yaozu Dong. Optimizing xen vmm an integrity-protected hypervisor on the x86 hardware based on intel R virtualization technology. In Inter- virtualized architecture. In Trust and Trustworthy Com- net Computing in Science and Engineering, 2008. ICI- puting, pages 141–165. Springer, 2010. CSE’08. International Conference on, pages 367–374. [182] Sathish Venkataramani. Advanced Board Bring Up - IEEE, 2008. Power Sequencing Guide for Embedded Intel Archi- [199] Li Zhuang, Feng Zhou, and J Doug Tygar. Keyboard tecture. Intel Corporation, Apr 2011. Reference no. acoustic emanations revisited. ACM Transactions on 325268. Information and System Security (TISSEC), 13(1):3, [183] Vassilios Ververis. Security evaluation of intel’s active 2009. management technology. 2010. [200] V.J. Zimmer and S.H. Robinson. Methods and systems [184] Filip Wecherowski. A real smm rootkit: Reversing and for microcode patching, 2012. US Patent 8,296,528. hooking bios smi handlers. Phrack Magazine, 13(66), [201] V.J. Zimmer and J. Yao. Method and apparatus for 2009. sequential hypervisor invocation, 2012. US Patent [185] Mark N Wegman and J Lawrence Carter. New hash 8,321,931. functions and their use in authentication and set equality. Journal of Computer and System Sciences, 22(3):265– 279, 1981. [186] Rafal Wojtczuk and Joanna Rutkowska. Attacking intel trusted execution technology. Black Hat DC, 2009. [187] Rafal Wojtczuk and Joanna Rutkowska. Attacking smm memory via intel cpu cache poisoning. Invisible Things Lab, 2009. [188] Rafal Wojtczuk and Joanna Rutkowska. Attacking intel txt via sinit code execution hijacking, 2011. 117