- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Eliminating Boundaries in Cloud Storage with Anna
展开查看详情
1 . Eliminating Boundaries in Cloud Storage with Anna Chenggang Wu, Vikram Sreekanti, Joseph M. Hellerstein UC Berkeley {cgwu, vikrams, hellerstein}@berkeley.edu ABSTRACT their own hands by addressing storage limitations in custom arXiv:1809.00089v1 [cs.DB] 1 Sep 2018 In this paper, we describe how we extended a distributed application logic. This introduces significant complexity and key-value store called Anna into an elastic, multi-tier ser- increases the likelihood of application-level errors. Develop- vice for the cloud. In its extended form, Anna is designed ers are inhibited by two key types of boundaries when build- to overcome the narrow cost-performance limitations typi- ing applications with non-uniform workload distributions: cal of current cloud storage systems. We describe three key Cost-Performance Boundaries. Each of the systems dis- aspects of Anna’s new design: multi-master selective repli- cussed above—ElastiCache, EBS, S3, etc.—offers a differ- cation of hot keys, a vertical tiering of storage layers with ent, fixed tradeoff of cost, capacity, latency, and bandwidth. different cost-performance tradeoffs, and horizontal elastic- These tradeoffs echo traditional memory hierarchies built ity of each tier to add and remove nodes in response to from RAM, flash, and magnetic disk arrays. To balance load dynamics. Anna’s policy engine uses these mechanisms performance and cost, data should ideally move adaptively to balance service-level objectives around cost, latency and across storage tiers, matching workload skew and shifting fault tolerance. Experimental results explore the behavior hotspots. However, current cloud services are largely un- of Anna’s mechanisms and policy, exhibiting orders of mag- aware of each other, so software developers and DevOps en- nitude efficiency improvements over both commodity cloud gineers must cobble together ad hoc memory hierarchies. KVS services and research systems. Applications must explicitly move and track data and re- quests across storage system boundaries in their business 1. INTRODUCTION logic. This task is further complicated by the heterogene- ity of storage services in terms of deployment, APIs, and As public infrastructure cloud providers have matured in consistency guarantees. For example, single-replica systems the last decade, the number of storage services they offer has like ElastiCache are linearizable, while replicated systems soared. Popular cloud providers like Amazon Web Services like DynamoDB are eventually consistent. (AWS) [9], Microsoft Azure [10], and Google Cloud Platform (GCP) [21] each have at least seven storage options. These Static Deployment Boundaries. Cloud providers offer services span the spectrum of cost-performance tradeoffs: very few truly elastic storage services; most such systems AWS ElastiCache, for example, is an expensive, memory- have hard boundaries on the number and type of nodes de- speed service, while AWS Glacier is extremely high-latency ployed. In AWS for example, high performance tiers like and low-cost. In between, there are a variety of services such ElastiCache are surprisingly inelastic, requiring system ad- as the Elastic Block Store (EBS), the Elastic File System ministrators to allocate and deallocate instances manually. (EFS), and the Simple Storage Service (S3). Azure and Two of the lower storage tiers—S3 and DynamoDB—are GCP both offer a similar range of storage solutions. elastic, but are insufficient for many needs. S3 autoscales Each one of these services is tuned to a unique point to match data volume but ignores workload; it is designed in that design space, making it well-suited to certain per- for “cold” storage, offering good bandwidth but high la- formance goals. Application developers, however, typically tency. DynamoDB offers workload-based autoscaling but is deal with a non-uniform distribution of performance require- prohibitively expensive to scale to a memory-speed service. ments. For example, many applications generate a skewed This motivates the use of ElastiCache over DynamoDB, which access distribution, in which some data is “hot” while other again requires an administrator to monitor load and usage data is “cold”. This is why traditional storage is assembled statistics, and manually adjust resource allocation. hierarchically: hot data is kept in fast, expensive cache while In an earlier paper, we presented the initial architecture cold data is kept in slow, cheap storage. These access dis- of a key-value storage system called Anna [51]. The focus of tributions have become more complex in modern settings, the initial paper was on a design to provide excellent perfor- because they can change dramatically over time. Realistic mance across orders of magnitude in scale. In this work, we workloads spike by orders of magnitude, and hot sets shift extend Anna to remove its cost-performance and static de- and resize. These large-scale variations in workload motivate ployment boundaries, enabling Anna to dynamically adjust an “elastic” service design, but most cloud storage services configuration and match resources to workloads. today are inelastic and unable to respond to these dynamics. While our previous work’s evaluation focused on raw per- The narrow performance goals of cloud storage services formance, here we are also interested in efficiency: the ratio result in poor cost-performance tradeoffs for applications. of performance to cost. For various cost points, we show To improve performance, developers often take matters into 1
2 .that Anna outperforms in-memory systems like AWS Elas- 2.1 Anna Without Boundaries tiCache and Masstree [35] by up to an order of magnitude. In this paper, we extend the initial version of Anna to Anna also outperforms elastic databases like DynamoDB by span the cost-performance design space more flexibly, en- more than two orders of magnitude in efficiency. abling it to adapt dynamically to workload variation in a In Section 2, we briefly describe the contributions of our cloud-native setting. The architecture presented here re- prior work on Anna [51] and preview our approach to mak- moves the cost-performance and static deployment bound- ing the system adapt across boundaries. In Section 3, we de- aries discussed in Section 1. To that end, we add three key scribe the mechanisms that Anna uses to respond to mixed mechanisms: (1) horizontal elasticity to adaptively scale de- and changing workloads. Section 4 introduces the architec- ployments; (2) vertical data movement in a storage hierar- ture of Anna including the implementation of these mech- chy to reduce cost by demoting cold data to cheap storage; anisms, and Section 5 describes Anna’s policy engine. In and (3) multi-master selective replication of hot keys across Section 6, we present an evaluation of Anna’s mechanisms nodes and cores to efficiently scale request handling for non- and policies, and we describe how they fare in comparison uniform access patterns. The resulting architecture is sim- to the state of the art. Section 7 discusses related work, and plified by deploying a single storage kernel across many tiers, we conclude with future work in Section 8. by entirely avoiding coordination, and by reusing the stor- In the remainder of this paper, we use AWS as the public age engine to store and manipulate system metadata. The cloud provider underlying Anna. The design principles and additions to Anna described in this work enable system oper- lessons learned here are naturally transferable to other cloud ators to specify high-level goals such as fault tolerance and providers with similar offerings. cost-performance objectives, without needing to manually configure the number of nodes and the replication factors of keys. A new policy engine automatically responds to work- load shifts using the mechanisms mentioned above to meet these service-level objectives (SLOs). 2. BACKGROUND 3. DISTRIBUTIONS AND MECHANISMS The first paper on Anna [51] presented a distributed key- In this section, we first classify and describe common value store based on a fully shared-nothing, thread-per-core workload distributions across data and time. We then dis- architecture with background gossip across cores and nodes. cuss the mechanisms that Anna uses to respond to the work- Anna threads have no shared data structures in memory be- load properties and changes. yond message queues, enabling each core to spend most of We believe that an ideal cloud storage service should grace- its time doing useful work. Experiments on high-contention fully adapt to three aspects of workload distributions and workloads showed Anna spending over 90% of its compute their dynamics in time: cycles serving put and get requests, while state-of-the-art, competing systems were achieving less than 10%. The vast A. Volume. As overall workload grows, the aggregate majority of the other systems’ time was spent trying to ex- throughput of the system must grow. During growth peri- ecute atomic processor instructions on shared data struc- ods, the system should automatically increase resource allo- tures. As a result, Anna outperformed the competition by cation and thereby cost. When workload decreases, resource orders of magnitude at many scale points. Relative to the usage and cost should decrease correspondingly as well. state-of-the-art distributed KVSes, Anna’s initial design also B. Skewness. Even at a fixed volume, skewness of access enabled an unprecedented richness of coordination-free con- distributions can affect performance dramatically. A highly sistency levels. The basic insight was that the full variety skewed workload will make many requests to a small subset of coordination-free consistency and transactional isolation of keys. A uniform workload of similar volume will make a levels taxonomized by Bailis et al. [11] can be achieved by few requests to each key. Different skews warrant different the monotone composition of simple lattice structures, as responses, to ensure that the resources devoted to serving suggested by Conway, et al. [16]. The original paper maps each key are proportional to its popularity. out a wide range of key-value and NoSQL systems against C. Shifting Hotspots. Workloads that are static in both Bailis’ taxonomy of consistency levels. skew and volume can still exhibit changes in distribution The first version of Anna focused on performing well at over time: hot data may become cold and vice versa. The both single-node and distributed scales. This showed that system must be able to detect changes in workload hotspots eventual consistency combined with a coordination-free shared and respond accordingly by prioritizing data in the new hot nothing architecture makes data management easy in the set and demoting data in the old one. face of deployment changes and also hinted at the poten- tial to remove deployment boundaries. However, the ini- We address these three workload variations with three tial architecture lacked the mechanisms to monitor and re- mechanisms in Anna, which we describe next. spond to usage and workloads. Another notable weakness 1. Horizontal Elasticity. In order to adapt to varia- of the initial work was its need to aggressively replicate tion in workload volume, each storage tier in Anna must the entire database across the main memory of many ma- scale elastically and independently, both in terms of stor- chines to achieve high performance. This gave the sys- age and request handling. Anna needs the storage capac- tem an unattractive cost-performance tradeoff and made ity of many nodes to store large amounts of data, and it its resource allocation very rigid. As a result, although a needs the compute and networking capacity of many nodes benchmark-beater, Anna’s first version suffered from the to serve large numbers of requests. This is accomplished by problems highlighted above: it was expensive and inflexi- partitioning (sharding) data across all the nodes in a given ble for large datasets with non-uniform access distributions. tier. When workload volume increases, Anna can respond by 2
3 . Workload Dynamics Relevant Mechanisms Volume Elasticity Skew Replication, Tiering Hotspot Replication, Tiering Table 1: The mechanisms used by Anna to deal with various aspects of workload distributions. automatically adding nodes and repartitioning a subset of data. When the volume decreases, Anna can remove nodes and repartition data among the remainders. 2. Multi-Master Selective Replication. When work- Figure 1: The Anna architecture. loads are highly skewed, simply adding shards to the system will not alleviate pressure. The small hot set will be concen- trated on a few nodes that will be receiving a large major- layer (e.g., S3) and a fourth (e.g., Glacier), but demoting ity of the requests, while the remaining nodes lie idle. The data to cold storage in these tiers operates on much longer only solution is to replicate the hot set onto many machines. timescales that are beyond the scope of this work. However, we do not want to repeat the mistakes of our first iteration of Anna’s design, replicating cold keys as well— this simply wastes space and increases overhead. Instead, 4.1 Overview replication must be selective, with hot keys replicated more Figure 1 presents an overview of Anna, with each rectan- than cold keys. Thus, Anna must accurately track which gle representing a node. In the original Anna paper [51], we data is hot and which is cold, and the replication factors described an extremely performant, coordination-free key- and current replica locations for each key. value store that provided a rich variety of consistency levels. In that work, we demonstrated how a KVS could scale from 3. Vertical Tiering. As in a traditional memory hierar- multicore to distributed settings while gracefully tolerating chy, hot data should reside in a fast, memory-speed tier for the natural messaging delays that arise in distributed sys- efficient access; significant cost savings are available by de- tems. To enable the mechanisms described in Section 3, we moting data that is not frequently accessed to cold storage. first extended the storage kernel to support multiple storage Again, Anna must correctly classify hot and cold data in media and then designed three new subsystems: a monitor- order to promote or demote appropriately. While the pre- ing system/policy engine, a routing service, and a cluster vious two mechanisms are aimed at improving performance, management system. Each subsystem is bootstrapped on this one primarily attempts to minimize cost without com- top of the key-value storage component in Anna, storing and promising performance. modifying their metadata as keys and values in the system. 3.1 Summary The monitoring system and policy engine are the internal services responsible for responding to workload dynamics Table 1 shows which mechanisms respond to which prop- and meeting SLOs. Importantly, these services are stateless erties of workload distributions. There is a direct mapping and thus are not concerned with fault tolerance and scaling; between an increase (or decrease) in volume—with other they rely on the storage service for these features. factors held constant—and a requirement to elastically add The routing service is a stateless client-facing API that (or remove) nodes. Changes in workload skew require a re- provides a stable abstraction above the internal dynamics sponse to the new hot set size via promotion or demotion, as of the system. The resource allocation of each tier may be well as appropriate selective replication. Similarly, a change in flux—and whole tiers may be added or removed from the in hotspot location requires correct promotion and demotion system at workload extremes—but clients are isolated from across tiers, in addition to shifts in per-key replication fac- these changes. The routing service consistently returns a tors. We describe how Anna implements each one of these correct endpoint that will answer client requests. Finally, mechanisms in Sections 4 and 5. In Section 6, we evaluate the cluster management system is another stateless service how well Anna responds to these dynamics. that executes resource allocation changes based on decisions reached by the policy engine. 4. ANNA ARCHITECTURE In this section, we introduce Anna’s architecture and il- 4.2 Storage System lustrate how the mechanisms discussed in Section 3 are im- Figure 2 shows the architecture of an Anna storage node. plemented. We present an overview of the core subsystems Each node has many worker threads, and each thread inter- and then discuss each component in turn. As mentioned in acts with a thread-local storage medium (a memory buffer or Section 1, Anna is built on AWS components. In our initial disk volume), processes client requests, and sends/receives implementation and evaluation, we validate this architecture multicasts to/from other Anna workers. over two storage tiers: one providing RAM cost-performance The Anna storage kernel is deployed across many storage and another providing flash disk cost-performance. Anna’s tiers. The only difference between tiers is the procedure for memory tier stores data in RAM attached to AWS EC2 translating data for persistence (serialization/deserialization, nodes. The flash tier leverages the Elastic Block Store (EBS), a.k.a. “serde”). Memory-tier nodes read from and write to a fault-tolerant block storage service that masquerades as a local memory buffers, while disk-tier nodes serialize data mounted disk volume on an EC2 node. There is nothing into files that are stored on EBS volumes. Anna’s uniformity intrinsic in our choice of layers. We could easily add a third across storage tiers makes adding additional tiers very sim- 3
4 . ing updates to replicas. The resulting code exploits multi- core parallelism within a single machine and smoothly scales out across distributed nodes. Our earlier work shows dra- matic benefits from this design, including record perfor- mance based on extremely high (90%) CPU utilization in useful work with low processor cache miss rates. While Anna eliminates contention, consistency becomes tricky: the same set of updates may arrive at different repli- cas in different orders. Na¨ıvely applying these updates can cause replicas to diverge and lead to inconsistent state. An- other contribution of [51] is achieving a wide range of con- sistency models by encapsulating state into monotone com- positions of simple CRDT-style [39] lattices, inspired by the Bloom language [16]. Lattices tolerate message reordering and duplication while guaranteeing eventual convergence of replicas. By default, Anna stores data in last-writer-wins Figure 2: The architecture of a storage node. lattices, which resolve divergent updates by picking the up- date with the most recent timestamp. However, Anna’s lat- tices can be composed to offer the full range of coordination- ple: we set the serde mode and adjust the number of worker free consistency guarantees including causal consistency, item threads based on the underlying hardware. For instance, the cut isolation, and read-committed transactions [11]. total number of threads for memory nodes matches the num- ber of CPU cores to fully utilize computing resources and 4.3 Metadata Management to avoid costly preemption of threads. However, in other Anna requires maintaining certain metadata to efficiently storage tiers where the performance bottleneck lies in seri- support mechanisms discussed in Section 3 and help the pol- alizing the key-value pairs to and from persistent storage, icy engine adapt to changing workloads. In this section, we the optimal strategy for resource allocation is different. Our introduce the types of metadata managed by Anna and how EBS tier allocates 4× as many threads per node (4) as we they are stored and used by various system components. have physical CPUs (1). Anna uses consistent hashing [25] to partition and repli- 4.3.1 Types of Metadata cate keys. For performance and fault tolerance (discussed Anna manages three distinct kinds of metadata. First, further in Sections 5 and 6), each key may be replicated onto every storage tier has two hash rings. A global hash ring, G, many nodes in each tier and multiple threads in each node. determines which nodes in a tier are responsible for storing Following the model of early distributed hash tables, we use each key. A local hash ring, L, determines the set of worker virtual nodes [37] in our consistent hashing algorithm. Each threads within a single node that are responsible for a key. physical node (or thread) handles traffic for many virtual Second, each individual key K has a replication vector of nodes (or threads) on the hash ring to ensure an even distri- the form [< R1 , ...Rn >, < T1 , ...Tn >]. Ri represents the bution. Virtual nodes also enable us to add heterogeneous number of nodes in tier i storing key K, and Ti represents the nodes in the future by allocating more virtual nodes to more number of threads per node in tier i storing key K. During powerful physical machines. client-request handling and multicast, both hash rings and In the following section, we present a brief overview of the key K’s replication vector are required to determine the set storage kernel’s design points that enable it to achieve high- of threads responsible for the key. For every tier, i, that performance coordination-free execution and replica consis- maintains a replica of K, we first hash K against Gi to tency. This overview is a brief summary of the initial design determine which nodes are responsible for K. We then look of Anna presented in [51]. at Li , tier i’s local hash ring to determine which worker threads are responsible for the key. 4.2.1 Storage Kernel Lastly, Anna also tracks monitoring statistics, such as the Recent work has demonstrated that shared-memory coor- access frequency of each key and the storage consumption dination mechanisms like locking and atomic “lock-free” in- of each node. This information is analyzed by the policy en- structions slow down low-level memory access performance gine to trigger actions in response to variations in workload. on a single node by orders of magnitude [20, 13]. Across Currently, we store 16 bytes of metadata per key and about nodes, consensus algorithms are well-known to cause dra- 10 KB of metadata per worker thread. matic latency and availability problems [14, 1, 12]. Anna’s coordination-free execution model avoids these issues en- 4.3.2 Metadata Storage tirely in pursuit of excellent performance and scalability. It Clearly, the availability and consistency of metadata is as gives each worker thread on every node a private memory important as that of regular data—otherwise, Anna would buffer to store the data it manages. Data is multi-mastered: be unable to determine a key’s location or get an accurate es- each thread or node processes both reads and writes locally timate of workload characteristics and the system’s resource regardless of replication. Each thread periodically runs a usage. In many systems [40, 42, 29, 48], metadata is en- background task to multicast (“gossip”) recent updates to meshed in the implementation of “master nodes” or stateful other workers that maintain replicas of these keys. This services like ZooKeeper [23]. Anna simply stores metadata shared-nothing, asynchronous messaging scheme eliminates in the storage system. Our metadata automatically derives thread synchronization and asynchronously resolves conflict- all the benefits of our storage system, including performance 4
5 .guarantees, fault tolerance, and consistency. Anna employs last-writer-wins consistency to resolve conflicts among meta- data replicas. Due to the eventual consistency model, worker threads may have stale views of hash rings and replication vectors. This can cause threads to disagree on the loca- tion of a key and can potentially cause multiple rounds of request redirection. However, since the metadata will even- tually converge, threads will agree on the key’s location, and requests will reach the correct destination. Note that multicast is performed every few seconds, while cluster state changes on the order of minutes, so cluster state metadata is guaranteed to converge before it undergoes further changes. In summary, storing metadata in Anna both simplifies sys- tem design by reducing external software dependencies and improves performance by relaxing the required consistency model. Figure 3: Monitoring node architecture. 4.3.3 Enabling Mechanisms Interestingly, manipulating two of these types of metadata (hash rings and replication vectors) is the key to enabling 4.4 Monitoring System & Policy Engine the mechanisms described earlier in Section 3. In this sec- In this section, we discuss the design of the monitoring tion, we discuss only the implementation of each mechanism. system and how it interacts with the policy engine. As When and why each action is executed is a matter of policy shown in Figure 3, each monitoring node has a monitoring and will differ based on system configuration and workload thread, a statistics buffer, and a policy engine. The monitor- characteristics—we save this discussion for Section 5. ing thread is stateless and periodically retrieves the stored Elasticity. When a new node joins a storage tier, it queries statistics from the storage engine and triggers the policy en- the storage system to retrieve the hash ring, updates the ring gine. The policy engine in turn analyzes these statistics and to include itself, and broadcasts its presence to all nodes in issues actions to meet its SLOs. Anna currently supports the system—storage, monitoring, and routing. Each exist- three types of actions: elasticity change, hot-key replication, ing node updates its copy of the hash ring, determines if it and cross-tier data movement. The implementation of these stores any keys that the new node is now responsible for, actions is covered above, in Section 4.3.3. We discuss when and gossips those keys to the new node. Similarly, when a each of these actions is triggered and describe the end-to-end node is removed, it removes itself from the hash ring and policy algorithm in Section 5. broadcasts its departure to all nodes. It then determines which nodes are now responsible for its data and gossips its 4.5 Routing Service keys to those nodes. Once all data has been broadcast, the The routing service isolates clients from the underlying node goes offline and its resources are deallocated. storage system: A client asks where to find a key and is Key migration overheads can be significant (see Section 6.3). returned the set of all valid addresses for that key. Anna’s To address this challenge, Anna interleaves key migration routing service only maintains soft state. Each routing node with client request handling to prevent system downtime. caches the storage tiers’ hash rings and key replication vector This is possible due to Anna’s support for coordination-free metadata to respond to the clients’ key address requests. If a consistency: The client may retrieve stale data during the key has any memory-tier replicas, the routing service only re- key migration phase, but it can maintain a client-side cache turns memory-tier addresses to maximize performance. The and merge future retrieved results with the cached value. client caches these addresses locally to reduce request la- Anna’s lattice-based conflict resolution guarantees that the tency and load on the routing service. state of the cached data is monotonically growing. When a client’s cached address set becomes invalid be- Selective Replication & Cross-Tier Data Movement. cause of a change in cluster configuration, a storage server Both these mechanisms are implemented via updates to repli- will tell the client to invalidate cache entries for keys that cation vectors. Each key in our two-tier implementation has have moved. The routing service will refresh its cached clus- a default replication vector of the form [< 1, k >, < 1, 1 >], ter state and give the client a new set of addresses, which meaning that it has one memory tier replica and k EBS- will again be cached until they are invalidated. tier replicas. Here, k is the number of replica faults per key the administrator is willing to tolerate (discussed further 4.6 Cluster Management in Section 4.7 and 5). By default, keys are not replicated Anna uses Kubernetes [27] as a cluster management tool. across threads within a single node. Anna induces cross-tier Kubernetes is responsible for allocating and deallocating data movement by simply manipulating metadata. It incre- nodes, ensuring that nodes are alive, and rebooting failed ments the replication factor of one tier and decrements that nodes. An Anna deployment has four kinds of nodes: stor- of the other tier; as a result, gossip migrates data across age nodes, routing nodes, monitoring nodes, and a single, tiers. Similarly, selective replication is achieved by adjust- stateless “cluster management” node described below. ing the replication factor in each tier, under the fault toler- A “pod” is the atomic unit of a Kubernetes application ance constraint. After updating the replication vector, Anna and is a collection of one or more Docker [19] containers. All synchronizes the metadata across replicas via asynchronous containers within a pod have access to the same resources multicast. but are isolated from each other. Each node in Anna is in- 5
6 .stantiated in a separate Kubernetes pod, and each pod con- goals, however, are conflicting. The cheapest configuration tains only one instance of a Anna node. Storage system and of Anna is to have k + 1 EBS nodes and 1 memory node routing service pods are pinned on separate EC2 instances (for metadata). Clearly, this configuration will not be very for resource isolation purposes. The monitoring system is performant. If we increase performance by adding memory less resource intensive and can tolerate preemption, so it is nodes to the system, we might exceed our budget. Con- not isolated. Finally, Anna maintains a singleton cluster versely, if we strictly enforce the budget, we might not be management pod, whose role is to issue requests to add or able to achieve the latency objective. remove nodes to the Kubernetes cluster. A simple, stateless The Anna administrator specifies only one of the two Python server in this pod receives REST requests from the goals. If a latency SLO is specified, Anna minimizes the policy engine and uses bash scripts to add or remove nodes. cost while meeting the latency. If the budget is specified instead, Anna uses no more than $B per hour while maxi- 4.7 Fault Tolerance mizing performance. Anna guarantes k-fault tolerance by ensuring k+1 replicas In Sections 5.1, 5.2, and 5.3, we describe heuristics to trig- are live at all times. The choice of k determines a trade-off ger each policy action—data movement, hot key replication, between resilience and cost. The k + 1 replicas of each key and elasticity. In Section 5.4, we present Anna’s complete can be spread across tiers arbitrarily, according to hotness. policy algorithm, which combines these heuristics to achieve When a storage node fails, other nodes detect the fail- the SLO. Throughout this section, we represent each key’s ure via a timeout and remove the node from the hash ring. replication vector as [< RM , RE >, < TM , TE >] since our When such a timeout happens, Anna automatically repar- initial prototype only uses two tiers—M for memory and E titions data using the updated hash ring. The cluster man- for EBS. We have included pseudocode for the algorithms agement pod then issues a request to spawn a new node, in this section in the Appendix. which enters the join protocol discussed in Section 4.3.3. Anna does not rely on the persistence of EBS volumes for 5.1 Cross-Tier Data Movement fault tolerance in the disk tier. Similar to nodes in the mem- Anna’s policy engine uses its monitoring statistics to cal- ory tier, these nodes lose their state when they crash—this culate how frequently each key was accessed in the past T is desirable because it allows all tiers to be symmetric, re- seconds, where T is an internal parameter. If a key’s access gardless of the durability of the underlying storage medium. frequency exceeds a configurable threshold, P , and all repli- Both routing nodes and monitoring nodes only store soft cas currently reside in the EBS tier, Anna promotes a single state and do not require any recovery mechanisms. If a replica to the memory tier. If the key’s access frequency routing node fails, it queries other routing nodes for up-to- falls below a separate internal threshold, D, and the key has date cluster information, and if a monitoring node fails, it one or more memory replicas, all replicas are demoted to the retrieves system statistics from the storage service. EBS tier. The EBS replication factor is set to k + 1, and the When the cluster management pod fails, Kubernetes au- local replication factors are restored to 1. Note that in Anna, tomatically revives it. No recovery is necessary as it does not all metadata is stored in the memory tier, is never demoted, manage any state. The state of the cluster will not change and has a constant replication factor. If the aggregate stor- while the pod is down since it is the actor responsible for age capacity of a tier is full, Anna adds nodes (Section 5.3) modifying resource allocation. As a result, the policy engine to increase capacity before performing data movement. If will re-detect any issue requiring an elasticity change before the budget does not allow for more nodes, Anna employs a the crash and re-issue the request upon revival. least-recently used caching policy to demote keys. In summary, Anna consists of a stateful storage kernel that is partitioned and selectively replicated for performance 5.2 Hot-Key Replication and fault tolerance with multi-master updates. Every other When the access frequency of a key stored in the memory component is either stateless and optionally caches soft state tier increases, hot-key replication increases the number of that is easily recreated. As a result, the only single point memory-tier replicas of that key. In our initial implemen- of failure in Anna is the Kubernetes master. Kubernetes of- tation, we configure only the memory tier to replicate hot fers high-availability features to mitigate this problem [28]. keys. Because the EBS tier is not intended to be as perfor- We also note that Kubernetes is not integral to the design mant, a hot key in that tier will first be promoted to the of Anna; we rely on it primarily to reduce the engineering memory tier before being replicated. This policy will likely burden of mundane tasks such as receiving heartbeats, allo- vary for a different storage hierarchy. cating VMs, and deploying containers. The policy engine classifies a key as “hot” if its access frequency exceeds an internal threshold, H, which is s stan- dard deviations above the mean access frequency. Because 5. POLICY ENGINE Anna is a shared-nothing system, we can replicate hot keys Anna supports three kinds of SLOs: an average request both across cores in a single node and across nodes. Repli- latency (Lobj ) in milliseconds, a cost budget (B) in dol- cating across nodes seems preferable, because network ports lars/hour, and a fault tolerance (k) in number of replicas. are a typical bottleneck in distributed system, so replicating The fault tolerance indicates the allowed number of replica across nodes multiplies the aggregate network bandwidth to failures, k. The latency objective, Lobj , is the average ex- the key. However, replicating across cores within a node can pected request latency. The budget, B, is the maximum also be beneficial, as we will see in Section 6.1. Therefore, cost per hour that will be spent on Anna. hot keys are first replicated across more nodes before being As discussed in Section 4.7, Anna ensures there will never replicated across threads within a node. be fewer than k + 1 replicas of each key to achieve the The policy engine computes the target replication factor, fault tolerance goal. The latency objective and cost budget RM ideal , using the ratio between the observed latency for 6
7 .the key and the latency objective. Cross-node replication is exceeds a threshold, Cupper , nodes are added to the mem- only possible if the current number of memory replicas, RM , ory tier. However, if not all nodes are occupied, hot keys is less than the number of memory-tier nodes in the cluster, are replicated in the memory tier, as per Section 5.2. Fi- NM . If so, we increment the key’s memory replication factor nally, if the observed latency is a fraction, flower (defaulting to min(RM ideal , NM ). Otherwise, we increment the key’s to 0.5), below the objective and the compute occupancy is local replication factor on memory-tier machines up to the below Clower , we invoke the node removal heuristic to check maximum number of worker threads (NT memory ) using the if nodes can be removed to save cost. same ratio. Finally, if the access frequency of a previously- The compute threshold, Cupper , is set to 0.20. Consistent hot key drops below a threshold, L, its replication vector is with our previous work [51], each storage node saturates its restored to the default: RM , TM , and TE are all set to 1 and network bandwidth well before its compute capacity. We RE is set to k. use the compute occupancy as a proxy metric for the satu- ration of the underlying network connection. This threshold 5.3 Elasticity varies significantly based on the hardware configuration; we found that 20% was optimal for our experimental setup (see Node Addition. Anna adds nodes when there is insuf- Section 6). ficient storage or compute capacity. When a tier has in- sufficient storage capacity, the policy engine computes the number of nodes required based on data size, subject to cost 5.4.1 Discussion constraints, and instructs the cluster management service to Storage Node Saturation. There are two possible causes allocate new nodes to that tier. for saturation. If all nodes are busy processing client re- Node addition due to insufficient compute capacity only quests, Anna must add more nodes to alleviate the load. happens in the memory tier because the EBS tier is not Performing hot-key replication is not productive: Since all designed for performance. Compute pressure on the EBS nodes are busy, replicating hot keys to a busy node will, tier is alleviated by promoting data to the memory tier since in fact, decrease performance due to additional gossip over- a memory node can support 15× the requests at 4× the head. The other cause is a skewed access distribution in cost. The policy engine uses the ratio between the observed which most client requests are sent to a small set of nodes latency and the latency objective to compute the number of serving the hot keys while most nodes are free. The optimal memory nodes to add. This ratio is bounded by a system solution is to replicate the hot keys onto unsaturated nodes. parameter, c, to avoid overly aggressive allocation. If we add nodes to the cluster, the hot keys’ replication fac- Node Removal. Anna requires a minimum of one mem- tors will not change, and clients will continue to query the ory node (for system metadata) and k + 1 EBS nodes (to few nodes storing those keys. Meanwhile, the newly added meet the k-fault SLO when all data is demoted). The pol- nodes will idle. As discussed in Section 5.4 (Line 8 and 10 of icy engine respects these lower bounds. We first check if Algorithm 5 in the Appendix), Anna’s policy engine is able any key’s replication factor will exceed the total number of to differentiate the two causes for node saturation and take storage nodes in any tier after node removal. Those keys’ the appropriate action. replication factors are decremented to match the number of Policy Limitations. There are cases in which our pol- nodes at each tier before the nodes are removed. Anna cur- icy engine fails to meet the latency objective and/or wastes rently only scales down the memory tier based on compute money. Due to current cloud infrastructure limitations, for consumption and not based on storage consumption. This is example, it takes about five minutes to allocate a new node. because selective replication can significantly increase com- An adversary could easily abuse this limitation. A short pute consumption without increasing storage consumption. workload spike to trigger elasticity, followed by an immedi- Nonetheless, this may lead to wasteful spending under ad- ate decrease would lead Anna to allocate unnecessary nodes. versarial workloads; we elaborate in the next section. These nodes will be under-utilized, but will only be removed Grace Periods. When resource allocation is modified, data if the observed latency drops below flower ∗ Lobj . Unfortu- is redistributed across each tier, briefly increasing request nately, removing this constraint would make Anna suscepti- latency (see Section 6.3). Due to this increase, as well as ble to reducing resource allocation during network outages, data location changes, key access frequency decreases. To which is also undesirable. We discuss potential solutions to prevent over-correction during key redistribution, we apply a these issues in future work. grace period to allow the system to stabilize. Key demotion, Knobs. There are a small number of configuration vari- hot-key replication, and elasticity actions are all delayed till ables mentioned in this section, which are summarized in after the grace period. Table 2. We distinguish variables that are part of the exter- nal SLO Spec from the internal parameters of our current 5.4 End-to-End Policy policy. In our evaluation, our parameters were tuned by In this section, we discuss how Anna’s policy engine com- hand to match the characteristics of the AWS services we bines the above heuristics to meet its SLOs. If the average use. There has been interesting work recently on autotun- storage consumption of all nodes in a particular tier has vi- ing database system configuration knobs [45]; our setting olated configurable upper or lower thresholds (Supper and has many fewer knobs than those systems. As an alterna- Slower ), nodes are added or removed respectively. We then tive to auto-tuning our current knobs, we are exploring the invoke the data movement heuristic from Section 5.1 to pro- idea of replacing the current threshold-based policy entirely mote and demote data across tiers. Next, the policy engine with a dynamic Reinforcement Learning policy that maps checks the average latency reported by clients. If the la- directly and dynamically from performance metrics to deci- tency exceeds a fraction, fupper (defaulting to 0.75), of the sions about system configuration changes. These changes to latency SLO and the memory tier’s compute consumption the policy engine are easy to implement, but tuning the pol- 7
8 . Variable (a) Low contention (zipf coefficient = 0.5) Meaning Default Value Type Throughput (ops/sec) Name 90 K Latency 80 K Anna Lobj 2.5ms SLO Spec 70 K Anna v0 Objective ElastiCache 60 K Masstree N/A 50 K B Cost Budget SLO Spec (user-specified) 40 K Fault 30 K k 2 SLO Spec Tolerance 20 K Monitoring Policy 10 K T 30 seconds 0K report period Knob 0 1 2 3 4 5 6 7 8 3 standard Key hotness deviations above Policy Cost (dollar/hour) H threshold the mean key Knob (b) High contention (zipf coefficient = 2) 80 K Throughput (ops/sec) access frequency Key coldness The mean key Policy 70 K Anna L 60 K Anna v0 threshold access frequency Knob ElastiCache Key 50 K Masstree 2 accesses in T Policy 40 K P promotion seconds Knob 30 K threshold Storage 20 K [Slower , Memory: [0.3, 0.6] Policy consumption 10 K Supper ] EBS: [0.5, 0.75] Knob thresholds 0K [flower , Latency Policy 0 1 2 3 4 5 6 7 8 [0.5, 0.75] fupper ] thresholds Knob Cost (dollar/hour) Compute [Clower , Policy occupancy [0.05, 0.20] Cupper ] Knob thresholds Figure 4: Cost-effectiveness comparison between Upper bound Anna, Anna v0, Elasticache, and Masstree. Policy c for latency 1.5 Knob ratio Our workload is a YCSB-style read-modify-write of a sin- Table 2: A summary of all variables mentioned in gle key chosen from a Zipfian distribution. We adjust the Section 5. Zipfian coefficient to create different contention levels—a higher coefficient means a more skewed workload. The clients were run on r4.16xlarge machines, with 8 threads each. icy is beyond the scope of this paper: It involves extensive Unless stated otherwise, experiments used 40 client ma- empirical work on multiple deployment configurations. chines for a total of 320 concurrent, single-threaded clients. 6.1 Replica Placement 6. EVALUATION We first compare the benefits of intra-node vs. cross-node In this section, we present an evaluation of Anna. We first replication; for brevity, no charts are shown for this topic. explore the advantage of different replica placement strate- On 12 memory-tier nodes, we run a highly skewed workload gies in Section 6.1. We then show the benefit of selective with the Zipfian coefficient set to 2. With a single replica replication in Section 6.2. We demonstrate Anna’s ability per key, we observe a maximum throughput of just above to detect and adapt to variation in workload volume, skew, 2,000 operations per second (ops). In the case of cross-node and hotspots in Section 6.3 and 6.4. Finally, Section 6.5 replication, four nodes each have one thread responsible for evaluates Anna’s ability to trade off performance and cost each replicated key; in the intra-node case, we have only one according to its SLO. node with four threads responsible for each key. Cross-node When selecting the appropriate instance type, we mea- replication improves performance by a factor of four to 8,000 sured the best combination of memory, CPU, and network ops, while intra-node replication only improves performance bandwidth for an average workload; due to space constraints, by a factor of two to 4,000 ops. This is because the four we do not include an evaluation here. Anna uses r4.2xlarge threads on a single node all compete for the same network instances for memory-tier nodes and r4.large instances for bandwidth, while the single threads on four separate nodes EBS-tier nodes. Each node has 4 worker threads; at peak have access to four times the aggregate bandwidth. Hence, capacity they can handle a workload that saturates the net- as discussed in Section 5.2, we prioritize cross-node replica- work link of the node. r4.2xlarge memory nodes have tion over intra-node replication whenever possible but also 61GB of memory, which is equally divided among all worker take advantage of intra-node replication. threads. Each thread in a EBS node has access to its own 64GB EBS volume. In our experiments, Anna uses two 6.2 Selective Replication m4.large instances for the routing nodes and one m4.large A key weakness of our initial work [51] (referred to as instance for the monitoring node. We include these nodes Anna v0) is that all keys are assigned a uniform replica- in all cost calculation below. Unless otherwise specified, all tion factor. A poor choice of replication factor can lead to experiments are run on a database with 1 million key-value significant performance degradation. Increasing the repli- pairs. Keys and values are 8 bytes and 256KB long, respec- cation factor boosts performance for skewed workloads, as tively. We set the k-fault tolerance goal to k = 2; there are requests to hot keys can be processed in parallel on differ- 3 total replicas of each key. This leads to a total dataset size ent replicas. However, a uniform replication factor means of about 750GB: 1M keys × 3 replicas × 256KB values. that cold keys are also replicated, which increases gossip 8
9 . (a) Latency over time overhead (slowing down the system) and storage utilization 18 Latency (millisecond) (making the system more expensive). By contrast, Anna 16 Latency selectively replicates hot keys to achieve high performance, 14 Latency SLO without paying a storage cost for replicating cold keys. 12 10 This experiment explores the benefits of selective repli- 8 cation by comparing Anna’s memory-tier against Anna v0, 6 AWS ElastiCache (using managed Memcached), and a lead- 4 ing research system, Masstree [35], at various cost points. 2 0 We hand-tune Anna v0’s single replication factor to the op- 0 5 10 15 20 25 30 35 timal value for each Zipfian setting and each cost point. This Time (min) experiment uses a database of 100,000 keys across all cost (b) Throughput and system cost over time points; we use a smaller database since the data must fit 100 K 9 Throughput (ops/sec) Cost (dollar/hour) on one node, corresponding to the minimum cost point. We Throughput 80 K Cost 8.5 configure keys in Anna to have a default replication factor of 1 since neither ElastiCache nor Masstree supports repli- 60 K 8 cation of any kind. To measure the performance for a fixed 40 K 7.5 price, we also disabled Anna’s elasticity mechanism. Figure 4(a) shows that Anna consistently outperforms 20 K 7 both Masstree and ElastiCache under low contention. As 0K 6.5 discussed in our previous work, this is because Anna’s thread- 0 5 10 15 20 25 30 35 per-core coordination-free execution model efficiently exploits Time (min) multi-core parallelism, while other systems suffer from thread synchronization overhead through the use of locks or atomic Figure 5: Anna’s response to changing workload. instructions. Neither Anna nor Anna v0 replicates data in this experiment, so they deliver identical performance. Under high contention (Figure 4(b)), Anna’s throughput Anna then further increases throughput and meets the la- increases linearly with cost, while both ElastiCache and tency SLO. At the 28-minute point, we reduce the load, and Masstree plateau. Anna selectively replicates hot keys across Anna removes nodes to save cost. nodes and threads to spread the load, enabling this linear Throughout the 32-minute experiment, the latency SLO scaling; the other two systems do not have this capabil- is satisfied 97% of the time. We first violate the SLO during ity. Anna v0 replicates the entire database across all nodes. hot-key replication by 4× for 15 seconds. Moreover, the While Anna v0’s performance scales, the absolute through- latency spikes to 7× the SLO during redistribution for about put is worse than Anna’s because naively replicating the 30 seconds. Data redistribution causes multicast overhead entire database increases multicast overhead for cold keys. on the storage servers and address cache invalidation on the Furthermore, Anna v0’s storage consumption is significantly clients. The latency effects are actually not terrible. As a higher: At $7.80/hour (14 memory nodes), Anna v0’s con- point of comparison, TCP link latencies in data centers are stant replication generates 13× the original data size, while documented tolerating link delays of up to 40× [5]. Anna incurs <1% extra storage overhead. From minutes 13 to 18, we meet our SLO of 3.3ms exactly. With a larger load spike or lower initial resource allocation, 6.3 Dynamic Workload Skew & Volume Anna could have easily violated its SLO during that period, We now combine selective replication and elasticity to re- putting SLO satisfaction at 83%, a much less impressive act to changes in workload skew and volume. In this experi- figure. In any reactive policy, large enough workload vari- ment, we start with 12 memory-tier nodes and a latency ob- ations can cause significant objective violations. As a re- jective of 3.3ms—about 33% above our unsaturated latency. sult, it is common among cloud providers to develop client- All servers serve a light load at time 0. At the 3-minute specific service level agreements (SLAs) that reflect access point, we start a high contention workload with a Zipfian patterns and latency expectations. In practice, these SLAs coefficient of 2. We see in Figure 5(a) that after a brief allow for significantly more leeway than a service’s internal spike in latency, Anna replicates the highly contended keys SLO might [38]. and meets the latency SLO (the dashed red line). At minute 13, we reduce the Zipfian coefficient to 0.5 to switch to a low 6.4 Varying Hotspot contention workload. Simultaneously, we increase the load Next, we introduce multiple tiers and run a controlled ex- volume by a factor of 4. Detecting these changes, the policy periment to demonstrate the effectiveness of cross-tier pro- engine reduces the replication factors of the previously-hot motion & demotion. We temporarily disable elasticity. We keys. It finds that all nodes are occupied with client re- evaluate Anna’s ability to detect and react to changes in quests and issues a request to add four more nodes to the workload hotspots. Here, we do not consider a latency ob- cluster. We see a corresponding increase in the system cost jective, as we are only interested in how quickly Anna iden- in Figure 5(b). tifies hot data. It takes 5 minutes for the new nodes to join the cluster. We allocate 3 memory nodes (insufficient to store all data) Throughput increases to the saturation point of all nodes and 15 EBS-tier nodes. We start with most data residing (the first plateau in Figure 5(b)), and the latency spikes to on the EBS tier. The blue curve in Figure 6 shows a moder- the SLO maximum from minutes 13 to 18. At minute 18, the ately skewed workload, and the green curve shows a highly new nodes come online and trigger a round of data reparti- skewed workload. At minute 0, we begin a workload cen- tioning, seen by the brief latency spike and throughput dip. tered around one hotspot. At minute 5, we switch to a dif- 9
10 . (a) Minimize latency given cost budget Latency (millisecond) 250 Memory Tier Access Percentage 1.0 200 zipf=0.5 zipf=0.8 0.8 150 zipf=1.0 100 0.6 50 0.4 0 2 3 4 5 6 7 8 0.2 zipf=1.0 Cost (dollar/hour) zipf=2.0 (b) Minimize cost given latency objective 12 Cost (dollar/hour) 0.0 0 2 4 6 8 10 12 14 10 zipf=0.5 Time (min) 8 zipf=0.8 zipf=1.0 6 Figure 6: Adapting to changing hotspots in work- 4 load. 2 0 Cost Anna DynamoDB 0 50 100 150 200 250 300 350 $2.50/hour 1271 ops/s 35 ops/s Latency (millisecond) $3.60/hour 3352 ops/s 55 ops/s $4.40/hour 23017 ops/s 71 ops/s $5.50/hour 33548 ops/s 90 ops/s $6.50/hour 38790 ops/s 108 ops/s Figure 7: Varying contention, we measure (a) Anna $7.60/hour 43354 ops/s 122 ops/s latency per cost budget; (b) Anna cost per latency objective. Table 3: Throughput comparison between Anna and DynamoDB at different cost budgets. In Figure 7(a), we plot Anna’s steady state latency for a cost SLO. We measure average request latency over 30 seconds. ferent, largely non-overlapping hotspot, and at minute 10, At $2.10/hour (4 EBS nodes and 1 memory node), only a we switch to a third, unique hotspot. The y-axis measures small fraction of hot data is stored in the memory tier due to what percent of queries are served by the memory tier—the limited storage capacity. The observed latency ranges from “cache hit” rate. 50ms to 250ms across contention levels. Requests under the We see that for the highly skewed workload, Anna is able high contention workload are more likely to hit the small set to react to the change almost immediately and achieve a per- of hot data in the memory tier. The majority of requests fect hit rate. The hot set is very small—on the order of a for the low contention workloads hit EBS nodes. As we in- few thousand keys—and all hot keys are promoted in about crease the budget, latency improves for all contention levels: ten seconds. The moderately skewed workload shows more more memory nodes are added and a larger fraction of the variation. We see the same dip in performance after the data is memory-resident. At $4.40/hour, Anna can promote hotspot changes; however, we do not see the same stabiliza- at least one replica of all keys to the memory tier. From tion. Because the working set is much larger, it takes longer here on, latency is under 10ms across all contention levels. for the hot keys to be promoted, and there is a probabilistic Performance differences between the contention levels are “fringe” of keys that are in cold storage at time of access, negligible, thanks to hot-key replication. leading to hit-rate variance. Nonetheless, we are still able We also compare the throughput between Anna and Dy- to achieve about an 80% hit rate less than a minute after namoDB at each cost budget. Note that in this experiment, the change. DynamoDB is configured to provide the same eventual con- sistency guarantees and fault tolerance metric (k = 2) as 6.5 Cost-Performance Tradeoffs Anna. As shown in Table 3, Anna outperforms DynamoDB Finally, we assess how well Anna is able to meet its SLOs. by 36× under a low-cost regime and by as much as 355× We study the Pareto efficiency of our policy: How well does at higher costs. Our observed DynamoDB performance is it find a frontier of cost-performance tradeoffs? We sweep actually somewhat better than AWS’s advertised perfor- the SLO parameter on one of the two axes of cost and latency mance [6], which gives us confidence that this result is a and observe the outcome on the other. Here, Anna uses reasonable assessment of DynamoDB’s efficiency. both storage tiers and enable all policy actions. We generate Lastly, we set Anna to minimize cost for a stated la- workloads with three contention levels—Zipfian coefficients tency objective (Figure 7(b)). Once more, when the sys- of 0.5 (about uniform), 0.8, and 1.0 (moderately skewed). tem reaches steady state, we measure its resource cost. To For a database of 1M keys with a fault tolerance metric achieve sub-5ms latency—the left side of Figure 7(b)—Anna k = 2, Anna needs four EBS nodes to store all data and requires $9-11 per hour depending on the contention level. one memory node for metadata; this amounts to minimum This latency requires at least one replica of all keys to be in deployment cost of $2.06 per hour. the memory tier. Between 5 and 200ms, higher contention To ensure that each point was representative and stable, workloads are cheaper, as hot data can be concentrated on a we wait for Anna to achieve steady state, meaning that few memory nodes. For the same latency range, lower con- nodes are not being added or removed and latency is stable. tention workloads require more memory and are thus more 10
11 .expensive. Above 200ms, most data resides on the EBS tier, loads. Adding compression to Anna to achieve fine-grained and Anna meets the latency objective at about $2 an hour. performance cost trade-off is an interesting future direction. Tiered Storage. Beyond textbook caching, there are many interesting multi-tier storage systems in the literature. A 7. RELATED WORK classic example in the file systems domain is the HP Au- As a key-value store, Anna builds on prior systems, both toRaid system [49]. Databases also considered tertiary stor- from the databases and distributed systems literature. Nonethe- age during the era of WORM devices and storage robots [41, less, it is differentiated in the manner in which it leverages 33]. Broadcast Disks envisioned using multiple broadcast and combines these ideas to achieve new levels of efficiency frequencies to construct arbitrary hierarchies of virtual stor- and automation. age [2]. More recently, there has been interest in filesystem caching for analytics workloads. OctopusFS [24] is a tiered Elastic Cloud Storage. A small number of cloud-based file system in this vein. Tachyon [32] is another recent sys- elastic file systems have considered workload responsiveness tem that serves as a memory cache for analytics working and elasticity. Sierra [44] and Rabbit [7] are single-master sets, backing a file system interface. Our considerations are systems that handle the problem of read and write offload- rather different than prior work: The size of each tier in ing: when a node is inactive or overloaded, requests to blocks Anna can change due to elasticity, and the volume of data at that node need to be offloaded to alternative nodes. This to be stored overall can change due to dynamic replication. is particularly important for writes to blocks mastered at the inactive node. SpringFS [52] optimizes this work by finding a minimum number of machines needed for offloading. By contrast, Anna supports multi-master updates and selective 8. CONCLUSION AND FUTURE WORK key replication. When nodes go down or get slow in Anna, Anna provides a simple unified API for efficient key-value writes are simply retried at any existing replica, and new storage in the cloud. Unlike popular cloud storage systems replicas are spawned as needed by the policy. today, it supports a non-trivial distribution of access pat- ElastMan [4] is a “bolt-on” elasticity manager for cloud terns by eliminating common boundaries in terms of static KVSes that responds dynamically to changing workload vol- deployment and cost-performance tradeoffs. Developers sim- ume. Anna, on the other hand, manages the dynamics of ply declare their desired tradeoffs, without managing a cus- skew and hotspots in addition to volume. An interesting tom mix of heterogenous services. aspect of ElastMan is its proactive policy for anticipating Behind its simple API, Anna uses three core mechanisms workload changes such as diurnal patterns; we return to to meet SLOs efficiently: horizontal elasticity to right-size this point when discussing future work. the service by adding and removing nodes dynamically, ver- Key-Value Stores. There has been a wide range of work tical data movement across tiers to reduce cost by demoting on key-value stores for both multicore and distributed systems— cold data, and multi-master selective replication to scale re- more than we have room to survey here. Our earlier work [51] quest handling at a fine granularity. The primary contribu- offers a recent snapshot overview of that domain. In this tion of Anna is its integration of these three features into an paper, our focus is not on the KVS kernel, but on the mech- efficient, elastic system representing a new design point for anisms to adapt to workload distributions and trade-offs in cloud storage. These features are implemented by a policy performance and cost. engine which monitors workloads and responds to them by manipulating metadata stored in Anna’s high-performance Selective Key Replication. The concept of selectively storage engine. replicating data for performance has a long history, dating Our evaluation shows that Anna is extremely efficient. In back to the Bubba database system [17]. More recently, ec- many cases, Anna is orders of magnitude more cost-effective Store [47], Scarlett [8] and E2FS [15] perform single-master than popular cloud storage services and prior research sys- selective replication, which generate read-only replicas of tems. Anna is also unique in its ability to automatically hot data to speed up read performance. Content delivery adapt to variable workloads. network (CDN) providers such as Google Cloud CDN [21], Swarmify [43], and Akamai [3] use similar techniques to Although Anna’s design addresses the main weaknesses of replicate content close to the edge to speed up content de- modern cloud storage that we set out to study, it also raises livery. In comparison, Anna’s multi-master selective repli- a number of interesting avenues for research. cation improves both read and write performance, achieving general workload scaling. Conflicts introduced by concur- Proactive Policy Design. Our current policy design is rent writes to different replicas are resolved asynchronously entirely reactive, taking action based on current state. To using our lattices’ merge logic [51]. improve this, we are interested in proactive policies that Selective replication requires maintaining metadata to track anticipate upcoming load spikes and allocate additional re- hot keys. ecStore uses histograms to keep the hot-key meta- sources in advance. Using larger workload traces and more data compact. Anna currently maintains access frequencies advanced predictive techniques, we suspect one could dy- for the full key set. We are exploring two traditional opti- namically tune Anna more intelligently to respond to and mizations to reduce overhead in Anna: heavy hitter sketches anticipate changing workloads. rather than histograms [31], and the use of distributed ag- Defining SLOs & SLAs. Currently, the system adminis- gregation architectures for computing sketches in parallel trator defines a single latency objective corresponding to an with minimal bandwidth [34]. overall average. For any system configuration, there are ad- Another effort to address workload skew is Blowfish [26], versarial workloads that can defeat this SLO. For example, which combines the idea of replication and compression to in Section 6.3, a larger load spike could have forced Anna trade-off storage and performance under time-varying work- above its stated SLO for a long period. SLOs, SLAs and 11
12 .policies can be designed for both expected- and worst-case latest/developerguide/HowItWorks. scenarios, using pricing and incentives. ProvisionedThroughput.html. Accessed May 3, 2018. A fundamental issue is that users with large working sets [7] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. require more resources at the memory tier to hit a given Kozuch, and K. Schwan. Robust and flexible SLO. This is clear in Figure 7: If each workload corresponds power-proportional storage. In Proceedings of the 1st to a user, the user with lower Zipfian parameter costs more ACM Symposium on Cloud Computing, SoCC ’10, to service at a given SLO. SLAs should be designed to ac- pages 217–228, New York, NY, USA, 2010. ACM. count for costs varying across users. [8] G. Ananthanarayanan, S. Agarwal, S. Kandula, Reducing Elasticity Overhead. The 5-minute delay for A. Greenberg, I. Stoica, D. Harlan, and E. Harris. node addition noted in Section 6.3 is a significant problem. Scarlett: Coping with skewed content popularity in It limits the effectiveness of any elasticity policy, since feed- mapreduce clusters. In Proceedings of the Sixth back from allocating a new node is delayed for an eternity Conference on Computer Systems, EuroSys ’11, pages in compute time. Anecdotally, our colleagues building elas- 287–300, New York, NY, USA, 2011. ACM. tic services at major cloud providers tell us they contend [9] Amazon web services. https://aws.amazon.com. with these same issues. A standard solution today is to [10] Microsoft azure cloud computing platform. maintain a standby pool of “warm” nodes that are partially http://azure.microsoft.com. prepared for use. To make this cost-effective, these nodes [11] P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. have to run alternative containers that monetize the idle Hellerstein, and I. Stoica. Highly available resources. An alternative solution is to make “cold” con- transactions: Virtues and limitations. Proc. VLDB tainer startup much faster than we experience today. This Endow., 7(3):181–192, Nov. 2013. is a well-studied problem for desktop operating systems [36] [12] K. Birman, G. Chockler, and R. van Renesse. Toward and VM research [50, 18, 30], and we believe it should be a cloud computing research agenda. ACM SIGACt more widely available in public cloud settings. News, 40(2):68–80, 2009. Evaluating Other Tiers. Currently, Anna is implemented [13] S. Boyd-wickizer, M. F. Kaashoek, R. Morris, and over only two tiers, but cloud providers like AWS offer a N. Zeldovich. Non-scalable locks are dangerous. much wider array of price-performance regimes. There is [14] E. Brewer. A certain freedom: Thoughts on the cap an opportunity to add services at both ends of the price- theorem. In Proceedings of the 29th ACM performance spectrum that can leverage Anna’s elastic scal- SIGACT-SIGOPS Symposium on Principles of ing and coordination-free execution. As mentioned earlier, Distributed Computing, PODC ’10, pages 335–335, our storage kernel requires very little modification to sup- New York, NY, USA, 2010. ACM. port new storage layers. Our policy also naturally supports [15] L. Chen, M. Qiu, J. Song, Z. Xiong, and H. Hassan. more than two tiers. However, our current thresholds in E2fs: an elastic storage system for cloud computing. Section 5 are the result of significant empirical measure- The Journal of Supercomputing, 74(3):1045–1060, Mar ment and tuning. These parameters will need to be ad- 2018. justed to the underlying storage hardware. This manual ef- [16] N. Conway, W. R. Marczak, P. Alvaro, J. M. fort could be replaced by auto-tuning approaches that learn Hellerstein, and D. Maier. Logic and lattices for models of configurations, workloads and parameter settings. distributed programming. In Proceedings of the Third There has been recent work on analogous auto-tuning prob- ACM Symposium on Cloud Computing, SoCC ’12, lems [46, 22, 45]. pages 1:1–1:14, New York, NY, USA, 2012. ACM. [17] G. Copeland, W. Alexander, E. Boughter, and 9. REFERENCES T. Keller. Data placement in Bubba. In ACM [1] D. Abadi. Consistency tradeoffs in modern distributed SIGMOD Record, volume 17, pages 99–108. ACM, database system design: Cap is only part of the story. 1988. Computer, 45(2):37–42, Feb 2012. [18] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, [2] S. Acharya, R. Alonso, M. Franklin, and S. Zdonik. N. Hutchinson, and A. Warfield. Remus: High Broadcast disks: data management for asymmetric availability via asynchronous virtual machine communication environments. In Mobile Computing, replication. In Proceedings of the 5th USENIX pages 331–361. Springer, 1995. Symposium on Networked Systems Design and [3] Akamai. https://www.akamai.com. Implementation, pages 161–174. San Francisco, 2008. [4] A. Al-Shishtawy and V. Vlassov. Elastman: Elasticity [19] Kubernetes - build, ship, and run any app, anywhere. manager for elastic key-value stores in the cloud. In https://www.docker.com. Proceedings of the 2013 ACM Cloud and Autonomic [20] J. M. Faleiro and D. J. Abadi. Latch-free Computing Conference, CAC ’13, pages 7:1–7:10, New synchronization in database systems: Silver bullet or York, NY, USA, 2013. ACM. fool’s gold? In Proceedings of the 8th Biennial [5] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, Conference on Innovative Data Systems Research, P. Patel, B. Prabhakar, S. Sengupta, and CIDR ’17, 2017. M. Sridharan. Data center tcp (dctcp). In Proceedings [21] Google cloud platform. https://cloud.google.com. of the ACM SIGCOMM 2010 Conference, SIGCOMM [22] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, ’10, pages 63–74, New York, NY, USA, 2010. ACM. F. B. Cetin, and S. Babu. Starfish: a self-tuning [6] Amazon Web Services. Amazon dynamodb developer system for big data analytics. In Cidr, volume 11, guide (api version 2012-08-10), Aug. 2012. pages 261–272, 2011. https://docs.aws.amazon.com/amazondynamodb/ 12
13 .[23] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. [36] Microsoft Corp. Delivering a great startup and Zookeeper: Wait-free coordination for internet-scale shutdown experience, May 2017. https://docs. systems. In USENIX annual technical conference, microsoft.com/en-us/windows-hardware/test/weg/ volume 8. Boston, MA, USA, 2010. delivering-a-great-startup-and-shutdown-experience. [24] E. Kakoulli and H. Herodotou. Octopusfs: A Accessed May 3, 2018. distributed file system with tiered storage [37] A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, management. In Proceedings of the 2017 ACM and I. Stoica. Load balancing in structured p2p International Conference on Management of Data, systems. In International Workshop on Peer-to-Peer SIGMOD ’17, pages 65–78, New York, NY, USA, Systems, pages 68–79. Springer, 2003. 2017. ACM. [38] A. Ross, A. Hilton, and D. Rensin. Slos, slis, slas, oh [25] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, my - cre life lessons, january 2017. M. Levine, and D. Lewin. Consistent hashing and https://cloudplatform.googleblog.com/2017/01/ random trees: Distributed caching protocols for availability-part-deux--CRE-life-lessons.html. relieving hot spots on the world wide web. In Accessed May 3, 2018. Proceedings of the Twenty-ninth Annual ACM [39] M. Shapiro, N. Pregui¸ca, C. Baquero, and Symposium on Theory of Computing, STOC ’97, pages M. Zawirski. Conflict-free replicated data types. In 654–663, New York, NY, USA, 1997. ACM. X. D´efago, F. Petit, and V. Villain, editors, [26] A. Khandelwal, R. Agarwal, and I. Stoica. Blowfish: Stabilization, Safety, and Security of Distributed Dynamic storage-performance tradeoff in data stores. Systems, pages 386–400, Berlin, Heidelberg, 2011. In 13th USENIX Symposium on Networked Systems Springer Berlin Heidelberg. Design and Implementation (NSDI 16), pages [40] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 485–500, Santa Clara, CA, 2016. USENIX Association. The hadoop distributed file system. In Proceedings of [27] Kubernetes: Production-grade container the 2010 IEEE 26th Symposium on Mass Storage orchestration. http://kubernetes.io. Systems and Technologies (MSST), MSST ’10, pages [28] Kubernetes. Set up high-availability kubernetes 1–10, Washington, DC, USA, 2010. IEEE Computer masters. https://kubernetes.io/docs/tasks/ Society. administer-cluster/highly-available-master/. [41] M. Stonebraker. The design of the postgres storage Accessed May 3, 2018. system. In Proceedings of the 13th International [29] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, Conference on Very Large Data Bases, VLDB ’87, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and pages 289–300, San Francisco, CA, USA, 1987. S. Taneja. Twitter heron: Stream processing at scale. Morgan Kaufmann Publishers Inc. In Proceedings of the 2015 ACM SIGMOD [42] Storm. https://github.com/apache/storm. International Conference on Management of Data, [43] Swarmify. https://swarmify.com. SIGMOD ’15, pages 239–250, New York, NY, USA, [44] E. Thereska, A. Donnelly, and D. Narayanan. Sierra: 2015. ACM. Practical power-proportionality for data center [30] H. A. Lagar-Cavilla, J. A. Whitney, A. M. Scannell, storage. In Proceedings of the Sixth Conference on P. Patchin, S. M. Rumble, E. De Lara, M. Brudno, Computer Systems, EuroSys ’11, pages 169–182, New and M. Satyanarayanan. Snowflock: rapid virtual York, NY, USA, 2011. ACM. machine cloning for cloud computing. In Proceedings [45] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. of the 4th ACM European conference on Computer Automatic database management system tuning systems, pages 1–12. ACM, 2009. through large-scale machine learning. In Proceedings [31] K. G. Larsen, J. Nelson, H. L. Nguyen, and of the 2017 ACM International Conference on M. Thorup. Heavy hitters via cluster-preserving Management of Data, SIGMOD ’17, pages 1009–1024, clustering. CoRR, abs/1604.01357, 2016. New York, NY, USA, 2017. ACM. [32] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and [46] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. I. Stoica. Tachyon: Reliable, memory speed storage for Automatic database management system tuning cluster computing frameworks. In Proceedings of the through large-scale machine learning. In Proceedings ACM Symposium on Cloud Computing, SOCC ’14, of the 2017 ACM International Conference on pages 6:1–6:15, New York, NY, USA, 2014. ACM. Management of Data, pages 1009–1024. ACM, 2017. [33] D. Lomet and B. Salzberg. Access methods for [47] H. T. Vo, C. Chen, and B. C. Ooi. Towards elastic multiversion data. SIGMOD Rec., 18(2):315–324, June transactional cloud storage with range query support. 1989. Proceedings of the VLDB Endowment, 3(1-2):506–514, [34] A. Manjhi, S. Nath, and P. B. Gibbons. Tributaries 2010. and deltas: Efficient and robust aggregation in sensor [48] F. M. Waas. Beyond conventional data warehousing - network streams. In Proceedings of the 2005 ACM massively parallel data processing with greenplum SIGMOD International Conference on Management of database - (invited talk). In U. Dayal, M. Castellanos, Data, SIGMOD ’05, pages 287–298, New York, NY, and T. Sellis, editors, Business Intelligence for the USA, 2005. ACM. Real-Time Enterprise - Second International [35] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness Workshop, BIRTE 2008, Auckland, New Zealand, for fast multicore key-value storage. In Proceedings of August 24, 2008, Revised Selected Papers, pages the 7th ACM european conference on Computer 89–96, Aug. 2008. Systems, pages 183–196. ACM, 2012. [49] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. 13
14 . The hp autoraid hierarchical storage system. ACM Algorithm 4 NodeRemoval Trans. Comput. Syst., 14(1):108–136, Feb. 1996. Input: tier, mode [50] T. Wood, P. J. Shenoy, A. Venkataramani, M. S. 1: if mode = storage & tier = E then Yousif, et al. Black-box and gray-box strategies for 2: SET Ntarget = max(required storage(E), k + 1) virtual machine migration. In NSDI, volume 7, pages 3: reduce replication() 17–17, 2007. 4: remove node(E, NE current − Ntarget ) [51] C. Wu, J. M. Faleiro, Y. Lin, and J. M. Hellerstein. 5: else if mode = compute & tier = M then Anna: A kvs for any scale. 2018 IEEE 34th 6: if NM current > 1 then International Conference on Data Engineering 7: reduce replication() (ICDE), 2018. 8: remove node(M , 1) [52] L. Xu, J. Cipar, E. Krevat, A. Tumanov, N. Gupta, M. A. Kozuch, and G. R. Ganger. Springfs: Bridging agility and performance in elastic distributed storage. Algorithm 5 AnnaPolicy In Proc. of the 12th USENIX FAST, pages 243–255. USENIX, 2014. Input: tiers = {M, E}, keys 1: for tier in tiers do 2: if storage(tier)> Supper then APPENDIX 3: NodeAddition(tier, storage) We include pseudocode for the algorithms described in Sec- 4: else if storage(tier)< Slower then tion 5 here. Note that some algorithms included here rely 5: NodeRemoval(tier, storage) on a latency objective, which may or may not be specified. 6: for key ∈ keys do When no latency objective is specified, Anna aspires to its 7: DataMovement(key) unsaturated request latency (2.5ms) to provide the best pos- 8: if Lobs > fupper ∗ Lobj & compute(M )> Cupper then sible performance but caps spending at the specified budget. 9: NodeAddition(M , compute) 10: else if Lobs > fupper ∗ Lobj & compute(M )<= Cupper Algorithm 1 DataMovement then 11: for key ∈ keysmemory do Input: Key, [< RM , RE >< TM , TE >] 12: HotKeyReplication(key) 1: if access(Key, T )> P & RM = 0 then 2: adjust(Key, RM + 1, RE − 1, TM , TE ) 13: else if Lobs < flower ∗ Lobj & compute(M )< Clower 3: else if access(Key, T )< D & RM > 0 then then 4: adjust(Key, 0, k + 1, 1, 1) 14: NodeRemoval(M , compute) Algorithm 2 HotKeyReplication Input: Key, [< RM , RE >< TM , TE >] 1: if access(Key, T )> H & RM < NM then 2: SET RM ideal = RM ∗ Lobs /Lobj 3: SET RM = min(RM ideal , NM ) 4: adjust(Key, RM , RE , TM , TE ) 5: else if access(Key, T )> H & RM = NM then 6: SET TM ideal = TM ∗ Lobs /Lobj 7: SET TM = min(TM ideal , NT memory ) 8: adjust(Key, RM , RE , TM , TE ) 9: else if access(Key, T )< L & (RM > 1 TM > 1) then 10: adjust(Key, 1, k, 1, 1) Algorithm 3 NodeAddition Input: tier, mode 1: if mode = storage then 2: SET Ntarget = required storage(tier) 3: if Costtarget > Budget then 4: SET Ntarget = adjust() 5: add node(tier, Ntarget − Ntier current ) 6: else if mode = compute & tier = M then 7: SET Ntarget = NM current ∗min(Lobs /Lobj , c) 8: if Costtarget > Budget then 9: SET Ntarget = adjust() 10: add node(M , Ntarget − NM current ) 14