CAP定理断言任何网络 共享数据系统只能有 三个理想性质中的两个。然而, 通过显式地处理分区, 设计师可以优化一致性和 可用性,从而实现一些权衡所有三个。十年来,它的介绍,设计师和 研究人员使用(有时滥用) 以CAP定理为理由进行了广泛的探讨 新的分布式系统。NoSQL运动 也将其作为反对传统的论据 数据库。CAP定理指出任何网络共享数据系统最多可以有三个理想的性能中的两个。

注脚

展开查看详情

1. C OV ER F E AT U RE CAP Twelve Years Later: How the “Rules” Have Changed Eric Brewer, University of California, Berkeley The CAP theorem asserts that any net- ances matter. CAP prohibits only a tiny part of the design worked shared-data system can have only space: perfect availability and consistency in the presence two of three desirable properties. How- of partitions, which are rare. Although designers still need to choose between consis- ever, by explicitly handling partitions, tency and availability when partitions are present, there designers can optimize consistency and is an incredible range of flexibility for handling partitions availability, thereby achieving some trade- and recovering from them. The modern CAP goal should off of all three. be to maximize combinations of consistency and avail- ability that make sense for the specific application. Such I an approach incorporates plans for operation during a n the decade since its introduction, designers and partition and for recovery afterward, thus helping de- researchers have used (and sometimes abused) the signers think about CAP beyond its historically perceived CAP theorem as a reason to explore a wide variety limitations. of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional WHY “2 OF 3” IS MISLEADING databases. The easiest way to understand CAP is to think of two The CAP theorem states that any networked shared-data nodes on opposite sides of a partition. Allowing at least system can have at most two of three desirable properties: one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to • consistency (C) equivalent to having a single up-to-date preserve consistency, one side of the partition must act copy of the data; as if it is unavailable, thus forfeiting A. Only when nodes • high availability (A) of that data (for updates); and communicate is it possible to preserve both consistency • tolerance to network partitions (P). and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P This expression of CAP served its purpose, which was and therefore have a difficult choice between C and A. to open the minds of designers to a wider range of systems In some sense, the NoSQL movement is about creating and tradeoffs; indeed, in the past decade, a vast range of choices that focus on availability first and consistency new systems has emerged, as well as much debate on the second; databases that adhere to ACID properties (atomi- relative merits of consistency and availability. The “2 of 3” city, consistency, isolation, and durability) do the opposite. formulation was always misleading because it tended to The “ACID, BASE, and CAP” sidebar explains this difference oversimplify the tensions among properties. Now such nu- in more detail. 0018-9162/12/$31.00 © 2012 IEEE Published by the IEEE Computer Society FEBRUARY 2012 23

2. C OV ER F E AT U RE caches or logging updates for later reconciliation. Although ACID, BASE, AND CAP these strategies did increase availability, the gain came at the cost of decreased consistency. A CID and BASE represent two design philosophies at opposite ends of the consistency-availability spectrum. The ACID properties focus on consistency and are the traditional approach of The first version of this consistency-versus-availability argument appeared as ACID versus BASE,2 which was not well received at the time, primarily because people love databases. My colleagues and I created BASE in the late 1990s to capture the emerging design approaches for high availability and the ACID properties and are hesitant to give them up. The to make explicit both the choice and the spectrum. Modern large- CAP theorem’s aim was to justify the need to explore a scale wide-area systems, including the cloud, use a mix of both wider design space—hence the “2 of 3” formulation. The approaches. theorem first appeared in fall 1998. It was published in Although both terms are more mnemonic than precise, the 19993 and in the keynote address at the 2000 Symposium BASE acronym (being second) is a bit more awkward: Basically on Principles of Distributed Computing,4 which led to Available, Soft state, Eventually consistent. Soft state and eventual consistency are techniques that work well in the presence of parti- its proof. tions and thus promote availability. As the “CAP Confusion” sidebar explains, the “2 of 3” The relationship between CAP and ACID is more complex and view is misleading on several fronts. First, because parti- often misunderstood, in part because the C and A in ACID represent tions are rare, there is little reason to forfeit C or A when different concepts than the same letters in CAP and in part because the system is not partitioned. Second, the choice between choosing availability affects only some of the ACID guarantees. The four ACID properties are: C and A can occur many times within the same system Atomicity (A). All systems benefit from atomic operations. at very fine granularity; not only can subsystems make When the focus is availability, both sides of a partition should still different choices, but the choice can change according to use atomic operations. Moreover, higher-level atomic operations the operation or even the specific data or user involved. (the kind that ACID implies) actually simplify recovery. Finally, all three properties are more continuous than Consistency (C). In ACID, the C means that a transaction pre- binary. Availability is obviously continuous from 0 to 100 serves all the database rules, such as unique keys. In contrast, the C in CAP refers only to single‐copy consistency, a strict percent, but there are also many levels of consistency, subset of ACID consistency. ACID consistency also cannot be and even partitions have nuances, including disagreement maintained across partitions—partition recovery will need to within the system about whether a partition exists. restore ACID consistency. More generally, maintaining invari- Exploring these nuances requires pushing the tra- ants during partitions might be impossible, thus the need for ditional way of dealing with partitions, which is the careful thought about which operations to disallow and how to fundamental challenge. Because partitions are rare, CAP restore invariants during recovery. Isolation (I). Isolation is at the core of the CAP theorem: if the should allow perfect C and A most of the time, but when system requires ACID isolation, it can operate on at most one partitions are present or perceived, a strategy that detects side during a partition. Serializability requires communication in partitions and explicitly accounts for them is in order. This general and thus fails across partitions. Weaker definitions strategy should have three steps: detect partitions, enter of correctness are viable across partitions via compensation an explicit partition mode that can limit some operations, during partition recovery. Durability (D). As with atomicity, there is no reason to forfeit and initiate a recovery process to restore consistency and durability, although the developer might choose to avoid needing compensate for mistakes made during a partition. it via soft state (in the style of BASE) due to its expense. A subtle point is that, during partition recovery, it is possible to CAP-LATENCY CONNECTION reverse durable operations that unknowingly violated an invariant In its classic interpretation, the CAP theorem ignores during the operation. However, at the time of recovery, given a latency, although in practice, latency and partitions are durable history from both sides, such operations can be detected and corrected. In general, running ACID transactions on each side deeply related. Operationally, the essence of CAP takes of a partition makes recovery easier and enables a framework for place during a timeout, a period when the program must compensating transactions that can be used for recovery from a make a fundamental decision—the partition decision: partition. • cancel the operation and thus decrease availability, or In fact, this exact discussion led to the CAP theorem. In • proceed wit h t he operat ion a nd t hu s r isk the mid-1990s, my colleagues and I were building a variety inconsistency. of cluster-based wide-area systems (essentially early cloud computing), including search engines, proxy caches, and Retrying communication to achieve consistency, for content distribution systems.1 Because of both revenue example, via Paxos or a two-phase commit, just delays goals and contract specifications, system availability was the decision. At some point the program must make the at a premium, so we found ourselves regularly choosing to decision; retrying communication indefinitely is in essence optimize availability through strategies such as employing choosing C over A. 24 COMPUTER

3. CAP CONFUSION A spects of the CAP theorem are often misunderstood, particularly the scope of availability and consistency, which can lead to undesirable results. If users cannot reach the service at all, there is no In practice, most groups assume that a datacenter (single site) has no partitions within, and thus design for CA within a single site; such designs, including traditional databases, are the pre-CAP default. choice between C and A except when part of the service runs on the However, although partitions are less likely within a datacenter, they client. This exception, commonly known as disconnected operation or are indeed possible, which makes a CA goal problematic. Finally, given offline mode,1 is becoming increasingly important. Some HTML5 the high latency across the wide area, it is relatively common to forfeit features—in particular, on-client persistent storage—make discon- perfect consistency across the wide area for better performance. nected operation easier going forward. These systems normally Another aspect of CAP confusion is the hidden cost of forfeiting choose A over C and thus must recover from long partitions. consistency, which is the need to know the system’s invariants. The Scope of consistency reflects the idea that, within some boundary, subtle beauty of a consistent system is that the invariants tend to hold state is consistent, but outside that boundary all bets are off. For even when the designer does not know what they are. Consequently, example, within a primary partition, it is possible to ensure complete a wide range of reasonable invariants will work just fine. Conversely, consistency and availability, while outside the partition, service is not when designers choose A, which requires restoring invariants after a available. Paxos and atomic multicast systems typically match this partition, they must be explicit about all the invariants, which is both scenario.2 In Google, the primary partition usually resides within one challenging and prone to error. At the core, this is the same concurrent datacenter; however, Paxos is used on the wide area to ensure global updates problem that makes multithreading harder than sequential consensus, as in Chubby,3 and highly available durable storage, as in programming. Megastore.4 Independent, self-consistent subsets can make forward progress References while partitioned, although it is not possible to ensure global invari- 1. J. Kistler and M. Satyanarayanan, “Disconnected Operation in the Coda File ants. For example, with sharding, in which designers prepartition data System” ACM Trans. Computer Systems, Feb. 1992, pp. 3-25. across nodes, it is highly likely that each shard can make some prog- 2. K. Birman, Q. Huang, and D. Freedman, “Overcoming the ‘D’ in CAP: Using ress during a partition. Conversely, if the relevant state is split across a Isis2 to Build Locally Responsive Cloud Services,” Computer, Feb. 2011, pp. partition or global invariants are necessary, then at best only one side 50-58. can make progress and at worst no progress is possible. 3. M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Does choosing consistency and availability (CA) as the “2 of 3” Systems,” Proc. Symp. Operating Systems Design and Implementation (OSDI make sense? As some researchers correctly point out, exactly what it 06), Usenix, 2006, pp. 335-350. means to forfeit P is unclear.5,6 Can a designer choose not to have parti- 4. J. Baker et al., “Megastore: Providing Scalable, Highly Available Storage for tions? If the choice is CA, and then there is a partition, the choice must Interactive Services,” Proc. 5th Biennial Conf. Innovative Data Systems revert to C or A. It is best to think about this probabilistically: choosing Research (CIDR 11), ACM, 2011, pp. 223-234. CA should mean that the probability of a partition is far less than that 5. D. Abadi, “Problems with CAP, and Yahoo’s Little Known NoSQL System,” of other systemic failures, such as disasters or multiple simultaneous DBMS Musings, blog, 23 Apr. 2010; http://dbmsmusings.blogspot. faults. com/2010/04/problems-with-cap-and-yahoos-little.html. Such a view makes sense because real systems lose both C and A 6. C. Hale, “You Can’t Sacrifice Partition Tolerance,” 7 Oct. 2010; http:// under some sets of faults, so all three properties are a matter of degree. codahale.com/you-cant-sacrifice-partition-tolerance. Thus, pragmatically, a partition is a time bound on com- ing remote copies asynchronously.5 However, it makes the munication. Failing to achieve consistency within the time master copy local, which decreases latency. This strategy bound implies a partition and thus a choice between C works well in practice because single user data is naturally and A for this operation. These concepts capture the core partitioned according to the user’s (normal) location. Ide- design issue with regard to latency: are two sides moving ally, each user’s data master is nearby. forward without communication? Facebook uses the opposite strategy:6 the master copy This pragmatic view gives rise to several important con- is always in one location, so a remote user typically has sequences. The first is that there is no global notion of a a closer but potentially stale copy. However, when users partition, since some nodes might detect a partition, and update their pages, the update goes to the master copy others might not. The second consequence is that nodes directly as do all the user’s reads for a short time, despite can detect a partition and enter a partition mode—a central higher latency. After 20 seconds, the user’s traffic reverts to part of optimizing C and A. the closer copy, which by that time should reflect the update. Finally, this view means that designers can set time bounds intentionally according to target response times; MANAGING PARTITIONS systems with tighter bounds will likely enter partition The challenging case for designers is to mitigate a par- mode more often and at times when the network is merely tition’s effects on consistency and availability. The key slow and not actually partitioned. idea is to manage partitions very explicitly, including not Sometimes it makes sense to forfeit strong C to avoid the only detection, but also a specific recovery process and a high latency of maintaining consistency over a wide area. plan for all of the invariants that might be violated during Yahoo’s PNUTS system incurs inconsistency by maintain- a partition. This management approach has three steps: FEBRUARY 2012 25

4. C OV ER F E AT U RE ery. For example, for the invariant State: S State: S1 State: S' that keys in a table are unique, Partition designers typically decide to risk recovery Operations on S that invariant and allow duplicate State: S2 keys during a partition. Duplicate Time keys are easy to detect during re- Partition starts covery, and, assuming that they Partition mode can be merged, the designer can easily restore the invariant. Figure 1. The state starts out consistent and remains so until a partition starts. To stay For an invariant that must be available, both sides enter partition mode and continue to execute operations, creat- maintained during a partition, ing concurrent states S1 and S2, which are inconsistent. When the partition ends, the however, the designer must pro- truth becomes clear and partition recovery starts. During recovery, the system merges hibit or modify operations that S1 and S2 into a consistent state S' and also compensates for any mistakes made during might violate it. (In general, there is the partition. no way to tell if the operation will actually violate the invariant, since • detect the start of a partition, the state of the other side is not knowable.) Externalized • enter an explicit partition mode that may limit some events, such as charging a credit card, often work this way. operations, and In this case, the strategy is to record the intent and execute • initiate partition recovery when communication is it after the recovery. Such transactions are typically part restored. of a larger workflow that has an explicit order-processing state, and there is little downside to delaying the operation The last step aims to restore consistency and compen- until the partition ends. The designer forfeits A in a way that sate for mistakes the program made while the system was users do not see. The users know only that they placed an partitioned. order and that the system will execute it later. Figure 1 shows a partition’s evolution. Normal operation More generally, partition mode gives rise to a funda- is a sequence of atomic operations, and thus partitions mental user-interface challenge, which is to communicate always start between operations. Once the system times that tasks are in progress but not complete. Researchers out, it detects a partition, and the detecting side enters have explored this problem in some detail for disconnected partition mode. If a partition does indeed exist, both sides operation, which is just a long partition. Bayou’s calendar enter this mode, but one-sided partitions are possible. In application, for example, shows potentially inconsistent such cases, the other side communicates as needed and (tentative) entries in a different color.7 Such notifications either this side responds correctly or no communication are regularly visible both in workflow applications, such as was required; either way, operations remain consistent. commerce with e-mail notifications, and in cloud services However, because the detecting side could have incon- with an offline mode, such as Google Docs. sistent operations, it must enter partition mode. Systems One reason to focus on explicit atomic operations, that use a quorum are an example of this one-sided par- rather than just reads and writes, is that it is vastly easier titioning. One side will have a quorum and can proceed, to analyze the impact of higher-level operations on invari- but the other cannot. Systems that support disconnected ants. Essentially, the designer must build a table that looks operation clearly have a notion of partition mode, as do at the cross product of all operations and all invariants some atomic multicast systems, such as Java’s JGroups. and decide for each entry if that operation could violate Once the system enters partition mode, two strategies the invariant. If so, the designer must decide whether to are possible. The first is to limit some operations, thereby prohibit, delay, or modify the operation. In practice, these reducing availability. The second is to record extra infor- decisions can also depend on the known state, on the argu- mation about the operations that will be helpful during ments, or on both. For example, in systems with a home partition recovery. Continuing to attempt communication node for certain data,5 operations can typically proceed will enable the system to discern when the partition ends. on the home node but not on other nodes. The best way to track the history of operations on both Which operations should proceed? sides is to use version vectors, which capture the causal Deciding which operations to limit depends primarily dependencies among operations. The vector’s elements are on the invariants that the system must maintain. Given a a pair (node, logical time), with one entry for every node set of invariants, the designer must decide whether or not that has updated the object and the time of its last update. to maintain a particular invariant during partition mode or Given two versions of an object, A and B, A is newer than risk violating it with the intent of restoring it during recov- B if, for every node in common in their vectors, A’s times 26 COMPUTER

5.are greater than or equal to B’s and at least one of A’s times The system concatenates logs, sorts them into some order, is greater. and then executes them. Commutativity implies the ability If it is impossible to order the vectors, then the updates to rearrange operations into a preferred consistent global were concurrent and possibly inconsistent. Thus, given the order. Unfortunately, using only commutative operations version vector history of both sides, the system can easily is harder than it appears; for example, addition is com- tell which operations are already in a known order and mutative, but addition with a bounds check is not (a zero which executed concurrently. Recent work8 proved that balance, for example). this kind of causal consistency is the best possible outcome in general if the designer chooses to focus on availability. Designers can choose to constrain Partition recovery the use of certain operations during At some point, communication resumes and the parti- partitioning so that the system can tion ends. During the partition, each side was available automatically merge state during and thus making forward progress, but partitioning has recovery. delayed some operations and violated some invariants. At this point, the system knows the state and history of both sides because it kept a careful log during partition Recent work by Marc Shapiro and colleagues at INRIA12,13 mode. The state is less useful than the history, from which has greatly improved the use of commutative operations the system can deduce which operations actually violated for state convergence. The team has developed commuta- invariants and what results were externalized, including tive replicated data types (CRDTs), a class of data structures the responses sent to the user. The designer must solve two that provably converge after a partition, and describe how hard problems during recovery: to use these structures to • the state on both sides must become consistent, and • ensure that all operations during a partition are com- • there must be compensation for the mistakes made mutative, or during partition mode. • represent values on a lattice and ensure that all opera- tions during a partition are monotonically increasing It is generally easier to fix the current state by starting with respect to that lattice. from the state at the time of the partition and rolling for- ward both sets of operations in some manner, maintaining The latter approach converges state by moving to the consistent state along the way. Bayou did this explicitly by maximum of each side’s values. It is a formalization and rolling back the database to a correct time and replaying improvement of what Amazon does with its shopping the full set of operations in a well-defined, deterministic cart:14 after a partition, the converged value is the union order so that all nodes reached the same state.9 Similarly, of the two carts, with union being a monotonic set opera- source-code control systems such as the Concurrent Ver- tion. The consequence of this choice is that deleted items sioning System (CVS) start from a shared consistent point may reappear. and roll forward updates to merge branches. However, CRDTs can also implement partition-tolerant Most systems cannot always merge conflicts. For ex- sets that both add and delete items. The essence of this ample, CVS occasionally has conflicts that the user must approach is to maintain two sets: one each for the added resolve manually, and wiki systems with offline mode and deleted items, with the difference being the set’s mem- typically leave conflicts in the resulting document that bership. Each simplified set converges, and thus so does require manual editing.10 the difference. At some point, the system can clean things Conversely, some systems can always merge conflicts up simply by removing the deleted items from both sets. by choosing certain operations. A case in point is text However, such cleanup generally is possible only while the editing in Google Docs,11 which limits operations to ap- system is not partitioned. In other words, the designer must plying a style and adding or deleting text. Thus, although prohibit or postpone some operations during a partition, the general problem of conflict resolution is not solvable, but these are cleanup operations that do not limit per- in practice, designers can choose to constrain the use of ceived availability. Thus, by implementing state through certain operations during partitioning so that the system CRDTs, a designer can choose A and still ensure that state can automatically merge state during recovery. Delaying converges automatically after a partition. risky operations is one relatively easy implementation of this strategy. Compensating for mistakes Using commutative operations is the closest approach In addition to computing the postpartition state, there to a general framework for automatic state convergence. is the somewhat harder problem of fixing mistakes made FEBRUARY 2012 27

6. C OV ER F E AT U RE ally customer service will compensate those passengers COMPENSATION ISSUES IN AN in some way. AUTOMATED TELLER MACHINE The airplane example also exhibits an externalized mistake: if the airline had not said that the passenger had I a seat, fixing the problem would be much easier. This is n the design of an automated teller machine (ATM), strong consistency would appear to be the logical choice, but in practice, another reason to delay risky operations: at the time of A trumps C. The reason is straightforward enough: higher availa- recovery, the truth is known. The idea of compensation is bility means higher revenue. Regardless, ATM design serves as a really at the core of fixing such mistakes; designers must good context for reviewing some of the challenges involved in create compensating operations that both restore an in- compensating for invariant violations during a partition. variant and more broadly correct an externalized mistake. The essential ATM operations are deposit, withdraw, and check balance. The key invariant is that the balance should be zero or Technically, CRDTs allow only locally verifiable invariants higher. Because only withdraw can violate the invariant, it will need —a limitation that makes compensation unnecessary but special treatment, but the other two operations can always that somewhat decreases the approach’s power. However, a execute. solution that uses CRDTs for state convergence could allow The ATM system designer could choose to prohibit withdrawals the temporary violation of a global invariant, converge during a partition, since it is impossible to know the true balance at that time, but that would compromise availability. Instead, using the state after the partition, and then execute any needed stand-in mode (partition mode), modern ATMs limit the net with- compensations. drawal to at most k, where k might be $200. Below this limit, Recovering from externalized mistakes typically re- withdrawals work completely; when the balance reaches the limit, quires some history about externalized outputs. Consider the system denies withdrawals. Thus, the ATM chooses a sophisti- the drunk “dialing” scenario, in which a person does not cated limit on availability that permits withdrawals but bounds the risk. remember making various telephone calls while intoxi- When the partition ends, there must be some way to both cated the previous night. That person’s state in the light of restore consistency and compensate for mistakes made while the day might be sound, but the log still shows a list of calls, system was partitioned. Restoring state is easy because the opera- some of which might have been mistakes. The calls are the tions are commutative, but compensation can take several forms. A external effects of the person’s state (intoxication). Because final balance below zero violates the invariant. In the normal case, the ATM dispensed the money, which caused the mistake to the person failed to remember the calls, it could be hard to become external. The bank compensates by charging a fee and compensate for any trouble they have caused. expecting repayment. Given that the risk is bounded, the problem In a machine context, a computer could execute orders is not severe. However, suppose that the balance was below zero at twice during a partition. If the system can distinguish two some point during the partition (unknown to the ATM), but that a intentional orders from two duplicate orders, it can cancel later deposit brought it back up. In this case, the bank might still charge an overdraft fee retroactively, or it might ignore the viola- one of the duplicates. If externalized, one compensation tion, since the customer has already made the necessary payment. strategy would be to autogenerate an e-mail to the cus- In general, because of communication delays, the banking tomer explaining that the system accidentally executed system depends not on consistency for correctness, but rather on the order twice but that the mistake has been fixed and auditing and compensation. Another example of this is “check to attach a coupon for a discount on the next order. With- kiting,” in which a customer withdraws money from multiple branches before they can communicate and then flees. The over- out the proper history, however, the burden of catching the draft will be caught later, perhaps leading to compensation in the mistake is on the customer. form of legal action. Some researchers have formally explored compensating transactions as a way to deal with long-lived transac- tions.15,16 Long-running transactions face a variation of during partitioning. The tracking and limitation of partition- the partition decision: is it better to hold locks for a long mode operations ensures the knowledge of which invari- time to ensure consistency, or release them early and ants could have been violated, which in turn enables the expose uncommitted data to other transactions but allow designer to create a restoration strategy for each such higher concurrency? A typical example is trying to update invariant. Typically, the system discovers the violation all employee records as a single transaction. Serializing during recovery and must implement any fix at that time. this transaction in the normal way locks all records and There are various ways to fix the invariants, includ- prevents concurrency. Compensating transactions take ing trivial ways such as “last writer wins” (which ignores a different approach by breaking the large transaction some updates), smarter approaches that merge opera- into a saga, which consists of multiple subtransactions, tions, and human escalation. An example of the latter each of which commits along the way. Thus, to abort the is airplane overbooking: boarding the plane is in some larger transaction, the system must undo each already sense partition recovery with the invariant that there committed subtransaction by issuing a new transaction must be at least as many seats as passengers. If there are that corrects for its effects—the compensating transaction. too many passengers, some will lose their seats, and ide- In general, the goal is to avoid aborting other trans- 28 COMPUTER

7.actions that used the incorrectly committed data (no 10th Ann. ACM Symp. User Interface Software and Technol- cascading aborts). The correctness of this approach de- ogy (UIST 97), ACM, 1999, pp. 119-128. pends not on serializability or isolation, but rather on the 8. P. Mahajan, L. Alvisi, and M. Dahlin, Consistency, Avail- ability, and Convergence, tech. report UTCS TR-11-22, Univ. net effect of the transaction sequence on state and outputs. of Texas at Austin, 2011. That is, after compensations, does the database essentially 9. D.B. Terry et al., “Managing Update Conflicts in Bayou, a end up in a place equivalent to where it would have been Weakly Connected Replicated Storage System,” Proc. 15th had the subtransactions never executed? The equivalence ACM Symp. Operating Systems Principles (SOSP 95), ACM, must include externalized actions; for example, refunding 1995, pp. 172-182. a duplicate purchase is hardly the same as not charging 10. B. Du and E.A. Brewer, “DTWiki: A Disconnection and that customer in the first place, but it is arguably equiva- Intermittency Tolerant Wiki,” Proc. 17th Int’l Conf. World Wide Web (WWW 08), ACM, 2008, pp. 945-952. lent. The same idea holds in partition recovery. A service 11. “What’s Different about the New Google Docs: Conflict or product provider cannot always undo mistakes directly, Resolution,” blog; http://googledocs.blogspot.com/2010/09/ but it aims to admit them and take new, compensating whats-different-about-new-google-docs_22.html. actions. How best to apply these ideas to partition recov- 12. M. Shapiro et al., “Conflict-Free Replicated Data Types,” ery is an open problem. The “Compensation Issues in an Proc. 13th Int’l Conf. Stabilization, Safety, and Security of Automated Teller Machine” sidebar describes some of the Distributed Systems (SSS 11), ACM, 2011, pp. 386-400. concerns in just one application area. 13. M. Shapiro et al., “Convergent and Commutative Replicated Data Types,” Bulletin of the EATCS, no. 104, June 2011, pp. S 67-88. ystem designers should not blindly sacrifice consis- 14. G. DeCandia et al., “Dynamo: Amazon’s Highly Available tency or availability when partitions exist. Using the Key-Value Store,” Proc. 21st ACM SIGOPS Symp. Operating proposed approach, they can optimize both prop- Systems Principles (SOSP 07), ACM, 2007, pp. 205-220. erties through careful management of invariants during 15. H. Garcia-Molina and K. Salem, “SAGAS,” Proc. ACM partitions. As newer techniques, such as version vectors SIGMOD Int’l Conf. Management of Data (SIGMOD 87), ACM, and CRDTs, move into frameworks that simplify their 1987, pp. 249-259. use, this kind of optimization should become more wide- 16. H. Korth, E. Levy, and A. Silberschatz, “A Formal Approach to Recovery by Compensating Transactions,” Proc. VLDB spread. However, unlike ACID transactions, this approach Endowment (VLDB 90), ACM, 1990, pp. 95-106. requires more thoughtful deployment relative to past strategies, and the best solutions will depend heavily on Eric Brewer is a professor of computer science at the details about the service’s invariants and operations. University of California, Berkeley, and vice president of infrastructure at Google. His research interests include cloud computing, scalable servers, sensor networks, and Acknowledgments technology for developing regions. He also helped create I thank Mike Dahlin, Hank Korth, Marc Shapiro, Justin Sheehy, USA.gov, the official portal of the federal government. Amin Vahdat, Ben Zhao, and the IEEE Computer Society vol- Brewer received a PhD in electrical engineering and com- unteers for their helpful feedback on this work. puter science from MIT. He is a member of the National Academy of Engineering. Contact him at brewer@cs. References berkeley.edu. 1. E. Brewer, “Lessons from Giant-Scale Services,” IEEE In- ternet Computing, July/Aug. 2001, pp. 46-55. 2. A. Fox et al., “Cluster-Based Scalable Network Services,” Proc. 16th ACM Symp. Operating Systems Principles ICDE 2012 (SOSP 97), ACM, 1997, pp. 78-91. 28th IEEE International Conference on Data Engineering 3. A. Fox and E.A. Brewer, “Harvest, Yield and Scalable Toler- 1-5 April 2012 ant Systems,” Proc. 7th Workshop Hot Topics in Operating Washington, DC, USA Systems (HotOS 99), IEEE CS, 1999, pp. 174-178. ICDE addresses research issues in designing, building, managing, and evaluating 4. E. Brewer, “Towards Robust Distributed Systems,” Proc. advanced data-intensive systems and applications. 19th Ann. ACM Symp.Principles of Distributed Com- Learn more at: puting (PODC 00), ACM, 2000, pp. 7-10; www.cs.berkeley. http://www.icde12.org/ edu/~brewer/PODC2000.pdf. 5. B. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform,” Proc. VLDB Endowment (VLDB 08), ACM, 2008, pp. 1277-1288. 6. J. Sobel, “Sca ling Out,” Facebook Enginee r ing Notes, 20 Aug. 2008; w w w.facebook.com/note. php?note_id=23844338919&id=9445547199. 7. W. K. Edwards et al., “Designing and Implementing Asyn- chronous Collaborative Applications with Bayou,” Proc. FEBRUARY 2012 29

8. This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org. Top articles, podcasts, and more. computingnow.computer.org

9. This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org. Top articles, podcasts, and more. computingnow.computer.org