- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Analytical and Transactional Main-Memory Workloads
展开查看详情
1 . Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis Tobias Scheuer ´ Ecole Polytechnique Fed ´ erale ´ SAP AG de Lausanne tobias.scheuer@sap.com iraklis.psaroudakis@epfl.ch Norman May Anastasia Ailamaki SAP AG ´ Ecole Polytechnique Fed ´ erale ´ norman.may@sap.com de Lausanne anastasia.ailamaki@epfl.ch ABSTRACT age the increase of the processing power of modern multi- Task scheduling typically employs a worker thread per hard- socket multi-core processors to maximize performance and ware context to process a dynamically changing set of tasks. efficiently service incoming queries over growing datasets. It is an appealing solution to exploit modern multi-core Typically, the execution engine of a DBMS uses a single processors, as it eases parallelization and avoids unneces- logical thread for short-lived latency-sensitive transactional sary context switches and their associated costs. Na¨ıvely queries. Long-running operations, such as complex transac- bundling DBMS operations into tasks, however, can result tional queries or analytical queries, may be parallelized us- in sub-optimal usage of CPU resources: highly contending ing more logical threads. For highly concurrent workloads, transactional workloads involve blocking tasks. Moreover, simply issuing logical threads and leaving scheduling to the analytical queries assume they can use all available resources operating system (OS), leads to high creation costs and nu- while issuing tasks, resulting in an excessive number of tasks merous context switches. The latter are incurred by the OS and an unnecessary associated scheduling overhead. time sharing policy that balances the usage of a limited num- In this paper, we show how to overcome these problems ber of available hardware contexts among a higher number and exploit the performance benefits of task scheduling for of threads using time slices [2, 25]. main-memory DBMS. Firstly, we use application knowledge For resource management, including CPU, DBMS typi- about blocking tasks to dynamically adapt the number of cally employ a query admission control to limit the number workers and aid the OS scheduler to saturate CPU resources. of processed queries. A query admission control, however, In addition, we show that analytical queries should issue a is a mechanism that operates on a per-query level, and can low number of tasks in cases of high concurrency, to avoid only indirectly avoid an excessive number of threads. It excessive synchronization, communication and scheduling does not control resource utilization of queries after they costs. To achieve that, we maintain a concurrency hint, have been admitted. Task scheduling [2, 6, 10, 13, 21] is reflecting recent CPU availability, that partitionable ana- a more appealing solution, as it uses a number of threads lytical operations can use as a limit while adjusting their to process all operations for the whole run-time of an ap- task granularity. We integrate our scheduler into a commer- plication. Tasks, which encapsulate operations, are stored cial main-memory column-store, and show that it improves in task pools, and worker threads are employed by the task the performance of mixed workloads, by up to 12.5% for scheduler to process the tasks. Task scheduling can be a analytical queries and 370% for transactional queries. complementary solution to query admission control, or even an alternative if tasks can be prioritized (see Section 4). Moreover, task scheduling is well-suited for the recent 1. INTRODUCTION wave of main-memory DBMS that forfeit disk-based stor- In the era of big data, the key evaluation criterion for age in favor of performance. By removing I/O bottlenecks, database management systems (DBMS) is their performance main-memory DBMS can focus completely on optimizing when evaluating an ever-increasing number of on-line ana- CPU and memory utilization. Task scheduling can prove a lytical processing (OLAP) and on-line transactional process- powerful tool for main-memory DBMS, as it can automate ing (OLTP) queries. The challenge for DBMS is to lever- the efficient usage of CPU resources, especially of modern shared-memory multi-core processors [2, 10, 13, 32], and can help developers easily parallelize database operations. Recent popular task scheduling frameworks include the OpenMP API [3, 10] and Intel Thread Building Blocks (TBB) [2]. Their main advantage is that developers express parti- tionable operations, that can be parallelized with a variable number of tasks, using a high level of abstraction such as data parallelism. The high level of abstraction helps to au- tomatically adjust the task granularity of analytical parti- 1
2 .tionable operations, such as aggregations or hash-joins. The a commercial main-memory DBMS, and how we apply our high level of abstraction, however, also turns out to be a task scheduler to SAP HANA. Next, we present the general disadvantage, as it cannot be used straightforwardly by al- architecture of our scheduler in Section 4, how we handle ready developed applications. Integration into a commercial blocking tasks using a flexible concurrency level in Section DBMS would require a re-write of large portions of code, 5, and how we use concurrency hints to aid task creators of which is a process with significant cost and time consider- partitionable operations adapt their task granularity in Sec- ations. Moreover, partitionable operations in commercial tion 6. In Section 7, we show our experimental evaluation. DBMS typically define their task granularity independently, Finally, in Section 8, we present our conclusions. without the use of a central mechanism for data parallelism. This is the common case, as optimizing the granularity of a 2. RELATED WORK single DBMS paritionable operation alone involves consider- Task scheduling for parallel programs and parallel systems able research effort (see [12], for example, for partitioning in is a broad field of research. Early related work focuses on hash-joins). In this paper, we show how to adjust task gran- static scheduling [19, 23], which is typically done at com- ularity in a main-memory DBMS, in a non-intrusive manner, pile time and assumes that basic information about tasks without the need of a high level of abstraction. We supply (such as processing times, dependencies, synchronization, partitionable operations with a hint reflecting recent CPU and communication costs) and the target machine environ- availability, that can be used to adjust their task granu- ment are known in advance. Given perfect information, a larity. Our experiments show that when partitionable op- static scheduling algorithm attempts to produce the opti- erations use this concurrency hint, overall performance for mal assignment of tasks to processors, that ideally balances analytical and mixed workloads is significantly improved. their loads and minimizes scheduling overheads and mem- Furthermore, recent task schedulers, e.g. Intel TBB, use a ory referencing delays [20]. Perfect information, however, fixed number of worker threads, equal to the number of hard- is hard to obtain for modern shared-memory multiproces- ware contexts. This is a standard technique to avoid over- sors and modern applications where tasks may be generated commitment of CPU resources. The fixed concurrency level, dynamically and at a fast pace. For example, queries in however, is only suited for CPU-intensive tasks that rarely a DBMS arrive dynamically, and information about them block [2]. We show that tasks in DBMS often block due can only be estimated, often with high relative errors. More to synchronization, especially in heavily contending OLTP recent related work focuses on dynamic scheduling, which workloads. Thus, the fixed concurrency level can result in is done on-the-fly at run-time [23, 35, 38]. Dynamic task under-utilization of CPU resources in DBMS workloads. In scheduling does not require information about supplied tasks this paper, we show how the task scheduler can detect the a priori, has less overhead than static scheduling, and pro- inactivity periods of tasks and dynamically adapt its con- vides automatic load-balancing and improved portability be- currency level. Our scheduler gives control to the OS of tween different hardware architectures [27]. Dynamic task additional workers when needed to saturate CPU resources. scheduling may also use run-time measurement to re-adapt Contributions. We apply task scheduling to a commer- scheduling decisions [30, 33, 38] for better data locality [38] cial main-memory DBMS. Our experiments show the ben- or NUMA (non-uniform memory access) awareness [11, 34]. efits of using task scheduling for scaling up main-memory Our work studies the practical application and evaluation DBMS over modern multi-socket multi-core processors, to of dynamic task scheduling for the specific case of a main- efficiently evaluate highly concurrent analytical and trans- memory DBMS on a single shared-memory multiprocessor actional workloads. Our main contributions are: machine. We focus on the minimization of context switches, handling blocked tasks, and adjusting task granularity for • We show that a fixed concurrency level for task schedul- highly concurrent workloads, and leave optimizations for ing is not suitable for a DBMS, as tasks often block, data locality and NUMA awareness for future work. especially in highly contending OLTP workloads. Our The common design of recent dynamic task schedulers scheduler adapts its concurrency level, by detecting involves task pools where tasks are submitted dynamically blocked tasks, and giving control to the OS of addi- at run-time, and the scheduler employs a set of threads to tional worker threads to saturate CPU resources. work on the tasks [21]. There are two main categories of • We show that using a hint reflecting recent CPU avail- task schedulers [16]: breadth-first schedulers [31] and work- ability helps to adjust the task granularity of parti- first schedulers [13] with various work-stealing techniques [6, tionable analytical operations. The concurrency hint 28, 32, 34] for load-balancing and improved predictability in improves overall performance significantly in cases of real-time applications [27]. In breadth-first schedulers, when high concurrency, by reducing costs related to commu- a task is created, it is placed in the task pools and the par- nication, synchronization and book-keeping. ent task continues execution. In work-first schedulers, the thread of the parent task switches to execute child tasks, • We show how we integrate our task scheduler into SAP for potentially better data locality [13]. Our scheduler com- HANA [18], a commercial main-memory DBMS. We bines both approaches: when a task generates a group of new show that our task scheduler improves the performance tasks, the parent thread takes upon one of the new tasks, of highly concurrent analytical workloads (TPC-H [5]) following the work-first approach, while the rest of the tasks by up to 16%, and of highly concurrent mixed work- are dispatched to the task pools, following the breadth-first loads by up to 12.5% for analytical queries (TPC-H) approach, for load-balancing. and 370% for transactional queries (TPC-C [4]). Hoffmann et al [21] provide a survey on different task pool implementations. Distributed task pools with stealing Paper organization. In Section 2, we present related achieve best performance, as they minimize synchronization work. In Section 3, we give an overview of SAP HANA, overheads and stealing amends load-balancing issues. We 2
3 .follow a similar approach for our scheduler. Johnson et al query execution plans. Scalability on every node is achieved [22] decouple contention management from scheduling and through multi-threaded algorithms that exploit data paral- load management, to combine the advantages of spin and lelism inherent in many database operations. These algo- blocking locks. Our scheduler is not concerned with how rithms are implemented with a special focus on hardware- locks are used, but uses the information about blocked tasks conscious design and high scalability on modern multi-core to dynamically adjust its concurrency level. processors. Furthermore, quick response times for short- In highly concurrent analytical workloads with a large running queries is provided by an efficient session manage- number of partitionable operations issuing tasks, granularity ment and client interface [24]. of tasks plays an important role in communication, synchro- Although the architecture of SAP HANA enables the loose nization, and scheduling costs [1, 14, 15, 26, 28]. Our ex- coupling of all components and their selective leverage, the perimental results corroborate these observations for main- components are independent as far as execution is concerned. memory DBMS. While we address workloads with long- Some components use their own thread pools, while they running analytical queries as well as short-running OLTP- assume they can fully exploit system resources, unaware of queries, we cannot rely on special implementations to avoid concurrent operations of other components. The three ma- locking as proposed in [39]. Recent task scheduling frame- jor thread pools are the following. (1) The Dispatcher is a works such as OpenMP [3, 10] and Intel Thread Building simple task graph scheduler used typically for parallelizing Blocks [2] regulate task granularity by requiring the devel- partitionable analytical operations. (2) The Executor is a oper to express parallelism in a higher-level abstract manner task graph scheduler that processes plans of operations that and use this information. We assume that there is no cen- can potentially be distributed across machines. Plan nodes tral mechanism for data parallelism, and that partitionable can use the Dispatcher for parallelizing on one machine. (3) operations define their task granularity independently. We The Receivers are threads that process received network re- use a concurrency hint, reflecting recent CPU availability, quests. Short-running transactions are typically completely to adjust the task granularity of partitionable operations. executed within a Receiver. For more complex transactions, longer analytical or distributed queries, the Receiver uses the Executor, which in turn uses the Dispatcher. 3. OVERVIEW OF SAP HANA The independence of these three thread pools poses sev- SAP HANA is a commercial main-memory DBMS, that eral problems. Firstly, when all thread pools are active, the aims to efficiently support OLTP and OLAP workloads [18]. number of logical threads may surpass available hardware It supports a multitude of data formats and provides facil- contexts, over-committing CPU resources and resulting in ities for an extensive spectrum of enterprise applications, context switching costs. Secondly, DBMS administrators under a flexible and componentized architecture [17]. can be confused while configuring how many threads each The DBMS architecture is depicted in Figure 1, (see [18, pool should use. A good configuration is highly dependent 36] for an overview). SAP HANA provides four main-memory on the workload. Thirdly, developers need to learn three dif- storage engines: a column-store, suited for OLAP-dominant ferent thread implementations, and battle with parallelizing and mixed workloads, a row-store, suited for OLTP-dominant their operations across each one of them. We design our new workloads [37], as well as a graph engine and a text engine scheduler to address the aforementioned issues, and better [18]. The transaction manager uses multi-version concur- scale up on a single shared-memory multiprocessor system rency control (MVCC) [36]. Further components provide by integrating the different thread pools of SAP HANA. extensions for enterprise applications, and various applica- Our scheduler constitutes a new component in the gen- tion interfaces such as SQLScript and calculation models. eral architecture of a DBMS (see Figure 1), and is orthog- In this architecture realizing parallelism is the key to good onal to other components such as the Persistency Layer or scalability. For scaling out, SAP HANA supports distributed the Transaction Manager. It is important to note that our scheduler does not compromise any transactional correctness or persistency semantics. Connection and Network Receivers Session management Various access interfaces (SQL, SQL Script, etc.) Transaction 4. SCHEDULER ARCHITECTURE Manager To support fast scheduling of all heterogeneous general- Calculation engine purpose tasks, we opt for a dynamic task scheduler that Authorization does not require or process any a priori execution informa- Optimizer and Plan Generator tion about the tasks, except for potentially a directed acyclic Execution Metadata graph (DAG) defining their correlations and ultimately their Executor Dispatcher Manager order of execution. The DAG can take any form, with the engine only restriction of having a single root node. This does not Scheduler (NEW) Row- Column- Graph Text prevent the creation of single-node graphs. Each node in the store store engine engine task graph can contain any piece of code. A node can po- tentially spawn a new task graph, or issue itself again to the Persistence Layer (Logging, Recovery, Page Management) scheduler. The developer is free to co-ordinate synchroniza- tion among tasks, since we take care to maintain a flexible concurrency level (see Section 5). Optionally the developer Figure 1: The general database architecture of SAP can assign a priority for the task graph, which results in a HANA. Our new scheduler integrates the three decreased or increased probability of being chosen for exe- main thread pools of SAP HANA. cution. The developer then dispatches the root node to the 3
4 . Task graph Priorities Task pools Workers general query execution is CPU-bound or memory-bound. Heavy I/O operations, such as savepoints, are only done pe- Root Node Max riodically and in the background to minimize the disruption of the general performance of the database [18]. Thus, I/O- Normal bound operations are traced mainly inside the persistence or network layer. Non-root Low The most significant difficulties we encountered were prop- ... node agating exceptions in the task graph to the creator thread that may wait for the task graphs associated with a query, Figure 2: The data structures used by the scheduler. and inheriting thread-local storage of the creator thread to the workers handling nodes of the task graph. Thread-local storage in SAP HANA is used to store the transactional scheduler, and he can wait for execution of the task graph MVCC details of the query, which are used by the Transac- or continue without waiting. tion Manager (see Figure 1). The scheduler maintains two sets of queues, depicted in Scheduling policies. The design with the two sets of Figure 2. The first set contains one queue per priority and queues allows us to provide different scheduling policies. Our holds the root nodes of the submitted graphs that have not default policy dictates that a free worker thread should re- yet been initiated for execution. The second set contains trieve its next task from the task pools, if they are not empty, queues that hold the non-root nodes to be executed. The with a high probability. In our experiments, this probabil- second set actually constitutes the main distributed task ity is set to maximum (1.0). Thus, new graphs are initi- pools for our scheduler. The task pools can further be sorted ated after older graphs have completed. This policy favours by node depth, in order to favour execution of deep-running throughput over latency, and takes care to finish earlier ini- graphs, or by the timestamp of the owning query, in order tiated graphs that may hold resources, such as memory, be- to favour execution of earliest queries. For our experiments, fore initiating new graphs. The administrator, however, may we sort the task pools by node depth, because resource- want to be fair to the latencies of all incoming queries, and intensive queries tend to create deep-running graphs, and thus we provide him with a setting to decrease the aforemen- we take care to finish these queries early, in order to free up tioned probability, so that free worker threads can execute the resources such as the memory.We use distributed task a root node even if the task pools are not empty. Even pools to reduce synchronization contention. Currently, we though this policy is more fair for all incoming queries, gen- create as many task pools as the number of sockets. More eral performance and throughput is hurt, due to increased task pools can also be created if the number of hardware contention from many concurrent queries. Our default pol- contexts in one socket is high and results in synchronization icy provides semantics similar to a light-weight admission contention for the task pool. Each worker thread is assigned control manager for CPU resources. We use the default pol- to a specific task pool in a round-robin fashion according to icy for our experiments, as it provides best results. its ordinal identifier. If the worker thread finds its assigned task pool empty, it starts querying other task pools, in a Watchdog thread. To control workers, but also to mon- round-robin fashion, and steals tasks [21]. itor the state of execution, we reserve an additional watch- When the task pools are empty, a free worker retrieves dog thread. The watchdog typically sleeps, but wakes up a root node from the queues of priorities, with a probabil- periodically to gather information and potentially control ity that favours prioritized root nodes. This probability is worker threads, similar to the notion of centralized schedul- configurable, in order to prevent starvation of root nodes ing [20]. We use light-weight mechanisms for monitoring, with low priorities. We note that in our experiments of Sec- based on statistical counters, such as the number of waiting tion 7, all tasks have the same priority. After executing the and executing tasks, how many tasks each worker thread root node, the worker thread continues executing the first has executed etc. These counters are updated using atomic descendant for better data locality, while the rest of the instructions by each worker thread and the watchdog. descendants are dispatched randomly to the task pools for load-balancing. When the task pools are not empty, a free worker retrieves his next task from the task pools. When a 5. DYNAMIC ADJUSTMENT OF CONCUR- non-root node is executed, the worker checks which descen- RENCY LEVEL dants are ready for execution, takes upon the first of them Typical task schedulers employ a number of worker threads and dispatches the rest to the task pools. equal to the number of hardware contexts. This level of Integration. Our simple design allows a fast integration of concurrency is suitable for task schedulers whose aim is to all three main thread pools of SAP HANA into our scheduler handle CPU-intensive tasks that do not block frequently [2]. (see Section 3). Without having to worry about synchroniza- Our aim, however, is to integrate already-developed general- tion (see Section 5), we quickly bundle old tasks and generic purpose code into tasks. We need to handle tasks that can blocks of code into tasks for our new scheduler. As is stan- include heavy usage of synchronization primitives and locks. dard for task schedulers [2], we do not bundle I/O-bound A representative example are highly concurrent and con- operations into tasks. These operations are executed by sep- tending transactional workloads, such as TPC-C [4]. When arate threads that are handled by the underlying OS sched- tasks are inactive, our scheduler takes care to overlap inac- uler. Since these threads do not reserve any worker threads tivity periods with additional worker threads and saturate from our scheduler, we can keep the system busy with CPU- CPU resources. Next, we describe the different states of in- intensive tasks while there are I/O operations. It is easy activity that a task can be in, and how our scheduler handles to detect I/O-bound operations in main-memory DBMS, as them, by adapting its concurrency level at run-time. 4
5 . resources. For this reason, we define: active concurrency level = concurrency level Blocked Inactive Waiting Parked Active Watch- Other − inactive workers in syscall by user for a task Threads workers dog threads When threads resume from inactivity, they are considered Inactive workers Scheduler again in the active concurrency level, which can at times be higher than the number of hardware contexts. Parked threads. In order to fix a high active concurrency Figure 3: The scheduler’s types of worker threads. level, the scheduler gets the chance to pre-empt a worker when it finishes a task. We cannot pre-empt a worker in the Blocked workers. The OS scheduler is the first to know middle of a generic task, as it can be in a critical section when a thread blocks after a system call for a synchroniza- and the consequences are unpredictable. Instead of ending tion primitive. It then cedes the CPU to another thread the thread, we keep it suspended, in a parked state. The waiting for its time slice. If we set a fixed number of worker watchdog is responsible for monitoring if the active concur- threads equal to the number of hardware contexts, blocked rency level gets low and waking up parked threads. Parked threads will not be overlapped by other threads, as the OS threads overcome the costs of creating logical threads, which scheduler does not have knowledge of any other working include the expensive allocation of their stacks. threads in our application. This results in under-utilization Other inactive threads. Apart from blocked and parked of CPU resources, as the OS scheduler could potentially threads, we define two additional states of inactivity. Firstly, schedule another worker thread while a worker thread blocks. there can be tasks that wait for another task graph. This in- Additionally, since we do not know how the developer syn- activity state is comparable to OpenMP’s suspend/resume chronizes tasks, a fixed concurrency level can lead to poten- points (e.g. taskwait) [10], or to TBB’s wait methods (e.g. tial deadlocks, if the interdependency edges between nodes wait for all) [2]. Secondly, we give the developer the op- are not correctly used. For example, if a node in a task portunity to explicitly define a region of code as inactive. graph requires a conditional variable from another node at This is useful for code regions that are not CPU-intensive, the same level of the task graph, the latter node may not be such as the I/O-bound commit part of a transaction that scheduled in time if the nodes in the level are more than the needs to write in the redo log on disk. We note that in available hardware contexts. Deadlocks can also happen if our experiments, we do not specify these I/O-bound parts code synchronizes heavily between different task graphs. of a transactional task as inactive. For both these cases of To avoid deadlocks and under-utilization of CPU resources, inactive threads, a new worker thread is activated imme- we argue that a scheduler handling general-purpose tasks diately if allowed by the active concurrency level, instead should not use a fixed concurrency level. Our watchdog pe- of being activated by the watchdog. We note that while a riodically checks for blocked worker threads, and activates worker thread is blocked, parked, or waiting for another task additional worker threads, that get scheduled by the OS graph, it is also considered inactive by the OS scheduler. A immediately and overlap the inactivity period. Thus, we code region, however, that is defined by the developer as co-operate with the OS by voluntarily adjusting the concur- inactive, pertains only to our scheduler’s accounting for its rency level, and giving control to the OS of additional worker active concurrency level, while the OS considers the relevant threads when needed to saturate CPU resources. We exploit worker thread as runnable and schedules it. both the advantages of task scheduling and the OS sched- All the aforementioned types of workers are shown in Fig- uler: Task scheduling ensures that the number of working ure 3. The total number of inactive workers is defined as: threads is small enough so that costly context switches are inactive workers = blocked workers avoided. By dynamically adjusting the concurrency level, we exploit the capability of the OS scheduler to quickly cede the + inactive by user CPU of a blocked thread to a new worker thread. + workers waiting for a task graph To detect blocks efficiently, we do not use OS synchroniza- + parked workers tion primitives directly. DBMS typically encapsulate these in user-level platform-independent data structures. For ex- Avoiding too many active threads. We note that acti- ample, SAP HANA on Linux uses a user-level semaphore vating additional worker threads in place of inactive work- based on atomic instructions, that calls the futex facilities ers may not always be optimal. If the inactivity period of of Linux when needed. We leverage these user-level synchro- a worker is short, and the newly activated worker begins nization primitives to know when a worker thread is about executing a large task, then when the first worker returns to call a potential system call that could block. from inactivity, there will be two worker threads active. If Active concurrency level. We define: this situation is repeated many times, the active concurrency level can get much higher than available hardware contexts, leading to context switching costs from the OS scheduler. concurrency level = total number of worker threads For this reason, it is important to handle inactivity states carefully. The inactivity states where the developer speci- The concurrency level is variable. There can be a number of fies a code region as inactive, and where a task waits upon inactive workers, such as blocked threads, and a number of another task, are typically not too short. These inactivity active workers. We are mainly interested, however, in keep- states lower the active concurrency level, and immediately ing the total number of active workers as close as possible to activate additional worker threads that increase the active the number of hardware contexts, in order to saturate CPU concurrency level up to the number of available hardware 5
6 .contexts. The duration of the inactivity state of blocked costs, and the number of hardware contexts of the system. threads, however, is generally unknown, and can be too We should note that the degree of parallelism does not affect short. Thus, blocked tasks are handled differently: they the choice of a query plan of the optimizer in SAP HANA. only lower the active concurrency level, and do not imme- In this paper, we are not concerned with how each compo- diately spawn additional workers. The active concurrency nent calculates task granularity, but with how task granular- level is increased only when the watchdog checks it periodi- ity affects performance when numerous concurrent queries, cally and attempts to fix it by activating additional worker possibly with other partitionable operations, are being pro- threads, or in case another inactive worker thread resumes cessed. The problem we notice is that partitionable oper- activity in the meanwhile and thus increases the active con- ations calculate the number of tasks irrespective of other currency level and is allowed to continue working on next concurrent tasks. In the worst case, every operation can dis- tasks. Thus, too short block periods are typically hidden patch a number of tasks equal to the number of hardware between the intervals of the watchdog and of active tasks. contexts. Our experiments show that this practice results in Even in the bad case that the active concurrency level gets a myriad of tasks in cases of high concurrency, and increased too high, this is quickly fixed when active workers finish book-keeping and scheduling costs. Thus, task creators for their current tasks and are pre-empted and parked in order partitionable operations need to adjust their task granular- to fix the active concurrency level. ity by taking into consideration other concurrent tasks. To To support our intuition, our experiments with SAP HANA solve this problem, our scheduler provides information about have an active concurrency level that is most of the time the state of execution to partitionable operations that they equal to the number of hardware contexts (see Section 7). can use for the calculation of task granularity. We note that the watchdog interval we use in our experi- Concurrency hint. If a partitionable operation issues ments is 20ms. We have experimented with larger intervals more tasks than can be handled by free worker threads at as well, but have not noticed significant differences in the the moment, there is little to no benefit for parallelism, and active concurrency level. Nevertheless, in order to avoid redundant scheduling costs. The intuition is that in cases of worst-case scenarios, the watchdog can check periodically high concurrency, when the system is fully loaded and free if the active concurrency level gets much higher than the worker threads are scarce, partitionable operations should number of hardware contexts. opt for a very coarse granularity in order to minimize the number of tasks to be processed. Our scheduler can provide task creators with information 6. DYNAMIC ADJUSTMENT OF TASK about the current availability of computing resources, as it GRANULARITY has knowledge of the active concurrency level of the whole Partitionable operations can be parallelized using a vari- DBMS. Thus, it can give a hint to task creators about the able number of tasks. Many analytical operations in a DBMS maximum number of tasks they should create at the mo- fall into this category, e.g. aggregations and hashing. If a ment. The watchdog is responsible for calculating the con- column needs to be aggregated, it can be split into several currency hint, which is an exponential moving average of parts which can be processed in parallel independently. This the free worker threads in the recent past. The free worker is a classic example of data parallelism using a fork-join pro- threads and the concurrency hint are defined as: gramming structure [9]. For this kind of partitionable operations, a number of free worker threads = max{0, number of hardware contexts tasks lower than the number of available hardware contexts, − active concurrency level} i.e. a coarse granularity of tasks, can under-utilize CPU re- sources. A higher number of tasks (up to the number of concurrency hint = a ∗ free worker threads available hardware contexts) means that the partitionable operation can potentially use more CPU resources and de- + (1.0 − a) ∗ previous concurrency hint crease its latency. Using a fine granularity, however, can po- where 0 ≤ a ≤ 1.0 tentially introduce additional costs for communication, syn- chronization and scheduling [1, 14, 26]. Thus, a balance is Due to the dynamic nature of our workers, which can required for the task granularity. change status often and quickly, an average can give bet- Task schedulers like Intel TBB [2] can greatly help in case ter results than an absolute value. For our experiments, we of partitionable operations. As the developer expresses par- use an exponential moving average, with equal weight for titionable operations through higher-level algorithmic struc- the free workers threads of the previous observations and tures and data parallelism, the framework employs a central- the currently observed number of free worker threads (i.e. ized mechanism for adjusting task granularity. a = 0.5). The sampling rate is configured at 50ms in our In commercial DBMS, however, the majority of partition- experiments, which provides reasonable smoothing over the able operations do not use a central mechanism for data par- recent past, and also quickly captures changes in the number allelism. This is the common case because optimizing the of free worker threads. Due to the fact that the exponential granularity of a single DBMS paritionable operation alone moving average captures all past observations, we take care involves considerable research effort (see [12], for example, to reset it to the number of hardware contexts when it sur- for partitioning in hash-joins). passes a predefined threshold. This threshold is set to 90% There are typically distinct components for partitionable of the number of hardware contexts for our experiments. operations that handle data parallelism and granularity in- Splitting. It can happen that a partitionable operation dependently. In SAP HANA, for example, each partition- gets a low hint. If resources are freed up later on, it will able operation employs heuristics to find the right task gran- under-utilize CPU resources. To correct this behaviour, we ularity, based on factors such as data size, communication follow a strategy similar to the data parallelism concept of 6
7 .splitting ranges in Intel TBB or the lazy task creation [28]. 150 Multiple-Pools Each task needs to check the concurrency hint periodically. Single-Pool-NoSys If the hint gets high, the task should decide if it should split 125 Response time (sec) Single-Pool into two more nodes. Those two nodes can recursively split 100 Single-Hints again. At the moment, we have implemented splitting for the simple cases of aggregations and calculations. Neverthe- 75 less, since we are interested in highly-concurrent workloads, when the system is fully loaded, the absence of splitting for 50 the rest of partitionable operations does not affect perfor- mance significantly in our experiments. 25 7. EXPERIMENTAL EVALUATION 0 32 96160224288352416480544608672736800864928992 32 128 256 512 1024 We compare the following variations of SAP HANA: Number of concurrent queries • Multiple-Pools, which is the original version of SAP HANA, with the different thread pools showed in Sec- Figure 4: Experiment with TPC-H queries. tion 1. This serves as our baseline. all query templates, up to 1024. The results are shown in • Single-Pool-NoSys, which integrates the different thread Figure 4. In Figure 5, we include performance measurements pools of Multiple-Pools into tasks for our new scheduler. for the case of 1024 queries. Our scheduler does not affect This variation assumes workers blocked on synchro- cache-related behaviour significantly and cache miss rate in nization primitives as working, and includes them in all variations stays at similar levels. the active concurrency level of the scheduler. Also, this Single-Pool-NoSys improves performance of Multiple-Pools variation defines the concurrency hint as the number by only 3% for high concurrency. The main improvement of hardware contexts, to simulate the old behaviour. comes from reducing lock contention, as shown by the reduc- tion in system CPU time. The number of context switches • Single-Pool, which is like Single-Pool-NoSys, but uses a has increased, even though we integrate all thread pools of flexible concurrency level by assuming workers blocked Multiple-Pools into a single task scheduler. We attribute on synchronization primitives as inactive. this to the fixed concurrency level. When workers block on synchronization primitives, the OS does not have knowl- • Single-Hints, which is like Single-Pool, but with the con- edge of any additional workers to schedule. It replaces the currency hint following the exponential moving aver- time slice of a blocked worker with any non-CPU-intensiveSF10-1024-specs2 age of free workers in the recent past. Partitionable analytical operations adjust their task granularity ac- J thread, outside K the scheduler, L with M a small time N slice. This O P is also reflected in the increased idle CPU time. cording to the concurrency hint. This variation77 is the 78 Single-Pool, which has a flexible concurrency level, over- best version of our new scheduler for SAP HANA. 79 comes this problem and improves performance of Multiple- Our server has eight ten-core processors Intel Xeon80 E7- Pools by 7%. When many worker threads block, the watch- 8870 at 2.40 GHz, with hyper-threading enabled, and81 1 TB 3E+13 140 Number of tasks (x10000) (Unknown for Multiple-Pools) of RAM. Each core has a 32KB L1 cache and a 256KB 82 L2 120 Instructions retired cache. Each processor has a 30MB L3 cache, shared 83 by all 84 a 100 its cores. The OS is a 64-bit SMP Linux (SuSE), with 2E+13 85 in 2.6.32 kernel. Unless stated otherwise, every data point 80 our graphs is an average of multiple iterations with a86stan- 60 dard deviation less than 10%. Our measurements for 87con- 1E+13 40 text switches and CPU times are gathered from Linux. 88The 20 total number of instructions retired are gathered from89Intel 0 0 Performance Counter Monitor. For all expriments, we90 warm 91 use Multiple-Pools Single-Pool-NoSys up the DBMS first and there are no thinking times. We Single-Pool Single-Hints transaction level snapshot isolation with repeatable 92 reads. 93 and We make sure that all queries and clients are admitted, 100% 100 Context Switches (x10000) 94 the we disable query caching because our aim is to evaluate 90% 90 execution of the queries and not query caching. 95 80% 80 Average CPU% 70% 70 Single-Pool-NoSys Single-Pool-NoSys 96 7.1 Analytical workload 60% 60 Multiple-Pools 97 Multiple-Pools 50% 50 Single-Hints We use the TPC-H benchmark [5] with a scaling 98 Single-Hints factor Single-Pool Single-Pool 40% 40 10, stored in a column-store. We measure performance 99 by 30% 30 100 varying the number of concurrent queries, and measuring 20% 20 the response time of each variation from the moment 101 we 10% 10 102 issue the queries until the last query returns successfully. 0% 0 103 Queries are instantiated from the 22 TPC-H query templates Idle Sys User in a round-robin fashion, with the same parameters for104each 105 query template for stable results, but without query caching. Figure 5: Measurements for the case of 1024 TPC-H We start measuring from 32 concurrent queries, to include 106 concurrent queries. 107 108 109 7 110 111 112 113
8 .dog issues more worker threads and gives the chance to the and book-keeping costs of these bursts of numerous tasks OS scheduler to schedule CPU-intensive worker threads with result in erratic behaviour of the CPU utilization. In con- new tasks and full time slices. That is reflected in the de- trast, the timeline for Single-Hints presents a much smoother creased idle CPU time, and the fewer context switches. run-time. The majority of the tasks are issued by all queries Single-Hints results in the best performance improvement in the beginning of the experiment, and they are gradually of Multiple-Pools by 16%. The coarser task granularity leads scheduled until the end of the experiment. The CPU uti- to a reduction of the total number of tasks by 86%. We lization line is more stable. Additionally, the flexible concur- achieve a significant reduction in unnecessary book-keeping rency level results in a few more worker threads that raise and scheduling costs, which is reflected in the 16% reduction the CPU utilization average in comparison to Single-Pool- of the total number of instructions retired. Furthermore, we NoSys. Nevertheless, we still note that CPU utilization is corroborate previous related work that a coarser granularity not fully saturated, and we still have idle time. results in less costs for synchronization and communication Thus, we note that the standard scheduling technique of [1, 14, 26], since system CPU time is further decreased. having an active number of workers equal to the number of We notice the effect of not splitting tasks (see Section hardware contexts does not fully saturate CPU resources, 6) for the case of 64 and 128 concurrent queries of Figure since idle CPU time is significant in all variations. A sim- 4, where the standard deviation for Single-Hints is up to ilar observation about idle time has been done in a recent 30%. In a few iterations, a partitionable operation that got evaluation of modern column-stores [8]. This is attributed a low hint was left in the end alone, under-utilizing CPU re- to the fact that DBMS operations are not purely compu- sources and prolonging response times. We remind that we tationally intensive [7], and the OS can exploit a few more have enabled splitting for aggregations and calculations in threads to overlap stalls (e.g. for memory). We plan to in- SAP HANA, and we plan to enable splitting for more par- vestigate how to dynamically raise the active concurrency titionable operations. In our experiments with Single-Hints, level, to minimize idle CPU time, while not deteriorating however, all partitionable operations use the concurrency context switching costs and cache miss rate. hint as a limit in order to show the effect of hints for high concurrency when the effect of not splitting is not obvious. 7.2 Mixed workload To better understand the effect of the flexible concurrency We measure the performance of the different variations by level and hints throughout the whole experiment, we show running a throughput experiment for 15 minutes for TPC- the timelines for Single-Pool-NoSys and Single-Hints for the H and TPC-C [4] concurrently on disjoint datasets. This case of 1024 queries in Figure 6. For Single-Pool-NoSys, we use case can happen if a hot transactional dataset is copied notice the effect of bursts of too many tasks being issued to on the same server for analytics. Our intention is to see the scheduler. The redundant scheduling, communication, how each workload behaves when they co-exist on the same server, and how scheduling affects their performance. For TPC-C, we use a database of 200 warehouses, stored 160 12000 in a column-store, 200 clients, and measure the total average Number of hardware contexts 140 Number of waiting tasks 10000 successful new-order transactions per minute (tpmC). Our 120 TPC-C driver is based on a previous project [29]. Before 8000 100 every experiment, we re-load our initially generated TPC-C 80 6000 database, in order for all experiments to start with the same 60 columnar data and have no data in their delta storages [36]. 4000 40 2000 20 0 0 0 20 40 60 80 100 120 140 Time (sec) Non-idle CPU Active workers Blocked Workers Concurrency hint Waiting tasks 160 6000 Number of hardware contexts Number of waiting tasks 140 5000 120 4000 100 80 3000 60 2000 40 1000 20 0 0 0 20 40 60 80 100 120 Time (sec) Figure 7: Throughput experiments with a mixed workload, consisting of TPC-H and TPC-C clients Figure 6: Timelines for Single-Pool-NoSys (above) on disjoint datasets. Each experiment involves a and Single-Hints (below), for the case of 1024 con- TPC-H throughput (bar with solid color) and a current TPC-H queries. TPC-C throughput (bar with pattern). 8
9 . We turn off merge operations for the column-store [36], in similar to Section 7.1. This is due to the fact that in TPC-C, order to avoid unpredictable periods of inactivity due to many clients contend for modifying common data. A lot of the TPC-C tables being merged and locked, and achieve a workers are blocked, and the OS scheduler cannot overlap stable behaviour for all variations. Due to the absence of them with other tasks. The flexible concurrency level of merge operations, we note that the tpmC slowly degrades Single-Pool corrects this behaviour, processes more tasks, in every experiment, but as this behaviour is stable for all and can achieve similar or better performance than Multiple- variations, and because each variation starts the experiment Pools. This also shows that our light-weight implementation with the same data and conditions, this does not prevent us does not hurt the performance of short-lived transactions. It from comparing the total average tpmC. For TPC-H, we also reduces the number of context switches in comparison to use a scaling factor of 10 and vary the number of concurrent Single-Pool-NoSys, but their number is still higher than that clients, in order to see the effect that long-running analytical of Multiple-Pools, as the additional worker threads created tasks have on the short-lived tasks of TPC-C. Clients issue due to inactive tasks are more than the threads of Multiple- queries from the TPC-H Q1-Q22 templates in a global (not Pools, due to the flexible concurrency level. per client) round-robin fashion, as in Section 7.1. We report The concurrency hint of Single-Hints results in a significant the average of TPC-H queries completed per minute. The reduction of the total number of TPC-H tasks, giving room results are shown in Figure 7. We note that CPU resources to TPC-C tasks to be queued up faster in the scheduler. For are completely saturated for the case of 32 and 64 concurrent 64 concurrent TPC-H clients, Single-Hints improves TPC-H TPC-H clients, and that standard deviation for some cases throughput by 12.5% and TPC-C throughput by 370% in of very low TPC-C throughput is up to 50%. In Figure 8, comparison to Multiple-Pools, with approximately the same we show measurements for the case of 64 TPC-H clients. number of instructions (since the system is fully loaded for Overall, we observe that increasing the number of concur- all variations for the majority of the duration of the experi- rent TPC-H clients overshadows the TPC-C clients. This is ment). Note that hints alone do not affect TPC-C queries, due to the fact that we do not change the number of TPC-C as they are executed in a single task. clients in this experiment, and overwhelm the database with an increasing number of resources-intensive TPC-H clients. Also, TPC-C queries are typically executed in a single task, 8. CONCLUSIONS while TPC-H queries usually pertain a task graph with mul- We show how to efficiently employ task scheduling for tiple tasks. This results in many more tasks for TPC-H than general-purpose tasks, without a central mechanism for data TPC-C. This trend serves as motivation to find ways to give parallelism, in order to handle both short-lived transactions the DBMS administrator the possibility of favouring either and long-running analytical tasks in highly concurrent work- transactional or analytical queries. Note that we cannot do loads in a main-memory DBMS. Our results are backed up this simply with the queues of priorities of our scheduler, by an evaluation on a commercial DBMS: SAP HANA. As transactional since Receivers handle all incoming queries in the same way 64-200-specs2 (2) tasks can heavily use synchronization and Wbundle themX Y into tasks Z scheduler. for the AA AB AC primitives, AD we AEshow thatAF the concurrency AG level of the task 81 We observe that Single-Pool-NoSys results in the worst scheduler should not be fixed, but be flexible. When worker 82 performance, due to its fixed concurrency level. Idle CPU threads block, more worker threads should be issued, giving 83 control to the OS scheduler of additional tasks to saturate time reaches 70%, and redundant context switches occur 84 CPU resources. Furthermore, for analytical partitionable 85 operations, we observe that task granularity significantly af- Total number of tasks (x10000) 86 1.6E+14 700 Number of instructions retired (Unknown for Multiple-Pools) 87 1.4E+14 fects communication, synchronization, and scheduling costs 600 88 in cases of high concurrency. For this reason, our scheduler 1.2E+14 500 89 gives a hint to the task creators of partitionable operations, 1E+14 90 400 reflecting the level of CPU contention. Using this hint, parti- 91 8E+13 300 tionable operations re-adjust their task granularity, to avoid 92 6E+13 200 excessive communication, synchronization, and scheduling 93 4E+13 costs for high concurrency, and avoid under-utilization of 94 2E+13 100 95 CPU resources in cases of low concurrency. 0 0 96 97 Multiple-Pools Single-Pool-NoSys 98 Single-Pool Single-Hints 9. REFERENCES 99 1400 100% [1] Intel articles - Granularity and Parallel Performance. 100 90% http://software.intel.com/en-us/articles/granularity- Context Switches (x10000) 101 1200 and-parallel-performance. Average CPU% 80% 102 [2] Intel Thread Building Blocks – Documentation – User 70% 1000 103 Guide – The Task Scheduler – Task-based 60% Single-Pool-NoSys 104 800 Programming. 50% Single-Pool-NoSys Multiple-Pools 105 http://threadingbuildingblocks.org/documentation. 40% 600 Single-Hints Multiple-Pools 106 Single-Pool 30% [3] OpenMP API for Parallel Programming. Single-Hints 107 Single-Pool 20% 400 http://www.openmp.org/. 108 [4] TPC-C Benchmark: Standard Specification, v. 5.11. 10% 200 109 http://www.tpc.org/tpcc/. 0% 110 [5] TPC-H Benchmark: Standard Specification, v. 2.14.3. Idle Sys User 0 111 http://www.tpc.org/tpch/. 112 [6] U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The 113 Figure 8: Measurements for 64 TPC-H clients. data locality of work stealing. In Proc. of the 12th 114 115 116 9
10 . annual ACM Symposium on Parallel Algorithms and [23] Y.-K. Kwok and I. Ahmad. Static scheduling Architectures, pages 1–12, 2000. algorithms for allocating directed task graphs to [7] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. multiprocessors. ACM Computing Surveys, Wood. DBMSs on a Modern Processor: Where Does 31(4):406–471, 1999. Time Go? In Proc. of the 25th Int’l Conf. on Very [24] J. Lee, Y. S. Kwon, F. F¨ arber, M. Muehle, C. Lee, Large Data Bases, pages 266–277, 1999. C. Bensberg, J. Y. Lee, A. Lee, and W. Lehner. SAP [8] I. Alagiannis, M. Athanassoulis, and A. Ailamaki. HANA distributed in-memory database system: Scaling up analytical queries with column-stores. In Transaction, session, and metadata management. In Proc. of the 6th Int’l Workshop on Testing Database IEEE 29th Int’l Conf. on Data Engineering, pages Systems, 2013. 1165–1173, 2013. [9] S.-l. Au and S. P. Dandamudi. The Impact of [25] C. Li, C. Ding, and K. Shen. Quantifying the cost of Program Structure on the Performance of Scheduling context switch. In Proc. of the 2007 Workshop on Policies in Multiprocessor Systems. Int’l Journal of Experimental Computer Science, 2007. Computers and Their Applications, 3(1):17–30, 1996. [26] H.-W. Loidl and K. Hammond. On the granularity of [10] E. Ayguad´e, N. Copty, A. Duran, J. Hoeflinger, divide-and-conquer parallelism. In Proc. of the 1995 Y. Lin, F. Massaioli, E. Su, P. Unnikrishnan, and Int’l Conf. on Functional Programming, pages G. Zhang. A Proposal for Task Parallelism in 135–144, 1995. OpenMP. In Proc. of the 3rd Int’l Workshop on [27] S. Mattheis, T. Schuele, A. Raabe, T. Henties, and OpenMP: a Practical Programming Model for the U. Gleim. Work stealing strategies for parallel stream Multi-Core Era, pages 1–12, 2008. processing in soft real-time systems. In Proc. of the [11] S. Blagodurov, S. Zhuravlev, A. Fedorova, and 25th Int’l Conf. on Architecture of Computing A. Kamali. A case for NUMA-aware contention Systems, pages 172–183, 2012. management on multicore systems. In Proc. of the [28] E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy 19th Int’l Conf. on Parallel Architectures and task creation: a technique for increasing the Compilation Techniques, pages 557–558, 2010. granularity of parallel programs. In Proc. of the 1990 [12] S. Blanas, Y. Li, and J. M. Patel. Design and ACM Conf. on LISP and Functional Programming, evaluation of main memory hash join algorithms for pages 185–197, 1990. multi-core cpus. In Proc. of the 2011 ACM SIGMOD [29] A. Morf. Snapshot Isolation in Distributed Int’l Conf. on Management of Data, pages 37–48. Column-Stores. Master’s thesis, ETH Zurich, 2011. [13] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. [30] R. Motwani, S. Phillips, and E. Torng. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an Non-clairvoyant scheduling. In Proc. of the 4th annual efficient multithreaded runtime system. In Proc. of the ACM-SIAM Symposium on Discrete Algorithms, pages 5th ACM SIGPLAN Symposium on Principles and 422–431, 1993. Practice of Parallel Programming, pp. 207–216, 1995. [31] G. J. Narlikar. Scheduling Threads for Low Space [14] D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of Requirement and Good Locality. In Proc. of the 11th synchronization and granularity on parallel systems. Annual ACM Symposium on Parallel Algorithms and In Proc. of the 17th annual Int’l Symposium on Architectures, pages 83–95, 1999. Computer Architecture, pages 239–248, 1990. [32] D. Neill and A. Wierman. On the Benefits of Work [15] J. Cieslewicz and K. A. Ross. Data partitioning on Stealing in Shared-Memory Multiprocessors, 2011. chip multiprocessors. In Proc. of the 4th Int’l Project Report at Carnegie Mellon University. Workshop on Data Management on New Hardware, http://www.cs.cmu.edu/~acw/15740/paper.pdf. pages 25–34, 2008. [33] T. D. Nguyen, R. Vaswani, and J. Zahorjan. Using [16] A. Duran, J. Corbal´ an, and E. Ayguad´e. Evaluation of Runtime Measured Workload Characteristics in OpenMP task scheduling strategies. In Proc. of the Parallel Processor Scheduling. In Proc. of the 4th Int’l Conf. on OpenMP in a New Era of Workshop on Job Scheduling Strategies for Parallel Parallelism, pages 100–110, 2008. Processing, pages 155–174, 1996. [17] F. F¨ arber, S. K. Cha, J. Primsch, C. Bornh¨ovd, [34] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, S. Sigg, and W. Lehner. SAP HANA database: data M. Spiegel, and J. F. Prins. OpenMP task scheduling management for modern business applications. strategies for multicore NUMA systems. Int’l Journal SIGMOD Rec., 40(4):45–51, Jan. 2012. of High Performance Computing Applications, [18] F. F¨ arber, N. May, W. Lehner, P. Große, I. M¨ uller, 26(2):110–124, 2012. H. Rauhe, and J. Dees. The SAP HANA Database – [35] M. A. Palis, J.-C. Liou, and D. S. L. Wei. Task An Architecture Overview. IEEE Data Engineering clustering and scheduling for distributed memory Bulletin, 35(1):28–33, 2012. parallel architectures. IEEE Transactions on Parallel [19] R. L. Graham, E. L. Lawler, J. K. Lenstra, and and Distributed Systems, 7(1):46–55, 1996. A. H. G. Rinnooy Kan. Optimization and [36] V. Sikka, F. F¨arber, W. Lehner, S. K. Cha, T. Peh, approximation in deterministic sequencing and and C. Bornh¨ ovd. Efficient transaction processing in scheduling: a survey. Annals of Discrete Mathematics, sap hana database: the end of a column store myth. 4:287–326, 1979. In Proc. of the 2012 ACM SIGMOD Int’l Conf. on [20] B. Hamidzadeh and D. Lilja. Dynamic scheduling Management of Data, pages 731–742, 2012. strategies for shared-memory multiprocessors. In Proc. [37] M. Stonebraker and U. Cetintemel. ”One Size Fits of the 16th Int’l Conf. on Distributed Computing All”: An Idea Whose Time Has Come and Gone. In Systems, pages 208–215, 1996. Proc. of the 21st Int’l Conf. on Data Engineering, [21] R. Hoffmann, M. Korch, and T. Rauber. Performance pages 2–11, 2005. Evaluation of Task Pools Based on Hardware [38] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, Synchronization. In Proc. of the 2004 ACM/IEEE and M. Prieto. Survey of scheduling techniques for Conf. on Supercomputing, 2004. addressing shared resources in multicore processors. [22] F. R. Johnson, R. Stoica, A. Ailamaki, and T. C. ACM Computing Surveys, 45(1):4:1–4:28, 2012. Mowry. Decoupling contention management from [39] M. Stonebraker, S. Madden, D. J. Abadi, S. scheduling. In Proc. of the 15th edition of Harizopoulos, N. Hachem, and P. Helland, The end of Architectural Support for Programming Languages and an architectural era: (it’s time for a complete rewrite). Operating Systems, pages 117–128, 2010. Proc. VLDB, 1150–1160, 2007. 10