05 Locking In Systems/Transactions

Locking In Systems/Transactions Experience with Processes and Monitors in Mesa Principles of Transaction-Oriented Database Recovery

1. Today’s Papers EECS 262a • Experience with Processes and Monitors in Mesa Advanced Topics in Computer Systems Butler Lampson and David Redell, Appears in Lecture 5 Communications of the ACM 23, 2 (Feb. 1980), pp 105-117. • Principles of Transaction-Oriented Database Mesa/Transactions Recovery, Theo Haerder and Andreas Reuter. September 6th, 2018 Appears in Journal of the ACM Computing Surveys (CSUR), Vol 15, No 4 (Dec. 1983), pp287-317 John Kubiatowicz Electrical Engineering and Computer Sciences • Thoughts? University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 9/6/2018 Cs262a-F18 Lecture-05 2 Mesa Motivation Mesa History • Putting theory to practice – building Pilot OS • 2nd system Xerox Star – followed the Alto • Focus of this paper: lightweight processes • Planned to build a large system using (threads in today’s terminology) and how they many programmers synchronize with each other – Some thoughts about commercializing • Advent of things like server machines and networking introduced applications that are heavy users of concurrency 9/6/2018 Cs262a-F18 Lecture-05 3 9/6/2018 Cs262a-F18 Lecture-05 4

2. Mesa History (cont’d) Programming Models for IPC • Chose to build a single address space system: • Two Inter-Process Communication models: – Single user system, so protection not an issue – Shared memory (monitors) vs. – Safety was to come from the language – Message passing – Wanted global resource sharing • Needham & Lauer claimed the two models • Large system, many programmers, many applications: are duals of each other – Module-based programming with information hiding • Mesa developers chose shared memory • Clean sheet design: model because they thought they could more – Can integrate the hardware, the runtime software, and the language with each other naturally fit it into Mesa as a language construct • Java language considers Mesa to be a predecessor 9/6/2018 Cs262a-F18 Lecture-05 5 9/6/2018 Cs262a-F18 Lecture-05 6 How to Synchronize Processes? Lightweight Processes (LWPs) • Non-preemptive scheduler: results in very delicate • Easy forking and synchronization systems – Why? – Have to know whether or not a yield might be called for every procedure you call – this violates information hiding • Shared address space – Prohibits multiprocessor systems – Need a separate preemptive mechanism for I/O anyway • Fast performance for creation, switching, and – Can’t do multiprogramming across page faults synchronization; low storage overhead • Simple locking (e.g., semaphores): – Too little structuring discipline, e.g., no guarantee that locks will be • Today we call LWPs, “threads” released on every code path – Wanted something that could be integrated into a Mesa language construct • Chose preemptive scheduling of lightweight processes and monitors 9/6/2018 Cs262a-F18 Lecture-05 7 9/6/2018 Cs262a-F18 Lecture-05 8

3. Recap: Synchronization Goals Recap: Synchronization Primitives • Mutual exclusion: • Locks: Implement mutual exclusion – Arbitrate access to critical section (e.g., shared data) – Lock.Acquire(): acquire lock before entering critical section; wait if lock – Only a single LWP in critical section at a given time not free » If one LWP in critical section  all other LWPs that want to – Lock.Release(): release lock after leaving critical section; wake up threads waiting for lock enter the critical section need to wait LWP 1 • Semaphores: Like integers with restricted interface • Scheduling constraint: – P(): Wait if zero; decrement when becomes non-zero – A LWP waiting for an event to happen in another thread LWP 2 – V(): Increment and wake a sleeping task (if exists) signal – Use a semaphore for each scheduling constraint and mutex • Wait instruction: • Monitors: A lock plus one or more condition variables – Don’t want busy-waiting, so sleep() – Condition variable: a queue of LWPs waiting inside critical section for an wait – Waiting LWPs are woken up when the condition they are waiting on event to happen becomes FALSE – Use condition variables to implement sched. constraints – Three Operations: Wait(), Signal(), and Broadcast() 9/6/2018 Cs262a-F18 Lecture-05 9 9/6/2018 Cs262a-F18 Lecture-05 10 Recap: Monitor with Condition Variables Recap: Monitors • Monitors represent the logic of the program – Wait if necessary – Signal when change something so any waiting LWPs can proceed • Basic structure of monitor-based program: lock.Acquire() Check and/or update while (need to wait) { condvar.wait(&lock); state variables } Wait if necessary lock.Release() (release lock when waiting) • Lock: the lock provides mutual exclusion to shared data do something so no need to wait – Always acquire before accessing shared data structure – Always release after finishing with shared data lock.Acquire() – Lock initially free Check and/or update • Condition Variable: a queue of LWPs waiting for something condvar.signal(); state variables inside a critical section lock.Release() – Key idea: make it possible to go to sleep inside critical section by atomically releasing lock at time we go to sleep 9/6/2018 Cs262a-F18 Lecture-05 11 9/6/2018 Cs262a-F18 Lecture-05 12

4. Mesa Monitors Design Choices and Implementation Issues • Monitor lock (for synchronization) • 3 types of procedures in a monitor module: • Tied to module structure of the language – makes it – Entry (acquires and releases lock) clear what’s being monitored – Internal (no locking done): can’t be called from outside the module. – External (no locking done): externally callable. Why is this useful? • Language automatically acquires and releases the » Allows grouping of related things into a module lock » Allows doing some of the work outside the monitor lock • Tied to a particular invariant, which helps users think » Allows controlled release and reacquisition of monitor lock about the program • Choices for notify semantics: – Invariant holds on entry and must be maintained before exit or wait – (Hoare monitors) Immediately cede CPU and lock to waking process • Condition variable (for scheduling) – when to wait » Causes many context switches but why would this approach be desirable? • Dangling references problem similar to pointers (Waiting process knows the condition it was waiting on is – There are also language-based solutions that would prohibit these guaranteed to hold) kinds of errors, such as do-across, which is just a parallel control » Also, doesn’t work in the presence of priorities structure – (Mesa monitors) Notifier keeps lock, wakes process with no guarantees – Do-across eliminates dangling processes because the syntax => waking process must recheck its condition defines the point of the fork and the join 9/6/2018 Cs262a-F18 Lecture-05 13 9/6/2018 Cs262a-F18 Lecture-05 14 Mesa Monitor: Why “while()”? Readers/Writers Problem • Why do we use “while()” instead of “if() with Mesa monitors? W – Example illustrating what happens if we use “if()”, e.g., if (queue.isEmpty()) { dataready.wait(&lock); // If nothing, sleep R } R R • Synchronized (infinite) queue example AddToQueue(item) { RemoveFromQueue() { lock.Acquire(); lock.Acquire(); queue.enqueue(item); if (queue.isEmpty()) { • Motivation: Consider a shared database dataready.signal(); dataready.wait(&lock); } – Two classes of users: lock.Release(); Hoare item = queue.dequeue(); » Readers – never modify database } lock.Release(); » Writers – read and modify database Mesa: Replace “while” return(item); with “if” – Is using a single lock on the whole database sufficient? } » Like to have many readers at the same time » Only one writer at a time 9/6/2018 Cs262a-F18 Lecture-05 15 9/6/2018 Cs262a-F18 Lecture-05 16

5. Basic Readers/Writers Solution Code for a Reader Reader() { • Correctness Constraints: // First check self into system – Readers can access database when no writers lock.Acquire(); – Writers can access database when no readers or writers – Only one thread manipulates state variables at a time while ((AW + WW) > 0) { // Is it safe to read? • Basic structure of a solution: WR++; // No. Writers exist – Reader() okToRead.wait(&lock); // Sleep on cond var Wait until no writers WR--; // No longer waiting Access data base } Check out – wake up a waiting writer AR++; // Now we are active! – Writer() Wait until no active readers or writers lock.release(); Access database // Perform actual read-only access Check out – wake up waiting readers or writer AccessDatabase(ReadOnly); – State variables (Protected by a lock called “lock”): » int AR: Number of active readers; initially = 0 // Now, check out of system » int WR: Number of waiting readers; initially = 0 lock.Acquire(); » int AW: Number of active writers; initially = 0 AR--; // No longer active » int WW: Number of waiting writers; initially = 0 if (AR == 0 && WW > 0) // No other active readers » Condition okToRead = NIL okToWrite.signal(); // Wake up one writer » Conditioin okToWrite = NIL lock.Release(); } 9/6/2018 Cs262a-F18 Lecture-05 17 9/6/2018 Cs262a-F18 Lecture-05 18 Code for a Writer Other Kinds of Notifications Writer() { // First check self into system • Timeouts lock.Acquire(); while ((AW + AR) > 0) { // Is it safe to write? WW++; // No. Active users exist • Broadcasts okToWrite.wait(&lock); // Sleep on cond var WW--; // No longer waiting } AW++; // Now we are active! • Aborts lock.release(); // Perform actual read/write access AccessDatabase(ReadWrite); • Deadlocks: // Now, check out of system – Wait only releases the lock of the current monitor, not any nested lock.Acquire(); calling monitors AW--; // No longer active – General problem with modular systems and synchronization: if (WW > 0){ // Give priority to writers » Synchronization requires global knowledge about locks, which okToWrite.signal(); // Wake up one writer violates information hiding paradigm of modular programming } else if (WR > 0) { // Otherwise, wake reader okToRead.broadcast(); // Wake all readers } lock.Release(); } 9/6/2018 Cs262a-F18 Lecture-05 19 9/6/2018 Cs262a-F18 Lecture-05 20

6. Four Requirements for Deadlock Deadlock • Mutual exclusion • Why is monitor deadlock less onerous than the yield – Only one thread at a time can use a resource. problem for non-preemptive schedulers? • Hold and wait – Want to generally insert as many yields as possible to provide – Thread holding at least one resource is waiting to acquire additional increased concurrency; only use locks when you want to resources held by other threads synchronize – Yield bugs are difficult to find (symptoms may appear far after the • No preemption bogus yield) – Resources are released only voluntarily by the thread holding the resource, after thread is finished with it • Circular wait • Basic deadlock rule: no recursion, direct or mutual – There exists a set {T1, …, Tn} of waiting threads – Alternatives? Impose ordering on acquisition » T1 is waiting for a resource that is held by T2 – “It is unreasonable to blame the tool when poorly chosen » T2 is waiting for a resource that is held by T3 constraints lead to deadlock” » … » Tn is waiting for a resource that is held by T1 • Lock granularity for concurrent access to objects – Introduced monitored records so that the same monitor code could handle multiple instances of something in parallel 9/6/2018 Cs262a-F18 Lecture-05 21 9/6/2018 Cs262a-F18 Lecture-05 22 Interrupts Priority Inversion • Interrupt handler can’t afford to wait to acquire a • High-priority processes may block on lower-priority monitor lock processes • Introduced naked notifies: notifies done without • A solution: holding the monitor lock – Temporarily increase the priority of the holder of the monitor to that of the highest priority blocked process – Somewhat tricky – what happens when that high-priority process • Had to worry about a timing race: finishes with the monitor? – The notify could occur between a monitor’s condition check and its » You have to know the priority of the next highest one – keep call on Wait them sorted or scan the list on exit – Added a wakeup-waiting flag to condition variables • What happens with active messages that need to acquire a lock? (move handler to its own thread) 9/6/2018 Cs262a-F18 Lecture-05 23 9/6/2018 Cs262a-F18 Lecture-05 24

7. The Mars Pathfinder Mission • Widely proclaimed as “flawless” in the early days after its July 4th, 1997 landing on the Martian surface • Successes included: – Its unconventional “landing” – bouncing onto the Martian surface surrounded by airbags BREAK – Deploying the Sojourner rover – Gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web – 20Mhz PowerPC processor and 128MB of DRAM • A few days later, just after Pathfinder started gathering meteorological data… – The spacecraft began experiencing total system resets, each resulting in losses of data – The press reported these failures in terms such as “software glitches” and “the computer was trying to do too many things at once.” 9/6/2018 Cs262a-F18 Lecture-05 26 Three Mars Pathfinder Tasks Priority Inversion • Bus management task • Very infrequently an interrupt would occur that caused the – Ran frequently with high priority to move certain kinds of data in and out of (medium priority) communications task to be scheduled the VME information bus during the short interval while the (high priority) information – Access to bus synchronized with mutual exclusion locks (mutexes) bus thread was blocked waiting for the (low priority) meteorological data thread • Meteorological data gathering task (ASI/MET) – In this case, the long-running communications task, having higher priority – Infrequent, low priority thread that used info bus to publish its data than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running – When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex – If an interrupt caused bus thread to be scheduled while this mutex was • After some time had passed, a watchdog timer would go held, and if the bus thread then attempted to acquire this same mutex in off, notice that the data bus task had not been executed for order to retrieve published data, this would cause it to block on the mutex, waiting until the meteo. thread released the mutex before it could continue some time, conclude that something had gone drastically wrong, and initiate a total system reset • Communications task that ran with medium priority • This scenario is a classic case of priority inversion • Most of the time this combination worked fine… 9/6/2018 Cs262a-F18 Lecture-05 27 9/6/2018 Cs262a-F18 Lecture-05 28

8. Really Remote Debugging! Remote Bug Fixing • Pathfinder used VxWorks • When created, a VxWorks mutex object accepts a – VxWorks can be run in a tracing mode boolean parameter that indicates whether priority – Records a total trace of all interesting system events, including inheritance should be performed by the mutex context switches, uses of synchronization objects, and interrupts – The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on • Reproducing bug locally it while it held the mutex, causing it be scheduled with higher – After the failure, JPL engineers spent hours and hours running the priority than the medium-priority communications task, thus system on the exact spacecraft replica in their lab with tracing preventing the priority inversion turned on, attempting to replicate the precise conditions under – Once diagnosed, it was clear to the JPL engineers that using which they believed that the reset occurred priority inheritance would prevent the resets they were seeing – Early in the morning, after all but one engineer had gone home, the – By coding convention, the initialization parameter for the mutex in engineer finally reproduced a system reset on the replica question (and those for two others which could have caused the same problem) were stored in global variables • Analysis of the trace revealed the priority inversion • How to fix remotely? – VxWorks debugging mode! 9/6/2018 Cs262a-F18 Lecture-05 29 9/6/2018 Cs262a-F18 Lecture-05 30 Temporal Logic Assertions for the VxWorks Debugging Mode Detection of Priority Inversion • VxWorks debugging mode • Temporal logic – system of rules and symbolism for – Has a C language interpreter intended to allow developers to type in C representing, and reasoning about, propositions qualified in expressions and functions to be executed on the fly during system terms of time debugging – Can express requirements – e.g., whenever a request is made, access to a resource is eventually granted, but it is never granted to two requestors simultaneously • JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled • TLA assertions were written as comments in the Pathfinder – Addresses of initialization parameters for mutexes stored in symbol code tables included in the launch software, and available to the C interpreter • Temporal-Rover software generated code that announces success and/or failure of any assertion during testing • A short C program was uploaded to the spacecraft, – T-R compared the actual program's behavior with the formal specification which when interpreted, changed the values of these – T-R captured all possible “golden” executions of the program variables from FALSE to TRUE • Interestingly enough, the JPL engineers actually created a priority inversion situation during testing • No more system resets occurred! – 1-2 system resets during months of pre-flight testing – One month planned mission lasted for three months instead! – Not reproducible or explainable, so “was probably caused by a hardware glitch” 9/6/2018 Cs262a-F18 Lecture-05 31 9/6/2018 Cs262a-F18 Lecture-05 32

9. TLA (cont’d) TLA (cont’d) • TLA captures time and order, so let: • Engineers did not manage to analyze their recorded data – HIGHPriorityTaskBlocked() represent a situation where the info bus thread is blocked by the low priority meteorological data gathering task well enough to conclude that priority inversion is indeed a – HIGHPriorityTaskInMutex() represent a situation where the information bus bug in their system thread is in Mutex – In other words, their test runs were sufficient, but their analysis tools were not – LOWPriorityTaskInMutex() represent a situation where the meteorological thread is in Mutex – MEDPriorityTaskRunning() represent a situation where the communications • Part of it was the engineers’ focus – extremely focused on task is running ensuring the quality and flawless operation of the landing • Assertions: software – Not Eventually ({HIGHPriorityTaskBlocked()} AND – Should it have failed, the mission would have been lost {MEDPriorityTaskRunning()}) – This formula specifies that never should it be that HIGHPriorityTaskBlocked() – Also, first “low-cost” NASA mission And MEDPriorityTaskRunning(). – Always({LOWPriorityTaskInMutex()} Implies Not {MEDPriorityTaskRunning()} Until {HIGHPriorityTaskInMutex()} ) • Entirely understandable for the engineers to discount – This formula specifies that always, if LOWPriorityTaskInMutex() then occasional glitches in the less-critical land-mission SW MEDPriorityTaskRunning() does not occur until a later time when – A spacecraft reset was a viable recovery strategy at that phase of the mission HIGHPriorityTaskInMutex(). 9/6/2018 Cs262a-F18 Lecture-05 33 9/6/2018 Cs262a-F18 Lecture-05 34 Exceptions Hints vs. Guarantees • Must restore monitor invariant as you unwind the • Notify is only a hint stack – Don’t have to wake up the right process – But, requires explicit UNWIND handlers (RETURN WITH – Don’t have to change the notifier if we slightly change the wait ERROR[args]), otherwise lock is not released condition (the two are decoupled) – Easier to implement, because it’s always OK to wake up too many processes. If we get lost, we could even wake up everybody • Failure to handle exceptions results in debugger (broadcast) invocation » Can we use broadcast everywhere there is a notify? Yes – “not much comfort, however, when a system is in operational use” » Can we use notify everywhere there is a broadcast? No, might not have satisfied OK to proceed for A, have satisfied it for B • What does Java do? • Enables timeouts and aborts – Release lock, no UNWIND primitive • General Principle: use hints for performance that have little or better yet no effect on the correctness – Many commercial systems use hints for fault tolerance: if the hint is wrong, things timeout and use a backup strategy » Performance hit for incorrect hint, but no errors 9/6/2018 Cs262a-F18 Lecture-05 35 9/6/2018 Cs262a-F18 Lecture-05 36

10. Performance 3 Key Features about the Paper • Assumes simple machine architecture • Describes the experiences designers had with – Single execution, non-pipelined – what about multi-processors? designing, building and using a large system that • Context switch is very fast: 2 procedure calls (60 ticks) aggressively relies on lightweight processes and • Ended up not mattering much for performance: monitor facilities for all its software concurrency needs – Ran only on uniprocessor systems – Concurrency mostly used for clean structuring purposes • Procedure calls are slow: 30 instructions (RISC proc. • Describes various subtle issues of implementing a calls are 10x faster); Why? threads-with-monitors design in real life for a large – Due to heap allocated procedure frames. Why did they do this? system » Didn’t want to worry about colliding process stacks – Mental model was “any procedure call might be a fork”: transfer was basic control transfer primitive • Discusses the performance and overheads of various primitives and presents three representative • Process creation: ~ 1100 instructions applications, but doesn’t give a big picture of how – Good enough most of the time important various decisions and features turned out – Fast-fork package implemented later that keeps around a pool or “available” processes to be 9/6/2018 Cs262a-F18 Lecture-05 37 9/6/2018 Cs262a-F18 Lecture-05 38 Some Flaws Is this a good paper? • Gloss over how hard it is to program with locks and • What were the authors’ goals? exceptions sometimes – not clear if there are better ways • What about the evaluation / metrics? • Did they convince you that this was a good system • Performance discussion doesn’t give the big picture /approach? – Tries to be machine-independent (ticks), but assumes particular model • Were there any red-flags? • A takeaway lesson: The lightweight threads-with- • What mistakes did they make? monitors programming paradigm can be used to • Does the system/approach meet the “Test of Time” successfully build large systems, but there are subtle challenge? points that have to be correct in the design and • How would you review this paper today? implementation in order to do so 9/6/2018 Cs262a-F18 Lecture-05 39 9/6/2018 Cs262a-F18 Lecture-05 40

11. Transactions (Brief intro before ARES) ACID Semantics • Second paper really a summary paper • Full Transactional support includes: – Ideas behind transactions were stabilizing at this time (1983) – Atomicity: It must be an all-or-nothing commitment – Gray, Berstein, Goodman, Codd, … – Consistency: After commit, a transaction will preserve the • Point of reading this paper: Getting transaction concept consistency of the data base into your brain » Example: for banking app, net amount of money constant, – Transactions are Atomic actions that are all-or-nothing even after moving money from account to account – They either complete entirely or they do not even start – Isolation: Events within a transaction hidden from other transactions Begin_Transaction() » Writes from uncommitted transactions invisible to other Do a bunch of things – read and write data transactions End_Transaction() – Durability: Once a transaction has been completed and committed its results, results will survive even system crashes – Until End_Transaction() completes, transaction may be aborted and effects will be nullified • Much of focus is around techniques for achieving – After End_Transaction() completes, transaction will be durable and results these properties should survive even system crashes – Use of Log to permit transactions to abort/be durable – Transaction said to “Commit” after End_Transaction() 9/6/2018 Cs262a-F18 Lecture-05 41 9/6/2018 Cs262a-F18 Lecture-05 42 Recovery actions: The Log Other things in paper • Log entries: • Lots of discussion of implementation techniques – REDO: (forward) enough information to perform update to the data – ATOMIC, STEAL, FORCE, … – UNDO: (backward) enough information to move backward\ – BEGIN: Start of transaction • ARIES paper for next time will show “BEST IN – COMMIT: End of transaction CLASS” combination of features – ABORT: Transaction was aborted • Once transaction committed in log, transaction durable – Updates may not be reflected in permanent state – If crash happens before COMMIT is placed into log, should be as if nothing ever happened: may need to undo state • Checkpoints: – Flush log out to well defined point/discard old log entries – Good point to recover from • Ways of updating permanent state – Update in place: use log to undo state if necessary – Shadow pages: keep old state around and update to new pages – Other ideas: keep different views of DB, “Old and new” copies, etc 9/6/2018 Cs262a-F18 Lecture-05 43 9/6/2018 Cs262a-F18 Lecture-05 44

12. Is this a good paper? • What were the authors’ goals? • What about the evaluation / metrics? • Did they convince you that this was a good system /approach? • Were there any red-flags? • What mistakes did they make? • Does the system/approach meet the “Test of Time” challenge? • How would you review this paper today? 9/6/2018 Cs262a-F18 Lecture-05 45