03 File Systems

File Systems A Fast File System for UNIX Analysis and Evolution of Journaling File Systems

1. Kernel Device Structure The System Call Interface EECS 262a Advanced Topics in Computer Systems Process Memory Device Filesystems Networking Lecture 3 Management Management Control Concurrency, Virtual Files and dirs: TTYs and Connectivity multitasking memory the VFS device access Filesystems File System Types Network August 30th, 2018 Architecture Subsystem Memory Device Dependent John Kubiatowicz Manager Block Control IF drivers Code Electrical Engineering and Computer Sciences Devices University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 8/30/2018 cs262a-F18 Lecture-03 2 Today’s Papers Review: Magnetic Disk Characteristic Track Sector • A Fast File System for UNIX • Cylinder: all the tracks under the Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler and Robert head at a given point on all surface S. Fabry. Appears in ACM Transactions on Computer Systems Head (TOCS), Vol. 2, No. 3, August 1984, pp 181-197 Cylinder • Analysis and Evolution of Journaling File Systems • Read/write data is a three-stage process: Platter Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, Appears in Proceedings of the Annual Conference on – Seek time: position the head/arm over the proper track (into proper cylinder) USENIX Annual Technical Conference (ATEC '05), 2005 – Rotational latency: wait for the desired sector to rotate under the read/write head – Transfer time: transfer a block of bits (sector) under the read-write head • System design paper and system analysis paper • Disk Latency = Queueing Time + Controller time + • Thoughts? Seek Time + Rotation Time + Xfer Time Software Controller Hardware Request Result Media Time Queue (Seek+Rot+Xfer) (Device Driver) • Highest Bandwidth: – Transfer large group of blocks sequentially from one track 8/30/2018 cs262a-F18 Lecture-03 3 8/30/2018 cs262a-F18 Lecture-03 4

2. Historical Perspective Disk History • 1956 IBM Ramac — early 1970s Winchester – Developed for mainframe computers, proprietary interfaces – Steady shrink in form factor: 27 in. to 14 in. Data • Form factor and capacity drives market more than performance density • 1970s developments Mbit/sq. in. – 5.25 inch floppy disk formfactor (microcode into mainframe) – Emergence of industry standard disk interfaces Capacity of • Early 1980s: PCs and first generation workstations Unit Shown • Mid 1980s: Client/server computing Megabytes – Centralized storage on file server » accelerates disk downsizing: 8 inch to 5.25 – Mass market disk drives become a reality 1973: 1979: » industry standards: SCSI, IPI, IDE 1. 7 Mbit/sq. in 7. 7 Mbit/sq. in » 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces 140 MBytes 2,300 MBytes • 1900s: Laptops => 2.5 inch drives • 2000s: Shift to perpendicular recording – 2007: Seagate introduces 1TB drive source: New York Times, 2/23/98, page C3, – 2009: Seagate/WD introduces 2TB drive “Makers of disk drives crowd even mroe data into even smaller spaces” • 2018: Seagate announces 14TB drives 8/30/2018 cs262a-F18 Lecture-03 5 8/30/2018 cs262a-F18 Lecture-03 6 Disk History Recent: Seagate Enterprise (2014) • 8TB! > 1.33 Tb/in2 (announced Nov 2015) • 6 (3.5”) platters, 2 heads each • Perpendicular recording (not SMR!) • 7200 RPM, 4.16ms latency • 237MB/sec sustained transfer speed • 256MB cache • Error Characteristics: 1989: 1997: 1997: – MBTF: 2 x 106 hours 63 Mbit/sq. in 1450 Mbit/sq. in 3090 Mbit/sq. in 60,000 MBytes 2300 MBytes 8100 MBytes – Bit error rate: 10-15 • Special considerations: – Normally need special “bios” (EFI): Bigger than easily handled by 32-bit OSes. source: New York Times, 2/23/98, page C3, – Seagate provides special “Disk Wizard” software that virtualizes “Makers of disk drives crowd even mroe data into even smaller spaces” drive into multiple chunks that makes it bootable on these OSes. 8/30/2018 cs262a-F18 Lecture-03 7 8/30/2018 cs262a-F18 Lecture-03 8

3. Storage Performance & Price (2014) Contrarian View Bandwidth Cost/GB Size • FFS doesn’t matter anymore! (sequential R/W) HHD 50-100 MB/s $0.05-0.1/GB 2-8 TB SSD1 200-500 MB/s $1.5-5/GB 200GB-1TB (SATA) 6 GB/s (PCI) DRAM 10-16 GB/s $5-10/GB 64GB-256GB 1http://www.fastestssd.com/featured/ssd-rankings-the-fastest-solid-state-drives/ BW: SSD up to x10 than HDD, DRAM > x10 than SSD Price: HDD x30 less than SSD, SSD x4 less than DRAM    10 • What about Journaling? Is it still relevant? 8/30/2018 cs262a-F18 Lecture-03 9 8/30/2018 cs262a-F18 Lecture-03 10 Filesystems Background A Fast File System for UNIX (4.2 BSD) • i-node: structure for per-file • Original UNIX FS was simple and elegant, but slow metadata (unique per file) • Could only achieve about 20 KB/sec/arm; ~2% of 1982 disk – contains: ownership, permissions, bandwidth timestamps, about 10 data-block pointers • Problems: – i-nodes form an array, indexed by “i-number” – so each i-node has a – Blocks too small unique i-number » 512 bytes (matched sector size) – Array is explicit for FFS, implicit for – Consecutive blocks of files not close together LFS (its i-node map is cache of » Yields random placement for mature file systems i-nodes indexed by i-number) – i-nodes far from data • Indirect blocks: » All i-nodes at the beginning of the disk, all data after that – i-node only holds a small number of data block pointers (direct pointers) – i-nodes of directory not close together – For larger files, i-node points to an indirect block containing – no read-ahead 1024 4-byte entries in a 4K block » Useful when sequentially reading large sections of a file – Each indirect block entry points to a data block – Can have multiple levels of indirect blocks for even larger files 8/30/2018 cs262a-F18 Lecture-03 11 8/30/2018 cs262a-F18 Lecture-03 12

4. FFS Changes Attack of the Rotational Delay • Aspects of new file system: • Problem: Missing blocks due to rotational delay – Issue: Read one block, do processing, and read next block. In meantime, – 4096 or 8192 byte block size (why not larger?) disk has continued turning: missed next block! Need 1 revolution/block! – large blocks and small fragments Skip Sector – disk divided into cylinder groups – each contains superblock, i-nodes, bitmap of free blocks, usage summary info Track Buffer – Note that i-nodes are now spread across the disk: (Holds complete track) » Keep i-node near file, i-nodes of a directory together (shared fate) – Cylinder groups ~ 16 cylinders, or 7.5 MB – Solution1: Skip sector positioning (“interleaving”) » Place the blocks from one file on every other block of a track: give – Cylinder headers spread around so not all on one platter time for processing to overlap rotation – Solution2: Read ahead: read next block right after first, even if • Two techniques for locality: application hasn’t asked for it yet. » This can be done either by OS (read ahead) – Lie – don’t let disk fill up (in any one area) » By disk itself (track buffers). Many disk controllers have internal – Paradox: to achieve locality, must spread unrelated things far apart RAM that allows them to read a complete track – Note: new file system got 175KB/sec because free list contained • Important Aside: Modern disks+controllers do many sequential blocks (it did generate locality), but an old system has complex things “under the covers” randomly ordered blocks and only got 30 KB/sec (fragmentation) – Track buffers, elevator algorithms, bad block filtering 8/30/2018 cs262a-F18 Lecture-03 13 8/30/2018 cs262a-F18 Lecture-03 14 Where are inodes stored? Where are inodes stored? • In early UNIX and DOS/Windows’ FAT file system, • Later versions of UNIX moved the header information headers stored in special array in outermost to be closer to the data blocks – Often, inode for file stored in same “cylinder group” as parent cylinders directory of the file (makes an ls of that directory run fast). – Header not stored anywhere near the data blocks. To read a small – Pros: file, seek to get header, seek back to data. » UNIX BSD 4.2 puts a portion of the file header array on each of – Fixed size, set when disk is formatted. At formatting time, a fixed many cylinders. For small directories, can fit all data, file number of inodes were created (They were each given a unique headers, etc. in same cylinder  no seeks! number, called an “inumber”) » File headers much smaller than whole block (a few hundred bytes), so multiple headers fetched from disk at same time » Reliability: whatever happens to the disk, you can find many of the files (even if directories disconnected) – Part of the Fast File System (FFS) » General optimization to avoid seeksbv 8/30/2018 cs262a-F18 Lecture-03 15 8/30/2018 cs262a-F18 Lecture-03 16

5. 4.2 BSD Locality: Block Groups FFS First Fit Block Allocation • File system volume is divided into a set of block groups – Close set of tracks • Data blocks, metadata, and free space interleaved within block group – Avoid huge seeks between user data and system structure • Put directory and its files in common block group • First-Free allocation of new file blocks – To expand file, first try successive blocks in bitmap, then choose new range of blocks – Few little holes at start, big sequential runs at end of group – Avoids fragmentation • Fills in the small holes at the start of block group – Sequential layout for big files • Avoids fragmentation, leaves contiguous free space • Important: keep 10% or more free! at end – Reserve space in the BG 8/30/2018 cs262a-F18 Lecture-03 17 8/30/2018 cs262a-F18 Lecture-03 18 FFS Locality Techniques (Summary) FFS Results • Goals • 20-40% of disk bandwidth for large reads/writes – Keep directory within a cylinder group, spread out different directories – Allocate runs of blocks within a cylinder group, every once in a • 10-20x original UNIX speeds while switch to a new cylinder group (jump at 1MB) • Size: 3800 lines of code vs. 2700 in old system • Layout policy: global and local – Global policy allocates files & directories to cylinder groups – picks “optimal” next block for block allocation • 10% of total disk space unusable (except at 50% – Local allocation routines handle specific block requests – select from a sequence of alternative if need to performance price) • Could have done more; later versions do 8/30/2018 cs262a-F18 Lecture-03 19 8/30/2018 cs262a-F18 Lecture-03 20

6. FFS System Interface Enhancements FFS Summary • Really a second mini-paper! • 3 key features: • Long file names (14  255 characters) – Parameterize FS implementation for the hardware it’s running on • Advisory file locks (shared or exclusive) – Measurement-driven design decisions – Process id of holder stored with lock => can reclaim the lock if process is – Locality “wins” no longer around • Major flaws: • Symbolic links (contrast to hard links) – Measurements derived from a single installation • Atomic rename capability – Ignored technology trends – The only atomic read-modify-write operation, before this there was none • Disk quotas • A lesson for the future: don’t ignore underlying • Could probably have gotten copy-on-write to work to avoid copying hardware characteristics data from user  kernel (would need to copies only for parts that are not page aligned) • Over-allocation would save time; return unused allocation later • Contrasting research approaches: improve what Advantages: you’ve got vs. design something new – 1) less overhead for allocation – 2) more likely to get sequential blocks 8/30/2018 cs262a-F18 Lecture-03 21 8/30/2018 cs262a-F18 Lecture-03 22 Is this a good paper? • What were the authors’ goals? • What about the evaluation / metrics? • Did they convince you that this was a good system /approach? • Were there any red-flags? BREAK • What mistakes did they make? • Does the system/approach meet the “Test of Time” challenge? • How would you review this paper today? 8/30/2018 cs262a-F18 Lecture-03 23

7. Quick Aside: Transactional File Systems Log-Structured/Journaling File System • Better reliability through use of log – All changes are treated as transactions • Radically different file system design – A transaction is committed once it is written to the log • Technology motivations: » Data forced to disk for reliability – CPUs outpacing disks: I/O becoming more-and-more of a » Process can be accelerated with NVRAM bottleneck – Although File system may not be updated immediately, data preserved in the log – Large RAM: file caches work well, making most disk traffic writes • Difference between “Log Structured” and “Journaled” • Problems with (then) current file systems: – In a Log Structured filesystem, data stays in log form – Lots of little writes – In a Journaled filesystem, Log used for recovery – Synchronous: wait for disk in too many places – makes it hard to • Journaling File System win much from RAIDs, too little concurrency – Applies updates to system metadata using transactions (using logs, etc.) – 5 seeks to create a new file: (rough order) – Updates to non-directory files (i.e., user stuff) can be done in place 1. file i-node (create) (without logs), full logging optional 2. file data – Ex: NTFS, Apple HFS+, Linux XFS, JFS, ext3, ext4 3. directory entry • Full Logging File System 4. file i-node (finalize) 5. directory i-node (modification time) – All updates to disk are done in transactions 8/30/2018 cs262a-F18 Lecture-03 25 8/30/2018 cs262a-F18 Lecture-03 26 LFS Basic Idea LFS Log Retrieval • Log all data and metadata with efficient, large, sequential • Keep same basic file structure as UNIX (inode, writes indirect blocks, data) • Treat the log as the truth, but keep an index on its contents • Retrieval is just a question of finding a file’s inode – Anti-locality for reads! • UNIX inodes kept in one or a few big arrays, LFS – Great locality for writes (including random writes) inodes must float to avoid update-in- place • Rely on a large memory to provide fast access through • Solution: an inode map that tells where each inode is caching (Also keeps other stuff: version number, last access • Data layout on disk has “temporal locality” (good for writing), time, free/allocated) rather than “logical locality” (good for reading) • inode map gets written to log like everything else – Why is this a better? Because caching helps reads but not writes! • Map of inode map gets written in special checkpoint • Two potential problems: location on disk; used in crash recovery – Log retrieval on cache misses – Wrap-around: what happens when end of disk is reached? » No longer any big, empty runs available » How to prevent fragmentation? 8/30/2018 cs262a-F18 Lecture-03 27 8/30/2018 cs262a-F18 Lecture-03 28

8. LFS Disk Wrap-Around LFS Segment Cleaning • Compact live info to open up large runs of free space • Which segments to clean? – Problem: long-lived information gets copied over-and-over – Keep estimate of free space in each segment to help find segments with lowest utilization • Thread log through free spaces – Always start by looking for segment with utilization=0, since those – Problem: disk fragments, causing I/O to become inefficient again are trivial to clean… – If utilization of segments being cleaned is U: » write cost = (total bytes read & written)/(new data written) = • Solution: segmented log 2/(1-U) (unless U is 0) – Divide disk into large, fixed-size segments » write cost increases as U increases: U = .9 => cost = 20! – Do compaction within a segment; thread between segments » Need a cost of less than 4 to 10; => U of less than .75 to .45 – When writing, use only clean segments (i.e. no live data) – Occasionally clean segments: read in several, write out live data in compacted form, leaving some fragments free • How to clean a segment? – Try to collect long-lived info into segments that never need to be cleaned – Segment summary block contains map of the segment – Note there is not free list or bit map (as in FFS), only a list of clean – Must list every i-node and file block segments – For file blocks you need {i-number, block #} 8/30/2018 cs262a-F18 Lecture-03 29 8/30/2018 cs262a-F18 Lecture-03 30 Analysis and Evolution of Journaling File Systems Three modes for a JFS • Write-ahead logging: commit data by writing it to log, • Writeback mode: synchronously and sequentially – Journal only metadata • Unlike LFS, then later moved data to its normal – Write back data and metadata independently (FFS-like) location – this write is called checkpointing – Metadata may thus have dangling references after a crash (if metadata written before the data with a crash in between) and like segment cleaning, it makes room in the (circular) journal • Ordered mode: • Better for random writes, slightly worse for big – Journal only metadata, but always write data blocks before their sequential writes referring metadata is journaled • All reads go the the fixed location blocks, not the – This mode generally makes the most sense and is used by Windows NTFS and IBM’s JFS journal, which is only read for crash recovery and checkpointing • Data journaling mode: • Much better than FFS (fsck) for crash recovery – Write both data and metadata to the journal (covered below) because it is much faster – Huge increase in journal traffic; plus have to write most blocks • Ext3/ReiserFS/Ext4 filesystems are the main ones in twice, once to the journal and once for checkpointing (why not all?) Linux 8/30/2018 cs262a-F18 Lecture-03 31 8/30/2018 cs262a-F18 Lecture-03 32

9. JFS Crash Recovery Logging File Systems • Instead of modifying data structures on disk directly, • Load superblock to find the tail/head of the log write changes to a journal/log – Intention list: set of changes we intend to make – Log/Journal is append-only • Scan log to detect whole committed transactions – Single commit record commits transaction (they have a commit record) • Once changes are in the log, it is safe to apply changes to data structures on disk – Recovery can read log to see what changes were intended • Replay log entries to bring in-memory data structures – Can take our time making the changes up to date » As long as new requests consult the log first – This is called “redo logging” and entries must be “idempotent” • Once changes are copied, safe to remove log • But, … • Playback is oldest to newest; tail of the log is the – If the last atomic action is not done … poof … all gone place where checkpointing stopped • Basic assumption: – Updates to sectors are atomic and ordered – Not necessarily true unless very careful, but key assumption • How to find the head of the log? 8/30/2018 cs262a-F18 Lecture-03 33 8/30/2018 cs262a-F18 Lecture-03 34 Redo Logging Example: Creating a file • Find free data block(s) • Prepare • Recovery • Find free inode entry – Write all changes (in – Read log • Find dirent insertion point transaction) to log – Redo any operations Free Space -------------------------- • Commit for committed map transactions • Write map (i.e., mark used) … – Single disk write to Data blocks make transaction – Garbage collect log • Write inode entry to point to durable block(s) Inode table • Write dirent to point to inode • Redo Directory – Copy changes to disk entries • Garbage collection – Reclaim space in log 8/30/2018 cs262a-F18 Lecture-03 35 8/30/2018 cs262a-F18 Lecture-03 36

10. Ex: Creating a file (as a transaction) ReDo log • Find free data block(s) • Find free inode entry • After Commit • Find dirent insertion point • All access to file system first Free Space looks in log Free Space -------------------------- map • Eventually copy changes to map • Write map (used) disk … … Data blocks Data blocks • Write inode entry to point to block(s) Inode table Inode table • Write dirent to point to inode Directory Directory entries entries tail head tail tail tail tail tail head commit commit done pending done start start Log in non-volatile storage (Flash or on Disk) Log in non-volatile storage (Flash) pending 8/30/2018 cs262a-F18 Lecture-03 37 8/30/2018 cs262a-F18 Lecture-03 38 Crash during logging - Recover Recovery After Commit • Upon recovery scan the long • Scan log, find start • Detect transaction start with • Find matching commit no commit • Redo it as usual Free Space Free Space • Discard log entries map – Or just let it happen later map • Disk remains unchanged … … Data blocks Data blocks Inode table Inode table Directory Directory entries entries tail head tail head commit done pending done pending start start Log in non-volatile storage (Flash or on Disk) Log in non-volatile storage (Flash or on Disk) 8/30/2018 cs262a-F18 Lecture-03 39 8/30/2018 cs262a-F18 Lecture-03 40

11. What if had already started writing back the What if the uncommitted transaction transaction ? was discarded on recovery? • Idempotent – the result does not change if the • Do it again from scratch operation is repeat several times. • Nothing on disk was changed • Just write them again during recovery 8/30/2018 cs262a-F18 Lecture-03 41 8/30/2018 cs262a-F18 Lecture-03 42 What if we crash again during recovery? Redo Logging • Idempotent • Prepare • Recovery • Just redo whatever part of the log hasn’t been – Write all changes (in – Read log garbage collected transaction) to log – Redo any operations • Commit for committed transactions – Single disk write to make transaction – Ignore uncommitted durable ones • Redo – Garbage collect log – Copy changes to disk • Garbage collection – Reclaim space in log 8/30/2018 cs262a-F18 Lecture-03 43 8/30/2018 cs262a-F18 Lecture-03 44

12. Can we interleave transactions in the log? Some Fine Points tail head • Can group transactions together: fewer syncs and fewer writes, since hot metadata may changes several times within one commit commit start start pending transaction • This is a very subtle question • Need to write a commit record, so that you can tell that all of the compound transaction made it to disk • The answer is “if they are serializable” – i.e., would be possible to reorder them in series without violating • ext3 logs whole metadata blocks (physical logging); JFS and any dependences NTFS log logical records instead, which means less journal • Deep theory around consistency, serializability, and traffic memory models in the OS, Database, and Architecture fields, respectively – A bit more later --- and in the graduate course… 8/30/2018 cs262a-F18 Lecture-03 45 8/30/2018 cs262a-F18 Lecture-03 46 Some Fine Points Linux Example: Ext2/3 Disk Layout • Head of line blocking: • Disk divided into block – Compound transactions can link together concurrent streams (e.g., groups from different apps) and hinder asynchronous apps performance – Provides locality (Figure 6) – Each group has two block- – This is like having no left turn lane and waiting on the car in front of sized bitmaps (free you to turn left, when you just want to go straight blocks/inodes) – Block sizes settable at format time: • Distinguish 1K, 2K, 4K, 8K… – Between ordering of writes and durability/persistence – careful ordering means that after a crash the file system can be recovered to a consistent • Actual Inode structure past state. similar to 4.2BSD – But that state could be far in the past in the case of JFS – with 12 direct pointers – 30 seconds behind is more typical for ext3 – if you really want something to be durable you must flush the log synchronously • Ext3: Ext2 w/Journaling – Several degrees of protection with more or less cost • Example: create a file1.dat 8/30/2018 cs262a-F18 Lecture-03 47 8/30/2018 under /dir1/ in Ext3 cs262a-F18 Lecture-03 48

13. Semantic Block-level Analysis (SBA) Semantic Trace Playback (STP) • Nice idea: interpose special disk driver between the • Uses two kinds of interposition: file system and the real disk driver – 1) SBA driver that produces a trace, and – 2) user-level library that fits between the app and the real filesystem • Pros: simple, captures ALL disk traffic, can use with a black-box filesystem (no source code needed and • User-level library traces dirty blocks and app calls to fsync can even use via VMWare for another OS), can be • Playback: more insightful than just a performance benchmark – Given the two traces, STP generates a timed set of commands to the raw disk device – this sequence can be timed to understand performance implications • Cons: must have some understanding of the disk layout, which differs for each filesystem, requires a • Claim: great deal of inference; really only useful for writes – Faster to modify the trace than to modify the filesystem and simpler and less error-prone than building a simulator • To use well, drive filesystem with smart applications that test certain features of the filesystem (to make • Limited to simple FS changes the inference easier) • Best example usage: – Showing that dynamically switching between ordered mode and data journaling mode actually gets the best overall performance (Use data journaling for random writes) 8/30/2018 cs262a-F18 Lecture-03 49 8/30/2018 cs262a-F18 Lecture-03 50 Is this a good paper? Extra Slides on LFS • What were the authors’ goals? • What about the evaluation/metrics? • Did they convince you that this was a good system/approach? • Were there any red-flags? • What mistakes did they make? • Does the system/approach meet the “Test of Time” challenge? • How would you review this paper today? 8/30/2018 cs262a-F18 Lecture-03 51 8/30/2018 cs262a-F18 Lecture-03 52

14. LFS i-node and Block Cleaning Simulation of LFS Cleaning • To clean an i-node: • Initial model: Uniform random distribution of references; – Just check to see if it is the current version (from i-node map) greedy algorithm for segment- to-clean selection – If not, skip it; if so, write to head of log and update i-node map • To clean a file block, must figure out it if is still live • Why does the simulation do better than the formula? – First check the UID, which only tells you if this file is current (UID only – Because of variance in segment utilizations changes when is deleted or has length zero) – Note that UID does not change every time the file is modified (since you would have to update the UIDs of all of its blocks) • Added locality (i.e., 90% of references go to 10% of data) – Next, walk through the i-node and any indirect blocks to get to the data block pointer for this block number and things got worse! » If it points to this block, then move the block to the head of the log 8/30/2018 cs262a-F18 Lecture-03 53 8/30/2018 cs262a-F18 Lecture-03 54 LFS Cleaning Solution #1 LFS Cleaning Solution #2 • First solution: Write out cleaned data ordered by age to • Second Solution: obtain hot and cold segments – It’s worth paying more to clean cold segments because you get to keep – What prog. language feature does this remind you of? (Generational GC) the free space longer – Only helped a little • Better way to think about this: • Problem: – Don’t clean segments that have a high d-free/dt (first derivative of – Even cold segments eventually have to reach the cleaning point, but they utilization) drift down slowly. tying up lots of free space – If you ignore them, they clean themselves! – Do you believe that’s true? – LFS uses age as an approximation of d-free/dt, because the latter is hard to track directly • New selection function: – MAX(T*(1-U)/(1+U)) – Resulted in the desired bi-modal utilization function – LFS stays below write cost of 4 up to a disk utilization of 80% 8/30/2018 cs262a-F18 Lecture-03 55 8/30/2018 cs262a-F18 Lecture-03 56

15. LFS Recovery Techniques LFS Checkpoints • LFS Checkpoints: • Three techniques: – Just an optimization to roll forward – Checkpoints – Reduces recovery time – Crash Recovery – Directory Operation Log • Checkpoint contains: pointers to i-node map and segment usage table, current segment, timestamp, checksum (?) • Before writing a checkpoint make sure to flush i-node map and segment usage table • Uses “version vector” approach: – Write checkpoints to alternating locations with timestamps and checksums – On recovery, use the latest (valid) one 8/30/2018 cs262a-F18 Lecture-03 57 8/30/2018 cs262a-F18 Lecture-03 58 LFS Crash Recovery LFS Directory Operation Log • Unix must read entire disk to reconstruct meta data • Example of “intent + action”: – Write the intent as a “directory operation log” – Then write the actual operations (create, link, unlink, rename) • LFS reads checkpoint and rolls forward through log from checkpoint state • This makes them atomic • Result: recovery time measured in seconds instead of minutes to hours • On recovery, if you see the operation log entry, then you can REDO the operation to complete it (For new file create with no data, you UNDO it instead) • Directory operation log == log intent to achieve atomicity, then redo during recovery, (undo for new files with no data, since you can’t redo it) • => “logical” REDO logging 8/30/2018 cs262a-F18 Lecture-03 59 8/30/2018 cs262a-F18 Lecture-03 60

16. LFS Summary LFS Observations • Key features of paper: • An interesting point: – CPUs outpacing disk speeds; implies that I/O is becoming more- – LFS’ efficiency isn’t derived from knowing the details of disk and-more of a bottleneck geometry; implies it can survive changing disk technologies (such – Write FS information to a log and treat the log as the truth; rely on variable number of sectors/track) better in-memory caching to obtain speed – Hard problem: finding/creating long runs of disk space to (sequentially) write log records to • A Lesson: » Solution: clean live data from segments, picking segments to – Rethink your basic assumptions about what’s primary and what’s clean based on a cost/benefit function secondary in a design – In this case, they made the log become the truth instead of just a recovery aid • Some flaws: – Assumes that files get written in their entirety; else would get intra- file fragmentation in LFS – If small files “get bigger” then how would LFS compare to UNIX? 8/30/2018 cs262a-F18 Lecture-03 61 8/30/2018 cs262a-F18 Lecture-03 62