04 File System Reliability: Bottom Up/Top Down

File System Reliability: Bottom Up/Top Down The HP AutoRAID Hierarchical Storage System (2-up version) Finding a needle in Haystack: Facebook’s photo storage

1. Today’s Papers EECS 262a • The HP AutoRAID Hierarchical Storage System (2-up version), Advanced Topics in Computer Systems John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. Appears in ACM Transactions on Computer Systems, Vol. 14, No, Lecture 4 1, February 1996, Pages 108-136. • Finding a needle in Haystack: Facebook’s photo storage,Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel. Filesystems (Con’t) Appears in Proceedings of the USENIX conference in Operating Systems Design and Implementation (OSDI), 2010 September 4th, 2018 John Kubiatowicz • System design paper and system analysis paper Electrical Engineering and Computer Sciences • Thoughts? University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 9/04/2018 Cs262a-F18 Lecture-04 2 Array Reliability RAID Basics (Two optional papers) • Levels of RAID (those in RED are actually used): • Reliability of N disks = Reliability of 1 Disk ÷ N – RAID 0 (JBOD): striping with no parity (just bandwidth) – RAID 1: Mirroring (simple, fast, but requires 2x storage) » 1/n space, reads faster (1 to Nx), writes slower (1x) – why? 50,000 Hours ÷ 70 disks = 700 hours – RAID 2: bit-level interleaving with Hamming error-correcting codes (ECC) – RAID 3: byte-level striping with dedicated parity disk Disk system MTTF: Drops from 6 years to 1 month! » Dedicated parity disk is write bottleneck, since every write also writes parity – RAID 4: block-level striping with dedicated parity disk • Arrays (without redundancy) too unreliable to be useful! » Same bottleneck problems as RAID 3 – RAID 5: block-level striping with rotating parity disk » Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x – RAID 6: RAID 5 with two parity blocks (tolerates two drive failures) Hot spares support reconstruction in parallel with • Use RAID 6 with today’s drive sizes! Why? access: very high media availability can be – Correlated drive failures (2x expected in 10hr recovery) achieved [Schroeder and Gibson, FAST07] – Failures during multi-hour/day rebuild in high-stress environments 9/04/2018 Cs262a-F18 Lecture-04 3 9/04/2018 Cs262a-F18 Lecture-04 4

2. Redundant Arrays of Disks Redundant Arrays of Disks RAID 5+: High RAID 1: Disk Mirroring/Shadowing I/O Rate Parity recovery group D0 D1 D2 D3 P Increasing A logical write Logical becomes four Disk physical I/Os D4 D5 D6 P D7 Addresses • Each disk is fully duplicated onto its "shadow" Independent writes D8 D9 P D10 D11 Very high availability can be achieved possible because of interleaved parity • Bandwidth sacrifice on write: D12 P D13 D14 D15 Logical write = two physical writes Reed-Solomon Stripe Codes ("Q") for P D16 D17 D18 D19 • Reads may be optimized protection during Stripe reconstruction Unit • Most expensive solution: 100% capacity overhead D20 D21 D22 D23 P Targeted for mixed applications . . . . . Targeted for high I/O rate , high availability environments . . . Disk Columns. . . . . . . 9/04/2018 Cs262a-F18 Lecture-04 5 9/04/2018 Cs262a-F18 Lecture-04 6 Problems of Disk Arrays: Small Writes System Availability: Orthogonal RAIDs String RAID-5: Small Write Algorithm Controller . . . 1 Logical Write = 2 Physical Reads + 2 Physical Writes String Controller . . . D0' D0 D1 D2 D3 P String Array Controller . . . new old (1. Read) old (2. Read) Controller String data data parity Controller . . . + XOR String Controller . . . + XOR String Controller . . . (3. Write) (4. Write) Data Recovery Group: unit of data redundancy D0' D1 D2 D3 Redundant Support Components: fans, power supplies, controller, cables P' End to End Data Integrity: internal parity protected data paths 9/04/2018 Cs262a-F18 Lecture-04 7 9/04/2018 Cs262a-F18 Lecture-04 8

3. System-Level Availability How to get to “RAID 6”? host host • One option: Reed-Solomon codes (Non-systematic): I/O Controller Fully dual redundant I/O Controller – Use of Galois Fields (finite element equivalent of real numbers) – Data as coefficients, code space as values of polynomial: – P(x)=a0+a1x1+… a4x4 Array Controller Array Controller – Coded: P(1),P(2)….,P(6),P(7) ... ... – Advantage: can add as much redundancy as you like: 5 disks? ... • Problems with Reed-Solomon codes: decoding gets Goal: No Single complex quickly – even to add a second disk Points of Failure • Alternates: lot of them – I’ve posted one possibility ... – Idea: Use prime number of columns, diagonal as well as straight XOR ... Recovery . with duplicated paths, higher performance can be Group . obtained when there are no failures 9/04/2018 . Cs262a-F18 Lecture-04 9 9/04/2018 Cs262a-F18 Lecture-04 10 HP AutoRAID – Motivation HP AutoRAID – Key Ideas • Goals: automate the efficient replication of data in a RAID • Key idea: mirror active data (hot), RAID 5 for cold data – RAIDs are hard to setup and optimize – Assumes only part of data in active – Mix fast mirroring (2 copies) with slower, more space-efficient parity disks use at one time – Automate the migration between these two levels – Working set changes slowly (to allow migration) • RAID small-write problem: • How to implement this idea? – to overwrite part of a block required 2 reads and 2 writes! – Sys-admin – read data, read parity, write data, write parity » make a human move around the • Each kind of replication has a narrow range of workloads for files.... BAD. painful and error prone which it is best... – File system – Mistake ⇒ 1) poor performance, 2) changing layout is expensive and error » best choice, but hard to implement/ prone deploy; can’t work with existing systems – Also difficult to add storage: new disk ⇒ change layout and rearrange data... – Smart array controller: (magic disk) block-level device interface » Easy to deploy because there is a well-defined abstraction » Enables easy use of NVRAM (why?) 9/04/2018 Cs262a-F18 Lecture-04 11 9/04/2018 Cs262a-F18 Lecture-04 12

4. HP AutoRaid – Features AutoRAID Details • Block Map – level of indirection so that blocks can be moved around among the disks – implies you only need one “zero block” (all zeroes), a variation of copy on write – in fact could generalize this to have one real block for each unique block • Mirroring of active blocks – RAID 5 for inactive blocks or large sequential writes (why?) – Start out fully mirrored, then move to 10% mirrored as disks fill • Promote/demote in 64K chunks (8-16 blocks) – Hot swap disks, etc. (A hot swap is just a controlled failure.) • PEX (Physical Extent): 1MB chunk of disk space – Add storage easily (goes into the mirror pool) • PEG (Physical Extent Group): Size depends on # Disks – A group of PEXes assigned to one storage class – useful to allow different size disks (why?) • Stripe: Size depends # Disks • No need for an active hot spare (per se); – One row of parity and data segments in a RAID 5 storage class – just keep enough working space around • Segment: 128 KB • Log-structured RAID 5 writes – Strip unit (RAID 5) or half of a mirroring unit – Nice big streams, no need to read old parity for partial writes • Relocation Block (RB): 64KB – Client visible space unit 9/04/2018 Cs262a-F18 Lecture-04 13 9/04/2018 Cs262a-F18 Lecture-04 14 Closer Look: Questions • When to demote? When there is too much mirrored storage (>10%) – Demotion leaves a hole (64KB). What happens to it? Moved to free list and reused – Demoted RBs are written to the RAID5 log, one write for data, a second for parity • Why log RAID5 better than update in place? – Update of data requires reading all the old data to recalculate parity. – Log ignores old data (which becomes garbage) and writes only new data/parity stripes • When to promote? When a RAID5 block is written... – Just write it to mirrored and the old version becomes garbage. • How big should an RB be? – Bigger ⇒ Less mapping information, fewer seeks – Smaller ⇒ fine grained mapping information • How do you find where an RB is? – Convert addresses to (LUN, offset) and then lookup RB in a table from this pair – Map size = Number of RBs and must be proportional to size of total storage • How to handle thrashing (too much active write data)? – Automatically revert to directly writing RBs to RAID 5! 9/04/2018 Cs262a-F18 Lecture-04 15 9/04/2018 Cs262a-F18 Lecture-04 16

5. Issues Is this a good paper? • Disks writes go to two disks (since newly written data is “hot”). • What were the authors’ goals? – Must wait for both to complete -- why? – Does the host have to wait for both? No, just for NVRAM • What about the performance metrics? • Controller uses cache for reads • Did they convince you that this was a good system? • Controller uses NVRAM for fast commit, then moves data to disks – What if NVRAM is full? Block until NVRAM flushed to disk, then write to NVRAM • Were there any red-flags? • What happens in the background? • What mistakes did they make? – 1) compaction, 2) migration, 3) balancing • Does the system meet the “Test of Time” challenge? • Compaction: clean RAID5 and plug holes in the mirrored disks. – Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a • How would you review this paper today? disks with lots of holes and move its used RBs to other disks. Resulting empty PEG is now usable by RAID5 – What if there aren’t enough holes? Write the excess RBs to RAID5, then reclaim the PEG • Migration: which RBs to demote? Least-recently-written (not LRU) • Balancing: make sure data evenly spread across the disks. (Most important when you add a new disk) 9/04/2018 Cs262a-F18 Lecture-04 17 9/04/2018 Cs262a-F18 Lecture-04 18 Finding a Needle in Haystack Old Solution: NFS • This is a systems level solution: • Issues with this design? – Takes into account specific application (Photo Sharing) • Long Tail  Caching does not » Large files!, Many files! work for most photos » 260 Billion images, 20 PetaBytes (1015 bytes!) – Every access to back end storage must be » One billion new photos a week (60 TeraBytes) fast without benefit of caching! » Each photo scaled to 4 sizes and replicated (3x) • Linear Directory scheme works – Takes into account environment (Presence of Content Delivery Network, CDN) badly for many photos/directory » High cost for NAS and CDN – Many disk operations to find even a – Takes into account usage patterns: single photo (10 I/Os!) » New photos accessed a lot (caching well) – Directory’s block map too big to cache in memory » Old photos accessed little, but – “Fixed” by reducing directory size, however still not great (10  3 I/Os) likely to be requested at any time  NEEDLES • FFS metadata requires ≥ 3 disk accesses per lookup (dir, inode, pic) • Cumulative graph of accesses – Caching all inodes in memory might help, but inodes are big as function of age • Fundamentally, Photo Storage different from other storage: – Normal file systems fine for developers, databases, etc. 9/04/2018 Cs262a-F18 Lecture-04 19 9/04/2018 Cs262a-F18 Lecture-04 20

6. Solution: Finding a needle (old photo) in What about these results? Haystack • Differentiate between old and new photos – How? By looking at “Writeable” vs “Read-only” volumes – New Photos go to Writeable volumes • Directory: Help locate photos – Name (URL) of photo has embedded volume and photo ID • Let CDN or Haystack Cache Serve new photos – rather than forwarding them to • Workloads: Writeable volumes – A: Random reads to 64KB images – 85% of raw throughput, 17% higher latency • Haystack Store: Multiple “Physical Volumes” – B: Same as A but 70% of reds are 8KB images – Physical volume is large file (100 GB) which stores millions of photos – C, D, E: Write throughput with 1, 4, 16 writes batched (30 and 78% throughput – Data Accessed by Volume ID with offset into file gain) – Since Physical Volumes are large files, use XFS which is optimized for large files – F, G: Mixed workloads (98% R/2% MW, 96% R/4% MW of 16 image MW) – DRAM usage per photo: 40 bytes vs 536 inode • Are these good benchmarks? Why or why not? • Cheaper/Faster: ~28% less expensive, ~4x reads/s than NAS • Are these good results? Why or why not? 9/04/2018 Cs262a-F18 Lecture-04 21 9/04/2018 Cs262a-F18 Lecture-04 22 Discussion of Haystack Is this a good paper? • Did their design address their goals? • What were the authors’ goals? – Why or why not • What about the performance metrics? • Were they successful? • Did they convince you that this was a good system? – Is this a different question? • Were there any red-flags? • What about the benchmarking? – Good performance metrics? • What mistakes did they make? – Did they convince you that this was a good system? • Does the system meet the “Test of Time” challenge? • Were there any red-flags? • How would you review this paper today? • What mistakes did they make? • Will this system meet the “Test of Time” challenge? 9/04/2018 Cs262a-F18 Lecture-04 23 9/04/2018 Cs262a-F18 Lecture-04 24