展开查看详情
1. CalvinFS: Consistent WAN Replica5on and Scalable Metdata Management for Distributed File Systems Slide credits: Thomas Kao 1
2. Background • Scalable solu5ons provided for data storage, why not file systems? 2
3. Mo5va5on • OLen boMlenecked by the metadata management layer • Availability suscep5ble to data center outages • S5ll provides expected file system seman5cs 3
4. Key Contribu5ons • Distributed database system for scalable metadata management • Strongly consistent geo-replica5on of file system state 4
5. Calvin: Log • Many front end servers • Asynchronously-replicated distributed block store • Small number of “meta-data” log servers • Transac5on requests are replicated and appended, in order, by the “meta log” 5
6. Calvin: Storage Layer • Knowledge of physical data store organiza5on and actual transac5on seman5cs • Read/write primi5ves that execute on one node • Placement manager • Mul5version key-value store at each node, plus consistent hashing mechanism 6
7. Calvin: Scheduler • Drives local transac5on execu5on • Fully examines transac5on before execu5on • Determinis5c locking • Transac5on protocol: Execute Serve Collect Perform all transac5on remote remote read local reads to reads results comple5on • No distributed commit protocol 7
8. CalvinFS Architecture • Design Principles: • Components – Main-memory metadata • Block store store • Calvin database – Poten5ally many small files • Client library – Scalable read/write throughput – Tolerate slow writes – Linearizable and snapshot reads – Hash-par55oned metadata – Op5mize for single-file opera5ons 8
9. CalvinFS Block Store • Variable-size immutable blocks – 1 byte to 10 megabytes • Block storage and placement – Unique ID – Block “buckets” – Global Paxos-replicated config file – Compacts small blocks 9
10. CalvinFS Metadata Management • Key-value store – Key: absolute path of file/directory – Value: entry type, permissions, contents 10
11. Metadata Storage Layer • Six transac5on types: – Read(path) – Create{File, Dir}(path) – Resize(path, size) – Write(path, file_offset, source, source_offset, num_bytes) – Delete(path) – Edit permissions(path, permissions) 11
12. Recursive Opera5ons on Directories • Use OLLP • Analyze phase – Determines affected entries and read/write set • Run phase – Check that read/write set has not grown 12
13.Performance: File Counts and Memory Usage • 10 million files of varying size per machine • Far less memory used per machine • Handles many more files than HDFS 13
14.Performance: Throughput Linear Sub-linear scalability scalability Linear scalability 14
15.Performance: Latency Write/append latency dominated by WAN replica5on 15
16. Performance: Fault Tolerance • Able to tolerate outages with liMle to no hit to availability 16
17. Discussion Pros Cons • Fast metadata • File crea5on is distributed management transac5on, doesn’t scale • Deployments are well scalable on large • Metadata opera5ons have to clusters recursively modify all entries • Huge storage in affected subtree capabili5es • High throughput • File-fragmenta5on addressed of reads and using mechanism that updates en5rely rewrites files • Resistant to datacenter outages 17
18. Discussion Ques5ons • Unlimited number of files? • What about larger files? 18