文件系统2及MMAP

本文主要讨论了File Systems, MMAP系统调用。提出了一个真正意义的文件系统应该具备什么功能?包括了UNIX文件系统、文件属性元数据、数据存储等,并指出真正等文件系统Ext2,文件链接,NTFS的格式说明等等概念,加深对文件系统基本概念的理解,最后引出了MMap系统调用的原理。
展开查看详情

1. CS162 Operating Systems and Systems Programming Lecture 19 File Systems (Con’t), MMAP October 1st, 2017 Prof. Ion Stoica http://cs162.eecs.Berkeley.edu

2. So What About a “Real” File System? • Meet the inode: Inode Array Triple Double Indirect Indirect Indirect Data Inode Blocks Blocks Blocks Blocks File file_number Metadata ... ... ... Direct ... Pointers ... ... ... ... ... Indirect Pointer Dbl. Indirect Ptr. Tripl. Indrect Ptr. ... ... ... ... ... ... ... 11/1/17 CS162 © UCB Fall 2017 Lec 19.2

3. An “Almost Real” File System • Pintos: src/filesys/file.c, inode.c Inode Array Triple Double Indirect Indirect Indirect Data Inode Blocks Blocks Blocks Blocks File file_number Metadata ... ... ... Direct ... Pointers ... ... ... ... ... Indirect Pointer Dbl. Indirect Ptr. Tripl. Indrect Ptr. ... ... ... ... ... ... ... 11/1/17 CS162 © UCB Fall 2017 Lec 19.3

4. Unix File System (1/2) • Original inode format appeared in BSD 4.1 – Berkeley Standard Distribution Unix – Part of your heritage! – Similar structure for Linux Ext2/3 • File Number is index into inode arrays • Multi-level index structure – Great for little and large files – Asymmetric tree with fixed sized blocks 11/1/17 CS162 © UCB Fall 2017 Lec 19.4

5. Unix File System (2/2) • Metadata associated with the file – Rather than in the directory that points to it • UNIX Fast File System (FFS) BSD 4.2 Locality Heuristics: – Block group placement – Reserve space • Scalable directory structure 11/1/17 CS162 © UCB Fall 2017 Lec 19.5

6. File Attributes • inode metadata Inode Array Triple Double Indirect Indirect Indirect Data Inode Blocks Blocks Blocks Blocks File Metadata ... User ... Group ... Direct ... 9 basic access control bits Pointers - UGO x RWX ... Setuid bit ... ... ... ... Indirect Pointer - execute at owner permissions Dbl. Indirect Ptr. Tripl. Indrect Ptr. rather than user ... ... ... ... ... Setgid bit ... ... - execute at group’s permissions 11/1/17 CS162 © UCB Fall 2017 Lec 19.6

7. Data Storage • Small files: 12 pointers direct to data blocks Direct pointers Inode Array Triple Double Indirect Indirect Indirect Data 4kB blocks Þ sufficient Inode Blocks Blocks Blocks Blocks for files up to 48KB File Metadata ... ... ... Direct ... Pointers ... ... ... ... ... Indirect Pointer Dbl. Indirect Ptr. Tripl. Indrect Ptr. ... ... ... ... ... ... ... 11/1/17 CS162 © UCB Fall 2017 Lec 19.7

8. Data Storage • Large files: 1,2,3 level indirect pointers Indirect pointers Inode Array Triple Double - point to a disk block Indirect Indirect Indirect Data containing only pointers Inode Blocks Blocks Blocks Blocks - 4 kB blocks => 1024 ptrs File => 4 MB @ level 2 Metadata => 4 GB @ level 3 ... 48 KB => 4 TB @ level 4 ... +4 MB ... Direct ... Pointers ... +4 GB ... ... ... ... Indirect Pointer Dbl. Indirect Ptr. Tripl. Indrect Ptr. ... ... ... ... ... ... ... +4 TB 11/1/17 CS162 © UCB Fall 2017 Lec 19.8

9. UNIX BSD 4.2 (1984) (1/2) • Same as BSD 4.1 (same file header and triply indirect blocks), except incorporated ideas from Cray Operating System: – Uses bitmap allocation in place of freelist – Attempt to allocate files contiguously – 10% reserved disk space – Skip-sector positioning (mentioned later) 11/1/17 CS162 © UCB Fall 2017 Lec 19.9

10. UNIX BSD 4.2 (1984) (2/2) • Problem: When create a file, don’t know how big it will become (in UNIX, most writes are by appending) – How much contiguous space do you allocate for a file? – In BSD 4.2, just find some range of free blocks » Put each new file at the front of different range » To expand a file, you first try successive blocks in bitmap, then choose new range of blocks – Also in BSD 4.2: store files from same directory near each other • Fast File System (FFS) – Allocation and placement policies for BSD 4.2 11/1/17 CS162 © UCB Fall 2017 Lec 19.10

11. Attack of the Rotational Delay • Problem 2: Missing blocks due to rotational delay – Issue: Read one block, do processing, and read next block. In meantime, disk has continued turning: missed next block! Need 1 revolution/block! Skip Sector Track Buffer (Holds complete track) – Solution1: Skip sector positioning (“interleaving”) » Place the blocks from one file on every other block of a track: give time for processing to overlap rotation » Can be done by OS or in modern drives by the disk controller 11/1/17 CS162 © UCB Fall 2017 Lec 19.11

12. Attack of the Rotational Delay • Problem 2: Missing blocks due to rotational delay – Issue: Read one block, do processing, and read next block. In meantime, disk has continued turning: missed next block! Need 1 revolution/block! Skip Sector Track Buffer (Holds complete track) – Solution 2: Read ahead: read next block right after first, even if application hasn’t asked for it yet » This can be done either by OS (read ahead) » By disk itself (track buffers) - many disk controllers have internal RAM that allows them to read a complete track • Note: Modern disks + controllers do many things “under the covers” – Track buffers, elevator algorithms, bad block filtering 11/1/17 CS162 © UCB Fall 2017 Lec 19.12

13. Where are inodes Stored? • In early UNIX and DOS/Windows’ FAT file system, headers stored in special array in outermost cylinders • Header not stored anywhere near the data blocks – To read a small file, seek to get header, seek back to data • Fixed size, set when disk is formatted – At formatting time, a fixed number of inodes are created – Each is given a unique number, called an “inumber” 11/1/17 CS162 © UCB Fall 2017 Lec 19.13

14. Where are inodes Stored? • Later versions of UNIX moved the header information to be closer to the data blocks – Often, inode for file stored in same “cylinder group” as parent directory of the file (makes an ls of that directory run fast) • Pros: – UNIX BSD 4.2 puts bit of file header array on many cylinders – For small directories, can fit all data, file headers, etc. in same cylinder Þ no seeks! – File headers much smaller than whole block (a few hundred bytes), so multiple headers fetched from disk at same time – Reliability: whatever happens to the disk, you can find many of the files (even if directories disconnected) • Part of the Fast File System (FFS) – General optimization to avoid seeks 11/1/17 CS162 © UCB Fall 2017 Lec 19.14

15. 4.2 BSD Locality: Block Groups • File system volume is divided into a set of block groups – Close set of tracks • Data blocks, metadata, and free space interleaved within block group – Avoid huge seeks between user data and system structure • Put directory and its files in common block group 11/1/17 CS162 © UCB Fall 2017 Lec 19.15

16. 4.2 BSD Locality: Block Groups • First-Free allocation of new file blocks – To expand file, first try successive blocks in bitmap, then choose new range of blocks – Few little holes at start, big sequential runs at end of group – Avoids fragmentation – Sequential layout for big files • Important: keep 10% or more free! – Reserve space in the Block Group 11/1/17 CS162 © UCB Fall 2017 Lec 19.16

17. UNIX 4.2 BSD FFS First Fit Block Allocation • Fills in the small holes at the start of block group • Avoids fragmentation, leaves contiguous free space at end 11/1/17 CS162 © UCB Fall 2017 Lec 19.17

18. UNIX 4.2 BSD FFS • Pros – Efficient storage for both small and large files – Locality for both small and large files – Locality for metadata and data – No defragmentation necessary! • Cons – Inefficient for tiny files (a 1 byte file requires both an inode and a data block) – Inefficient encoding when file is mostly contiguous on disk – Need to reserve 10-20% of free space to prevent fragmentation 11/1/17 CS162 © UCB Fall 2017 Lec 19.18

19. BREAK 11/1/17 CS162 © UCB Fall 2017 Lec 19.19

20. Linux Example: Ext2/3 Disk Layout • Disk divided into block groups – Provides locality – Each group has two block- sized bitmaps (free blocks/inodes) – Block sizes settable at format time: 1K, 2K, 4K, 8K… • Actual inode structure similar to 4.2 BSD – with 12 direct pointers • Ext3: Ext2 with Journaling – Several degrees of protection with • Example: create a file1.dat comparable overhead under /dir1/ in Ext3 11/1/17 CS162 © UCB Fall 2017 Lec 19.20

21. A bit more on directories • Stored in files, can be read, but typically don’t – System calls to access directories /usr – open / creat traverse the structure – mkdir /rmdir add/remove entries /usr/lib – link / unlink (rm) /usr/lib4.3 » Link existing file to a directory • Not in FAT ! » Forms a DAG • When can file be deleted? /usr/lib/foo – Maintain ref-count of links to the file – Delete after the last reference is gone /usr/lib4.3/foo • libc support – DIR * opendir (const char *dirname) – struct dirent * readdir (DIR *dirstream) – int readdir_r (DIR *dirstream, struct dirent *entry, struct dirent **result) 11/1/17 CS162 © UCB Fall 2017 Lec 19.21

22. Links • Hard link – Sets another directory entry to contain the file number for the file – Creates another name (path) for the file – Each is “first class” • Soft link or Symbolic Link or Shortcut – Directory entry contains the path and name of the file – Map one name to another name 11/1/17 CS162 © UCB Fall 2017 Lec 19.22

23. Large Directories: B-Trees (dirhash) in FreeBSD, NetBSD, OpenBSD Search for hash(”out2”) = 0x0000c194 B+Tree Root Before 00ad1102 b0bf8201 ... cff1a412 Child Pointer B+Tree Node B+Tree Node B+Tree Node Before 0000c195 00018201 ... ... Child Pointer B+Tree Leaf B+Tree Leaf B+Tree Leaf Hash 0000a0d1 0000b971 ... 0000c194 ... Entry Pointer Name . .. file1 file2 ... file9841 out1 out2 ... out16341 File Number 36210429 983211 239341 231121 ... 243212 841013 841014 ... 324114 “out2” is file 841014 11/1/17 CS162 © UCB Fall 2017 Lec 19.23

24. NTFS • New Technology File System (NTFS) – Default on Microsoft Windows systems • Variable length extents – Rather than fixed blocks • Everything (almost) is a sequence of <attribute:value> pairs – Meta-data and data • Mix direct and indirect freely • Directories organized in B-tree structure by default 11/1/17 CS162 © UCB Fall 2017 Lec 19.24

25. NTFS • Master File Table – Database with Flexible 1KB entries for metadata/data – Variable-sized attribute records (data or metadata) – Extend with variable depth tree (non-resident) • Extents – variable length contiguous regions – Block pointers cover runs of blocks – Similar approach in Linux (ext4) – File create can provide hint as to size of file • Journaling for reliability – Discussed later http://ntfs.com/ntfs-mft.htm 11/1/17 CS162 © UCB Fall 2017 Lec 19.25

26. NTFS Small File Master File Table Create time, modify time, access time, Owner id, security specifier, flags (RO, hidden, sys) MFT Record (small file) data attribute Std. Info. File Name Data (resident) (free) Attribute list 11/1/17 CS162 © UCB Fall 2017 Lec 19.26

27. NTFS Medium File Start Master File Table Length + Data Extent MFT Record Start + Length Std. Info. File Name Data (nonresident) (free) Start Length + Data Extent Start + Length 11/1/17 CS162 © UCB Fall 2017 Lec 19.27

28. NTFS Multiple Indirect Blocks 11/1/17 CS162 © UCB Fall 2017 Lec 19.28

29. Master File Table MFT Record (huge/badly-fragmented file) Std. Info. Attr. List (nonresident) ... ... ... Extent with part of attribute list Data (nonresident) ... Data (nonresident) ... Data (nonresident) ... ... Extent with part of attribute list Data (nonresident) ... Data (nonresident) ... ... Extent with part of attribute list Data (nonresident) ... Data (nonresident) ... 11/1/17 CS162 © UCB Fall 2017 Lec 19.29