- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
计算机存储规则
展开查看详情
1 . The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb Jim Gray, Goetz Graefe Microsoft Research, 301 Howard St. #830, SF, CA 94105 {Gray, GoetzG}@Microsoft.com Abstract: future technology ratios would move the break-even Simple economic and performance arguments sug- point to five minutes. gest appropriate lifetimes for main memory pages and suggest optimal page sizes. The fundamental The five-minute rule is based on the tradeoff between tradeoffs are the prices and bandwidths of RAMs and the cost of RAM and the cost of disk accesses. The disks. The analysis indicates that with today's tech- tradeoff is that caching pages in the extra memory nology, five minutes is a good lifetime for randomly can save disk IOs. The break-even point is met when accessed pages, one minute is a good lifetime for the rent on the extra memory for cache ($/page/sec) two-pass sequentially accessed pages, and 16 KB is a exactly matches the savings in disk accesses per sec- good size for index pages. These rules-of-thumb ond ($/disk_access/sec). The break even time is change in predictable ways as technology ratios computed as: BreakEvenReferenceInterval (seconds) = change. They also motivate the importance of the new Kaps, Maps, Scans, and $/Kaps, $/Maps, PagesPerMBofRAM x PricePerDiskDrive (1) $/TBscan metrics. AccessPerSecondPerDisk PricePerMBofDRAM 1. The Five-Minute Rule Ten Years Later The disk price includes the cost of the cabinets and controllers (typically 30% extra.) The equations in All aspects of storage performance are improving, [1] were more complex because they did not realize but different aspects are improving at different rates. that you could factor out the depreciation period. The charts in Figure 1 roughly characterize the per- formance improvements of disk systems over time. The price and performance from a recent DELL The caption describes each chart. TPC-C benchmark [2] gives the following parameters for Equation 1: In 1986, randomly accessed pages obeyed the five- PagesPerMBofRAM = 128 pages/MB (8KB pages) minute rule [1]: pages referenced every five minutes AccessesPerSecondPerDisk = 64 access/sec/disk should have been kept in memory rather than reading PricePerDiskDrive = 2000 $/disk (9GB + controller) them from disk each time. Actually, the break-even PricePerMBofDRAM = 15 $/MB_DRAM point was 100 seconds but the rule anticipated that Disk Performance vs Time Disk accesses/second Storage Price vs Time vs Time Megabytes per kilo-dollar 100 10. 100 10,000. 1,000. bandwidth: MB/s seeks per second Capacity (GB) 100. MB/k$ 10 1. 10 10. Accesses per Second 1. 1 0.1 1 0.1 1980 1990 2000 1980 1990 2000 1980 1990 2000 Year Year Year Figure 1: Performance of magnetic storage disks over time. The first two graphs show that accesses and access times improved 10x or 100x while capacity grew 100x. The third graph shows that prices improved 1,000x in the same time. We have compensated for the changing ratios among accesses, capacity, and cost by using larger RAM buffers and larger pages. That is one theme of this paper. ACM COPYRIGHT NOTICE. Copyright © 2000 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercia l advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy othe rwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm .org. For definitive copy see: http://www.acm.org/pubs/citations/proceedings/mod/191839/p243-gray/. This copy is posted by permission of ACM and may not be redistributed.
2 .Evaluating Equation 1 with these values gives a ref- 1.2. Sequential Data Access: the One-Minute erence interval of 266 seconds -- about five minutes1 . Sequential Rule So, even in 1997, data referenced every five minutes The discussion so far has focused on random access should be kept in main memory. to small (8KB) pages. Sequential access to large pages has different behavior. Modern disks can Prices for the same equipment vary enormously, but transfer data at 10 MBps if accessed sequentially all the categories we have examined follow some- (Figure 1a). That is a peak value, the analysis here thing like a five-minute rule. Server hardware prices uses a more realistic 5 MB/s as a disk sequential data are often three times higher than "street prices" for rate. Disk bandwidth drops 10x (to 0.5 MBps) if the the same components. DEC Polaris RAM is half the application fetches random 8KB pages from disk. price of DELL. Recent TPC-C Compaq reports have So, it should not be surprising that sequential IO op- 3x higher RAM prices (47$/MB) and 1.5x higher erations like sort, cube, and join, have different disk prices (3129$/drive) giving a two-minute rule. RAM/disk tradeoffs. As shown below, they follow a The March 1997 SUN+Oracle TPC-C benchmark [3] one-minute-sequential rule. had prices even better than DELL (13$/MB of RAM and 1690$ per 4GB disk and controllers). These sys- If a sequential operation reads data and never refer- tems all are near the five-minute rule. Mainframes ences it, then there is no need to cache the data in are at 130$/MB for RAM, 10K$/MIPS, and RAM. In such one-pass algorithms, the system needs 12k$/disk. Thus, mainframes follow a three-minute only enough buffer memory to allow data to stream rule. from disk to main memory. Typically, two or three one-track buffers (~100 KB) are adequate. For one- One can think of the first ratio of Equation 1 (Pages- pass sequential operations, less than a megabyte of PerMBofRAM/AccessesPerSecondPerDisk) as a RAM per disk is needed to buffer disk operations and technology ratio. The second ratio of Equation 1 allow the device to stream data to the application. (PriceofDiskDrive/PriceOfMBofRAM) is an eco- nomic ratio. Looking at the trend lines in Figure 1, Many sequential operations read a large data-set and the technology ratio is shifting. Page size has in - then revisit parts of the data. Database join, cube, creased with accesses/second so the technology ratio rollup, and sort operators all behave in this way. Con- has decreased ten fold (from 512/30 = 17 to 128/64 = sider the dis k access behavior of Sort in particular. 2). Disk drive prices dropped 10x and RAM prices Sort uses sequential data access and large disk trans- dropped 200x, so that the economic ratio has ni - fers to optimize disk utilization and bandwidth. Sort creased ten fold (20k$/2k$=10 to 2k$/15$=133). The ingests the input file, reorganizes the records in consequent reference interval of equation (1) went sorted order, and then sequentially writes the output from 170 seconds (17x10) to 266 seconds (2x133). file. If the sort cannot fit the file in main memory, it produces sorted runs in a first pass and then merges These calculations indicate that the reference in- these runs into a sorted file in the second pass. terval of Equation (1) is almost unchanged, despite Hash-join has a similar one-pass two-pass behavior. these 10x, 100x, and 1,000x changes. It is still in the 1-minute to 10-minute range. The 5-minute The memory demand of a two pass sort is approxi- rule still applies to randomly accessed pages. mately given in equation 2: MemoryForTwoPassSort The original paper [1] also described the 10-byte rule ? 6 ? Buffer _ Size ? 3 ? Buffer _ Size ? File _ Size.....( 2 ) for trading CPU instructions off against DRAM. At the time one instruction cost the same as 10 bytes. Equation 2 is derived as follows. The first sort pass Today, PCs follow a 1-byte rule, mini-computers produces about File_Size/Memory_Size runs while follow a 10 byte rule, while mainframes follow a the second pass can merge Memory_Size/Buffer_Size kilobyte rule because the processors are so over- runs. Equating these two values and solving for priced. memory size gives the square root term. The con- stants (3 and 6) depend on the particular sort algo- rithm. Equation 2 is graphed in Figure 2 for file sizes from megabytes to exabytes. 1 Sort shows a clear tradeoff of memory and disk IO. The current 2 KB page-size of Microsoft SQL Server 6.5 A one-pass sort uses half the disk IO but much more gives a reference interval of 20 minutes . MS SQL is memory. When is it appropriate to use a one-pass moving to an 8 KB page size in the 1998 release.
3 .sort? This is just an application of Equation 1 to Similar comments apply to other sequential opera- compute the break-even reference interval. Use the tions (group by, rollup, cube, hash join, index build, DEC TPC-C prices [2] and components in the previ- etc…). In general, sequential operations should ous section. If sort uses to 64KB transfers then there use high-bandwidth disk transfers and they are 16 pages/MB and it gets 80 accesses per second should cache data that they will revisit the data (about 5 MB/s). within a minute. PagesPerMBofRAM = 16 pages/MB AccessesPerSecondPerDisk = 80 access/sec/disk In the limit, for large transfers, sequential access cost degenerates to the cost of the bandwidth. The tech- Using these parameters, Equation 1 yields a break- nology ratio of equation 1 becomes the reciprocal of even reference interval of 26 seconds (= (16/80) x the bandwidth (in megabytes): (2,000/15)). Actually, sort would have to write and TechnologyRatio then read the pages, so that doubles the IO cost and = (PagesPerMB)/(AccessesPerSecond) moves the balance point to 52 seconds. Anticipating = (1E6/TransferSize)/ higher bandwidths and less expensive RAM, we pre- ( DiskBandwidth/TransferSize) dict that this value will slowly grow over time. for purely sequential access = 1E6/DiskBandwidth. (3) Consequently, we recommend the one-minute- This is an interesting result. It gives rise to the as- sequential rule: hash joins, sorts, cubes, and other ymptote in Figure 3 that shows the reference interval sequential operations should use main memory to vs. page size. With current disk technology, the ref- cache data if the algorithm will revisit the data within erence interval asymptotically approaches 40 seconds a minute. as the page size grows. For example, a one-pass sort is known to run at about 5 GB/minute [4]. Such sorts use many disks and lots Reference Inverval vs Page Size of RAM but they use only half the IO bandwidth of a (DRAM vs disk or tape storage) 100,000 two-pass sort (they pass over the data only once). Applying the one-minute-sequential rule, below 5 GB 10,000 a one-pass sort is warranted. Beyond that size, a Tape two-pass sort is warranted. With 5GB of RAM a two- 1,000 robot Reference Interval pass sort can sort 100 terabytes. This covers ALL 100 disk current sorting needs. (minutes) 5 min 10 rule 1 DRAM Needed for a Two-pass Sort 1 min asymtpote Gigabytes Can Sort PetaBytes 0 1.E+12 1E+0 1E+3 1E+6 1E+9 1E+12 Page Size (bytes) DRAM size needed Figure 3: The break-even reference interval for disk 1.E+09 vs. DRAM asymptotically approaches something like one minute for current technology. The asymptote is the product of the technology ratio (which becomes 1e6/bandwidth) and the economic ratio. A later sec- tion discuses the disk-tape tradeoff. Fundamentally, 1.E+06 1.E+06 1.E+09 1.E+12 1.E+15 1.E+18 tape technology is VERY expensive to access. This Size of input file (bytes) encourages very large tape page sizes and very cold Figure 2: A two-pass sort can process 100 terabyte data on tape. The tape asymptote is approached at 10 files with a 5 GB DRAM buffer. The two pass sort GB (tape hardware is described in Table 4). balances the run length against the number of runs to merge in the second pass. If it generates a thou- 1.4. RAID and Tape sand runs of 100 MB each, it can merge them using RAID 0 (striping) spreads IO among disks and so 100 MB of merge buffers in phase 2. This is a 100 makes the transfer size smaller. Otherwise, RAID 0 GB sort. With current technology, use a 1-pass does not perturb this analysis. RAID 1 (mirroring) sort up to 5GB files. For larger files, do a 2-pass slightly decreases the cost of reads and nearly dou-
4 .bles the cost of writes. RAID 5 increases the cost of when (1) there is contention for cache space, or (2) writes by up to a factor of 4. In addition RAID5 con- the page must be checkpointed because the page has trollers usually carry a price premium. All these been dirty for a long time. The checkpoint interval is factors tend to increase the economic ratio (making typically five minutes. Checkpoint limits recovery to disks more expensive, and raise the technology ratio redoing the last five or ten minutes of the log. (lower accesses per second). Overall they tend to increase the random access reference interval by a Hot-standby and remote-disaster-recovery systems factor of 2x to 5x. reduce the need for checkpoints because they con- tinuously run recovery on their version of the data- Tape technology has moved quickly to improve ca- base and can take over within seconds. In these dis- pacity. Today the Quantum DLTstor™ is typical of aster-tolerant systems, checkpoints can be very infre- high performance robots. Table 4 presents the per- quent and almost all flushes are contention flushes. formance of this device. To implement the N-minute rule for contention Table 4: Tape robot price and performance char- flushes and evictions, the buffer manager keeps a list acteristics (source Quantum DLTstor™). of the names of all pages touched within the last N Quantum DLT Tape Robot 9,000$ price minutes. When a page is re-read from disk, if it is in Tape capacity 35 GB the N-minute list, it is given an N-minute lifetime (it Number of tapes 14 will not be evicted for N-minutes in the future). This Robot Capacity 490 GB simple algorithm assures that frequently accessed pages are kept in the pool, while pages that are not Mount time (rewind, un- 30 seconds re-referenced are aggressively evicted. mount, put, pick, mount, posi- tion) Transfer rate 5 MBps 1.6. Five-Minute Summary In summary, the five-minute rule still seems to apply to randomly accessed pages, primarily because page Accessing a random data record on a tape requires sizes have grown from 1KB to 8KB to compensate mounting it, moving to the right spot and then read- for changing technology ratios. For large (64KB ing the tape. If the next access is on another tape and pages) and two-pass sequential access, a one-minute so one must rewind the current tape, put it away, pick rule applies today. the next one, scan to the correct position, and then read. This can take several minutes, but the specifi- cations above charitably assumed it takes 30 seconds 2.How Large Should Index Pages Be? on average. The size of an internal index page determines its re- trieval cost and fan-out (EntriesPerPage). A B-tree When should you store data on tape rather than in indexing N items will have a height (in pages) of: RAM? Using Equation 1, the break-even refe rence Indexheight ~ log2(N)/log2(EntriesPerPage) pages (4). interval for a 8KB tape block is about two months (keep the page in RAM rather than tape if you will The utility of an index page measures how much revisit the page within 2 months). closer the index page brings an associative search to the destination data record. It tells how many levels Another alternative is keeping the data on disk. What of the binary-tree fit on a page. The utility is the divi- is the tradeoff of keeping data on disk rather than on sor of the Equation 4: tape? The tradeoff is that tape-space rent is 10x less IndexPageUtility = log2 (EntriesPerPage) (5) expensive but tape accesses are much more expensive (100,000x more for small accesses and 5x more for For example, if each index entry is 20 bytes, then a 2 large (1GB) accesses). The reference interval bal- KB index page that is 70% full will contain about 70 ances the lower tape rent against the higher access entries. Such a page will have a utility of 6.2, about cost. The resulting curve is plotted in Figure 3. half the utility of a 128 KB index page (see Table 6). Reading each index page costs a logical disk access 1.5. Checkpoint Strategies In Light of but each page brings us IndexPageUtility steps closer the 5-minute Rule to the answer. This cost-benefit tradeoff gives rise to Buffer managers typically use an LRU or Clock2 an optimal page size that balances the IndexPageAc- (two round clock) algorithm to manage the buffer cessCost and the IndexPageUtility of each IO. pool. In general, they flush (write to disk) pages
5 . The utility rises as the log and high fixed disk read costs. Very large pages also An have low benefit because utility grows only as the log index of the page size. The cost Utility page of the access goes up line- of the page size, but transfer cost grows linearly with arly with page sizeConse- page size. quently, for a particular Table 6 and Figure 7 indicate that for current devices, disk latency and transfer index page sizes in the range of 8 KB to 32 KB are rate, there is an optimal preferable to smaller and larger page sizes. By the index page size. The tree year 2005, disks are predicted to have 40 MB/s trans- Figure 5: The utility of an at left shows just the fer rates and so 8 KB pages will probably be too index page is the number search path (it is not bal- small. of levels of the binary tree anced because the drawing would be too cluttered). Table 6 and Figure 7 indicate that for current devices, that it traverses. index page sizes in the range of 8 KB to 32 KB are Reading a 2 KB page from a disk with a 10 ms aver- preferable to smaller and larger page sizes. By the age access time (seek and rotation) and 10 MB/s year 2005, disks are predicted to have 40 MB/s trans- transfer rate uses 10.2 ms of disk device time. So the fer rates and so 8 KB pages will probably be too read cost is 10.2 milliseconds. More generally, the small. cost of accessing an index page is either the storage cost in main memory if the page is cached there, or 3. New Storage Metrics the access cost if the page is stored on disk. If pages These discussions point out an interesting phenome- near the index root are cached in main memory, the non -- the fundamental storage metrics are changing. cache saves a constant number of IOs on average. Traditionally, disks and tapes have been rated by ca- This constant can be ignored if one is just optimizing pacity. As disk and tape capacity approach infinity the IO subsystem. The index page disk access cost is (50 GB disks and 100 GB tapes are in beta test to- IndexPageAccessCost = Disk Latency + PageSize / day), the cost/GB goes to zero and the cost/access DiskTransferRate (6) becomes the dominant performance metric. The benefit-cost ratio of a certain page size and entry size is the ratio of the two quantities. The traditional performance metrics are: IndexPageBenefit/Cost = IndexPageUtility / GB: storage capacity in gigabytes. IndexPageAccessCost. (7) $/GB: device price divided by capacity. The right column of Table 6 shows this computation Latency: time between issue of IO and start of data for various page sizes assuming 20-byte index en- transmission. tries. It indicates that 8 KB to 32 KB pages are near Bandwidth: sustained transfer rate from the device. optimal for these parameters. Figure 7 graphs the benefit/cost ratios for various The latter two are often combined as a single access entry sizes and page sizes for both current, and next - time metric (time to read a random KB from the de- generation disks. The graphs indicate that, small vice). pages have low benefit because they have low utility Kaps : kilobyte accesses per second. Table 6: Tabulation of index page utility and bene- As device capacities grow, additional metrics become fit/cost for 20 byte index entries assuming each important. Transfers become larger. Indeed, the page is 70% full and assuming a 10ms latency 10 minimum economical tape transfer is probably a one MBps transfer rate. MB object page size entries Index Index Index Page KB per page Page Page Benefit/ Increasingly, applications use a dataflow style of Fan-out Utility Access Cost (20B) programming and stream the data past the device. Cost (ms) Data mining applications and archival applications 2 68 6.1 10.2 0.60 are the most common example of this today. These 4 135 7.1 10.4 0.68 suggest the following two new storage metrics. 8 270 8.1 10.8 0.75 Maps: Megabyte accesses per second. 16 541 9.1 11.6 0.78 Scan: how long it takes to sequentially read or write 32 1081 10.1 13.2 0.76 all the data in the device? 64 2163 11.1 16.4 0.68 128 4325 12.1 22.8 0.53
6 . Index Page Utility vs Page Size Index Page Utility vs Page Size and Disk Performance and Index Elemet Size 0.90 1.00 0.90 0.80 40 MB/s 16 byte entries 0.80 Utility 0.70 10 MB/s Utility 32 byte entries 0.70 5 MB/s 0.60 64 byte entries 0.60 0.50 3 MB/s 0.50 128 byte entries 1 MB/s 0.40 2 4 8 32 64 128 0.40 2 4 8 32 64 128 40 MB/s 0.645 0.741 0.832 0.969 0.987 0.94 16 B 0.6355 0.7191 0.7843 0.7898 0.6938 0.5403 10 MB/s 0.636 0.719 0.784 0.79 0.694 0.54 32 B 0.5375 0.623 0.6919 0.7144 0.6334 0.497 5 MB/s 0.623 0.692 0.729 0.633 0.497 0.345 64 B 0.4395 0.527 0.5994 0.6391 0.573 0.4538 3 MB/s 0.511 0.56 0.576 0.457 0.339 0.224 128 B 0.3415 0.4309 0.507 0.5638 0.5126 0.4105 1 MB/s 0.405 0.439 0.444 0.334 0.24 0.155 Page Size (KB) Page Size (KB) Figure 7. (a) The left graph shows the utility of index pages versus page size for various index entry sizes using a high-performance disk (10ms latency, 10 MB/s transfer rate). (b) The graphs at right use a fixed-sized 16-byte entry and show the impact of disk performance on optimal page size. For high-performance disks, the optimum index page size grows from 8KB to 64KB. rated these differential changes. This growth pre- These metrics become price/performance metrics served the five-minute rule for randomly accessed when combined with the device rent (depreciated pages. A one- minute rule applies to pages used in over 3 years). The Scan metric becomes a measure two-pass sequential algorithms like sort. As tech- of the rent for a terabyte of the media while the media nology advances, secondary storage capacities grow is being scanned. Table 8 displays these metrics for huge. The Kaps, Maps, and Scans metrics that meas- current devices: ure access rate and price/access are becoming ni - creasingly important. Table 8: Performance Metrics of high-performance devices circa 1997. 5. Acknowledgments RAM Disk Tape robot Paul Larson, Dave Lomet, Len Seligman and Unit capacity 1GB 9 GB 14 x 35 GB Catharine Van Ingen helped us clarify our Unit price $ 15,000$ 2,000$ 10,000$ presentation of optimum index page sizes. The $/GB 15,000 $/GB 222$/GB 20 $/GB Kaps, Maps, and Scans metrics grew out of Latency (ms) 0.1 micro sec 10 milli sec 30 sec discussions with Richie Larry. Bandwidth 500 MBps 5 MBps 5 MBps Kaps 500,000 Kaps 100 Kaps .03 Kaps 6. References Maps 500 Maps 4.8 Maps .03 Maps [1] J. Gray & G. F. Putzolu, "The Five-minute Scan time 2 seconds 30 minutes 27 hours Rule for Trading Memory for Disc Accesses, and $/Kaps 0.3 nano $ 0.2 micro $ 3 milli $ the 10 Byte Rule for Trading Memory for CPU $/Maps .3 micro $ 4 micro $ 3 milli $ Time," Proceedings of SIGMOD 87, June 1987, $/TBscan .32 $ 4.23$ 296$ pp. 395-398. [2] Dell-Microsoft TPC-C Executive summary: 4. Summary http://www.tpc.org/results/individual_results/D ell/dell.6100.es.pdf The fact that disk access speeds have increased ten- [3] Sun-Oracle TPC-C Executive summary: fold in the last twenty years is impressive. But it http://www.tpc.org/results/individual_results/Su pales when compared to the hundred-fold increase in n/sun.ue6000.oracle.es.pdf disk unit capacity and the ten-thousand-fold decrease [4] Ordinal Corp. http://www.ordinal.com/ in storage costs (Figure 1). In part, growing page sizes sixteen-fold from 512 bytes to 8 KB has amelio-
7 .