- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
[External] 20210308 Low Latency Presto and Alluxio
Optimizing Latency-Sensitive Queries at Facebook with Presto & Alluxio
汪可,facebook软件工程师,曾在百度,绿盟从事大数据开发,后进入Facebook Presto部门助攻高性能低延迟大数据查询
范斌,位于硅谷的开源数据平台软件Alluxio公司的创始成员和VP of Open Source. 加入Alluxio前, 范斌在Google从事下一代大规模分布式存储系统的研究与开发. 范斌博士毕业于卡内基梅隆大学计算机系, 博士期间在分布式系统算法和系统实现等方向发表多篇包括SIGCOMM, SOSP, NSDI等顶级国际会议论文以及多篇专利。
展开查看详情
1 .Optimizing Latency-Sensitive Queries at Facebook with Presto & Alluxio Ke Wang (Facebook) Bin Fan (Alluxio) March 2021
2 . • Overview • Architecture and Problems • Re-architecture and Solutions • Performance • Alluxio Deep-dive 2
3 .Presto is Open Source
4 . Presto @ Facebook Scale 40K ~ 1 EB data > 80% Servers scan per day new ETL 4
5 . • Overview • Architecture and Problems • Re-architecture and Solution • Performance • Alluxio Deep-dive 5
6 . How Presto Works getPartitions Hive Metastore getFiles SQL Planner/ Scheduler Optimizer workload balanced openFiles HDFS Worker Driver Worker Driver result read/writeBlock Worker Driver 6
7 . • Overview • Architecture and Problems • Re-architecture and Solution • Performance • Alluxio Deep-dive 7
8 . Caching • Metadata cache at various levels • schemas • ACLs • Partitions info • HDFS • File handle caching: avoid file open calls • File stripe/footer caching: avoid multiple redundant RPC calls to HDFS • File data caching: avoid network or HDFS latency. • Compute • Plan • Partitial Result 8
9 . Data Caching • An optimization technique is to cache working dataset closer to the compute node. • Less trips to remote storage should help with latencies and IO. 9
10 . Presto with Data Caching Low-overhead getFiles Metadata Hive coordinator Caching Metastore Planner/ Scheduler KV store Optimizer file location/stats soft affinity HDFS Worker Driver openFile/footer cache Worker Driver Data Cache Worker read/writeBlock Driver Local SSD 10
11 . Affinity Scheduling • Random Node Scheduler • Best efforts to assign the same split to the same worker 11
12 . Soft Affinity • Blocked --> Secondary Preference --> Least busy 12
13 . Various Options • Facebook internal caching libraries • Open source solutions • Build our own 13
14 . File Merge Caching • Naïve solution • Copying files from remote storage on local storage • Merging files in the local storage to keep file count low 14
15 . File Merge Caching 15
16 . Learnings & Alluxio Collaboration • Segment Based data caching • Pluggable eviction policies • Configuration of various aspects like sizes, resources usage, eviction policies, etc. • A Java based OSS library • Provide detailed stats regarding cache usage. • Caching should not become a point of failure. • Asynchronous operations. • Files management at the disk level. • Flash throughput limiter to avoid endurance issues. 16
17 . • Overview • Architecture and Problems • Re-architecture w/ Presto+Alluxio • Performance • Alluxio Deep-dive 17
18 . Benchmark Configuration • Two full days worth of queries from the production cluster was shadowed to the test cluster. • Query Count: 17320 • 600 nodes cluster • 460GB per node was configured for data caching. • LRU eviction policy. • 1MB as the block size, meaning data is read, stored, and evicted in the 1 MB size. 18
19 . Benchmark Results Query Execution Time 19
20 . Benchmark Results IO Savings • Data Size read for master branch run: 582 T Bytes • Data Size read for caching branch run: 251 T Bytes • Savings in Scans: 57% 20
21 . Benchmark Results Cache hit rate 21
22 .Production
23 . • Overview • Architecture and Problems • Re-architecture and Solutions • Performance • Alluxio Deep-dive 23
24 . Alluxio Overview Translate access to optimal storage APIs over a slow network Java File API HDFS Interface S3 Interface POSIX Interface REST API Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver 24
25 . Presto & Alluxio Local Cache Architecture Presto Presto Server JVM Worker HDFS API Calls Alluxio Caching File System On Cache Miss On Cache Hit External Alluxio Cache File System Manager External Local cache Storage storage 25
26 . Alluxio Cache Internals • Cache input files in pages (fix-sized file segments) • configurable, 1MB by default • Store pages off-heap • avoid using JVM memory resource but with SSDs • Highly-concurrent & thread-safe • Light-weight & fine-grained locking if cacheManager.hasPage(pageId): page = cacheManager.readPage(pageId) else: readFromExternalFS(page, offset, len) cacheManager.writePage(pageId, page) 26
27 . Basic Features • Pluggable cache replace policies: • LRU, LFU, FIFO • Pluggable cache storage options: • Local file system store: each page -> one file • Rocksdb store: each page -> one value associated with pageId (Experimental) • Async cache writes • to mitigate bursty cache write ops, queue writes in background • Comprehensive monitoring metrics • hit ratio, error types, capacity ... 27
28 . Operation at Facebook Scale • All corner cases are expected when running at Facebook Scale. • Handle Disk failure during operation • after IO syscalls timing out, gracefully fallback to no-cache mode when disks become not usable/responsible • Recover previous state from server restart • Each cache may have billions of pages to restore after server restart, 30+ minutes to fully recover previous state on restart • Entering read-only mode with async recover working in background 28
29 . Operation at Facebook Scale (Cont.) • All corner cases are expected when running at Facebook Scale. • Bursty and Possibly Highly Concurrent Query Load • Carefully handle concurrent N contenting threads reading and caching the same file, rather than failing N-1 threads on write conflict • Some tables are loaded with small files (<< 1 MB) • Carefully tune eviction policy to avoid unnecessary eviction failure b/c not enough space made due to small files being evicted 29