申请试用
HOT
登录
注册
 

Petabytes, Exabytes, and Beyond - Managing Delta Lakes for Inter

Spark开源社区
/
发布于
/
3603
人观看

Data production continues to scale up and the techniques for managing it need to scale too. Building pipelines that can process petabytes per day in turn create data lakes with exabytes of historical data. At Databricks, we help our customers turn these data lakes into gold mines of valuable information using Apache Spark. This talk will cover techniques to optimize access to these data lakes using Delta Lakes, including range partitioning, file-based data skipping, multi-dimensional clustering, and read-optimized files. We’ll cover sample implementations and see examples of querying petabytes of data in seconds, not hours.

We’ll also discuss tradeoffs that data engineers deal with everyday like read speed vs. write throughput, managing storage costs, and duplicating data to support multiple query profiles. We’ll also discuss combining batch with streaming to achieve desired query performance. After this session, you will have new ideas for managing truly massive Delta Lakes.

6 点赞
2 收藏
1下载
确认
3秒后跳转登录页面
去登陆