申请试用
HOT
登录
注册
 
Optimizing Delta/Parquet Data Lakes for Apache Spark

Optimizing Delta/Parquet Data Lakes for Apache Spark

Spark开源社区
/
发布于
/
8867
人观看
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren’t mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
1点赞
2收藏
3下载
确认
3秒后跳转登录页面
去登陆