申请试用
HOT
登录
注册
 
Smart Join Algorithms for Fighting Skew at Scale

Smart Join Algorithms for Fighting Skew at Scale

Spark开源社区
/
发布于
/
8351
人观看
Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.
1点赞
1收藏
4下载
确认
3秒后跳转登录页面
去登陆