- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
唯品会基于开源组件Spark和Alluxio的准实时ETL平台的构建
展开查看详情
1 .1 Wanchun Wang Chief Architect VipShop Inc Near Real-time ETL platform using Spark & Alluxio
2 .2 About VipShop A leading online discount retailer for brands in China $12.3 Billion net revenue in 2018 20M+ visitors/day
3 .3 Agenda Brief overview big data systems for streaming & batch processing The problem: Real-time Sales attribution Approaches to speeding up sales attribution Comparison to alternative options & Benefits of Alluxio Future work Challenges with the existing batch platform
4 .4 Overview - Big Data systems Separate Streaming and Batch platforms, single data pre-processing pipeline, no longer a pure Lambda architecture Typically s treaming data get sinked into h ive tables every 5 minutes M ore ETL jobs are moving toward Near Real Time Log Kafka Data Cleansing Kafka Augmen-tation Kafka Hive Delta Hive Daily Streaming(Storm/Flink/Spark) Batch ETL (Hive/Spark)
5 .5 The problem: Sales Attribution The process of identifying a set of user actions (“events”) across screens and touch points that contribute in some manner to a product sale, and then assigning value to each of these events. front today’s new man’s special Product A detail man’s special Product B detail add cart order 返回
6 .6 Real-time sales attribution is a very complex process Recompute full day’s data at each iteration: ~ 30 minutes, worst case 2-3 hours Many data sources involved: page view, add cart, order_with_discount , order_cookie_map , sub_order , prepay_order_goods etc Several large data sources each contain billions of records and take up 300GB ~ 800GB space on Disk Sales Path assignment is very CPU intensive computation Written by business analysts Complex SQL scripts with UDF functions Business expectation: updated result every 5 - 15 minutes
7 .7 Batch platform of Hadoop + Spark is oversubscribed Running performance sensitive jobs on current batch platform not an option Around 200K batch jobs executed daily in Hadoop & Spark clusters Hdfs(1400+ nodes) SSD hdfs (50+ nodes) Spark Clusters( 300+ nodes) Cluster usage is above 80% at normal days, resources are even more saturated during monthly promotion period Many issues contribute to the Inconsistent data access time such as NN RPC too high, slow DataNode response etc Scheduling overhead when running M/R jobs
8 .8 Approaches to speed up Sales Attribution processing Adding more compute power Too expensive - Not a real option Improve ETL job to process updates incrementally Create a new, relatively isolated environment consistent computing resource allocation intermediate data caching faster read/write
9 .9 Recompute the click paths for the active users in current window Merge active user paths with previous full path result Less data in computation but one more read on history data 2.Improve ETL Job to process updates incrementally
10 .10 3. Create isolated, localized environment
11 .11 Sales Attribution New Cluster Spec A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores, 256G Memory) Alluxio colocated with Spark Very consistent read/write I/O time over iterations Alluxio Mem + HDD Disable multiple copies to save space Leave enough memory to OS, improve stability
12 .12 Comparison Remote HDFS cluster: on average 1-2 times slow than Alluxio, the biggest problem is there are lots of spikes over the day, especially when Hadoop system is busy Use local HDFS, twice as slow as Alluxio ( Mem + HDD) On dedicated SSD cluster on par with Alluxio in regular days, but overall read/write latency increased 100% during busy days On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done) Spark Cache Our daily views, clicks and path result are too big to fit into JVM Slow to create and we have lots of “only used twice” data Multiple downstream spark apps need to share the data
13 .13 L essons learned Move the downstream processes closer to the data, avoid duplicating large amount of data from Alluxio to remote HDFS Manage NRT jobs A single big Spark Streaming job? too many inputs and outputs at different stages Split into multiple jobs? how to coordinate multiple stream jobs Our typical batch job scheduling Strictly follow time based dependency, executed for every fixed interval When there is a severe delay, multiple batch instances for different slot running at the same time
14 .14 Schedule Near Real Time ETL NRT executed in much higher frequency, very sensitive to system hiccups, cascading effects Report data readiness to Watermark Service, manage dependency between loosely coupled jobs Ultimate goal is get the latest result fast a delayed batch might consume the unprocessed input blocks span over multiple cycles. Output for fix intervals is not guaranteed not all inputs are mandatory, iteration get kicked off even when optional input sources are not ready
15 .15 Benefits of Alluxio Easy to setup Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx Within our Data Centers, it is easier to allocate computing resources but SSD machines are scare Localization is easier to achieve, together with Spark, either form a separated satellite cluster or on label machines in our big clusters Evaluation: Spark and Alluxio on K8S We need shuffle those machines to run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different time of a day Very stable in production Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
16 .16 Future works Async persistent to remote HDFS Avoid duplicated write in user code/SQL, Put hadoop /tmp/ directory on Alluxio over SSD, reduce NN rpc and load on DN Cache hot/warm data for Presto, Heavy traffic and ad hoc query is very sensitive to HDFS stability
17 .Questions? Alluxio Bay Area Meetup @ alluxio alluxio.org /slack info@alluxio.com WeChat