唯品会基于开源组件Spark和Alluxio的准实时ETL平台的构建

这是2019年3月在湾区的Alluxio meetup上,VIPShop数据架构师王万春做的专题报告。介绍如何使用Apache Spark ,Alluxio还有HDFS设计和搭建准实时数据ETL平台的经验和实践。更多的用例, 参见 https://www.alluxio.org/community/powered-by-alluxio
展开查看详情

1.1 Wanchun Wang Chief Architect VipShop Inc Near Real-time ETL platform using Spark & Alluxio

2.2 About VipShop A leading online discount retailer for brands in China $12.3 Billion net revenue in 2018 20M+ visitors/day

3.3 Agenda Brief overview big data systems for streaming & batch processing The problem: Real-time Sales attribution Approaches to speeding up sales attribution Comparison to alternative options & Benefits of Alluxio Future work Challenges with the existing batch platform

4.4 Overview - Big Data systems Separate Streaming and Batch platforms, single data pre-processing pipeline, no longer a pure Lambda architecture Typically s treaming data get sinked into h ive tables every 5 minutes M ore ETL jobs are moving toward Near Real Time Log Kafka Data Cleansing Kafka Augmen-tation Kafka Hive Delta Hive Daily Streaming(Storm/Flink/Spark) Batch ETL (Hive/Spark)

5.5 The problem: Sales Attribution The process of identifying a set of user actions (“events”) across screens and touch points that contribute in some manner to a product sale, and then assigning value to each of these events. front today’s new man’s special Product A detail man’s special Product B detail add cart order 返回

6.6 Real-time sales attribution is a very complex process Recompute full day’s data at each iteration: ~ 30 minutes, worst case 2-3 hours Many data sources involved: page view, add cart, order_with_discount , order_cookie_map , sub_order , prepay_order_goods etc Several large data sources each contain billions of records and take up 300GB ~ 800GB space on Disk Sales Path assignment is very CPU intensive computation Written by business analysts Complex SQL scripts with UDF functions Business expectation: updated result every 5 - 15 minutes

7.7 Batch platform of Hadoop + Spark is oversubscribed Running performance sensitive jobs on current batch platform not an option Around 200K batch jobs executed daily in Hadoop & Spark clusters Hdfs(1400+ nodes) SSD hdfs (50+ nodes) Spark Clusters( 300+ nodes) Cluster usage is above 80% at normal days, resources are even more saturated during monthly promotion period Many issues contribute to the Inconsistent data access time such as NN RPC too high, slow DataNode response etc Scheduling overhead when running M/R jobs

8.8 Approaches to speed up Sales Attribution processing Adding more compute power Too expensive - Not a real option Improve ETL job to process updates incrementally Create a new, relatively isolated environment consistent computing resource allocation intermediate data caching faster read/write

9.9 Recompute the click paths for the active users in current window Merge active user paths with previous full path result Less data in computation but one more read on history data 2.Improve ETL Job to process updates incrementally

10.10 3. Create isolated, localized environment

11.11 Sales Attribution New Cluster Spec A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores, 256G Memory) Alluxio colocated with Spark Very consistent read/write I/O time over iterations Alluxio Mem + HDD Disable multiple copies to save space Leave enough memory to OS, improve stability

12.12 Comparison Remote HDFS cluster: on average 1-2 times slow than Alluxio, the biggest problem is there are lots of spikes over the day, especially when Hadoop system is busy Use local HDFS, twice as slow as Alluxio ( Mem + HDD) On dedicated SSD cluster on par with Alluxio in regular days, but overall read/write latency increased 100% during busy days On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done) Spark Cache Our daily views, clicks and path result are too big to fit into JVM Slow to create and we have lots of “only used twice” data Multiple downstream spark apps need to share the data

13.13 L essons learned Move the downstream processes closer to the data, avoid duplicating large amount of data from Alluxio to remote HDFS Manage NRT jobs A single big Spark Streaming job? too many inputs and outputs at different stages Split into multiple jobs? how to coordinate multiple stream jobs Our typical batch job scheduling Strictly follow time based dependency, executed for every fixed interval When there is a severe delay, multiple batch instances for different slot running at the same time

14.14 Schedule Near Real Time ETL NRT executed in much higher frequency, very sensitive to system hiccups, cascading effects Report data readiness to Watermark Service, manage dependency between loosely coupled jobs Ultimate goal is get the latest result fast a delayed batch might consume the unprocessed input blocks span over multiple cycles. Output for fix intervals is not guaranteed not all inputs are mandatory, iteration get kicked off even when optional input sources are not ready

15.15 Benefits of Alluxio Easy to setup Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx Within our Data Centers, it is easier to allocate computing resources but SSD machines are scare Localization is easier to achieve, together with Spark, either form a separated satellite cluster or on label machines in our big clusters Evaluation: Spark and Alluxio on K8S We need shuffle those machines to run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different time of a day Very stable in production Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!

16.16 Future works Async persistent to remote HDFS Avoid duplicated write in user code/SQL, Put hadoop /tmp/ directory on Alluxio over SSD, reduce NN rpc and load on DN Cache hot/warm data for Presto, Heavy traffic and ad hoc query is very sensitive to HDFS stability

17.Questions? Alluxio Bay Area Meetup @ alluxio alluxio.org /slack info@alluxio.com WeChat