【分三06-陈越晨】Apache Flink在爱奇艺的应用实践

实时计算的挑战与演化 爱奇艺怎么使用Flink Flink平台化 对Flink的改进 未来工作
展开查看详情

1.Application and Practice of Apache Flink® In iQIYI 公司:爱奇艺 职位:资深工程师 演讲者:陈越晨

2.Introduction to iQIYI

3.Introduction to iQIYI

4.Agenda Ø Challenges and evolution Ø 实时计算的挑战与演化 Ø Flink use cases @iQIYI Ø 爱奇艺怎么使用Flink Ø Platform built for Flink Ø Flink平台化 Ø Improvements on Flink Ø 对Flink的改进 Ø Future Work Ø 未来工作

5.Challenges and evolution What’s RTC ( Real-time computing ) ? 实时计算是什么?

6.Challenges and evolution Problems 问题 • Out of order 数据乱序 • Out of date 数据延时 • Event time vs Processing time 事件时间 vs 处理时间 Challenges 挑战 • Correctness 正确性 • Fault Tolerant 容错 Peak events per second: 11 Million/s 峰值事件数:1100万/s • Performance 性能 • Latency 延迟 • Throughput 吞吐量 • Scalability 扩展性

7.Challenges and evolution RTC evolution @ iQIYI 爱奇艺实时计算里程碑 Ø 2013:Storm,3 clusters Ø 2013:小规模使用Storm,3个独立集群 Ø 2015:Spark streaming on YARN Ø 2015:引入Spark Streaming,部署在YARN Ø 2016:Developed platform for Spark streaming Ø 2016:Spark Streaming平台化,大规模使用 Ø 2017:Flink standalone & Flink on YARN Ø 2017:引入Flink,部署在独立集群和YARN Ø 2018:Flink co-locate with Kafka, developed Streaming Ø 2018:Flink与Kafka混合部署,构建Streaming SQL与实时分析平 SQL and real-time analysis platform 台

8.Challenges and evolution Spark Streaming Problems 问题 Ø Micro-batch 微批次 Ø Hard to build REAL real-time, etc. Ø APIs: DStream API, SQL Ø event time、late data 很难构建“实时”计算

9.Challenges and evolution Flink - Dataflow Model (What Where When How) Ø Layered APIs: DataStream API, process function, SQL Ø 分层API:DataStream API, Process Function, SQL Ø Guaranteed correctness: exactly once, event-time, late Ø 正确性保证:Exactly once, 事件时间,延时数据处理 data handing Ø 性能:低延时 (ms) Ø Performance: low latency (ms) Ø 伸缩性:Scale-out架构,增量checkpoint Ø Scales: scale-out architecture, incremental checkpointing

10.Challenges and evolution Flink vs Spark Streaming

11.Flink use cases @iQIYI Ø Real-time ETL with massive data 海量数据实时ETL Ø Real-time fraud detection 实时风控 Ø Distributed system tracing 分布式调用链分析

12.Real-time ETL with massive data Problem: 问题: Ø Aggregated event logs in Nginx (Peak: 11 Million/s) Ø 将Nginx上聚合日志(峰值流量:1100万/s)根据业 Ø Sort by category, write to multiple Kafka clusters 务线拆分到多个Kafka集群

13.Real-time ETL with massive data Option 1: Use tail command 选项1:使用tail命令 Not scalable

14.Real-time ETL with massive data Option 2: ETL 选项2:ETL拆分 Requirements: 需求: Ø High Throughout (11 Million/s) Ø 高吞吐(千万QPS) Ø 低延时(< 1s) Ø Low latency (<1s)

15.Real-time ETL with massive data Flink ETLs

16.Real-time ETL with massive data IP Blacklist Filtering IP 黑名单过滤 Tuning: 优化: Ø KeyBy DeviceId Ø 通过设备ID keyBy,避免处理倾斜 Ø Local LRU Cache + Redis (4k/s) Ø 缓存优化:Local LRU Cache + Redis(4k/s) Latency: P99 < 500ms

17.Real-time ETL with massive data Resource Tuning 资源优化 Flink Ø 200+ machines 200+机器 Ø CPU-bound CPU密集型 Mixed: Flink & Kafka 混合部署 Ø CPU usage: 55% CPU利用率55% Ø 200+ machines 200+机器 Kafka Ø CPU usage: 70% CPU利用率70% Ø 100+ machines 100+机器 Ø IO-bound & disk-bound IO吞吐大,磁盘存储多 Ø IO Throughput: 8W IO吞吐8w Ø CPU usage: 15% CPU利用率15%

18.Real-time fraud detection Fraud detection–collision attack 风控检测 – 机器撞库盗号攻击 Requirements: 需求: Ø 事中过滤:超高频异常检测 Ø Detect anomalies Ø 事后分析:生成IP和设备ID的黑名单 Ø Post Analysis: Generate IP & Device ID blacklist

19.Real-time fraud detection CEP: 复杂事件处理: Ø Recently registered user Ø 刚注册的用户 Ø Perform specific actions in short time Ø 短时间内进行一些特定的业务操作 High-frequency anomaly detection: 超高频异常检测: Ø Count # of requests by one user every second Ø 对同一个用户在1秒内的请求进行计数 Ø Above threshold -> anomalous user Ø 超过某个阈值标记为黑产

20.Distributed system tracing Average duration on tracing topology 整个拓扑的各调用平均耗时

21.Distributed system tracing Path and duration of a single call 单次调用的路径和耗时

22.Distributed system tracing Architecture 架构

23.Distributed system tracing Multiple window aggregations: 多窗口聚合: Ø Level 1: build tracing topology of apps Ø 第一级:构建App调用拓扑 Ø Level 2: calculate average call duration aggregated Ø 第二级:按窗口计算平均耗时统计 by different windows

24.Distributed system tracing Multiple window triggers 多窗口触发

25.Platform built for Flink Overview 概览

26.Scale Ø YARN MapReduce/Spark/Flink Ø YARN MapReduce/Spark/Flink 30+ clusters,6000+ machines 30+集群数,6000+机器 Ø Streaming Apps: Ø 流任务: Spark Streaming:2000+ Spark Streaming:2000+ Flink:300+ Flink:300+

27.Streaming Platform Overview 概览

28.Streaming Platform Streaming IDE 流任务IDE

29.Streaming Platform Monitoring 指标监控