【分三06-陈越晨】Apache Flink在爱奇艺的应用实践

下载 22

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/FlinkChina/Application_and_Practice_of_Apache_Flink_In_iQIYI?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

Flink China中文社区

发布于

6年前

5547

人观看

#信息技术

实时计算的挑战与演化爱奇艺怎么使用Flink Flink平台化对Flink的改进未来工作

展开查看详情

1 .Application and Practice of Apache Flink® In iQIYI 公司：爱奇艺职位：资深工程师演讲者：陈越晨

2 .Introduction to iQIYI

3 .Introduction to iQIYI

4 .Agenda Ø Challenges and evolution Ø 实时计算的挑战与演化 Ø Flink use cases @iQIYI Ø 爱奇艺怎么使用Flink Ø Platform built for Flink Ø Flink平台化 Ø Improvements on Flink Ø 对Flink的改进 Ø Future Work Ø 未来工作

5 .Challenges and evolution What’s RTC ( Real-time computing ) ? 实时计算是什么？

6 .Challenges and evolution Problems 问题 • Out of order 数据乱序 • Out of date 数据延时 • Event time vs Processing time 事件时间 vs 处理时间 Challenges 挑战 • Correctness 正确性 • Fault Tolerant 容错 Peak events per second: 11 Million/s 峰值事件数：1100万/s • Performance 性能 • Latency 延迟 • Throughput 吞吐量 • Scalability 扩展性

7 .Challenges and evolution RTC evolution @ iQIYI 爱奇艺实时计算里程碑 Ø 2013：Storm，3 clusters Ø 2013：小规模使用Storm，3个独立集群 Ø 2015：Spark streaming on YARN Ø 2015：引入Spark Streaming，部署在YARN Ø 2016：Developed platform for Spark streaming Ø 2016：Spark Streaming平台化，大规模使用 Ø 2017：Flink standalone & Flink on YARN Ø 2017：引入Flink，部署在独立集群和YARN Ø 2018：Flink co-locate with Kafka, developed Streaming Ø 2018：Flink与Kafka混合部署，构建Streaming SQL与实时分析平 SQL and real-time analysis platform 台

8 .Challenges and evolution Spark Streaming Problems 问题 Ø Micro-batch 微批次 Ø Hard to build REAL real-time, etc. Ø APIs: DStream API, SQL Ø event time、late data 很难构建“实时”计算

9 .Challenges and evolution Flink - Dataflow Model (What Where When How) Ø Layered APIs: DataStream API, process function, SQL Ø 分层API：DataStream API, Process Function, SQL Ø Guaranteed correctness: exactly once, event-time, late Ø 正确性保证：Exactly once, 事件时间，延时数据处理 data handing Ø 性能：低延时 (ms) Ø Performance: low latency (ms) Ø 伸缩性：Scale-out架构，增量checkpoint Ø Scales: scale-out architecture, incremental checkpointing

10 .Challenges and evolution Flink vs Spark Streaming

11 .Flink use cases @iQIYI Ø Real-time ETL with massive data 海量数据实时ETL Ø Real-time fraud detection 实时风控 Ø Distributed system tracing 分布式调用链分析

12 .Real-time ETL with massive data Problem: 问题: Ø Aggregated event logs in Nginx (Peak: 11 Million/s) Ø 将Nginx上聚合日志（峰值流量：1100万/s）根据业 Ø Sort by category, write to multiple Kafka clusters 务线拆分到多个Kafka集群

13 .Real-time ETL with massive data Option 1: Use tail command 选项1：使用tail命令 Not scalable

14 .Real-time ETL with massive data Option 2: ETL 选项2：ETL拆分 Requirements：需求： Ø High Throughout (11 Million/s) Ø 高吞吐（千万QPS） Ø 低延时（< 1s） Ø Low latency (<1s)

15 .Real-time ETL with massive data Flink ETLs

16 .Real-time ETL with massive data IP Blacklist Filtering IP 黑名单过滤 Tuning：优化： Ø KeyBy DeviceId Ø 通过设备ID keyBy，避免处理倾斜 Ø Local LRU Cache + Redis (4k/s) Ø 缓存优化：Local LRU Cache + Redis（4k/s） Latency： P99 < 500ms

17 .Real-time ETL with massive data Resource Tuning 资源优化 Flink Ø 200+ machines 200+机器 Ø CPU-bound CPU密集型 Mixed: Flink & Kafka 混合部署 Ø CPU usage: 55% CPU利用率55% Ø 200+ machines 200+机器 Kafka Ø CPU usage: 70% CPU利用率70% Ø 100+ machines 100+机器 Ø IO-bound & disk-bound IO吞吐大，磁盘存储多 Ø IO Throughput: 8W IO吞吐8w Ø CPU usage: 15% CPU利用率15%

18 .Real-time fraud detection Fraud detection–collision attack 风控检测 – 机器撞库盗号攻击 Requirements：需求： Ø 事中过滤：超高频异常检测 Ø Detect anomalies Ø 事后分析：生成IP和设备ID的黑名单 Ø Post Analysis: Generate IP & Device ID blacklist

19 .Real-time fraud detection CEP: 复杂事件处理： Ø Recently registered user Ø 刚注册的用户 Ø Perform specific actions in short time Ø 短时间内进行一些特定的业务操作 High-frequency anomaly detection: 超高频异常检测： Ø Count # of requests by one user every second Ø 对同一个用户在1秒内的请求进行计数 Ø Above threshold -> anomalous user Ø 超过某个阈值标记为黑产

20 .Distributed system tracing Average duration on tracing topology 整个拓扑的各调用平均耗时

21 .Distributed system tracing Path and duration of a single call 单次调用的路径和耗时

22 .Distributed system tracing Architecture 架构

23 .Distributed system tracing Multiple window aggregations: 多窗口聚合: Ø Level 1: build tracing topology of apps Ø 第一级：构建App调用拓扑 Ø Level 2: calculate average call duration aggregated Ø 第二级：按窗口计算平均耗时统计 by different windows

24 .Distributed system tracing Multiple window triggers 多窗口触发

25 .Platform built for Flink Overview 概览

26 .Scale Ø YARN MapReduce/Spark/Flink Ø YARN MapReduce/Spark/Flink 30+ clusters，6000+ machines 30+集群数，6000+机器 Ø Streaming Apps： Ø 流任务： Spark Streaming：2000+ Spark Streaming：2000+ Flink：300+ Flink：300+

27 .Streaming Platform Overview 概览

28 .Streaming Platform Streaming IDE 流任务IDE

29 .Streaming Platform Monitoring 指标监控

13点赞

10收藏

22下载