基于Flink的美团点评实时计算平台实践和应⽤

基于Flink的美团点评实时计算平台实践和应⽤
展开查看详情

1.Flink Forward China 2018 公司:美团点评 职位:研究员 演讲者:鞠⼤大升

2.Flink Forward China 2018 基于Flink的美团点评实时计算平台实践和应⽤用 Realtime Compute Platform on Flink The Practice and Application of MTDP’s 美团点评 鞠⼤大升 2018-12-10

3.Outline •  介绍 Introduction •  平台建设实践 Practice in Platform Construction •  实时应⽤用 The Realtime Applications •  挑战&未来 Challenges and the Future Work

4.Outline •  介绍 Introduction •  平台建设实践 Practice in Platform Construction •  实时应⽤用 The Realtime Applications •  挑战&未来 Challenges and the Future Work

5.关于我们 About us 美团 ⼤大众点评 美团外卖 猫眼电影 美团点评是中国领先的⽣生活服务电⼦子商务平台 MeituanDianping is China’s leading life service e-commerce platform

6.业务特点 Business Characteristics •  多业务线、形态各异 Multiple Service Lines with Heterogeneous Patterns •  涉及交易易、链路路⻓长 Involve Long Transaction Chains •  业务协同需求强 Require Strong Business Collaboration

7.平台架构 The Platform Architecture Feed in Streams Realtime Management Platform Realtime Applications 流数据接⼊入 实时管理理平台 实时应⽤用 Flink & SQL Result LogCenter DB/Binlog Storm DB Yarn Hdfs Petra Kafka Druid State & Dim MLX Log HBase Redis ES DW 实时元数据 & 权限管理理 Realtime Metadata and Authority Management

8.平台现状 The Current Status 10 thousand 4 thousand 1000 billion 10 million Jobs Machines Messages/Day Peak Messages/s

9.应⽤用场景 Application Scenarios •  ⻛风控 & 反爬⾍虫 Risk Control and Anti-Crawling •  实时流量量分析 Traffic Analysis in Realtime •  业务监控 Business Monitoring •  B端应⽤用 Browser Applications •  运营分析 Operations Analysis

10.Outline •  介绍 Introduction •  平台建设实践 Practice in Platform Construction •  实时应⽤用 The Realtime Applications •  挑战&未来 Challenges and the Future Work

11.我们关注什什么? What we cares Ø  引擎能⼒力力(精确计算、状态管理理) Ø  Capabilities of the Engine (Precise Calculation and State Management) Ø  平台化(多租户、资源、权限) Ø  Platformization (Multi-Tenant, Resources and Authorities) Ø  效率(开发、调试、问题追查、调优、SQL) Ø  Efficiency (Development, Debugging, Tracing, Tuning and SQL) Ø  High Availability (Disaster Tolerance and Ø  ⾼高可靠(容灾、运维) Maintenance) Ø  Scenario Applications (Log Center, Petra Ø  场景化应⽤用(⽇日志中⼼心、Petra、MLX) and MLX)

12.稳定性建设 How to be Stable Ø  资源隔离 Resource Isolation The Cluster for Offline Jobs / The Cluster for Realtime Jobs - Ø  离线机群/实时机群 – 物理理隔离部署 Physical Isolation Ø  不不同业务线 – Yarn标签隔离 Different Service Lines –Label-Based Resource Isolation in Yarn CPU Memory Disk Net IO Disk IO Storage

13.稳定性建设 How to be Stable Ø  故障容灾 Fault Tolerance Ø  Job Manager HA Ø  作业⾃自动拉起 Job Auto-Reboot Retry on Exception for Flink Kafka Ø  Flink Kafka异常重试 Ø  多机房容灾 Multi-Datacenter for Disaster Recovery Hot Standby for Streaming Systems Ø  流热备

14.稳定性建设 How to be Stable Alarm Monitoring Ø  监控报警 Ø  作业状态报警 Job Status Alarm Ø  处理理延迟报警 Processing Delay Alarm Ø  ⾃自定义Metrics报警 Custom Metrics Alarm

15.调优诊断 Tuning & Debugging •  统⼀一的⽇日志收集和检索 Unified Log Collection and Retrieval •  统⼀一指标收集和查询 Unified Metrics Collection and Querying •  基于指标的可配置报警 Configurable Alarm on Metrics

16.调优诊断 Tuning & Debugging ⽇日志查询条件 Conditions of Log Query 作业吞吐指标 Metrics for Job Throughput ⽇日志查询结果 Results of Log Query 节点性能指标 Metrics for Node Performance

17.为什什么需要平台化? Why we need Platform? 多业务线 效率提升 Efficiency Multiple Service Line Improvement 数据平台 Data Platform 数据共享 业务协同 Data Share Business Collaboration

18.平台化建设 The Platform Construction Tenants 租户 Groups 项⽬目组 ⼈人 Humans 资源 Resources 任务 数据 Tasks Data

19.平台化建设 The Platform Construction •  租户体系 The Tenant System •  资源管理理 Resource Management •  任务&数据管理理 Task & Data Management •  权限管理理 Authority Management

20.SQL化 Embrace SQL Streaming Data Management Job Management UDF Management Result Management 流数据管理理 作业管理理 UDF管理理 结果管理理 Table SQL Job UDF DB SQL Parser/ Druid Schema化 Optimizer Stream ES Flink Job (Topic) Yarn Hdfs

21.SQL化 Embrace SQL •  流Schema化和Table管理理 Introduce Schemas in Streams and Table Management •  SQL语义的丰富度 The Diversity of SQL •  UDF扩展 UDF Extensions •  执⾏行行优化 Execution Optimization

22.Outline •  介绍 Introduction •  平台建设实践 Practice in Platform Construction •  实时应⽤用 The Realtime Applications •  挑战&未来 Challenges and the Future Work

23.Petra Metrics •  Exactly Once Ø 实时指标聚合服务 Realtime Metrics Aggregation Service 精确⼀一次语义 Pre-Aggregation Module •  Window Aggr based on process-time Ø  Event-time 事件时间 基于处理理时间的窗⼝口聚合 Ø  Multi-dimension Inc-Aggregation Result •  Inc-Aggregation Result 多维度 增量量式聚合结果 Ø  Compound indicator All-Aggregation Module 复合指标 •  Exactly Once Ø  Exactly Once 精确⼀一次语义 精确⼀一次语义 Aggregation Result •  Window Aggr based on event-time Ø  Low latency 基于事件时间的窗⼝口聚合 低延迟 Sync Module •  All-Aggregation Result Sync Module 全量量式聚合结果 OpenTSDB Falcon

24.Petra Ø  Exactly Once 保障 Exactly Once Guarantee Ø  数据倾斜 Data Skew Ø 晚到数据处理理 Processing Late Data

25.MLX Ø  机器器学习平台 A Machine Learning Platform Ø  近线训练部分 Training Only Ø  流Join Stream Join Ø  ⼤大窗⼝口 Large Window

26.Outline •  介绍 Introduction •  平台建设实践 Practice in Platform Construction •  实时应⽤用 The Realtime Applications •  挑战&未来 Challenges and the Future Work

27.我们的⽤用户满意吗?Are Users Satisfied? Ø  引擎能⼒力力和表意能⼒力力不不⾜足以⽀支撑业务要求 The Engine Is NOT Powerful and Expressive Enough for the Business. Ø  效率(开发、调试、问题追查、热点机器器)不不尽如⼈人意 Low Efficiency (Development, Debugging and Tracing) Ø  ⾼高可靠性(全链路路监控、容灾)达不不到业务要求 Unavailability (Link Monitoring and Disaster Tolerance) Ø  数据质量量(丢失率、延迟指标)低 Low Data Quality (Loss Rate and Delay)

28.Future Work •  依托平台化,解决效率问题 Improve Efficiency Based on the Platform •  依托实时数仓建设,提升引擎能⼒力力和表意能⼒力力 Improve the Power and Expressive Richness Based on the Realtime Data Warehouse •  依托B端、监控业务建设,提升可靠性 Improve the Reliability Based on the Browser and Monitoring Businesses

29.