- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
邹丹_Flink在字节跳动的实践
展开查看详情
1 .)OLQN ࣁਁᜓ᪡ۖጱਫ᪢ The Practice of Apache Flink at ByteDance ݪلғਁᜓ᪡ۖ ᘳ֖ғय़හഝૡᑕ ᄍᦖᘏғᮛԄ
2 . ༷ᥦ Outline Ø ፘىᙧว Related Background Ø ၞୗ֢ӱᓕቘଘݣ Streaming Job Management Platform Ø ኞԾਫ᪢ Production Practices Ø ๕ Future Work 2
3 . Yarn ᵞᗭ 5+ 1w+ ֢ӱහ 2000+ አಁහ 300+ හ܈ӻԾߝ 5+ Yarn clusters 10k+ machines 2k+ jobs 300+ users Dozens of products 3
4 . Flink on Yarn ᇿᒈጱ <DUQ ᵞᗭ ೲӧݶጱӱړښۓᴚڜ Independent Yarn clusters Queue divide by groups Flink on Yarn ᯿ᥝ֢ӱ᪒ࣁ <DUQ ᇿᒈ ODEHO ٖਂ CPU ᵍᐶ ጱӤ Memory and CPU Isolation %XVLQHVV FULWLFDO MREV RQ ODEHOHG PDFKLQHV RI WKH FOXVWHU 4
5 . ၞୗ֢ӱᓕቘଘݣ Streaming Job Management Platform 5
6 . 1 ׀ᶭᶎ֢҅ӞᲫ҅ۖސ؊ྊ҅᯿ސ Enable simple operations e.g. start , stop and restart. 2 ֢ӱአಁᕟᕬਧ҅ො֢ӱᓕቘ ၞୗ֢ӱᓕቘଘݣ Streaming Job Management Platform Bind job and user (group) for ease of management. 3 դᎱᇇᓕቘ҅܋ᕆࢧ჻ᓌܔ Manage code versions for ease of upgrade / rollback. 4 դᎱᯈᗝړᐶ Separate code and configuration. 6
7 . 5 ፊഴ֢ӱᇫா֢҅ӱ०ᨳᛔ̶ۖ᩸ Monitor job status, and restart the failed job automatically. 6 ᦕ୯֢ܲ҅ݥො᭄შ ၞୗ֢ӱᓕቘଘݣ Streaming Job Management Platform Record operating history for easy tracing. 7 ֢׀ӱᳯ᷌ᛔۖഭັૡٍ Provide automatic troubleshooting tools. 8 Ӟᒊୗᓕቘ One stop management. 7
8 . ຝ $UFKLWHFWXUH 8
9 . ၞᑕ :RUN )ORZ ٟդᎱ FRGLQJ ݎᇇ UHOHDVH ဳٙ ׀PDYHQ ཛྷ UHJLVWHU Provide Maven Modules ᯈᗝ դᎱᇇᓕቘො܋ᕆࢧ჻ FRQILJ Code version management, which enables easy upgrade and rollback ဳ֢ٙӱ҅ऴفच௳מ ۖސ Register jobs and fill in the basic start Ⴒےᬩᤈ݇හ information Add runtime parameters ֢ۖސӱ Start jobs ၞୗձۓᓕቘଘݣ Streaming Job Management Platform 9
10 . )OLQN 64/ Ø ᓌܔฃֵ҅አᳪད֗ Easy to understand, lower entry bar Ø API ԅᑞਧ҅ᇇ܋ᕆ҅አಁ෫ᵱץදդᎱ Stable API, users do not need to modify the code after upgrade Ø սᐹጱս۸ຝ҅አಁݝᵱᥝӫဳԭӱۓ᭦ᬋਫሿ Great optimization frameworks, users only need to focus on the business logic 10
11 . )OLQN 64/ ၞᑕ Flink SQL Work Flow Flink SQL ෛୌහഝრ ۖސ Create data sources Start ဳ֢ٙӱ ᥴຉ ၥᦶ Register jobs Analysis & test ᖫٟ 64/ ᯈᗝ݇හ Write SQL Write parameters 11
12 . ᖫٟ 64/ Write SQL 12
13 . ၥᦶ Test 13
14 . ֢ӱፊഴ Job Monitoring 14
15 . ኞԾਫ᪢ 3URGXFWLRQ 3UDFWLFHV 15
16 . ᳯ᷌Ӟ ᬩᖌܴێय़ High Operation and Maintenance Pressure Ø 2k+ ֢ӱ 3+ ධउሁ 2000+ jobs 3+ engineers Ø ਫᥝṛ Critical real-time requirement 16
17 . ᥴဩ ᛔۖഭັૡٍ Automatic Troubleshooting Tools ଉഭັ Troubleshooting ۖސଉ ᬩᤈଉ Startup exception 5XQWLPH H[FHSWLRQ 70 ۖސ ᷇ᔺ᯿ސ -0 ۖސ හഝञᑌ 7DVNPDQDJHU Frequent -REPDQDJHU VWDUWXS High lag size VWDUWXS restarts Analyze Jobtrace metrics 17
18 . Კ෭ப Error Log 18
19 . හഝ᬴ Data Delay 19
20 . ᳯ᷌ԫ ᵞᗭᓕቘࢯᵙ Hard to Manage Clusters Ø ᗌᵞᗭᇫாፊഴ Insufficient cluster monitoring Ø ᗌಢᰁ֢ۑᚆ Lack of batch operation capacities 20
21 . ᥴဩ ᵞᗭᓕቘૡٍ &OXVWHU 0DQDJHPHQW 7RROV ᵞᗭᬩᖌى Operation Switch Cluster operation switch ֢ӱಢᰁᬢᑏ Job migration Bulk migration Monitoring ᵞᗭᇫாፊഴ Cluster status monitoring 21
22 . ᳯ᷌ӣ ֢ӱౌۖސ Jobs Start Slow Ø Yarn ړᯈ container ౌ Container allocation by Yarn is slow Ø Flink job ౌۖސ Starting flink jobs is slow 22
23 . ᥴဩ ےۖސ᭛ Job Start Speedup Ø Yarn ᧣ଶս۸ Yarn scheduling optimization Ø وՁولᩒრ Share public resources Ø 6ORWᬡ๋֗کᥝ੪ۖސ Start when slots meet minimum requirements 23
24 . ᳯ᷌ࢥ ᩒრᎨᗌ 0DFKLQH 5HVRXUFHV 6KRUWDJH Ø ӱۓṛ᭛ݎ Fast business growth Ø ᩒრํᴴ Machine Resources are limited 24
25 . ᥴဩ ᛔۖᩒრ᧣ෆ Automatic Resource Adjustment ᯿ސ ᬩᤈӾ Restart Runtime Application Container ᬦ݄ੜ ᬦ݄ੜ Last 24 hours Last 1 hour 25
26 . ᳯ᷌Բ ᑞਧӧ᪃ ,QVWDELOLW\ Ø य़֢ӱᑞਧ૧ Big jobs are not stable enough Ø ܋ᕆӧଘჶ Job upgrade is not smooth 26
27 . ᥴဩ ֢ӱړڔ Job Segmentation sub-job1 ü ᑞਧᬩᤈ sub-job4 Stably running 10% sub-job2 split Big job 40% 20% 30% ü ଘჶ܋ᕆ Smoothly upgrading sub-job3 27
28 . ๕ Future Work Ø വଠ Flink SQL Promote Flink SQL Ø ๅग़ጱӱ࣋ۓว Enable more business scenarios Ø ṛᑞਧ Improve stability Ø ࢧḇᐒ܄ Contribute back to the community 28
29 . Q&A 29