邹丹_Flink在字节跳动的实践

【分会场三01-邹丹】Flink在字节跳动的实践
展开查看详情

1.)OLQN ࣁਁᜓ᪡ۖጱਫ᪢ The Practice of Apache Flink at ByteDance ‫ݪل‬ғਁᜓ᪡ۖ ᘳ֖ғय़හഝૡᑕ૵ ᄍᦖᘏғᮛԄ

2. ༷ᥦ Outline Ø ፘ‫ى‬ᙧว Related Background Ø ၞୗ֢ӱᓕቘଘ‫ݣ‬ Streaming Job Management Platform Ø ኞԾਫ᪢ Production Practices Ø ઀๕ Future Work 2

3. Yarn ᵞᗭ 5+ ๢࢏ 1w+ ֢ӱහ 2000+ አಁහ 300+ හ‫܈‬ӻԾߝ 5+ Yarn clusters 10k+ machines 2k+ jobs 300+ users Dozens of products 3

4. Flink on Yarn ᇿᒈጱ <DUQ ᵞᗭ ೲӧ‫ݶ‬ጱӱ‫ړښۓ‬ᴚ‫ڜ‬ Independent Yarn clusters Queue divide by groups Flink on Yarn ᯿ᥝ֢ӱ᪒ࣁ <DUQ ᇿᒈ ODEHO ٖਂ޾ CPU ᵍᐶ ጱ๢࢏Ӥ Memory and CPU Isolation %XVLQHVV FULWLFDO MREV RQ ODEHOHG PDFKLQHV RI WKH FOXVWHU 4

5. ၞୗ֢ӱᓕቘଘ‫ݣ‬ Streaming Job Management Platform 5

6. 1 ൉‫׀‬ᶭᶎ඙֢҅ӞᲫ‫҅ۖސ‬؊ྊ҅᯿‫ސ‬ Enable simple operations e.g. start , stop and restart. 2 ֢ӱ޾አಁ ᕟ ᕬਧ҅ො‫֢׎‬ӱᓕቘ ၞୗ֢ӱᓕቘଘ‫ݣ‬ Streaming Job Management Platform Bind job and user (group) for ease of management. 3 դᎱᇇ๜ᓕቘ҅‫܋‬ᕆࢧ჻ᓌ‫ܔ‬ Manage code versions for ease of upgrade / rollback. 4 դᎱᯈᗝ‫ړ‬ᐶ Separate code and configuration. 6

7. 5 ፊഴ֢ӱᇫா֢҅ӱ०ᨳᛔۖ೉̶᩸ Monitor job status, and restart the failed job automatically. 6 ᦕ୯඙֢ܲ‫҅ݥ‬ො‫׎‬᭄შ ၞୗ֢ӱᓕቘଘ‫ݣ‬ Streaming Job Management Platform Record operating history for easy tracing. 7 ൉‫֢׀‬ӱᳯ᷌ᛔۖഭັૡٍ Provide automatic troubleshooting tools. 8 Ӟᒊୗᓕቘ One stop management. 7

8. ຝ຅ $UFKLWHFWXUH 8

9. ၞᑕ :RUN )ORZ ٟդᎱ FRGLQJ ‫ݎ‬ᇇ๜ UHOHDVH ဳٙ ൉‫ ׀‬PDYHQ ཛྷ຃ UHJLVWHU Provide Maven Modules ᯈᗝ դᎱᇇ๜ᓕቘො‫܋׎‬ᕆ޾ࢧ჻ FRQILJ Code version management, which enables easy upgrade and rollback ဳ֢ٙӱ҅ऴ‫ف‬च๜‫௳מ‬ ‫ۖސ‬ Register jobs and fill in the basic start Ⴒ‫ے‬ᬩᤈ݇හ information Add runtime parameters ‫֢ۖސ‬ӱ Start jobs ၞୗձ‫ۓ‬ᓕቘଘ‫ݣ‬ Streaming Job Management Platform 9

10. )OLQN 64/ Ø ᓌ‫ܔ‬ฃ౜ֵ҅አᳪད֗ Easy to understand, lower entry bar Ø API ᫾ԅᑞਧ҅ᇇ๜‫܋‬ᕆ෸҅አಁ෫ᵱ‫ץ‬දդᎱ Stable API, users do not need to modify the code after upgrade Ø սᐹጱս۸໛ຝ҅አಁ‫ݝ‬ᵱᥝӫဳԭӱ‫ۓ‬᭦ᬋਫሿ Great optimization frameworks, users only need to focus on the business logic 10

11. )OLQN 64/ ၞᑕ Flink SQL Work Flow Flink SQL ෛୌහഝრ ‫ۖސ‬ Create data sources Start ဳ֢ٙӱ ᥴຉ ၥᦶ Register jobs Analysis & test ᖫٟ 64/ ᯈᗝ݇හ Write SQL Write parameters 11

12. ᖫٟ 64/ Write SQL 12

13. ၥᦶ Test 13

14. ֢ӱፊഴ Job Monitoring 14

15. ኞԾਫ᪢ 3URGXFWLRQ 3UDFWLFHV 15

16. ᳯ᷌Ӟ ᬩᖌܴ‫ێ‬य़ High Operation and Maintenance Pressure Ø 2k+ ֢ӱ 3+ ධउሁ 2000+ jobs 3+ engineers Ø ਫ෸௔ᥝ࿢ṛ Critical real-time requirement 16

17. ᥴဩ ᛔۖഭັૡٍ Automatic Troubleshooting Tools ୑ଉഭັ Troubleshooting ‫୑ۖސ‬ଉ ᬩᤈ୑ଉ Startup exception 5XQWLPH H[FHSWLRQ 70 ‫ۖސ‬ ᷇ᔺ᯿‫ސ‬ -0 ‫ۖސ‬ හഝञᑌ 7DVNPDQDJHU Frequent -REPDQDJHU VWDUWXS High lag size VWDUWXS restarts Analyze Jobtrace metrics 17

18. Კ᧏෭ப Error Log 18

19. හഝ୊᬴ Data Delay 19

20. ᳯ᷌ԫ ᵞᗭᓕቘࢯᵙ Hard to Manage Clusters Ø ᗌ੝ᵞᗭᇫாፊഴ Insufficient cluster monitoring Ø ᗌ੝ಢᰁ඙֢‫ۑ‬ᚆ Lack of batch operation capacities 20

21. ᥴဩ ᵞᗭᓕቘૡٍ &OXVWHU 0DQDJHPHQW 7RROV ᵞᗭᬩᖌ୏‫ى‬ Operation Switch Cluster operation switch ֢ӱಢᰁᬢᑏ Job migration Bulk migration Monitoring ᵞᗭᇫாፊഴ Cluster status monitoring 21

22. ᳯ᷌ӣ ֢ӱ‫ౌۖސ‬ Jobs Start Slow Ø Yarn ‫ړ‬ᯈ container ౌ Container allocation by Yarn is slow Ø Flink job ‫ౌۖސ‬ Starting flink jobs is slow 22

23. ᥴဩ ‫ےۖސ‬᭛ Job Start Speedup Ø Yarn ᧣ଶս۸ Yarn scheduling optimization Ø ‫و‬Ձ‫ول‬ᩒრ Share public resources Ø 6ORWᬡ‫๋֗ک‬ᥝ࿢੪‫ۖސ‬ Start when slots meet minimum requirements 23

24. ᳯ᷌ࢥ ๢࢏ᩒრᎨᗌ 0DFKLQH 5HVRXUFHV 6KRUWDJH Ø ӱ‫ۓ‬ṛ᭛‫઀ݎ‬ Fast business growth Ø ๢࢏ᩒრํᴴ Machine Resources are limited 24

25. ᥴဩ ᛔۖᩒრ᧣ෆ Automatic Resource Adjustment ᯿‫ސ‬ ᬩᤈӾ Restart Runtime Application Container ᬦ݄ੜ෸ ᬦ݄ੜ෸ Last 24 hours Last 1 hour 25

26. ᳯ᷌Բ ᑞਧ௔ӧ᪃ ,QVWDELOLW\ Ø य़֢ӱᑞਧ௔૧ Big jobs are not stable enough Ø ‫܋‬ᕆӧଘჶ Job upgrade is not smooth 26

27. ᥴဩ ֢ӱ‫ړڔ‬ Job Segmentation sub-job1 ü ᑞਧᬩᤈ sub-job4 Stably running 10% sub-job2 split Big job 40% 20% 30% ü ଘჶ‫܋‬ᕆ Smoothly upgrading sub-job3 27

28. ઀๕ Future Work Ø വଠ Flink SQL Promote Flink SQL Ø ๅग़ጱӱ‫࣋ۓ‬ว Enable more business scenarios Ø ൉ṛᑞਧ௔ Improve stability Ø ࢧḇᐒ‫܄‬ Contribute back to the community 28

29. Q&A 29