陈华曦--基于Apache Flink的搜索处理平台

陈华曦--基于Apache Flink的搜索处理平台
展开查看详情

1.ElasticFlow चԭApache Flinkጱ൤ᔱහഝ॒ቘଘ‫ݣ‬ Search offline platform based on Apache Flink ‫ݪل‬ғᴨ᯾૬૬ ᘳ֖ғṛᕆದ๞ӫਹ ᄍᦖᘏғᴯ܏เ

2.Search&Rec in Alicloud Elastic Open AI Rec Search Search (beta)

3.Search/Rec Offline Data Process RDS Offline System Open DTS Search Max Compute ODPS Elastic Search Stream Compute SLS AI Rec Full build Incremental build Rebuild all the documents from data sources with specific Synchronize the real time updates from data sources to search interval(daily). engine. Ex: RDS binlog parsed by DTS

4.Pain Points Full Inc Monitor Different Complex business Perf. and Ops datasources

5.Elastic Flow რᛔᴨ᯾൤ᔱᐶᕚଘ‫҅ݣ‬൉‫׀‬Ӟᒊୗහഝ॒ቘӱ‫̵ݎ୏ۓ‬᮱ᗟ̵ᬩᖌᚆ‫҅ێ‬ჿ᪃አ ಁਖ਼ग़໏۸හഝ॒ቘଚ੕‫ف‬൤ᔱവគ୚කጱᵱ࿢̶ • ᶎ‫ݻ‬አಁጱ୏‫ݎ‬ኴᶎ҅Ӟེ୏‫҅ݎ‬ᛔۖኞ౮‫ق‬ᰁीᰁձ‫ۓ‬ၞᑕ • ၹᰁහഝṛ௔ᚆ҅ჿ᪃‫ق‬ᰁṛ‫҅ݺރ‬ीᰁṛਫ෸ጱᥝ࿢ • ୑຅හഝრඪ೮ғ5'62'36'766/6.DINDᒵᒵ • ࢵᴬ۸ग़UHJLRQᛔۖ۸᮱ᗟ҅ग़ᖌଶፊഴಸᦄ

6.Elastic Flow Arch Elastic Flow SAROS Catalog Swift Apache Flink Maat HBase Queue Yarn Hippo HDFS Compute Storage

7.Business Graph Business Table Abstract table, was composed of full table and inc table. The two tables are sharing the same schema and business Full:RDS Inc:DTS Schema

8.RDS DTS FlinkJ FlinkJ ODPS obs HBASE obs ES SLS Sync Layer Process Layer

9.Job Generate Flink SQL Full Inc Business Join Join Graph Compute Flink SQL Job Graph Translator Full Inc Sync Sync Optimized Graph Maat DAG Publis Schedule Maat Build h Job Graph Translator Start

10.Incremental sync SQL Full sync SQL CREATE TABLE DTSSource_test_table ( CREATE TABLE MysqlSource_test_table ( `path` VARCHAR, `path` VARCHAR, `status` VARCHAR, `status` VARCHAR, `receive_id` VARCHAR `receive_id` VARCHAR ) with ( ) with ( tableFactoryClass tableFactoryClass =‘DTSExTableFactory’ =‘MySQLScanTableFactory’ ) ) INSERT INTO HbaseSink_main_table INSERT INTO HfileSink_main_table SELECT SELECT rowKeyGen(`receive_id`, 2000, 4) AS rowkey, rowKeyGen(`receive_id`, 2000, 4) AS rowkey, `path`, `path`, `status`, `status`, `receive_id` `receive_id`, FROM DTSSource_test_table FROM MysqlSource_test_table where receive_id <> ''; where receive_id <> '';

11.Optimization Ø Async dim join increase throughput while reading Hbase Ø Flink state on Niagara reduces Hbase IO Ø Blink 2.2.x new feature: Credit Based flow control, network buffer zero copy Ø Object reuse leads to less GC

12.Batch Ø Batch/Stream unified SQL boost dev efficiency Ø Bulkload mass data to Hbase though Flink batch and hfilesink Ø External Shuffle Service Ø Per unit schedule uses less resource for the same job

13.Hbase table design

14.Inside Alibaba Ø Tens of billions of docs per day, nearly 1 PB. Lazada Youku Aliexpress Ø Almost 1 million tps in incremental build. Ding Hema Flypig Talk Ø Double 11 for three years ᘸ‫ښ‬ᓒ Xianyu Ⴃਮ

15.ᥢ‫޾ښ‬᪠ᕚࢶ ੒ള(6҅ඪ೮5'6 ੒ള2SHQ6HDUFK҅ඪ ౲ᘏ2'36‫ܔ‬ᤒ‫ݶ‬ ೮ṛᕆᇇ޾ᇿՁࣳᵞᗭ ྍ̶૪ᕪ‫ل‬ၥ ጱग़໏۸ᵱ࿢̶ࢵᴬ۸ ग़‫܄‬ऒ᮱ᗟ̶ ଙ์ ଙ์ ଙ์ ଙ์ ੒ള$,5HF҅൉‫࣋׀‬ว۸ ᭐‫)ڊ‬OLQN )XFWLRQ҅ඪ ጱᓒဩૡᑕᚆ‫҅ێ‬۱ೡᇙ ೮ग़හഝᤒ॔๥॒ቘ҅ ஄‫҅ྍݶ‬ཛྷࣳᦒᕞᒵ ۱ೡ-RLQ̵)LOWHUᒵ

16.