MLSQL,A powerful DSL for BIgData and AI.pdf

祝威廉发布于2018/11/15 10:45


1.MLSQL OverView

2. ⾃自我简介 祝海海林林,丁⾹香园⼤大数据资深架构师 技术博客: 开源项⽬目:

3.What’s the most successful fields for AI? Self-Driving Security and Protection

4.And game… AlphaGo

5.How should AI be applied ? No atom bomb Algorithm Bunch of Algorithm Empower Every Step Everyone can use

6.00 What’s MLSQL

7.Why design this? 1. BigData and AI should not be separated 2. A language everyone can take control 3. AI should be a easy thing

8.Advantage Easy: SQL Script everyone can use Flexible: Code with Python/Scala Full scenes: Batch/Stream/ if you want to enhance SQL Crawler/AI

9.Disadvantage Documents is not enough Lack of real case Actively development, API is not stable

10.Talk Talk 01 Let’s rock DL Cifar10 机器器学习平台MLSQL实践 02 Stream My Data 03 Crawler the web 04 How we make ML real simple 05 Show me case Please Keep Quiet

11.01 Let’s rock DL Cifar10

12.Load and convert Images run emptyData as ImageLoaderExt.`/Users/allwefantasy/Downloads/cifar/train` where and code=''' def apply(params:Map[String,String]) = { Resize(28, 28) -> MatToTensor() -> ImageFrameToSample() } ''' as data;

13.Extract Label from path -- convert image path to number label select split(split(imageName,"_")[1],"\\.")[0] as labelStr,features from data as newdata; train newdata as StringIndex.`${labelMappingPath}` where inputCol="labelStr" and outputCol="labelIndex" as newdata1; predict newdata as StringIndex.`${labelMappingPath}` as newdata2; select (cast(labelIndex as int) + 1) as label,features from newdata2 as newdata3;

14.Extract Label from path(The other way) -- convert image path to number label register ScriptUDF.`` as extract_label options and lang="scala" and code=''' def apply(path:String) = { path.split(“_”)[1].split(“\\.”).first } ''' ; select extract_label(label) as labelStr,features from data as newdata; train newdata as StringIndex.`${labelMappingPath}` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;

15.Cast label to float select array(cast(label as float)) as label,features from newdata3 as newdata4;

16.Train --train with LeNet5 model train newdata4 as BigDLClassifyExt.`${modelPath}` where fitParam.0.featureSize="[3,28,28]" and fitParam.0.classNum="10" and fitParam.0.maxEpoch="50" and fitParam.0.code=''' def apply(params:Map[String,String])={ val model = Sequential() model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3))) 。。。。。 model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2")) } ''';

17.Batch Predict -- batch predict predict newdata4 as BigDLClassifyExt.`${modelPath}` as predictdata;

18.API Predict -- deploy with api server register BigDLClassifyExt.`/tmp/bigdl` as mnistPredict; select vec_argmax(mnistPredict(vec_dense(features))) as predict_label, label from data as output;

19.That’s all No Deploy, No Environment, People knows SQL can No Complication, do Deep Learning Now No code if you want. All is SQL-Like.

20.02 Stream My Data

21.Set Stream Name -- the stream name, should be uniq. set streamName="streamExample";

22.Load no-end table from Kafka -- if you are using kafka 1.0 load kafka.`pi-content-realtime-db` options `kafka.bootstrap.servers`="---" as kafka_post_parquet; -- if you are using kafka 0.8.0/0.9.0 load kafka9.`pi-content-realtime-db` options `kafka.bootstrap.servers`="---" as kafka_post_parquet;

23.Load no-end table from Kafka -- if you are using kafka 1.0 load kafka.`pi-content-realtime-db` options `kafka.bootstrap.servers`="---" as kafka_post_parquet; -- if you are using kafka 0.8.0/0.9.0 load kafka9.`pi-content-realtime-db` options `kafka.bootstrap.servers`="---" as kafka_post_parquet;

24.Or We mock some Data -- mock some data. set data=''' {"key":"yes","value":"a,b,c","topic":"test","partition":0,"offset":0,"timestamp":"2008-01-24 18:01:01.001","timestampType":0} {"key":"yes","value":"d,f,e","topic":"test","partition":0,"offset":1,"timestamp":"2008-01-24 18:01:01.002","timestampType":0} {"key":"yes","value":"k,d,j","topic":"test","partition":0,"offset":2,"timestamp":"2008-01-24 18:01:01.003","timestampType":0} {"key":"yes","value":"m,d,z","topic":"test","partition":0,"offset":3,"timestamp":"2008-01-24 18:01:01.003","timestampType":0} {"key":"yes","value":"o,d,d","topic":"test","partition":0,"offset":4,"timestamp":"2008-01-24 18:01:01.003","timestampType":0} {"key":"yes","value":"m,m,m","topic":"test","partition":0,"offset":5,"timestamp":"2008-01-24 18:01:01.003","timestampType":0} '''; -- load data as table load jsonStr.`data` as datasource; -- convert table as stream source load mockStream.`datasource` options stepSizeRange="0-3" and valueFormat="csv" and valueSchema="st(field(column1,string),field(column2,string),field(column3,string)) " as newkafkatable1;

25.Processing -- aggregation select column1,column2,column3,kafkaValue from newkafkatable1 as table21;

26.Save Stream save append table21 as newParquet.`/table1/hp_stat_date=${date.toString("yyyy-MM- dd")}` options mode="Append" and duration="30" and checkpointLocation="/tmp/ckl1";

27. That’s all,No select ..... as table1; -- register watermark for table1 register WaterMarkInPlace.`table1` as tmp1 options eventTimeCol="ts" and delayThreshold="1 seconds"; We also support watermark -- process table1 select count(*) as num from table1 group by window(ts,"30 minutes","10 seconds") as table2; save append ......

28.03 Crawl the web

29.How to load link list from web page load crawlersql.`` options matchXPath="//ul[@id='feedlist_id']//div[@class='title']//a/@href" and fetchType="list" and `page.type`="scroll" and `page.num`="10" and `page.flag`="feedlist_id" as aritle_url_table_source;

30.more processing -- 抓取全⽂文,并且存储 select crawler_request(regexp_replace(url,"http://","https://") from aritle_url_table_source where url is not null as aritle_list; save overwrite aritle_list as parquet.`${resultTempStore}`; -- 对内容进⾏行行解析 load parquet.`${resultTempStore}` as aritle_list; select crawler_auto_extract_title(html) as title, crawler_auto_extract_body(html) as body, crawler_extract_xpath(html,"//main/article//span[@class='time'] from aritle_list where html is not null as article_table;

31.04 How we make ML real simple

32. Paints It’s really sad for developer who knows Scala well to convert Python code to Scala It’s really sad for Algorithm Engineer to teach developer to do converting. One Week? It’s really sad for people develop new API everytime when new model are created 1-3 days? It’s really sad for people deployment take them too much time. Python Environment is Complex 1-3 days?

33. Data Processing What’s model? train Table BlackBox model (Data Processing).contains (Algorithm training) predict Table model table BlackBox = Estimator/Transformer run Table BlackBox table

34.Data Processing In MLSQL Python/Scala Algorithm, Module SQL functions Data Processing== MLSQL(UDF/UDAF + Model ) Data/Model MLSQL is grammar They are operator

35.What’s API service process What’s API service? json Table transformer e.g. model json table 1. Low latency 2. InputTable should be small 3. HTTP(S)/RPC protocol

36.API service look like

37.How we model API service It’s best to use Function when we deploy Transformer as API service json Table API Server json Table select predict(tiidf(item)) as predictRes register hdfs:/models/randomforest scala/python code SQL Function

38.Train generate transofmers Python/Scala code(UDF/UDAF style) generate Train Transormer(e.g model) SQL functions automatically convert to UDF Predict

39. Understand deeper Train Stage Raw text analysis TfIdf RandomForest register register register Predict Stage Raw text udf1 udf2 udf3 select udf3(udf2(udf1(item))) as predictRes

40.That’s it We never code again when deploy model and feature engineer They are all from training stage

41.04 Show me the case

42.MLSQL Architecture 分析师,算法,研 Auth 发,商业部⻔门 Scheduler Skone Web/API Job more… Manager StreamingPro StreamingPro …. SQL Server SQL Server 注册 Analyser generate MySQL temp Parquet table (Parquet/CarbonData) connect jdbc where driver="com.mysql.jdbc.Driver" and url="jdbc:mysql://" and user="root" StreamingPro Batch/Hive and password="****" as db1; Batch

43.Cases Business Depart Create their own BI System based on Skone Exact-marketing System(e.g.EDM) Script can be invoked by Skone API Algorithm Team have create and deploy some many models without trouble develop too much Many streams have been deploied

44.Thanks! 欢迎提问 Please make some noise


  • 论文提出了一种新的深度神经网络模型家族,并没有规定一个离散的隐藏层序列,而是使用神经网络将隐藏状态的导数参数化。然后使用黑箱微分方程求解器计算该网络的输出。这些连续深度模型的内存成本是固定的,它们根据输入调整评估策略,并显式地用数值精度来换取运算速度。论文展示了连续深度残差网络和连续时间隐变量模型的特性。此外,还构建了连续的归一化流,一个可以使用最大似然方法训练的生成模型,无需对数据维度进行分割或排序。至于训练,论文展示了如何基于任意 ODE 求解器进行可扩展的反向传播,无需访问 ODE 求解器的内部操作。这使得在较大模型内也可以实现 ODE 的端到端训练。

  • 自然语言处理的总体介绍及其主要的任务,传统机器学习方法和深度学习方法分析,以及涉及到的NLP前沿研究进展。

  • 1. 中国人工智能产业发展迅速,但整体实力仍落后于美国 2. 中国企业价值链布局侧重技术层和应用层,对需要长周期的基础层关注 度较小 3. 科技巨头生态链博弈正在展开,创业企业则积极发力垂直行业解决方案, 深耕巨头的数据洼地,打造护城河 4. 政府端是目前人工智能切入智慧政务和公共安全应用场景的主要渠道, 早期进入的企业逐步建立行业壁垒,未来需要解决数据割裂问题以获得长 足发展 5. 人工智能在金融领域的应用最为深入,应用场景逐步由以交易安全为主 向变革金融经营全过程扩展 6. 医疗行业人工智能应用发展快速,但急需建立标准化的人工智能产品市 场准入机制并加强医疗数据库的建设 7. 以无人驾驶技术为主导的汽车行业将迎来产业链的革新 8. 人工智能在制造业领域的应用潜力被低估,优质数据资源未被充分利 用 9. 人工智能加速新零售全渠道的融合,传统零售企业与创业企业结成伙 伴关系,围绕人、货、场、链搭建应用场景 10. 政策与资本双重驱动推动人工智能产业区域间竞赛,京沪深领跑全 国,杭州发展逐步加速 11. 各地政府以建设产业园的方式发挥人工智能产业在推动新旧动能转换 中的作用 2. 杭州未来科技城抓住人工智能产业快速发展的机会并取得显著成绩, 未来可以从人才、技术、创新三要素入手进一步打造产业竞争力

  • 张首晟教授讲述了量子的世界,人工智能未来的发展以及区块链技术的展望与发展趋势。张提出区块链与人工智能共生的见解,人工智能需要数据,但是数据往往被中心化平台垄断,因而阻碍创新。从这种意义上人工智能有所欠缺。 加密经济学创造了一个对于数据提供者有正确激励机制的数据市场。人工智能能够依赖这个数据市场起飞。