讲述了每个语法是为了解决什么问题,为什么要设计,以及具体用法

祝威廉发布于2018/11/07

注脚

展开查看详情

1.MLSQL grammar

2. ⾃自我简介 祝海海林林,丁⾹香园⼤大数据资深架构师 技术博客: http://www.jianshu.com/u/59d5607f1400 开源项⽬目: https://github.com/allwefantasy

3.00 What’s MLSQL

4.Why design this? 1. BigData and AI should not separated 2. A language everyone can take control 3. AI should be a easy thing

5.Advantage Easy: No Spark,No Python, No Scala, Just SQL Script everyone can use Flexible: Code with Python/Scala Full scenes: Batch/Stream/ if you want Crawler/AI

6.Disadvantage Documents is not enough Lack of real case Actively development, API is not stable

7.Summary 01 Load/Connect statement 02 Select statement Welcome to MLSQL 03 Train/run/predict statement 04 Register statement 05 Powerful Set statement 06 Include statement Save statement Please Keep Quiet 07

8.01 Load/Connect

9.Why design this? Load them as table Json There are tons of datasources MySQL ElasticS earch in realty, and many of them have CSV their own loading way. Parquet LibSVM How to union them?

10.Grammar ('load'|'LOAD') format '.' path ‘options’|'where'? expression? booleanExpression* 'as' tableName load data with `format` from location `path`, the load options are `expressions`. Once loaded, name this source as `tableName`

11.MySQL load jdbc.`mysql-1.table1` options and driver="com.mysql.jdbc.Driver" and url="jdbc:mysql://127.0.0.1:3306/...." and driver="com.mysql.jdbc.Driver" and user="..." and password="...." as table1; How we control the load parallelization from MySQL? (partitionColumn, lowerBound, upperBound,numPartitions)

12.HDFS-File load parquet.`path` as table1; load csv.`path` where header=“true” and interSchema=“true” as table1; load json.`path` as table1; Notice that any value in MLSQL options/where statement is string.

13.simplify load:Connect statement connect jdbc where truncate="true" and driver="com.mysql.jdbc.Driver" and url="jdbc:mysql://127.0.0.1:3306/...." and driver="com.mysql.jdbc.Driver" and user="..." and password="...." as mysql-1; load jdbc.`mysql-1.table1` as table1; select * from table1 as output;

14. grammar ('connect'|'CONNECT') format 'where'? expression? booleanExpression* ('as' db)? connect the data source with `format`, the connection options are `expressions`, we call this connection as `db`

15.How to load hive table? load hive.`db.table` as table1;

16.How to load raw text without schema? load text.`/Users/allwefantasy/CSDNWorkSpace/tmtagsHistory` as table1;

17.How to load ES table connect es where `es.nodes`=“esnode1" and `es.net.http.auth.user`="" and `es.net.http.auth.pass`="" as es1; load es.`es1.pi_top/dxy` as es_pi_top; The name of property support dot, so just simply write as follow: es.net.http.auth.user=“" multi-line value: es.net.http.auth.user=‘’’ hello ’’’

18.How to load hbase table load hbase.`wq_tmp_test1` options rowkey="id" and zk="---" and inputTableName="wq_tmp_test1" and field.type.filed1="FloatType" and field.type.filed2="TimestampType" and field.type.filed3="BinaryType" as testhbase ;

19.How to load link list from web page load crawlersql.`https://www.csdn.net/nav/ai` options matchXPath="//ul[@id='feedlist_id']//div[@class='title']//a/@href" and fetchType="list" and `page.type`="scroll" and `page.num`="10" and `page.flag`="feedlist_id" as aritle_url_table_source;

20.more processing -- 抓取全⽂文,并且存储 select crawler_request(regexp_replace(url,"http://","https://") from aritle_url_table_source where url is not null as aritle_list; save overwrite aritle_list as parquet.`${resultTempStore}`; -- 对内容进⾏行行解析 load parquet.`${resultTempStore}` as aritle_list; select crawler_auto_extract_title(html) as title, crawler_auto_extract_body(html) as body, crawler_extract_xpath(html,"//main/article//span[@class='time'] from aritle_list where html is not null as article_table;

21.Load literal as table set jsonStr=''' {"features":[5.1,3.5,1.4,0.2],"label":0.0}, {"features":[5.1,3.5,1.4,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[4.4,2.9,1.4,0.2],"label":0.0} {"features":[5.1,3.5,1.4,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} Usage? {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[4.7,3.2,1.3,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} '''; load jsonStr.`jsonStr` as data;

22.Is that all? no….. load modelParams.`RandomForest` check available params of module as output; check the doc and load modelExample.`RandomForest` example of module as output; load modelExplain.`/tmp/model` where check the params of model alg="RandomForest" as outout;

23.02 Select

24.Grammar ('select'|'SELECT') ~(';')* 'as' tableName The same with select in Spark SQL . We just add suffix “‘as’ tableName”, so we can reference it easily later.

25.Lots of UDF available https://github.com/allwefantasy/streamingpro/blob/master/docs/ functions.md https://github.com/allwefantasy/streamingpro/blob/master/docs/ crawler.md#爬⾍虫相关udf函数 ⼤大部分模型都可以注册成UDF函数供使⽤用

26.03 Train/run/predict

27. Why design this? train Table BlackBox model predict There should be a better way Table model table to do things like fit/transform run Table BlackBox table

28.Grammar (‘train’|’predict’|'run') tableName 'as' format '.' path 'where'? expression? booleanExpression* (as tableName)*

29.Algorithm -- use RandomForest train data1 as RandomForest.`/tmp/model` where keepVersion="true" and evaluateTable="data1" -- specify group 0 parameters Train algorithm and `fitParam.0.labelCol`="features" and `fitParam.0.featuresCol`="label" and `fitParam.0.maxDepth`="2" -- specify group 1 parameters and `fitParam.1.featuresCol`="features" and `fitParam.1.labelCol`="label" and `fitParam.1.maxDepth`="10" ;