讲述了每个语法是为了解决什么问题,为什么要设计,以及具体用法

祝威廉发布于2018/11/07 19:13

注脚

1.MLSQL grammar

2. ⾃自我简介 祝海海林林,丁⾹香园⼤大数据资深架构师 技术博客: http://www.jianshu.com/u/59d5607f1400 开源项⽬目: https://github.com/allwefantasy

3.00 What’s MLSQL

4.Why design this? 1. BigData and AI should not separated 2. A language everyone can take control 3. AI should be a easy thing

5.Advantage Easy: No Spark,No Python, No Scala, Just SQL Script everyone can use Flexible: Code with Python/Scala Full scenes: Batch/Stream/ if you want Crawler/AI

6.Disadvantage Documents is not enough Lack of real case Actively development, API is not stable

7.Summary 01 Load/Connect statement 02 Select statement Welcome to MLSQL 03 Train/run/predict statement 04 Register statement 05 Powerful Set statement 06 Include statement Save statement Please Keep Quiet 07

8.01 Load/Connect

9.Why design this? Load them as table Json There are tons of datasources MySQL ElasticS earch in realty, and many of them have CSV their own loading way. Parquet LibSVM How to union them?

10.Grammar ('load'|'LOAD') format '.' path ‘options’|'where'? expression? booleanExpression* 'as' tableName load data with `format` from location `path`, the load options are `expressions`. Once loaded, name this source as `tableName`

11.MySQL load jdbc.`mysql-1.table1` options and driver="com.mysql.jdbc.Driver" and url="jdbc:mysql://127.0.0.1:3306/...." and driver="com.mysql.jdbc.Driver" and user="..." and password="...." as table1; How we control the load parallelization from MySQL? (partitionColumn, lowerBound, upperBound,numPartitions)

12.HDFS-File load parquet.`path` as table1; load csv.`path` where header=“true” and interSchema=“true” as table1; load json.`path` as table1; Notice that any value in MLSQL options/where statement is string.

13.simplify load:Connect statement connect jdbc where truncate="true" and driver="com.mysql.jdbc.Driver" and url="jdbc:mysql://127.0.0.1:3306/...." and driver="com.mysql.jdbc.Driver" and user="..." and password="...." as mysql-1; load jdbc.`mysql-1.table1` as table1; select * from table1 as output;

14. grammar ('connect'|'CONNECT') format 'where'? expression? booleanExpression* ('as' db)? connect the data source with `format`, the connection options are `expressions`, we call this connection as `db`

15.How to load hive table? load hive.`db.table` as table1;

16.How to load raw text without schema? load text.`/Users/allwefantasy/CSDNWorkSpace/tmtagsHistory` as table1;

17.How to load ES table connect es where `es.nodes`=“esnode1" and `es.net.http.auth.user`="" and `es.net.http.auth.pass`="" as es1; load es.`es1.pi_top/dxy` as es_pi_top; The name of property support dot, so just simply write as follow: es.net.http.auth.user=“" multi-line value: es.net.http.auth.user=‘’’ hello ’’’

18.How to load hbase table load hbase.`wq_tmp_test1` options rowkey="id" and zk="---" and inputTableName="wq_tmp_test1" and field.type.filed1="FloatType" and field.type.filed2="TimestampType" and field.type.filed3="BinaryType" as testhbase ;

19.How to load link list from web page load crawlersql.`https://www.csdn.net/nav/ai` options matchXPath="//ul[@id='feedlist_id']//div[@class='title']//a/@href" and fetchType="list" and `page.type`="scroll" and `page.num`="10" and `page.flag`="feedlist_id" as aritle_url_table_source;

20.more processing -- 抓取全⽂文,并且存储 select crawler_request(regexp_replace(url,"http://","https://") from aritle_url_table_source where url is not null as aritle_list; save overwrite aritle_list as parquet.`${resultTempStore}`; -- 对内容进⾏行行解析 load parquet.`${resultTempStore}` as aritle_list; select crawler_auto_extract_title(html) as title, crawler_auto_extract_body(html) as body, crawler_extract_xpath(html,"//main/article//span[@class='time'] from aritle_list where html is not null as article_table;

21.Load literal as table set jsonStr=''' {"features":[5.1,3.5,1.4,0.2],"label":0.0}, {"features":[5.1,3.5,1.4,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[4.4,2.9,1.4,0.2],"label":0.0} {"features":[5.1,3.5,1.4,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} Usage? {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[4.7,3.2,1.3,0.2],"label":1.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} {"features":[5.1,3.5,1.4,0.2],"label":0.0} '''; load jsonStr.`jsonStr` as data;

22.Is that all? no….. load modelParams.`RandomForest` check available params of module as output; check the doc and load modelExample.`RandomForest` example of module as output; load modelExplain.`/tmp/model` where check the params of model alg="RandomForest" as outout;

23.02 Select

24.Grammar ('select'|'SELECT') ~(';')* 'as' tableName The same with select in Spark SQL . We just add suffix “‘as’ tableName”, so we can reference it easily later.

25.Lots of UDF available https://github.com/allwefantasy/streamingpro/blob/master/docs/ functions.md https://github.com/allwefantasy/streamingpro/blob/master/docs/ crawler.md#爬⾍虫相关udf函数 ⼤大部分模型都可以注册成UDF函数供使⽤用

26.03 Train/run/predict

27. Why design this? train Table BlackBox model predict There should be a better way Table model table to do things like fit/transform run Table BlackBox table

28.Grammar (‘train’|’predict’|'run') tableName 'as' format '.' path 'where'? expression? booleanExpression* (as tableName)*

29.Algorithm -- use RandomForest train data1 as RandomForest.`/tmp/model` where keepVersion="true" and evaluateTable="data1" -- specify group 0 parameters Train algorithm and `fitParam.0.labelCol`="features" and `fitParam.0.featuresCol`="label" and `fitParam.0.maxDepth`="2" -- specify group 1 parameters and `fitParam.1.featuresCol`="features" and `fitParam.1.labelCol`="label" and `fitParam.1.maxDepth`="10" ;

30.Data processing(Feature Engineer) train orginal_text_corpus as TfIdfInPlace.`/tmp/tfidfinplace` where inputCol="content" and ignoreNature="true" and dicPaths="...." Compute TF/IDF and stopWordPath="/tmp/tfidf/stopwords" and priorityDicPath="/tmp/tfidf/prioritywords" and priority="5.0" and nGrams="2,3" ;

31.Batch predict predict data as RandomForest.`/tmp/model`;

32.Only transform, no model run lwys_corpus_final_format as RateSampler.`${traning_dir}/ratesampler` where labelCol="label" and sampleRate="0.9,0.1";

33.Bunches of moduels

34.04 Register

35.Why design this Single Item API Server Single Result Batch Three deploy mode: Table Predict Table Table/ SingleItem Streaming Table/ SingleResult

36.Why design this(2) It’s best to use UDF when we deploy model/code as API service Single Item API Server Single Result select predict(item) as predictRes register hdfs:/models/randomforest scala/python code

37.Grammar ('register'|'REGISTER') format '.' path 'as' functionName (‘options’|’where')? expression? booleanExpression*

38.Register model load libsvm.`/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt` as data; train data as RandomForest.`/tmp/model` where maxDepth="3"; register RandomForest.`/tmp/model` as rf_predict; select rf_predict(feature) from data as output;

39. Register Python/Scala snippet register ScriptUDF.`` as echoFun options and lang="python" and dataType="map(string,string)" and code=''' def apply(self,m): return m ''' ; select echoFun(map(“a”,”b”) ) as res as output; Limitation:Only lib implemented by pure python. If the lib use c/c++ e.g. Numpy are not support.

40.All python snippet in one file set funs=''' def fun1(self,m): return m def fun2(self,m): return m register ScriptUDF.`funsTable` as fun2 options and lang="python" ‘''; and methodName=“fun2" and dataType="map(string,string)" load script.`funs` as funsTable; ; register ScriptUDF.`funsTable` as fun1 options and lang="python" and methodName="fun1" and dataType="map(string,string)" ;

41.Conclusion Register can convert things like follow to UDF: 1. Scala snippet 2. Python snippet 3. The model generated by train statement Make SQL more powerful, without compilation, without deploy model is a function

42.05 Powerful Set

43. Why design this Duplicate string are train data PythonAlg.`/user/name/a/b/c/d/e` …. full of screen register PythonAlg.`/user/name/a/b/c/d/e` …. So sad No way to manager them — if i want to generate directory according in one place — to date? save overwrite aritle_list Can’t dynamically as parquet.`/tmp/table1/date?`; generate variable.

44. Grammar ('set'|'SET') setKey '=' setValue 'options'? expression? booleanExpression*

45.Normal usage set tmpPath=“/tmp/dir1”;

46.Evaluate from SQL set xx = `select unix_timestamp()` options type = "sql" ;

47.set multi-line text to variable set xx = ''' sentence1 sentence2 sentence3 ''';

48.Powerful date format tool set jack=''' hello today is:${date.toString("yyyyMMdd")} ''';

49.spark sql configuration set `hive.exec.dynamic.partition`= “true” options type=“conf”;

50.Evaluate from shell set time= “date” options type=“shell”;

51.How to reference variable — notice the table usage select '${xx}' as t as `${tm}`; save overwrite tm as parquet.`hp_stae=${xx}`; select statment the value of property,path,table name in train/predict/run/register/load Be caution: include is not support variable.

52.06 Include

53.Why design this many SQL snippets functions lib snippet1 No project concept snippet1 Can’t reuse SQL jobs snippet1 snippet1 snippet1 tables lib snippet1 snippet1 snippet1 snippet1 include others…

54.Grammar include function.`methods.join_str`; load join_str method in package methods which is in library of function

55.Project Structure 1.function: reusable sql snippet and udf functions 2.entity_tables : output tables 3.virtual_tables: temp tables 4.job: job script

56.SQL snippet reuse 1. case when snippet in function.`snippet.channel_name` 2. how can we reuse it?

57.SQL snippet reuse 1.include the snippet 2.use it multi times 3.simplify the whole script

58.functions, tables

59. input/output declaration snippet a: set inputTable=“jack” options type=“defaultParam”; set outputTable=“op” options type=“defaultParam”; select “$inputTable” as table as `${outputTable}`; script b: 1. declare the input/output in script included 2. will be overwrited by the script who include them set inputTable=“wow”; include function.`snippet.a`; select * from ${outputTable} as output;

60.07 Save

61.Why design this create table test (key int, value int) partitioned by (dt int) select 123 as a,456 as b as data; stored as parquet location '/user/me/test' ; save data overwrite parquet.`/user/me/test/dt=1` insert overwrite directory '/user/me/test/dt=1' stored as parquet select 123, 456 ; 1. No need to create table. 2. Data with schema 3. Simplify the usage of save

62. Grammar ('save'|'SAVE') (overwrite | append | errorIfExists | ignore)* tableName 'as' format '.' path 'options'? expression? booleanExpression* ('partitionBy' col)?

63.HDFS-File save overwrite table1 as csv.`/tmp/data` options header=“true"; save overwrite table1 as json.`/tmp/data`; save overwrite table1 as parquet.`/tmp/data`;

64.MySQL connect mysql where url="..." and driver="" and user="" and password="" as mysql_crawler_db; save append article_table as jdbc.`mysql_crawler_db.crawler_table`;

65.HBase select "a" as id, "b" as ck, 1 as jk as wq_tmp_test2 ; save overwrite wq_tmp_test2 as hbase.`wq_tmp_test1` options rowkey=“id" and zk=“127.0.0.1:2181” and `field.type.ck`="BinaryType"; ;

66.Hive select "a" as id, "b" as ck, 1 as jk as wq_tmp_test2 ; save overwrite wq_tmp_test2 as hive.`temp.wq_tmp_test1`;

67.Kafka -- this is good select 1 as id union all select 2 as id as wq_tmp_test2 ; — kafka < 0.10.0 save overwrite wq_tmp_test2 as kafka8.`topic1` options `metadata.broker.list`= "127.0.0.1:9092" ; — kafka >= 0.10.0 save overwrite wq_tmp_test2 as kafka.`topic1` options `kafka.bootstrap.servers`= “127.0.0.1:9092" ;

68.08 Conclusions

69.Conclusions MLSQL can cover the scenes of batch/ml/stream/crawler with some simple statements. It make many of situations can be handled by the same platform.

70.Questions Please make some noise

user picture

相关Slides

  • 本PPT解释了作为支持交易型分布式数据库系统的TiDB核心产品架构及其主要组件,包括TiDB,TiKV,Placement Driver,TiSpark,TheFlash,Tool,TiDB-operator for k8s等,对其基本作用进行阐述,并对其中的核心组件TiKV重点分析,解释了基本数据组织方式,执行方式,数据管理,水平扩展和负载均衡,以及分布式一致性等基本问题。最好对其分析引擎TiSpark也进行了简要功能说明。

  • 介绍了ES的基本结构,功能和原理,重点分析了在实际生产环境中各种运维和监控的指标,以及各种调优经验和配置参数,还有运维自动化的方法论探讨,可以作为ES在实际生产环境中的最佳实践部署和运维监控案例,也可以帮助ES平台维护者理解并思考如何提供更好的ES服务及运维保障。

  • Adaptive Execution @ Spark + AI Summit Europe 2018 Video @ https://databricks.com/session/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale-2

  • Apache Spark作为分布式内存计算引擎,内存使用的优化对于性能提升至关重要,Intel的Optane(傲腾)技术,让内存和SSD之间架设了个新的数据缓存/存储层,并通过PMDK等特殊的API绕过文件系统,系统调用,内存拷贝等一系列额外操作,让性能有极大的提升。Intel开源的OAP(Optimized Analytics Package)for Apache Spark项目,也是基于这个前体,构建即席查询引擎,以及在机器学习算法诸如KMeans算法上也获得了不错的性能回报。