杨旭 Alink:基于Apache Flink的算法平台

杨旭 Alink:基于Apache Flink的算法平台
展开查看详情

1.Alink: 基于Apache Flink的算法平台 Alink: an Algorithm Platform on Apache Flink 公司:阿⾥里里巴巴 职位:资深技术专家 演讲者:杨旭 Xu Yang Senior Staff Engineer at Alibaba Group

2. What is Alink? Ø PAI 算法平台的一部分,是基于Flink的算法平台。 Part of the PAI algorithm platform, based on Flink's algorithm platform. Ø 同时支持批式/流式算法,支持机器学习、统计等方面的二百多种常用算法 Support batch/streaming algorithms, support more than 200 commonly used algorithms in machine learning, statistics, etc. Ø 帮助数据分析和应用开发人员能够从数据探索、模型训练、实时预测、可视化展示, 端到端地完成整个流程。 Help data analytics and application developers complete the process from end to end with data exploration, model training, real-time forecasting, and visual presentation.

3. What is Alink? Ø 相关名称的公共部分 Common part of related words Alibaba, Algorithm, AI, Flink, Blink Ø 各算法功能通过“link”的方式进行链接 Each algorithm function is linked by means of "link" op1.link(op2) op3.linkFrom(op1,op2))

4. Alink 架构(Alink Architecture) Alink SDK & Web UI & Client & Visualization Processing for Structural Data Stream Operator Batch Operator Stream Processing Batch Processing Common Libs For Streaming For Streaming For Streaming Processing Processing Relational Relational For Batch For Batch For Batch Alink Stat Alink Stat Alink ...... Alink ...... Alink ML Alink ML Learning Maching Flink ML Graph Table Table Event Gelly CEP DataStream API DataSet API Stream Processing Batch Processing Runtime Distributed Streaming Dataflow Local Cluster Cloud Single JVM Standalone YARN GCE EC2

5. How to use? 多种调⽤用⽅方式,适合不不同⽤用户及场景 Three calling modes for different users and scenes ⽹网⻚页前端 (Web UI) PC 客户端( Client) 命令⾏行行 (Console) 简单便便捷,⼯工作流配置和执⾏行行 ⽀支持脚本编辑运⾏行行,⽀支持本 运⾏行行Alink脚本 Drag-drop, easy to build 地运⾏行行与集群运⾏行行 Excute Alink Scripts workflow Local run,Edit and run script

6.

7.

8. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

9.

10. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

11. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

12. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

13. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

14. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

15.

16.

17. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

18. 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

19.本地 集群 可视化 文字部分尽量靠近此虚线以上 (完成PPT后删除此段文字及虚线)

20. Current Flink ML Library Ø Supervised Learning – SVM – Multiple linear regression – Optimization Framework Ø Unsupervised Learning – k-Nearest neighbors join Ø Data Preprocessing – Polynomial Features – Standard Scaler – MinMax Scaler Ø Recommendation – Alternating Least Squares (ALS) Ø Outlier Selection – Stochastic Outlier Selection (SOS) Ø Utilities – Distance Metrics – Cross Validation

21. Alink Supported Algorithms(1/4) Ø 回归(Regression) Multi-Linear Regression, Lasso Regression, Ridge Regression, SVM Regression, Stepwise Linear Regression, Cart, GBDT, Random Forest Regression Ø 分类(Classification) Logistic Regression, Supported Vector Machine(SVM), Perceptron, Naive Bayes, K-Nearest Neighbor, Tradaboost, Random Forest, ID3, Cart, C45 Ø 聚类(Clustering) KMeans, KModes, DBSCan, AGNES, PIC Ø 深度学习(Deep Learning) TensorFlow Prediction and Training Ø 在线学习(Online Learning) FTRL,KMeans, Perceptron,Passive Aggressive (PA),PA-I,PA-II Ø 评估(Evaluation) EvalClassification, EvalClustering, EvalRegression

22. Alink Supported Algorithms(2/4) Ø 数据处理理(Data Processing) Random Sampling, Stratified Sampling, Normalization, Standardization, Fill Missing Value, Type Conversion KvToTensor,TableToTensor,TensorToTable,TensorFunction,TensorToTuples Velocity Variable, Network traffic indicator, TensorExpandDim, Append ID Split a single column into multiple collumns, Select the Column After splitting, Extract Json Values, Single Column into Multiple, Multiple Columns into Single SqlCmd, As, Select, UnionAll, Where, GroupBy, Distinct, Intersect, Join, Minus, Orderby Multi-Stream merge, LatestJoin, Lookup Ø 特征⼯工程(Feature Engineering) One-Hot Coding, Feature Scale Transformation, Feature Anomaly Smoothing, Linear Model Feature importance Analysis

23. Alink Supported Algorithms(3/4) Ø 基本统计(Basic Statistics) Window Statistics, Full Table Statistics, Grouped Window Statistics Count, Sum, Mean, Maximum, Minimum, Number of missing values, variance, standard deviation, Standard Error, Kurtosis, Skewness, etc. Largest k Values, Smallest k Values Ø 变量量关系(Variable Relationship) Covariance, Correlation Coefficient, Correspondence Analysis, Cross Table, Multicollinearity Ø 数据分布(Data Distribution) Percentile, Frequency, histogram, PDF, CDF, Empirical PDF, Empirical CDF, P-P Plot, Lorenz Curve Ø 假设检验(Hypothetical Test) T-Test, chi2 Test, AD Test, KS Test Ø 数据降维(Reduction) Principal Component Analysis(PCA) ,tSNE Ø 时间序列列(Time Series) ARIMA, Garch, ArimaGarch

24. Alink Supported Algorithms(4/4) Ø 异常检测(Outlier Selection) SOS, K-Sigma, AVF, Boxplot, AGD, One Class SVM, SMA, EWMA, CDM, G Test, UriNumberDetection, GroupDetection, GroupMFIDetection, BigGraphGeneration Ø 推荐算法(Recommendation) ALS, Simrank, FM, ItemCF Ø ⽂文本分析(Text Analysis) Word Count, Word Segmentation, Stop Word Filtering, Tokenizer, New Word Recognition, TF-IDF, Text Feature Generation Word2Vec, Text Sensitive Number Capture, Bank Card Information Parsing, ID Card Information Parsing, Word Sequence to ID sequence, String Similarity, Semantic Vector Distance, SimHash Ø 图算法(Graph) Single Source Shortest Path, Community Detection, Label Propagation, PageRank, HITS, Tree Depth, Connected Graph, Modularity, K-Core, Triangle Count, Second-Degree Neighbor Lookup

25. Demo for Statistics and Visualization • IJCAI-17 Dataset • https://tianchi.aliyun.com/datalab/index.htm • Trading amounts and locations of Alipay users • 19.6 million users, 67 million trades 当前⽆无法显示该图像。

26.

27. Classification Demo DataSet • https://archive.ics.uci.edu/ml/datasets/adult • Predict whether income exceeds $50K/yr based on census data. • 48842 instances, 6 continuous attributes, 8 discrete attributes.

28.Classification Demo

29.Classification Demo