- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Apache DolphinScheduler 3.2.0 features and integration with OceanBase-Kyle
展开查看详情
1 .Apache DolphinScheduler 3.2 preview and OceanBase Integration Kyle Zhike Chen Data Engineer Manager at GoTo Financial Twitter: _kk17_ GitHub: kk17 Aug 21, 2023
2 .Agenda • Introduction of DolpinScheduler • 3.1 & 3.2.0 feature preview • User case - OceanBase integration
3 .What is DolphinScheduler?
4 . What is DolphinScheduler Workflow Platform Drag and Drop First • Complex Dependence • Drag&Drop First • Concurrency&Limitations • Python API & Open API • Retry & Alert & Backfill • Yaml Definition • Monitoring & Metric A distributed and extensible workflow scheduler platform with powerful DAG visual interfaces High Availability & Plug-in Based Design Performance • Tasks: 40+ • Decentralization • DataSources: 11+ • Native HA Queue • Alerts:10 • Fault Tolerance • Registration: 3
5 .DolphinScheduler History
6 .DolphinScheduler 3.1 Features Simple& WYSWYG workflow Cloudnative&Extensiblity High Reliability Rich Workflow Functions Support pause&resume • Support User-defined Task • Drag & Drop to create workflow • Decentralized multi- • Masters and multi-Worker workflow • Condition and subworkflow • DAG Graph run-time management • High • Support projects,multi-tenant • Elastic Master & Worker dynamic • Open API to support others performance(support 1m+ on-line&off-line • Support 30+ Tasktype,Spark, Task in production env) Hive, MR, Python, Sub-Process, Shell,EMR, S3 • High Reliability ML Orchestration Realtime Support Python, YAML Workflow Support Kubernetes Support • DataPreparation+MLOps • Python generate Workflow • K8S Operator • Flink, Sparking streaming • ML flow, Sagemaker,DVC Support • YAML generate Workflow • K8S Task • Jupyter,PyTorch • Data Stream workflow • Code Review & Deployment Support • OpenMLDB • Data Stream Management
7 .User friendly for non developers Unlike the common workflow scheduling system currently, Not only provides a user-friendly Web UI to create worfklow. But aslo a programming or Yaml definition one. It is a good choose if your company have both developer and non developer who want to create and manage workflow
8 .Batch & Streaming With the increase of data effectiveness requirements, we have a lot of streaming workflow. Many companies have two tool handle steaming and batch workflow. Not at the same level tool will cause problems when workflow to connect and get lineages
9 .Both MLOps and Data Warehouse MLOps: Deploy and maintain machine learning models in productionly. Some teams use different tools for data warehouse and MLOps But data preparation and model tranning should in the same level, and should handle with a single tool
10 .Batch & Streaming With the increase of data effectiveness requirements, we have a lot of streaming workflow. Many companies have two tool handle steaming and batch workflow. Not at the same level tool will cause problems when workflow to connect and get lineages
11 .Data Quality ● There always data quality issue in source system, while workflow will be controled only by time/dependencies. ● Data Quality table always is the last job in the workflow. DQ rules can be managed or import into DolphinScheduler
12 .DolphinScheduler vs Airflow DolphinScheduler Airflow program language written in Java written in Python DAG definition Drag and Drop First Python DAG files DAG versioning build in version control integrate with git-sync support component integration Rich integration of big general component data and ML components Backfill UI support command line multi-tenancy yes partially support Streaming job streaming task support no streaming task support Data Quality Yes no
13 . DolphinScheduler 3.2 Features • Add default tenant • Add support for multiple data sources, such as Snowflake, Databend, Kyuubi, OceanBase, and more • Add new task types • Add caching support for tasks • Enhance existing task types (e.g. Sqoop, SQL, etc.) • Enhance architecture (e.g. Alert supports HA, SSO support, etc.) • Specify workflow execution forward and backward when re- running tasks • Add support for remote logs from OSS, GCS, S3 • Get real-time logs from Kubernetes pods • Enhance task parameters • Add support for Alibaba Cloud OSS in the resource center • Enhance the Restful API • Add support for ETCD and JDBC registry centers
14 .How to Create Your Very First Workflow? Oceanbase integration example
15 .Via Web UI Create project and initialize workflow
16 .Via Web UI Drap & Drop to craete and define task, and set dependence
17 .Via Web UI Save, publish and trigger the workflow
18 .Via Web UI Recap • Create project • Initialize workflow • Drap & Drop to craete and define task, and set dependence • Save, publish and trigger the workflow
19 .Via Web UI Create workflow instance status and view log
20 .Via Python API PyDolphinScheduler is Python API for Apache DolphinScheduler, which allow you definition your workflow by Python code, aka workflow-as-codes.
21 . Via Yaml Define Yaml Definie is function binding into PyDolphinScheduler currently, it can covert Yaml file to workflow
22 .Community
23 .Users & Integrations Some of Our Users Some of Our Integrations
24 .Contact Us • Troubleshooting • User Mail List: users@dolphinscheduler.apache.org • Slack: https://asf-dolphinscheduler.slack.com/channels/troubleshooting • Bug & Features Request & Features Discussion • Developer Mail List: dev@dolphinscheduler.apache.org • GitHub Issue: https://github.com/apache/dolphinscheduler/issues • GitHub Pull Requests: https://github.com/apache/dolphinscheduler/pulls • Announcements • Mail List: Both dev and users metions above • Twitter: https://twitter.com/dolphinschedule • Slack: https://asf-dolphinscheduler.slack.com/channels/announcements