Databricks + Snowflake: Catalyzing Data and AI Initiatives

“Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo. Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow). Key Takeaways: How Databricks & Snowflake work; Why they’re so powerful; How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives”

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Databricks + Snowflake: Catalyzing Data and AI Garren Staubli Solutions Architect | @gstaubli #UnifiedAnalytics #SparkAISummit Slides & Resources:

3.Agenda Introductions Scenario Challenges Solutions Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: 3

4. Introductions - Me MySQL AWS Python Hadoop Scala, Python & Java Ruby Pig & Hive Linux NoSQL Apache Spark & ML 2011 2012 2013 2014 2015 2016 2017 2018 2019 #UnifiedAnalytics #SparkAISummit | Slides & Resources: 4

5.Introductions - Databricks Databricks Workspace Collaborative Notebooks, Production Jobs Data &Runtime Databricks ML Databricks Delta Lifecycle ML Frameworks ML Frameworks Transactions Indexing Data Engineering Data Science Cloud Accelerate innovation by unifying data science and engineering

6.Introductions - Snowflake #UnifiedAnalytics #SparkAISummit | Slides & Resources: 6

7.Forget Oil. Data is worth more than Gold #UnifiedAnalytics #SparkAISummit | Slides & Resources: 7

8.Scenario #UnifiedAnalytics #SparkAISummit | Slides & Resources: 8

9. Scenario - Annotated Data Mining Data Science ML Engineering Production Delivery* DevOps QA * not Digiorno #UnifiedAnalytics #SparkAISummit | Slides & Resources: 9

10.Scenario - Reality #UnifiedAnalytics #SparkAISummit | Slides & Resources: 10

11.Challenges Sources LAKES STREAMS Data-Driven Production APIs Apps BI WAREHOUSES NOSQL #UnifiedAnalytics #SparkAISummit | Slides & Resources: 11

12. Challenges - Reality Partitions: 20 Insights Rows per second: 10,000 Format: JSON ML Analysis Extract Transform Load #UnifiedAnalytics #SparkAISummit | Slides & Resources: 12

13. Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Sources Flat, RDBMS, Streams, etc Syntax Unified batch & stream APIs Scale Autoscaling with usage Languages Python, Scala, SQL, R & Java Performance JVM w/ optimization Expressiveness Multilevel APIs (SQL + RDDs) #UnifiedAnalytics #SparkAISummit | Slides & Resources: 13

14. Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Malformed Records Ignore/infer + log records Errors Handle + retry w/ checkpoint Changing Fields Schema Evolution Writes - Performance Partitioned + optimized files - Semantics Exactly once - Reliability ACID transactions #UnifiedAnalytics #SparkAISummit | Slides & Resources: 14

15. Challenges & Solutions - ML Partitions: 20 Rows per second: 10,000 Format: JSON Data Access Apache Spark + Delta Syntax Koalas Collaboration Databricks Notebooks Models - Iteration - Tracking - Reproducibility - Projects - Deployment - Models #UnifiedAnalytics #SparkAISummit | Slides & Resources: 15

16.Challenges & Solutions - Analysis Partitions: 20 Rows per second: 10,000 Format: JSON Time to value Interactive queries Intermittent demand Instant Scaling Language SQL Common Tooling Tableau, PowerBI, etc Ease of Use Optimized DWaaS Cost control Decoupled storage + compute #UnifiedAnalytics #SparkAISummit | Slides & Resources: 16

17. Final Solution Architecture Machine Learning Partitions: 20 Rows per second: 10,000 Format: JSON BI Reporting Dashboards #UnifiedAnalytics #SparkAISummit | Slides & Resources: 17

18. Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: 18

19.Review Introductions Scenario Challenges Solutions Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: 19

20.Solution Sources Persistence LAKES MLBI LAKES STREAMS DELTA DELTA WAREHOUSES NOSQL WAREHOUSES NOSQL Processing Integration APIs Apps BI #UnifiedAnalytics #SparkAISummit | Slides & Resources: 20



由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。