Databricks + Snowflake: Catalyzing Data and AI Initiatives

“Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo. Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow). Key Takeaways: How Databricks & Snowflake work; Why they’re so powerful; How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives”
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Databricks + Snowflake: Catalyzing Data and AI Garren Staubli Solutions Architect garren@databricks.com | @gstaubli #UnifiedAnalytics #SparkAISummit Slides & Resources: garrens.com/DataSnowCat

3.Agenda Introductions Scenario Challenges Solutions Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 3

4. Introductions - Me MySQL AWS Python Hadoop Scala, Python & Java Ruby Pig & Hive Linux NoSQL Apache Spark & ML 2011 2012 2013 2014 2015 2016 2017 2018 2019 #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 4

5.Introductions - Databricks Databricks Workspace Collaborative Notebooks, Production Jobs Data &Runtime Databricks ML Databricks Delta Lifecycle ML Frameworks ML Frameworks Transactions Indexing Data Engineering Data Science Cloud Accelerate innovation by unifying data science and engineering

6.Introductions - Snowflake #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 6

7.Forget Oil. Data is worth more than Gold #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 7

8.Scenario #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 8

9. Scenario - Annotated Data Mining Data Science ML Engineering Production Delivery* DevOps QA * not Digiorno #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 9

10.Scenario - Reality #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 10

11.Challenges Sources LAKES STREAMS Data-Driven Production APIs Apps BI WAREHOUSES NOSQL #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 11

12. Challenges - Reality Partitions: 20 Insights Rows per second: 10,000 Format: JSON ML Analysis Extract Transform Load #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 12

13. Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Sources Flat, RDBMS, Streams, etc Syntax Unified batch & stream APIs Scale Autoscaling with usage Languages Python, Scala, SQL, R & Java Performance JVM w/ optimization Expressiveness Multilevel APIs (SQL + RDDs) #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 13

14. Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Malformed Records Ignore/infer + log records Errors Handle + retry w/ checkpoint Changing Fields Schema Evolution Writes - Performance Partitioned + optimized files - Semantics Exactly once - Reliability ACID transactions #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 14

15. Challenges & Solutions - ML Partitions: 20 Rows per second: 10,000 Format: JSON Data Access Apache Spark + Delta Syntax Koalas Collaboration Databricks Notebooks Models - Iteration - Tracking - Reproducibility - Projects - Deployment - Models #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 15

16.Challenges & Solutions - Analysis Partitions: 20 Rows per second: 10,000 Format: JSON Time to value Interactive queries Intermittent demand Instant Scaling Language SQL Common Tooling Tableau, PowerBI, etc Ease of Use Optimized DWaaS Cost control Decoupled storage + compute #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 16

17. Final Solution Architecture Machine Learning Partitions: 20 Rows per second: 10,000 Format: JSON BI Reporting Dashboards #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 17

18. Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 18

19.Review Introductions Scenario Challenges Solutions Demo #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 19

20.Solution Sources Persistence LAKES MLBI LAKES STREAMS DELTA DELTA WAREHOUSES NOSQL WAREHOUSES NOSQL Processing Integration APIs Apps BI #UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat 20

21.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

22.