Power Your Delta Lake with Streaming Transactional Changes

Organizations are adopting data digitization and data-driven decision making is at the heart of this transformation. Cloud Data Lakes and Datawarehouses provide great flexibility to proto-type and roll out applications continuously at much lower costs.

Transactional databases are optimized for processing huge volumes of transactions in real-time, whereas the cloud data lake needs to be optimized for analyzing huge volumes of data quickly. This brings about a challenge in creating a streamlined data flow process from capturing realtime transactions into a cloud datawarehouse to drive realtime insights in a scalable and cost effective manner.

In this session, we’ll show how organizations can easily overcome that challenge by adopting a robust platform with StreamSets and Delta Lake. StreamSets provides a no-code framework to automate ingestion of transactional data and data processing on Spark, while Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Power your Delta Lake with streaming transactional changes Rupal Shah Director, Cloud Services StreamSets

3.NEW DATA CONTINUOUSLY GENERATED User Behavior Data Reporting Click Streams Dashboards Sensor data (IoT) DATA LAKE Alerting Video/Speech Usage/Billing data Machine Telemetry Commerce Data …

4. Performance Data Integrity Reporting Dashboards NEW DATA DATA LAKE Alerting Cost Infrastructure

5. Delta Lake 2. Query Performance 1. Data Reliability Fast at Scale (10-100x Faster) Cheaper to Operate Indexing & Caching ACID Compliant Transactions Schema Enforcement & Evolution LOTS OF NEW DATA Reporting User Behavior Data Click Streams Dashboards Sensor data (IoT) Alerting Video/Speech Machine Learning Usage/Billing data Machine Telemetry 3. Simplified Architecture Commerce Data Unify batch & streaming … Early data availability for analytics

6.StreamSets DataOps platform powers continuous data by operationalizing the full data flow lifecycle Monitoring Design Deploy Operate Smart Data Pipelines Automation • Single tool for all data, all use • Performance and security • Continuous delivery cases SLAs across lifecycle • Built-in drift handling • End-to-end views • Supports all data platforms to avoid lock-in

7. End to End Data Integration Data Stores • Bulk Data Consumers • Micro-Batch • CDC Integrate Prep Analyze Raw Curated Consumer Internal & External Message Request Databases Queues Ready Ingest APIs ELT Distribute Sources Events ETL ML • Streaming Sensors Machines Logs • Edge Data Shipping • Micro-batch Collaborative Development, Continuous Design Data Platforms Deployment Flexibility: Choose, Change On-premise, private cloud or public cloud


9.Change Data Capture

10. Slowly Changing Dimension

11.Take Aways • No code • Powers Delta Lake with Fast Data • Handles complex data integration logic (CDC, SCD, …) with ease • Become a DataOps champion!

12.Next Steps… • Visit booth #88 for further information • Get started with StreamSets for powering your Delta Lakes: https://streamsets.com/download/ • Get slack’ing with StreamSets ninjas https://streamsetters-slack.herokuapp.com/


由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。