Interacting with customers in the moment and in a relevant, meaningful way can be challenging to organizations faced with hundreds of various data sources at the edge, on-premises, and in multiple clouds.
To capitalize on real-time customer data, you need a data management infrastructure that allows you to do three things:
1) Sense-Capture event data and stream data from a source, e.g. social media, web logs, machine logs, IoT sensors.
2) Reason-Automatically combine and process this data with existing data for context.
3) Act-Respond appropriately in a reliable, timely, consistent way. In this session we’ll describe and demo an AI powered streaming solution that can tackle the entire end-to-end sense-reason-act process at any latency (real-time, streaming, and batch) using Spark Structured Streaming.
The solution uses AI (e.g. A* and NLP for data structure inference and machine learning algorithms for ETL transform recommendations) and metadata to automate data management processes (e.g. parse, ingest, integrate, and cleanse dynamic and complex structured and unstructured data) and guide user behavior for real-time streaming analytics. It’s built on Spark Structured Streaming to take advantage of unified API’s, multi-latency and event time-based processing, out-of-order data delivery, and other capabilities.
You will gain a clear understanding of how to use Spark Structured Streaming for data engineering using an intelligent data streaming solution that unifies fast-lane data streaming and batch lane data processing to deliver in-the-moment next best actions that improve customer experience.
1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
2.AI-Powered Streaming Analytics for Real-Time Customer Experience John YourHaddad Name,&Your Vishwa Belur Organization #UnifiedDataAnalytics #SparkAISummit
3.Sense Reason Act
4.Streaming Data Management Sense Reason Act Streaming Data Streaming Data Streaming Pipeline Ingestion Enrichment Operationalization Collect streaming data from Enrich and distribute streaming Operationalize actions streaming and IoT endpoints data in real time for business based on insights from and ingest it onto the lake or user consumption streaming data messaging hub
5.Intelligent Streaming Analytics Real-time Offer Predictive Maintenance Clinical Research Real-time Fraud Management and Smart Factory Optimization Detection and Alerting Combine web searches Identify stress signals coming Collect and process bedside Run real time fraud and camera feeds to from devices and act on them monitor data for clinical detection machine- before its too late researchers to more effectively learning model on the identify the customer and understand and detect disease transaction data roll out real-time offers
6. Case Study: Multi-latency data management at a large global Media & Entertainment company Objective: Enhance the customer experience with chatbot on mobile app in cloud • How did Informatica help? – Real-time ingestion and analytics on chatbot responses – Provide real time feedback to IBM Watson on the chatbot response quality Data Engineering – Run batch analytics on the data for operational reporting Data Lake Mobile phone Mobile App (Line) Data Engineering Amazon Kinesis
7. Case Study: Streaming Analytics at OVO, a leading Indonesian financial services platform Objective: Create better customer engagement with targeted real-time campaigns • How did Informatica help? – Stream data from millions of customers for analysis of various combinations of customer behaviors and segments – Integrate customer transaction and interaction data to deliver personalized campaigns in less than 15 seconds Data Engineering Data Lake Mobile App – Top up, payments, purchases notificatons, promo codes, deals Customer analytics and attributes
8.Event-centric Data Processing – Methodology The value of most events is multiplied by the context Dynamic Context Enterprise Data External Data Models, Rules, History, Patterns, States Event Sense Reason Act Action
9. Real-time Streaming Analytics — Customer Expectations Sense Reason Act Versatile Edge Schema Streaming Serverless Connectivity Processing Drift Analytics & ML Variety of sources and Simple edge processing Support “intent- Support “event time Real time enrichment source protocol (MQTT, to handle bad records driven ingestion” based processing” with transformation OPC, HTTP, etc.) Add metadata to the Support for Support for late Serverless streaming Structured/Unstructured message for dynamically evolving arrival of IoT events Real time ML model better analytics schema Cloud data lakes and operationalization messaging hubs
10. Informatica Enterprise Streaming and Ingestion Sense Reason Act Parse Filter Transform Aggregate Enrich Change Changes Data Capture & Real-time/Batch Processing & Analytics Publish Azure Amazon Real time Real time Relational Event Hub Kinesis offer alert dashboard Systems Message Hub Sensor Data Trigger business processes Machine Data / IoT Real-time AWS/Azure/Google Web Ingestion Logs Not Only SQL Social Media Capture and Ingest Enrich, Process, and Analyze Persist /Data Lake/Data Warehouse
11.Cloud Streaming Ingestion Service Sense Provides streaming ingestion capabilities as part of IICS Data Ingestion service Ingest streaming data: Logs, clickstream, social media, Kafka Kinesis, S3, ADLS, Firehose, etc. Real-time monitoring of Sensor ingestion jobs with lifecycle Data management and alerting in Messaging Real time Machine Systems analytics case of issues Data / IOT WebLogs Orchestrate streaming data Social Data Lake Consumption ingestion in hybrid/cloud as Media managed and secure service Messaging Systems
12.Decipher Data With CLAIRE™ Reason Automatic Structure Automatic Model Deploy on Cloud Detection Development or on-premise E E p x p x o o r r t t • Machine learning algorithm • View data in a visual • Parser is automatically recognizes the file structure structure created • Relational structure • Clearly see which elements • Intelligent Parsers can be generated on the fly are connected to real data used in run-time to transform similar data files for • Refine data: continuous processing – Normalize – Exclude – Rename element
13.Intelligent Structure Discovery in Action • Data Fluctuation and Data Drifting – Different formats of the same semantics: 01/01/2019 and 01-01-2019 and Jan-01-2019 – Record changes within file: If some records contain 10 fields other contain 8 – Some changes in file format: for example additional fields Original Log New version Log New fields that are not in the New date format Added spaces model are mapped to is handled are handled unassigned ports correctly correctly
14.Spark Structured Streaming—Overview Reason Scalable and fault-tolerant stream processing engine • What is Structured Streaming? – High level streaming API built on DataFrames/DataSets; treats stream as an infinite table – Unifies streaming, interactive and batch queries and provides a structured abstraction • How does it help? STRUCTURED – Handle streaming data based on event time instead of processing STREAMING (Spark) time – Address otherwise impossible—out of ordered delivery of streaming data with watermarking – Support for output mode in Streaming target—append, complete and update https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
15.Motivation for Structured Streaming • Streaming and batch needed separate implementation – Streaming (Dstream) managed by StreamingContext – Batch (DataFrame) managed by SqlContext • Spark unified API to leverage the functionalities offered by DataFrame API’s – Common implementation for batch & streaming – Provides enhanced Stateful operations and gives further control to spark application developers – Moving towards operating batch as stream and stream as batch
16.Spark Structured Streaming Support What are we doing? • First vendor to support Spark Structured Streaming with Spark 2.3.1 • Windowing based on source event time • Ability to define watermark for late event handling How does it help? • Aggregate streaming data based on event time instead of processing time • Handle “out of order” data from the source and deliver “in order” to target
17.Spark Structured Streaming vs Dstream Event EventTime Spark Computing Time Actual Event Rec1 01/01/2019 00:00:00 01/01/2019 00:00:01 LogDateTime ( 01/01/2019 00:00:00 ) , LogLevel - ERROR Rec 2 01/01/2019 00:00:01 01/01/2019 00:01:01 Rec2 – LogDateTime ( 01/01/2019 00:00:01 ) , LogLevel - ERROR Rec 3 01/01/2019 00:00:02 01/01/2019 00:00:03 Rec3 – LogDateTime ( 01/01/2019 00:00:02 ) , LogLevel - ERROR Rec 4 01/01/2019 00:00:03 01/01/2019 00:00:21 – LogDateTime ( 01/01/2019 00:00:03 ) , LogLevel - ERROR Rec 5 01/01/2019 00:00:04 01/01/2019 00:00:23 – LogDateTime ( 01/01/2019 00:00:04 ) , LogLevel – ERROR • Batch Size : 20 seconds • Window Size : 40 seconds • Watermark Interval : 5 minutes (ignored in Dstream , applicable only for structured streaming).
18.Processing with Dstream • Dstream’s window is based on the processing time- time of arrival of data into spark’s window computation • User has no choice to pick any data or metadata which is part of the incoming data for windowing batch-(n+1) batch-(n+2) batch-(n+3) batch-(n+4) Rec1 ….. Rec3 Rec4 …… Rec2 Rec5 window -1 window -2 Count(Rec1, Rec3, Rec4, Rec5) = 4 count(Rec2) = 1 Late Arrival is not addressed and source event time based ordered processing is not possible
19.Processing with Structured Streaming State Store maintained by Spark Window Count(Aggregate) [01/01/2019 00:00:00 – 01/01/2019 00:00:40] 5 batch-(n+1) batch-(n+2) batch-(n+3) batch-(n+4) …Recx (last data to arrive with Rec1 ….. Rec3 Rec4 …… Rec2 TS 31/12/2018 11:59:20) in this batch. Rec5 last record in batch-(n+1) : Rec 3 Empty batch will be ignored. Last record in batch-(n+2) : Rec5 WaterMarkDelay = 31/12/2018 11:59:20 WaterMarkDelay = 01/01/2019 Water Mark Delay = 01/01/2019 – 5 minutes = 31/12/2018 11:54:20 00:00:02 – 5 minutes = 00:00:04 – 5 minutes = 31/12/2018 11:55:02 31/12/2018 11:55:04 . Rec1 and Rec3 falls within 01/01/2019 Rec2 falls within 01/01/2019 00:00:00 – 01/01/2019 00:00:40 and Rec4 and Rec5 falls within 00:00:00 – 01/01/2019 00:00:40 watermark is not greater than end 01/01/2019 00:00:00 – 01/01/2019 and watermark is not greater than interval. 00:00:40 and watermark is not end interval. So, spark waits and aggregate for greater than end interval. records arriving for this window. Structured Streaming aggregates the late arrival of events in the same window [01/01/2019 00:00:00 – 01/01/2019 00:00:40]
20.How Informatica adopted Structured Streaming? 20
21.Data Science & Machine Learning Act What are we doing? • Python Tx in batch and streaming mapping How does it help? • Helps to operationalize ML models in streaming flows • Solves Data science use cases
22.Enterprise Streaming & Ingestion - Reference Architecture Machine Learning-based Data MDM Recommendation Warehouse Hub Engine Real-time Streaming & Batch Visualization Analytics Target Topic Alerts THINGS Stream & IoT Parse Integrate Data Lakes & Warehouses Customer Ingest Deliver Service Representative Cleanse Match CDC CRM ERP Ingest Enrich Mask Databases
23. Demo 23
24. Innovative Retail Corp. – Use case • Innovative Retail Corp - big retail corporation with offline and online presence • Business Challenges • Loyal customers visiting the shop are not buying products they browsed online • Customers browsing products online, but not purchasing • What do they want to do? • Track customer online searches • Ingest camera feed & use image recognition to identify customers visiting the store • Ingest Beacon feed data to identify the real-time location of the customer in shop • Push real time offers on customer mobile phones based on their search history & the section they are in the store 24
25. Demo Scenario Sense Reason Act www…com JSON Lookup Analytics Real time Ingestion offer Alert SMS Searc h weblogs history JSON Ingestion Camera Feed Beacon Real time dashboard data Capture and Transport Enrich and Process
26.AI-Powered Enterprise Streaming Solution High Level Roadmap Enterprise Class Multi-Cloud, Streaming Hybrid Connectivity Analytics & Streaming CLAIRE • Cloud native streaming data • Serverless computing • CLAIRE & Confluent Schema ingestion and stream processing • Streaming reference architecture Registry support for parsing & • Enterprise readiness – with Databricks schema drift distributed deployment, IoT • Extensive Streaming & IoT • De-duplication and Continuous deployment, dynamic mapping source & target connectivity streaming
27. Summary Streaming in Multi Streaming Simplified Complex Data Parsing Latency Platform Connectivity UI & Schema Drift • Multi latency data • IoT & Streaming • Easy to use experience for • AI-driven complex data management platform source connectivity data ingestion & integration parsing with CLAIRETM • End-to-end streaming for • Cloud data lakes • Unified UI for streaming and • Address dynamically Cloud/Hybrid eco-systems and messaging hub batch evolving schema connectivity • Integrated deployment & real-time monitoring
29.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT