- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Journey to Creating a 360 View of the Customer: Implementing Big Data Strategies
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Big Data Journey to Create the 360 View of the Consumer : Data Driven Strategies with Data Lake and Databricks Jyoti P. Mohapatra, Altria Ramesh Ketha, Capgemini #UnifiedAnalytics #SparkAISummit
3 .About Altria Altria's companies have a strong American heritage stretching back more than 180 years. Altria Group holds diversified positions across tobacco, alcohol and cannabis. Through our wholly-owned subsidiaries and strategic investments in other companies, we seek to provide category-leading choices to adult consumers, while returning maximum value to shareholders through dividends and growth. We are a FORTUNE 200 company, proud to call Richmond, Virginia our home. Our people and companies address tough industry issues, like reducing the health effects of tobacco use and preventing underage tobacco use. And we focus on strengthening the communities where we live and work. Our Mission & Values Is to own and develop financially disciplined businesses that are leaders in responsibly providing adult tobacco and wine consumers with superior branded products. Our Values guide our behavior as we pursue our Mission and our business strategies : Integrity, Trust and Respect Passion to Succeed, Executing with Quality, Driving Creativity into Everything We Do, Sharing with Others. #UnifiedAnalytics #SparkAISummit 3
4 .Context • Data is a competitive advantage for Altria – Adult Consumer Database – Marketplace Information – Trade Program • Access to and use of new adult consumer information and sources of data are increasing • Very competitive and regulated market • Growth impact seen at companies that inject analytics into their operations • Building up and connecting data will drive better insights and continued advantage for Altria #UnifiedAnalytics #SparkAISummit 4
5 .Mission Connected Data • Owned ATC 21+ data who are age verified, registered and Opt-In Data • 3rd party data (e.g. public data: census, economic data etc.) • Marketplace POS Scan data • Other Altria operational data Analytics Analytics Roadmap & Marketing • ATC Understanding Insights and • Precise Value and Equity Delivery External • Enable Salesforce Synthesis Execution • Product Innovation and • Analytical tools Regulatory Approval • People • External Engagement • Process #UnifiedAnalytics #SparkAISummit 5
6 .Business Initiatives • Digital Transformation • Data Velocity and Volume – Adult Consumer 360 and – Growth in POS Scan Data Personalization – Trade Program Management – Marlboro Rewards Launch – Competitive Products – Market Basket Analysis – Trade Payments – Precise Value and Equity Delivery • Sales Application Cloud Migration – Product Innovation and Regulatory – Reduce the Data footprint On- Approval Premise – Data Interfaces ,Pipelines and Process rebuild – Applications Transactional sync • Data Governance and Stewardship and Unified Access #UnifiedAnalytics #SparkAISummit 6
7 .Challenges • Data Content stored in disparate sources • Limited integrated view of adult consumers and cross channel activities • Cumbersome, slow data access • Asynchronous data exchange with suppliers and adult consumer touchpoints (e.g., Email, SMS) • Limited analytics capabilities, e.g., real-time personalization, coupon optimization, cross channel harmonization, experimentation • Siloed architecture that limits cross-channel experiences and scalability #UnifiedAnalytics #SparkAISummit 7
8 .Data Landscape Digital Execution Tailored set for use CMI ‘Standardized’ Data Sales & Operation in model discovery (Advanced Access across multiple departments Analytics) ATC PII Data Foundation Data Consumer Sales Transactions & Cache Discovery Engagement Data Data Mart ILD (AGDC AZURE) Foundation (KPI’s, Standard (Retail and Wholesale) Reporting and Clickstream Analytics) Email/SMS/DM Retail STARS POS Scan Loyalty Fulfillment & eCom. Model Automation *Key to enabling ‘Data Driven’ Execution/Agility Std. KPIs Model Scores Model Scores Model Automation Data Lake (standardized storage of all sources representing Sales Order->Consumer) #UnifiedAnalytics #SparkAISummit 8
9 . Enterprise Data Lake Journey Ingress Data Store Data Build Marts Do Analytics & BI (Secure ,Authorized & Monitored ) (Secure , Governed and Automated ) (BI ,Discovery &External Feeds) (Azure Databricks) Discovery/Models POS Interactive Data Science Cluster Sales Data Services STARs ADLA – Batch Reporting Tools Channel Property Altria Enterprise Data Lake ETL Marts Consumer Activity Business Analysts U-SQL IRI Unify Consumer Loyalty /Marts Data Lake Services - Data Enrollment Program (DEP) – Automated new data providers, Notifications and Monitoring. - Data Analytics Services (DAS) – Data Marts , Modeled , Visualization - Data Governance Service (DGS) – Audit Ingress , Stage and Egress - Infrastructure Data Service (IDS) – Archive/Stage Only - Pre-modeled Data Host Services (PDHS) – Stage Pre-modeled/Aggregated data for Visualization #UnifiedAnalytics #SparkAISummit 9
10 .Altria Design Principles ▪ PaaS first Solution ▪ Consolidated datasets in central location (without ▪ Security & Governance inline with Personally Identifiable Information) Enterprise Architectural guidelines ▪ Data resides within Altria Subscription ▪ Secured Azure Cloud Environment, ‘Private ▪ Enable a ‘Single Source of Truth’ Peer’ Express route only from On-Premise ▪ Enable analytics to Cloud. - Quick and easy access to information ▪ PAAS Service provider Must have - Leverage power of cloud computing to enable ▪ Identity Management (AAD) machine learning / advanced analytics ▪ Approved Networking & Security ▪ Governed by Information and Insights Initiative ▪ Vulnerability & Audit reports Data Principles #UnifiedAnalytics #SparkAISummit 10
11 .Azure PaaS Reference Architecture Managed Service Providers Altria Azure Cloud Altria On Premise Altria Cloud Public Services Customer’s Public Services Data Factory Apps Public Integration Traffic Event TCP ,443 Hub API Mgmt Storage TCP ,443 Azure Active Directory Data lake Event Grid P BI Service SSH (22) Managed Providers Altria Control Plane Corporate TCP ,443 Network Altria Cloud VNets shared by TCP ,1433 NVA Service End Points/ Managed Services Data Gateway Customer’s Service End Points UDR Private IP Control Plane Virtual Network Traffic Managed Providers MS backbone Express Route Telemetry Privat Peer -Only SSIS Integration Cluster FQDNs Runtime Virtual Network Virtual Network Certificate TCP ,443/80 related assets/ Service tags #UnifiedAnalytics #SparkAISummit 11
12 .Data Lake Data Flow Strategy– Multi Zone Implementation Data Acquisition Data Preparation Data Reporting Data Engineers & Data Data Engineers & Business Analysts & Data Operation Data Scientists Scientists Landing Discovery 1. DEP & Ingress process 1.Sandbox Data Mining on Aggregates 2. Checksum validated 2.Expoloration 3. Schema validated 3.Mining Raw Reporting Data Mining 1.Raw data files 1. Business Ready 2.Decomressed 2. Data Marts 3.UTF8 Converted 3. Aggregates 4. Access through Relational Database and Visualization tools Archive Refined 1.Source file Archival 1.Schema harmonization 2.Cool Storage 2.Single version of truth Batch processing for Marts & Reporting views 3.Similar to Raw zone layout 3.Active, Cleansed, partitioned 4. Detail Level access through Databricks. #UnifiedAnalytics #SparkAISummit 12
13 . Why Databricks on Azure Sources Storage Train & Prepare • Self-Service cluster management • Easy to configure and all backend services ETL Machine Learning POS Managed by Databricks Discovery • Integration with Azure Identity Management Sales • Easy integration with Azure Data Lake, Streaming Storage and SQL Database and other Azure STARS native cloud services • Major Contributor of Open Spark Channel Property • Excellent Support for Data Science Consumer Activity Marts Development languages like Python, R, Reporting Scala etc. Consumer Loy alty • Speed to market on Technology & Secured implementation for Hybrid access • Collaborative Notebooks and Less Code Azure Data lake Rewrites Analytics • Full Suite of Data Transformation and ML Batch Engine capabilities including MLFlow #UnifiedAnalytics #SparkAISummit 13
14 .Implementation Challenges • Solution Involve Hybrid PaaS offerings • New era of Altria Cloud VNETs being shared by Service Providers for managed services • Routing trust for Managed Services without Firewall Appliances • Hosting Public IP’s in Altria cloud VNETs and no Express route • Multiple Key Stake holders involvement for Security firewall and Networking landscape • Altria Networking Landscape is evolving - So many moving parts • Subject Matter Experts new to Azure • New Tools being matured ( Single Sign-on, Security and Networking, Evolving Azure storage Gen2 ) • Legacy SQL Users transition to Notebooks, new skills working with cloud tools #UnifiedAnalytics #SparkAISummit 14
15 .Success Measure • Data Lake which includes structured and unstructured data to create a consumer 360 • Variety of data storage types to pre-process data from the Data Lake to support faster and efficient data access • Synchronous data exchange with suppliers, retailers, and digital properties via Restful APIs • Robust unified analytics platform using Databricks to support new capabilities, e.g., advanced analytics , optimization, and experimentation • Single data repository and engine that has enormous processing power and ability to handle concurrent tasks #UnifiedAnalytics #SparkAISummit 15
16 . Success Measure – Model Scoring (Data Fusion) Aggregates and consolidates, to a single view, all of the known data about an ATC 21+ to model the unknown - Main component of Model Engine Value Response Web Channel Clickstream Adult Consumer Profile Consume Activity Data Fusion Survey Responses Full ATC21+ view Publicly Available Data #UnifiedAnalytics #SparkAISummit 16
17 .Success Measure / Data Builder & Manager Model ▪ System integrated with Data Fusion to manage Created all aspects of the model life cycle - Model Documentation Model Model - Model Run Schedule Validated Documented - Model Scoring and Validation ▪ R , Spark SQL and PySpark interface to Data Fusion that greatly simplifies creating custom models - Generates 1,000+ time specific variables for Model Used Added to Run Schedule modeling - Takes care of all data munging and cleaning - Creates first pass XGBoost and elastic net models Model Runs in Production - Standardizes variable creation for consistent use across Altria and within vendor network #UnifiedAnalytics #SparkAISummit 17
18 .Success Measure - Altria Model Engine • In-house model management solution • Rapidly build and install new models • Technology agnostic (aligns with IS/Digital Machine Infrastructure) • Formalized Model Documentation process • QC all data in one place • Flexibility to add new data as available #UnifiedAnalytics #SparkAISummit 18
19 .Learnings… • Connectivity Issues can cause on-premise job failures & irreversible data since no atomic operations support Data Lake Gen1 - Build support failover and monitoring • Data Lake folders case sensitive • Data Lake Folder permissions inheritance with Service Principles • ADLA (U-SQL) supported only UTF8 encoding . Evolved to support zipping and unzipping the files, schema validations but make sure it runs on single node. • ADLA node limits , concurrent jobs and working with Parquet Compressed files. • ADLA ,USQL doesn’t have inbuilt capabilities to extract files having different schema , Custom Extractors • Files transfers move from Logic Apps to Data Factory . Logic Apps support up to 1 GB. #UnifiedAnalytics #SparkAISummit 19
20 .Learnings… • Logic Apps Triggers inconsistence with X number of files & Size. • Pipeline deployments with Power shell , make sure ADF2 & Power shell version on sync to deployments • ADF V2, Databricks and Parameterized notebooks , Worked with MS & closed issues • EventHub triggers can only handled through Function Apps and issues with long running jobs • EventHub integration with only Azure blob and Data Lake and only AVRO. • SQL Managed Instance no integration with Polybase & Data Lake • Databricks , Driver Node ( only 2GB) limitation with Pandas or Native R . Move to SparkR or SparklyR • Databricks , Data Lake with Mount points and moved Pass through & session scope access points #UnifiedAnalytics #SparkAISummit 20
21 .We are reducing bureaucracy, decentralizing decision-making and more effectively using data analytics to drive strategy #UnifiedAnalytics #SparkAISummit 21
22 .DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT