1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2.Creating an Omni- Channel Customer Experience (Spark / ML) Todd Dube – CarMax Technology #UnifiedAnalytics #SparkAISummit
3.About CarMax • Original used car industry disruptor 25 Years • Nation’s largest retailer of used cars (over $18b Revenue) • Sold 5+ million wholesale vehicles • 200+ stores in 41 states • Top 10 used car loan originator • #174 on 2018 Fortune 500 List • 25,000+ associates nationwide • FORTUNE 100 Best Companies to Work For 15 years in a row (Feb 2019) #UnifiedAnalytics #SparkAISummit 3
4.CarMax and Omni-Channel • Omni-Channel’s focus is on the Customer – Convenience, seamlessness and personalization • We customize the experience for how the customer wants to buy a vehicle • Online, In Store, Express Pickup, and Delivery • Data Science and it’s enablement are part of the our growth strategy.
5.CarMax and Data Science – history • Data Science (DS) has • Limited Compute / Space grown tremendously Year on Laptops / On-Prem after Year Limitations • Key Foundational DS • Ad-Hoc datasets Assets are now critical to everywhere = governance our current and future and truth in data issues growth • Models have to be rewritten • Data Scientists create / for integration w/ other repurpose code apps/services SQL/Python on our Data Warehouse or their Laptops
6.Real Example of prior work involved Recommender Recommender Generator Service . . . . S MMT Matrix Vehicle Recommendations n n n Teradata Q CSV (Stock ton Stock) CarMax.com API e Cosmos DB e e e CarMax.com L t t t t • Recommender flow involved manually pulling prior data • Work done on Datasets via local Laptop, exported CSV for import • C#/.net Application with Logic and Coefficients for Model Service • Data Could change but Model couldn’t change with out planning and effort (once a Month if planned) • Need for streaming and real-time ingestion of vehicle information
7.We had to Define Goals • CarMax needs a set of tools and a platform for Data Data Science and ML (Raw) – Model Development, Testing, and Deployment – Data Accessibility, Research and Development, Evaluate Data Prep – 3rd party datasets (Acxiom, NuStar, LiveRamp, Adobe, etc) – Scalable / Affordable Storage Model Lifecyle • Develop richer faster changing models (Real-Time) • Drive Enhanced Customer and Associate Experience Train Develop – Omni-Channel – CarMax.com Test – Key Business Areas (Marketing, Finance, Pricing, etc) Governance
8.We had to define a Data Scientist We had to define new roles for Technology and Business – You need both types: Business: • Data Scientist Type-A: (Analyst) producing meaningful insights from the data. Best suited for statisticians with engineering knowledge Technology: • Data Scientist Type-B: (Build) implement production models that interact directly with users. Best suited for engineers with statistics knowledge.
9.Set Technology Goals and Use Case • Enable our Data Scientists: – Enable CarMax Data Scientists to more autonomously build, test, and deploy models – Leverage data of varied structures for research, on-demand, self-service machine learning – Support for familiar data science tools and libraries – Support Common Python, Spark, Jupyter, etc and packages/frameworks – Spend less time wrangling data • Support Key Use Case to Prove out Platform and Value to CarMax – 1st - Recommender System, then Bidding and Others…
10.CarMax Technology Requirements • Centralize Hosted Data Lake Storage • “Catalog” for Managing Data Assets • Defined Ingest and Management Patterns • Performant and Easy Management of Compute – Support for Tools and Technologies new and emerging • Support for Spark, Python, Scala, Python DS/ML Packages • Managed Platform in Cloud utilizing PaaS/SaaS Resources • Architecture to Support Batch and Real-Time Model Build/Test/Deployments – Real Time Model Serving and A/B Testing for Data Scientists
11.DataLake Zones – Curation and Flow • Data is loaded in natural Pipelines move data through Production data lake zones state without applying Raw transformations • “Landing zone” of the data lake RAW VALID REFINED • Converted to standardized file format to reduce storage Valid and improve processing • Metadata validation to confirm data is in expected format • Aggregation and/or ERROR consolidation of one or Refined multiple valid data sets for use as input to model • Enrichment of data
12.DS / ML Platform Phase 1 Phase 1: 4 Months Starting in July 2018 - DONE Modernize batch Recommender: – Architecture and Solution POC (Evaluated - Knime, Dataiku, Databricks, AzureML Studio, H2O.AI) – New Daily Batch Recommender System – model refresh any time • Batch Daily Based on prior history of click, sales, and other relevant data sources – Pure Agile Approach w/ 2 week Sprints – Utilize Vendor partner to bring expertise in Data Lake, Spark, Azure and Data Science – Framework for Metadata Driven Data Ingestion • November 2018 Deploy Batch Recommender !!! 5 MONTHS! July August September October November Requirements, Vendor POC Framework, Finalize Ingest, Finalize Model, Design, Vendor Selection, Build Ingest, Catalog, Model Testing API, and Candidate Infrastructure Replatform Model and Refinement Measurement
13.DS / ML Platform Phase 2 / 3 Phase 2: 3-4 Months / Phase 3: 2 Months Real-Time Recommender Model and Architecture – Model Development, Deployment, and Testing in Real-Time – High SLA for Web/Mobile – Prove out Architecture for Hosting, Testing, and Deployment of Models – Real-Time Streaming of Input Data – Real-Time Serving of Recommender Model Request/Responses Broader Business Unit Support for other Models – Bidding, Propensity, Lead, and other models deployed in Phase 2 – User Adoption and Roll Out to Other Data Scientist Teams
14.Wait you did what and how ? Really ?
15.CarMax Technologies Chosen on Azure Azure Data Lake Storage (ADLS) – Hadoop- Azure Data Factory – Data Pipeline service to compatible scalable storage for big data analytic workloads. orchestrate and automate data movement and data transformation. Azure Data Catalog – Metadata service for Databricks – Apache Spark-based analytics registration and discovery of enterprise data assets. platform as PaaS w/ Full Support For: Python, PySpark, SparkSQL, etc Azure Functions – “Serverless” compute service that can run code on-demand without having to explicitly provision or manage MLFlow – End to End ML Lifecycle (easy to infrastructure use) Azure Event Hubs – Data streaming and event ingestion service, capable of receiving and Azure ML – Open Source Python/.Net/Java for processing millions of events per second building, deploying, and monitoring Models Azure Kubernetes Service – Build, Test, Deploy, Monitor Models
16.Unified Data Processing and ML: Batch Recommender
17.Real-Time Architecture Proposed
18.Outcomes and Results • Batch Recommender created 10%+ more engagement with recommendations • Model / Data and Recommendations now updated Daily • We had to tune our model on Vehicle Inventory status in Real-Time (Changed ingestion / not model) • Refined model and ingest numerous times with out outage or issues #UnifiedAnalytics #SparkAISummit 18
19.Why Databricks and Spark Data Scientists • Manage Code and Python Notebooks similar to Jupyter • Model Management w/ MLFlow • Move way beyond prior limitations (Computer, storage, datasets) • Centralized place for everyone casual exploration to hardcode DS/ML Spark Development • Full Support for All Python Libraries • Deep ML (Horvod, Tensor, anything..) • Collaboration / Sharing of Notebooks with others • Detractor – hard to get old “guard” up to speed on all new things..
20.Why Databricks and Spark – cont’d Technology • Easily Manage Scalable Compute – (no Hadoop/cluster skills) • Spark is a Go Forward Platform – – Databricks Far and Away is biggest committer to Spark project – Product Reflects their knowledge and enablement of Platform • Spark is complicated but Databricks helps make it very easy • Easily Fits in Azure Architecture (ADLS, AAD, ADF, etc) • Orchestration of Pipelines in Notebooks
21.Things about me • Apple/Mac Purist • 25+ Years Technology Reading / Learning: • Gartner – yes really • iPad Pro – Python, C#, Others? OCDevel.com • Dedicate Time to learning and reading as part of job… • Realypython.com – Podcasts! • Favorite Podcast – MPU Mac Power Users thetalkingmachines.com YouTube – – Deep Learning SIMPLIFIED – Azure Everything
22.Connect with me on LinkedIn – search ‘Todd Dube’ https://www.linkedin.com/in/tdube/ Questions ? WE ARE HIRING – jobs.carmax.com!!!!
23.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT