Azure HDInsight Customer Deck 101

Introducing Apache Hadoop ... Governed by Apache Software Foundation (ASF) .... Apache Storm on HDInsight. Devices to take action. Kafka /. RabbitMQ /.

1.What Is Hadoop And Why Deploy It In the Cloud?

2.Agenda What Is Hadoop? Why Deploy To the Cloud? Microsoft’s Solution How Do I Get Started?

3.Breaking points of traditional approach BI & analytics Data warehouse ETL Dashboards Reporting Staging Increasing data volumes 1 Source Systems OLTP ERP CRM LOB 5 0x Data growth 2010-2020 40ZB Digital Universe 2020 1Trillion Web pages

4.Breaking points of traditional approach BI & analytics Data warehouse ETL Dashboards Reporting Staging Increasing data volumes 1 Source Systems OLTP ERP CRM LOB 204M Emails sent every minute 340M Tweets sent every day 231B US Ecommerce in 2012 Real-time data 2

5.Breaking points of traditional approach BI & analytics Data warehouse ETL Dashboards Reporting Staging Increasing data volumes 1 Real-time data 2 Source Systems OLTP ERP CRM LOB New Data Devices Web Sensors Social New data types 3 15x Machine generated data 2020 1.3M Hours on Skype per hour 2.4M Facebook content per minute

6.Breaking points of traditional approach BI & analytics Data warehouse ETL Dashboards Reporting Staging Increasing data volumes 1 Real-time data 2 New data types 3 Cloud-born data 4 $100B spend on cloud 50% large orgs have hybrid by 2017 40% CRM sold are SaaS Source Systems OLTP ERP CRM LOB New Data Devices Web Sensors Social

7.What if you could handle big data? Data complexity : variety and velocity Terabytes Gigabytes Megabytes Petabytes Big Data Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Click stream Wikis/blogs Sensors/RFID/ devices Social sentiment Audio/video Web 2.0 Web Logs Digital Marketing Search Marketing Recommendations Advertising Mobile Collaboration eCommerce ERP/CRM Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline

8.Data Velocity Data Volumes Data Variety Introducing Apache Hadoop Apache Open Source Project Highly scalable distributed file system (HDFS) Distributed processing on data nodes

9.Data volume Hadoop stores files in a distributed file system Storage and computation is distributed across many servers Files can be spread out over multiple nodes Hadoop can store very large amounts of data Combined storage resource can grow with demand from a few nodes to thousands of nodes Scales out linearly Very large files supported including those larger than the capacity of a single node Files

10.Data variety Hadoop stores files (non-relational store) Files could have a variety of semi-structured or unstructured data Previously, these files may not have been seen as providing value or insights Today, new business questions and insights are being uncovered through data science Sentiment Understand how your customers feel about your brand and products— right now Clickstream Capture and analyze website visitors’ data trails and optimize your website Sensors Discover patterns in data streaming automatically from remote sensors and machines Geographic Analyze location-based data to manage operations where they occur Server logs Research logs to diagnose process failures and prevent security breaches Unstructured Understand patterns in files across millions of web pages, emails, and documents

11.Applications Devices HTTP Incoming Outgoing Data velocity Hadoop can stream live data and process them in real-time Hadoop can act as scalable event stream ingestion Hadoop can do near real-time in-stream processing Data input Event broker Stream processing Outgoing

12. Governance and integration Data workflow, lifecycle and governance Falcon Sqoop Flume NFS WebHDFS YARN: data operating system Script Pig Search Solr SQL Hive/Tez, HCatalog Nosql Hbase Accumulo Stream Storm Others Spark, in-memory, ISV engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Batch Map reduce Data access HDFS (Hadoop Distributed File System) Data management Authentication Authorization Accounting Data protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Security Operations Provision, manage, and monitor Ambari Zookeeper Scheduling Oozie Hadoop is a platform with portfolio of projects Governed by Apache Software Foundation (ASF) Comprises core services of MapReduce , HDFS, and YARN In addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase , Pig, Flume, Sqoop ) Operational services which help manage the cluster ( Ambari , Falcon, and Oozie )

13.A Hadoop distribution is a package of projects Tested for consistency across entire package Knox Tez Pig Hive and HCatalog Phoenix Accumulo Storm Mahout Solr Falcon Sqoop Flume Ambari Oozie Zookeeper HBase Hadoop and YARN Data management Data access Governance and integration Operations Security HDP 2.0 October 2013 2.2.0 0.12.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 1.4.4 3.3.2 3.4.5 .0.4.0 HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0 HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 1.5.1 4.0.0 3.4.5 .0.4.0 2.4.0

14.With many contributors 80 committers to Hadoop core project

15.Retail 360°view of the customer Analyze brand sentiment Localized, personalized promotions Website optimization Optimal store layout Financial services New account risk screens Fraud prevention Trading risk Maximize deposit spread Insurance underwriting Accelerate loan processing Telecom Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Utilities, oil, and gas Smart meter stream analysis Slow oil well decline curves Optimize lease bidding Compliance reporting Proactive equipment repair Seismic image processing Public sector Analyze public sentiment Protect critical networks Prevent fraud and waste Crowd source reporting for repairs to infrastructure Fulfill open records requests Manufacturing Supplier consolidation Supply chain and logistics Assembly line quality assurance Proactive maintenance Crowd source quality assurance Healthcare Genomic data for medical trials Monitor patient vitals Reduce re-admittance rates Store medical research data Recruit cohorts for pharmaceutical trials Business applications of Hadoop

16.New analytic applications from new data INDUSTRY USE CASE SENTIMENT AND WEB CLICKSTREAM AND BEHAVIOR MACHINE AND SENSOR GEOGRAPHIC SERVER LOGS STRUCTURED AND UNSTRUCTURED Financial services New account risk screens ✔ ✔ Trading risk ✔ Insurance underwriting ✔ ✔ ✔ Telecom Call detail records (CDR) ✔ ✔ Infrastructure investment ✔ ✔ Real-time bandwidth allocation ✔ ✔ ✔ Retail 360° view of the customer ✔ ✔ ✔ Localized, personalized promotions ✔ Website optimization ✔ Manufacturing Supply chain and logistics ✔ Assembly line quality assurance ✔ Crowd-sourced quality assurance ✔ Healthcare Use genomic data in medial trials ✔ ✔ ✔ Monitor patient vitals in real-time Pharmaceuticals Recruit and retain patients for drug trials ✔ ✔ Improve prescription adherence ✔ ✔ ✔ ✔ Oil and gas Unify exploration and production data ✔ ✔ ✔ ✔ Monitor rig safety in real-time ✔ ✔ ✔ Government ETL offload/federal budgetary pressures ✔ ✔ Sentiment analysis for government programs ✔

17.Up-front HW costs Capacity planning Hadoop expertise Challenges with implementing Hadoop

18.Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need No HW costs $0 Unlimited scale Pay what you need Deployed in minutes Why Hadoop in the Cloud?

19.On-premises Hadoop S oftware Appliances Scenarios For Deploying Hadoop As Hybrid Cloud Cloud Develop/POC Cloud Bursting Cloud Backup/archive

20.Agenda What Is Hadoop? Why Deploy To the Cloud ? Microsoft’s Solution How Do I Get Started?

21.Introducing Azure HDInsight 100% Apache Hadoop Powered by the Cloud Immersive insights

22.Hadoop 2.2 and 2.4 80% data compression with ORC Microsoft contributions to Hadoop Hadoop on Windows Hive 100x Query Speed Up 30,000 + code line contributions HDFS in Cloud (Azure) REEF for Machine Learning 10,000 + engineering hours Committers to Hadoop

23.Introducing HDInsight Microsoft’s cloud Hadoop offering 100% open source Apache Hadoop Built on the latest releases across Hadoop Up and running in minutes with no hardware to deploy Harness existing .NET and Java skills to write MapReduce Utilize familiar BI tools for analysis including Microsoft Excel


25.Microsoft + Hortonworks Promoting Open Hadoop Engineering alignment Corporate alignment Field alignment

26.HDInsight Built for Windows or Linux Customer Choice Managed & supported by Microsoft Familiarity of Windows Re-use common tools, documentation, samples from Hadoop/Linux ecosystem Add Hadoop projects that were authored on Linux to HDInsight Easier transition from on-premise to cloud

27.HDInsight Supports Hive SQL-like queries on Hadoop data in HDInsight HDInsight provides easy-to-use graphical query interface for Hive HiveQL is a SQL-like language (subset of SQL) Hive structures include well-understood database concepts such as tables, rows, columns, partitions Compiled into MapReduce jobs that are executed on Hadoop Dramatic performance gains with Stinger/ Tez Stinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with Hive Brings query execution engine technology from Microsoft SQL Server to Hive Performance gains up to 100x Microsoft contribution to Apache code Hadoop 2.0 1400s 44.3s 35.1s Sample Query Hive 10 HDP 1.3 / Hive 11 HDP 2.0 32x Speedup 40X Speedup HDP 2.1 15s 100x Speedup

28.HDInsight Supports HBase Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Name Node Job Tracker HMaster Coordination Region Server Region Server Region Server Region Server NoSQL database on data in HDInsight Columnar, NoSQL database Runs on top of the Hadoop Distributed File System (HDFS) Provides flexibility in that new columns can be added to column families at any time

29.HDInsight Supports Mahout Machine learning library A library of machine learning algorithms to execute on data in HDFS Algorithms are not dependent on size of data and can scale with large datasets Library includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models