Data Platform (HDP). The Only Completely Open Distribution for Apache Hadoop .... Business Analytics. Custom. Apps. Apache YARN. Apache MapReduce. 1.

Alex2发布于2018/06/19

注脚

展开查看详情

1.

2.Hadoop For Windows Rohit Bakhshi DBI-B335

3.Speaker Rohit Bakhshi Product Manager Hortonworks

4.Modern Data Architecture Hadoop for Windows Hortonworks Data Platform under the covers Q&A Agenda

5.Modern Data Architecture

6.What Makes Up Big Data? Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record ERP CRM WEB BIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA

7.A data architecture under pressure from new data APPLICATIONS DATA SYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Business Analytics Custom Applications Packaged Applications Source: IDC 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020 OLTP, ERP , CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation

8.Hadoop within an emerging Modern Data Architecture OPERATIONS TOOLS Provision, Manage & Monitor DEV & DATA TOOLS Build & Test DATA SYSTEM REPOSITORIES SOURCES RDBMS EDW MPP OLTP, ERP, CRM Systems Documents, Emails Web Logs , Click Streams Social Networks Machine Generated Sensor Data Geolocation Data Governance & Integration Security Operations Data Access Data Management APPLICATIONS Business Analytics Custom Applications Packaged Applications OLTP, ERP , CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation

9.Hadoop for Windows

10.HDP for Windows Hortonworks Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability Hortonworks Data Platform 2.2 YARN : Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Slider Slider SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment Choice Linux Windows On-Premises Cloud

11.HDP: Enterprise Data Platform HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 Falcon 0.5.0 Ranger Spark 0.14.0 0.14.0 0.98.4 4.2 0.9.3 1.2.0 0.6.0 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations

12.Seamless Interoperability Integrations with Microsoft tools for native big data analysis SOURCES APPLICATIONS OPERATIONAL TOOLS DEV & DATA TOOLS INFRASTRUCTURE xΩ a DATA SYSTEM HDInsight Azure New! Power BI

13.HDP: Powered by Apache Hadoop HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 Falcon 0.5.0 Ranger Spark 0.14.0 0.14.0 0.98.4 4.2 0.9.3 1.2.0 0.6.0 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations

14.Apache Hadoop Scalable Linearly scale to store Petabytes of data Reliable Redundant storage protects against node failures Flexible Store all types of data, apply flexible schemas for analysis and sharing Economical Utilize cose efficient commodity hardware Achieve high cluster utilization Open Source Data M anagement Storage HDFS Distributed across “nodes” Natively redundant Single File System Processing YARN Cluster Resource Manager Built in F ault Tolerance High Cluster Utilization

15.YARN: Data Operating System NodeManager NodeManager NodeManager NodeManager map 1.1 vertex 1.2.2 NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager map 1.2 reduce 1.1 Batch vertex 1.1.1 vertex 1.1.2 vertex 1.2.1 Interactive SQL ResourceManager Scheduler Real-Time nimbus 0 nimbus 1 nimbus 2

16.Right Tool for the Right Usage Traditional Database SCALE (storage & processing) Hadoop Platform NoSQL MPP Analytics EDW s chema speed governance best fit use processing Required on write Required on read Reads are fast Writes are fast Standards and structured Loosely structured Limited, no data processing Processing coupled with data d ata types Structured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing

17.Maximize Hadoop Deployment Choice Hortonworks Data Platform (HDP) for Windows 100% Apache open source Hadoop software for Windows Server Microsoft Azure HDInsight Hadoop -based managed service in the cloud via Microsoft Azure Microsoft Analytics Platform System (APS) Scale-out appliance with data warehousing and Hadoop in one box All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on - premises and cloud

18.HDP under the covers

19.Data Operating System of Hadoop Single Cluster, Shared Data Set, Multiple Workloads Support a range of access patterns Shared operational services HDP 2.2: Core Platform DATA ACCESS YARN : Data Operating System DATA MANAGEMENT 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-Memory Analytics, ISV engines Batch Map Reduce

20.Flexible Ingest into HDP Sqoop HORTONWORKS DATA PLATFORM (HDP) For Windows RPC REST (HTTP) C LibHDFS Flume

21.SQL Access: Stinger Initiative Stinger Initiative Next generation SQL based interactive query in Hadoop Speed Interactive Hive Query response Scale queries that scale from TB to PB SQL broadest range of SQL semantics for analytic applications Business Analytics Custom Apps Apache YARN Apache MapReduce 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N Apache Tez Apache Hive SQL ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months

22.Apache Tez (“Speed”) Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task Processor Input Output

23.Hive with Tez as execution engine Hive – MR Hive – Tez SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE( c.price ) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state , c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE( c.price ) SELECT b.id SELECT a.state , COUNT(*), AVERAGE( c.price ) FROM a JOIN b ON ( a.id = b.id ) JOIN c ON ( a.itemId = c.itemId ) GROUP BY a.state Tez avoids unneeded writes to HDFS

24.Hive: Enhanced SQL Semantics Hive SQL Datatypes Hive SQL Semantics INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc ) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs ( ngram , Xpath , URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax INTERSECT / EXCEPT Hive 0.12 (HDP 2.0) Hive 0.11 Hive 0.13 (HDP 2.1) SQL Compliance Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop

25.Stream Processing Apache Storm Real -time event processing for sensor and business activity monitoring Scale : Ingest millions of events per second. Fast query on petabytes of data Implement new real time business cases with your Hadoop platform http:// storm.incubator.apache.org /

26.NoSQL Database Store and Process Petabytes of Data Scale out on Commodity Servers High Performance Highly Available Integrated with YARN SQL Interface YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N ° ° ° ° ° ° HDFS (Permanent Data Storage) NoSQL HBase

27.HDP Search HDFS ( Hadoop Distributed File System ) ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Raw Files Indexed Documents MapReduce Indexing Job Solr Solr Solr Lucene HTML PDF Word XML Logs … Search Web App Query Response Apache Solr High performance indexing and simple UI for advanced search applications

28.All Processing on Shared Infrastructure YARN : Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Others Engines Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° Storm Stream Others Engines Slider Solr Search HBase NoSQL Slider Accumulo NoSQL Slider Spark In-Memory Kafka Slider ° ° ° ° HDFS (Hadoop Distributed File System)

29.YARN: Next Generation Hadoop Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … 1 st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Flexible Data Processing Hive, Pig, others… Batch MapReduce Batch & Interactive Tez Online Data Processing HBase , Accumulo Stream Processing Storm o thers … 2 nd Gen of Hadoop Classic Hadoop Apps

30.Data Governance & Integration Apache Falcon Simplified Data Governance for Enterprise Hadoop Provides key governance framework for: Acquisition & processing of data sets Replication & Retention of datasets Redirect datasets to non- Hadoop extensions Provides audit trail & lineage

31.Apache Falcon Define sophisticated Worklows and DLM Policies Enable audit, compliance, and data re-processing Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only

32.Apache Falcon Disaster Recovery and Backup between environments Publishing data between environments for Discovery Site to Site Site to Cloud

33.Extend with the Cloud Hybrid = On-premises + Cloud Constraints of on-premises Scale constrained to on-premise procurement Capex up front costs Expertise for tuning and deployment Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need Cloud Hadoop HDInsight Cloud On-premises Hadoop Software Appliances APS

34.Central Security Administration HDP Advanced Security Single Pane of Glass Centralizes administration of security policy across entire HDP Project: Apache Ranger

35.Perimeter Security Apache Knox A common place to preform authentication across Hadoop and all related projects Integrated to LDAP and AD Secure interfaces for: WebHDFS , WebHCAT , Oozie , Hive & HBase Broad community effort, Incubated with Microsoft, broad set of developers invovled

36.Apache Knox: Perimeter Security Enterprise Identity Provider LDAP/AD Identity Providers Knox Gateway GW DMZ A stateless reverse proxy instance deployed in DMZ Firewall HDP Cluster 1 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT HDP Hadoop Cluster 2 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT - Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Firewall REST Client JDBC Client Browser

37.Operating Enterprise Hadoop Ambari : Deploy, Manage, Monitor AMBARI WEB compute & storage . . . . . . . . compute & storage . . PROVISION MANAGE MONITOR REST APIs AMBARI SERVER PROVISION | MANAGE | MONITOR

38.Ambari : Deploy on Windows

39.Ambari : Deploy on Windows

40.Ambari : Manage on Windows

41.Ambari : Monitor on Windows

42.Enables Microsoft System Center Operations Manager (SCOM) to monitor Hadoop Ambari SCOM Management Pack gives insight into the performance and health of Hadoop Ambari SCOM leverages the Ambari framework to aggregate and expose Hadoop metrics Ambari SCOM Ambari SCOM Mgmt Pack HADOOP Storage & Process at Scale Ambari SCOM Server Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM monitors health + alerts in case of problems

43.For More Information Web hortonworks.com /products/ hdp -windows/ hortonworks.com /labs/ microsoft / microsoft.com / bigdata Training hortonworks.com / hadoop -training/ hadoop -on-windows / Online documentation docs.hortonworks.com Forums hortonworks.com /community/forums /

44.Questions?

45.27 Hands on Labs + 8 Instructor Led Labs in Hall 7 DBI Track resources Free SQL Server 2014 Technical Overview e-book microsoft.com/sqlserver and Amazon Kindle Store Free online training at Microsoft Virtual Academy microsoftvirtualacademy.com Try new Azure data services previews! Azure Machine Learning , DocumentDB , and Stream Analytics

46.Resources Learning Microsoft Certification & Training Resources www.microsoft.com/learning Developer Network http ://developer.microsoft.com TechNet Resources for IT Professionals http://microsoft.com/technet Sessions on Demand http://channel9.msdn.com/Events/TechEd

47.TechEd Mobile app for session evaluations is currently offline SUBMIT YOUR TECHED EVALUATIONS Fill out an evaluation via CommNet Station/PC: Schedule Builder LogIn : europe.msteched.com/catalog We value your feedback!

48.© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.