- 微博 QQ QQ空间 贴吧
Scaling 30 TB’s of Data Lake with Apache HBase at Production
1 . hosted by Scaling 30 TB’s of Data Lake with Apache HBase and Scala DSL at Production Chetan Khatri
2 . hosted by Who Am I Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
3 . hosted by Agenda 01 What is Apache HBase 02 Why Apache HBase 03 Apache Spark and Scala 04 Apache Spark HBase Connector 05 Case Study: Retail Analytics Architecturing Fast Data Processing Platform to Scale 30 TB Data in Production
4 . hosted by What is Apache HBase ● Column-oriented NoSQL ● Non-relational ● Distributed database build on top of HDFS. ● Modeled after Google’s BigTable. Add the title ● Built for fault-tolerant application with billions/ Modules trillions of• rows do not limit and millions of the size, columns. ● Very low latency number, andinterval, can be adjusted near real-time random reads and random according to need writes. Source: https://hbase.apache.org/ ● Replication, end-to-end checksums, Modules do with automatic• rebalancing not limit the size, HDFS. ● Compressionnumber, interval, can be adjusted ● Bloom filters according to need ● MapReduce over HBase data. ● Best at fetching • Modules rowsdobynotkey, limitscanning the size, ranges of rows number, interval, can with ordered be adjusted partitioning. according to need
5 . hosted by What is Apache Spark ? Source: https://spark.apache.org/ Structured Data / SQL - Graph Processing - Spark SQL GraphX Machine Learning - Streaming - Spark Streaming, Structured Streaming MLlib
6 . hosted by What is Scala ● Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. ● Scala is object-oriented ● Scala is functional ● Strongly typed, Type Inference Source: www.scala-lang.org ● Higher Order Functions ● Lazy Computation
7 . hosted by Case Story: Retail Analytics Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production Use cases in Retail Analytics: Business: explain the who, what, when, where, why and how they are doing Retailing. ● What is selling as compared to what was being ordered. ● Effective promotions - right promotions at right outlet and right time. ● What types of Cigarette consumers are shopping in your outlets ? ○ Gives smoking patterns in specific geography, predict demand on supply. ● What are the purchasing patterns of your consumers ? ○ are they purchasing Pizza and Ice cream together ? ○ are they purchasing multiple Instant food products with soda together ? ● Time Series problem - year, month, day of year, week of year to Identify which brands are not getting sold at specific geography, so it can be swap to other store.
8 . hosted by Case Story: Retail Analytics - Scale Challenges ² Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job execution failures with unutilized Spark Cluster. ² Scalability of massive data: ○ ~4.6 Billion events on every weekly data refresh. ○ Processing historical data: one time bulk load ~30 TB of data / ~17 billion transactional records. ² Linear / sequential execution mode with broken data pipelines. ² Joining ~17 billion transactional records with skewed data nature. ² Data deduplication - outlet, item at retail.
9 . hosted by Using HBase as a MDM System MDM - Master Data Management 1. HBase Driven Data Deduplication Algorithms Example, ² Outlet Matching ² Item Matching ² Address Matching ² Brand Matching 2. Abbreviation Standardization Example, UOM Quantity ² UOM Standardization PACK 2 ² Outlet Name, Address Standardization ² UPC Standardization 2PACK NA 2PK
10 . hosted by Using HBase as a MDM System Data Deduplication Problem Retail Analytics ! Examples, ● You may find Item with same UPC code. ● You may find Outlet with same Outlet number. ● What if UPC Code gets upgraded from 10 Digits to 14 Digits. (Update everywhere, needs faster update.)
11 . hosted by HBase - NoSQL, Denormalized columnar schema model For example, Table: transactional_line_item Column Family: f Column Family: o Column Family: t created_datetime Outlet_id Transaction_id file_id Outlet_name Item_id transaction_id Outlet_state Promo_code manufacturer_operator_submitter_id Outlet_city Gross_price Outlet_address1 Discount Outlet_address2 Quantity Outlet_region Upc Outlet_country Uom Outlet_owner_name State_gst Outlet_status Country_gst Outlet_zipcode Delivery_charges Outlet_started_date Packing_charges outlet_sub_chain_name Vendor_discount Partner_discount Corporate_discount reward_point_discount
12 . hosted by Case Story: Retail Analytics - Scale - 5x performance improvements by re-engineering entire data lake to analytical engine pipeline’s. - Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events. - 10x performance improvements on historical load by under the hood algorithms optimization on 17 Billion events (Benchmarks ~1 hour execution time only) - Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails data(Item, Outlet) improved elastic performance. - Using HBase as a Master Data Management (MDM) System.
13 . hosted by Case Story: Retail Analytics How ?
14 . hosted by Data Platform - Infrastructure Architecture Staging Aggregation HBase Hive Hive Layer Read Rec by Rec Elastic and do match Search Oracle NodeJS Akka-HTTP Spark  maprcli Index Kafka HBase Streaming and setup replication Real time query POS Files Hive / Spark / HDFS Presto  NodeJS HBase https://www.npmjs.com/package/  Asynchronous HBase: node-thrift2-hbase https://github.com/OpenTSDB/asynchbase https://www.npmjs.com/package/  Index MapR-DB Data into Elasticsearch async https://community.mapr.com/community/exchange/blog/ Spark TensorFlow https://www.npmjs.com/package/ 2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws MLLib thrift
15 . hosted by AsyncHBase Build.sbt with Akka HTTP
16 . hosted by Data Processing Infrastructure ● MapR Distribution - http://archive.mapr.com/releases/ ecosystem-5.x/redhat/ ● Data Lake - Apache HBase 0.98.12 ● EDW / Analytical Data store - Apache Hive 1.2.1 ● Unified Execution Engine - Apache Spark 2.0.1 ● Distributed File storage - MapR-FS ● Queueing mechanism - Apache Kafka ● Streaming - HTTP Akka + scala 2.11 , Spark Streaming ● Reporting Database - PostgreSQL 9.x ● Legacy Database - Oracle 9 ● Workflow Management tool - BMC Control M
17 . hosted by Rethink - Fast Data Architectures. Unify, Simplify. UNIFIED fast data processing engine that provides: The The The SCALE RELIABILITY & PERFORMANCE LOW LATENCY of data lake of data warehouse of streaming
18 . hosted by Spark HBase Connector. https://github.com/nerdammer/spark-hbase- connector Credit: Contributors
19 . hosted by Spark HBase Connector. It’s a Spark package connector written on top of Java HBase API. A simple and elegant way to write Spark - HBase Jobs. Powerful Functional Scala DSL integrated for Apache Spark. Supports: ● Scala > 2.10 ● Spark > 1.6 ● HBase > 1.0
20 . hosted by Spark HBase Connector. Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" Setting the HBase Host
21 . hosted by Writing to HBase
22 . hosted by Reading from HBase
23 . hosted by Filtering
24 . hosted by Manage Empty Columns with Option[T]
25 . hosted by Custom Mapping with Case Classes
26 . hosted by Custom Mapping with Case Classes ...
27 . hosted by Implicit Reader
28 . hosted by Implicit Reader ... Do not forget to override the columns method.
29 . hosted by HBase Read table Data in Spark DataFrame