Scaling 30 TB’s of Data Lake with Apache HBase at Production

下载 9

中国HBase技术社区

发布于

2232

人观看

#信息技术

用Scala DSL+HBase的方案，解决生产环境30TB规模数据湖的管理，来自印度的工程师介绍了如何利用Spark SQL +Hbase解决零售行业数据分析。

展开查看详情

1 . hosted by Scaling 30 TB’s of Data Lake with Apache HBase and Scala DSL at Production Chetan Khatri

2 . hosted by Who Am I Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.

3 . hosted by Agenda 01 What is Apache HBase 02 Why Apache HBase 03 Apache Spark and Scala 04 Apache Spark HBase Connector 05 Case Study: Retail Analytics Architecturing Fast Data Processing Platform to Scale 30 TB Data in Production

4 . hosted by What is Apache HBase ●  Column-oriented NoSQL ●  Non-relational ●  Distributed database build on top of HDFS. ●  Modeled after Google’s BigTable. Add the title ●  Built for fault-tolerant application with billions/ Modules trillions of•  rows do not limit and millions of the size, columns. ●  Very low latency number, andinterval, can be adjusted near real-time random reads and random according to need writes. Source: https://hbase.apache.org/ ●  Replication, end-to-end checksums, Modules do with automatic•  rebalancing not limit the size, HDFS. ●  Compressionnumber, interval, can be adjusted ●  Bloom filters according to need ●  MapReduce over HBase data. ●  Best at fetching •  Modules rowsdobynotkey, limitscanning the size, ranges of rows number, interval, can with ordered be adjusted partitioning. according to need

5 . hosted by What is Apache Spark ? Source: https://spark.apache.org/ Structured Data / SQL - Graph Processing - Spark SQL GraphX Machine Learning - Streaming - Spark Streaming, Structured Streaming MLlib

6 . hosted by What is Scala ●  Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. ●  Scala is object-oriented ●  Scala is functional ●  Strongly typed, Type Inference Source: www.scala-lang.org ●  Higher Order Functions ●  Lazy Computation

7 . hosted by Case Story: Retail Analytics Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production Use cases in Retail Analytics: Business: explain the who, what, when, where, why and how they are doing Retailing. ●  What is selling as compared to what was being ordered. ●  Effective promotions - right promotions at right outlet and right time. ●  What types of Cigarette consumers are shopping in your outlets ? ○  Gives smoking patterns in specific geography, predict demand on supply. ●  What are the purchasing patterns of your consumers ? ○  are they purchasing Pizza and Ice cream together ? ○  are they purchasing multiple Instant food products with soda together ? ●  Time Series problem - year, month, day of year, week of year to Identify which brands are not getting sold at specific geography, so it can be swap to other store.

8 . hosted by Case Story: Retail Analytics - Scale Challenges ²  Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job execution failures with unutilized Spark Cluster. ²  Scalability of massive data: ○  ~4.6 Billion events on every weekly data refresh. ○  Processing historical data: one time bulk load ~30 TB of data / ~17 billion transactional records. ²  Linear / sequential execution mode with broken data pipelines. ²  Joining ~17 billion transactional records with skewed data nature. ²  Data deduplication - outlet, item at retail.

9 . hosted by Using HBase as a MDM System MDM - Master Data Management 1. HBase Driven Data Deduplication Algorithms Example, ²  Outlet Matching ²  Item Matching ²  Address Matching ²  Brand Matching 2. Abbreviation Standardization Example, UOM Quantity ²  UOM Standardization PACK 2 ²  Outlet Name, Address Standardization ²  UPC Standardization 2PACK NA 2PK

10 . hosted by Using HBase as a MDM System Data Deduplication Problem Retail Analytics ! Examples, ●  You may find Item with same UPC code. ●  You may find Outlet with same Outlet number. ●  What if UPC Code gets upgraded from 10 Digits to 14 Digits. (Update everywhere, needs faster update.)

11 . HBase - NoSQL, Denormalized columnar schema model For example, Table: transactio Column Family: f Column Family: o created_datetime Outlet_id file_id Outlet_name transaction_id Outlet_state manufacturer_operator_submitter_id Outlet_city Outlet_address1 Outlet_address2 Outlet_region Outlet_country Outlet_owner_name Outlet_status Outlet_zipcode Outlet_started_date outlet_sub_chain_name hosted by nal_line_item Column Family: t Transaction_id Item_id Promo_code Gross_price Discount Quantity Upc Uom State_gst Country_gst Delivery_charges Packing_charges Vendor_discount Partner_discount Corporate_discount reward_point_discount

12 . hosted by Case Story: Retail Analytics - Scale -  5x performance improvements by re-engineering entire data lake to analytical engine pipeline’s. -  Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events. -  10x performance improvements on historical load by under the hood algorithms optimization on 17 Billion events (Benchmarks ~1 hour execution time only) -  Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails data(Item, Outlet) improved elastic performance. -  Using HBase as a Master Data Management (MDM) System.

13 . hosted by Case Story: Retail Analytics How ?

14 . hosted by Data Platform - Infrastructure Architecture Staging Aggregation HBase Hive Hive Layer Read Rec by Rec Elastic and do match Search Oracle NodeJS Akka-HTTP Spark [2] maprcli Index Kafka HBase Streaming and setup replication Real time query POS Files Hive / Spark / HDFS Presto [3] NodeJS HBase https://www.npmjs.com/package/ [1] Asynchronous HBase: node-thrift2-hbase https://github.com/OpenTSDB/asynchbase https://www.npmjs.com/package/ [2] Index MapR-DB Data into Elasticsearch async https://community.mapr.com/community/exchange/blog/ Spark TensorFlow https://www.npmjs.com/package/ 2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws MLLib thrift

15 . hosted by AsyncHBase Build.sbt with Akka HTTP

16 . hosted by Data Processing Infrastructure ●  MapR Distribution - http://archive.mapr.com/releases/ ecosystem-5.x/redhat/ ●  Data Lake - Apache HBase 0.98.12 ●  EDW / Analytical Data store - Apache Hive 1.2.1 ●  Unified Execution Engine - Apache Spark 2.0.1 ●  Distributed File storage - MapR-FS ●  Queueing mechanism - Apache Kafka ●  Streaming - HTTP Akka + scala 2.11 , Spark Streaming ●  Reporting Database - PostgreSQL 9.x ●  Legacy Database - Oracle 9 ●  Workflow Management tool - BMC Control M

17 . hosted by Rethink - Fast Data Architectures. Unify, Simplify. UNIFIED fast data processing engine that provides: The The The SCALE RELIABILITY & PERFORMANCE LOW LATENCY of data lake of data warehouse of streaming

18 . hosted by Spark HBase Connector. https://github.com/nerdammer/spark-hbase- connector Credit: Contributors

19 . hosted by Spark HBase Connector. It’s a Spark package connector written on top of Java HBase API. A simple and elegant way to write Spark - HBase Jobs. Powerful Functional Scala DSL integrated for Apache Spark. Supports: ●  Scala > 2.10 ●  Spark > 1.6 ●  HBase > 1.0

20 . hosted by Spark HBase Connector. Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" Setting the HBase Host

21 . hosted by Writing to HBase

22 . hosted by Reading from HBase

23 . hosted by Filtering

24 . hosted by Manage Empty Columns with Option[T]

25 . hosted by Custom Mapping with Case Classes

26 . hosted by Custom Mapping with Case Classes ...

27 . hosted by Implicit Reader

28 . hosted by Implicit Reader ... Do not forget to override the columns method.

29 . hosted by HBase Read table Data in Spark DataFrame

10点赞

1收藏

9下载