用Scala DSL+HBase的方案，解决生产环境30TB规模数据湖的管理，来自印度的工程师介绍了如何利用Spark SQL
1. hosted by Scaling 30 TB’s of Data Lake with Apache HBase and Scala DSL at Production Chetan Khatri
2. hosted by Who Am I Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
3. hosted by Agenda 01 What is Apache HBase 02 Why Apache HBase 03 Apache Spark and Scala 04 Apache Spark HBase Connector 05 Case Study: Retail Analytics Architecturing Fast Data Processing Platform to Scale 30 TB Data in Production
4. hosted by What is Apache HBase ● Column-oriented NoSQL ● Non-relational ● Distributed database build on top of HDFS. ● Modeled after Google’s BigTable. Add the title ● Built for fault-tolerant application with billions/ Modules trillions of• rows do not limit and millions of the size, columns. ● Very low latency number, andinterval, can be adjusted near real-time random reads and random according to need writes. Source: https://hbase.apache.org/ ● Replication, end-to-end checksums, Modules do with automatic• rebalancing not limit the size, HDFS. ● Compressionnumber, interval, can be adjusted ● Bloom filters according to need ● MapReduce over HBase data. ● Best at fetching • Modules rowsdobynotkey, limitscanning the size, ranges of rows number, interval, can with ordered be adjusted partitioning. according to need
5. hosted by What is Apache Spark ? Source: https://spark.apache.org/ Structured Data / SQL - Graph Processing - Spark SQL GraphX Machine Learning - Streaming - Spark Streaming, Structured Streaming MLlib
6. hosted by What is Scala ● Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. ● Scala is object-oriented ● Scala is functional ● Strongly typed, Type Inference Source: www.scala-lang.org ● Higher Order Functions ● Lazy Computation
7. hosted by Case Story: Retail Analytics Architecting Fast Data Processing Platform to Scale 30 TB of Data in Production Use cases in Retail Analytics: Business: explain the who, what, when, where, why and how they are doing Retailing. ● What is selling as compared to what was being ordered. ● Effective promotions - right promotions at right outlet and right time. ● What types of Cigarette consumers are shopping in your outlets ? ○ Gives smoking patterns in specific geography, predict demand on supply. ● What are the purchasing patterns of your consumers ? ○ are they purchasing Pizza and Ice cream together ? ○ are they purchasing multiple Instant food products with soda together ? ● Time Series problem - year, month, day of year, week of year to Identify which brands are not getting sold at specific geography, so it can be swap to other store.
8. hosted by Case Story: Retail Analytics - Scale Challenges ² Weekly Data refresh, Daily Data refresh batch Spark / Hadoop job execution failures with unutilized Spark Cluster. ² Scalability of massive data: ○ ~4.6 Billion events on every weekly data refresh. ○ Processing historical data: one time bulk load ~30 TB of data / ~17 billion transactional records. ² Linear / sequential execution mode with broken data pipelines. ² Joining ~17 billion transactional records with skewed data nature. ² Data deduplication - outlet, item at retail.
9. hosted by Using HBase as a MDM System MDM - Master Data Management 1. HBase Driven Data Deduplication Algorithms Example, ² Outlet Matching ² Item Matching ² Address Matching ² Brand Matching 2. Abbreviation Standardization Example, UOM Quantity ² UOM Standardization PACK 2 ² Outlet Name, Address Standardization ² UPC Standardization 2PACK NA 2PK
10. hosted by Using HBase as a MDM System Data Deduplication Problem Retail Analytics ! Examples, ● You may find Item with same UPC code. ● You may find Outlet with same Outlet number. ● What if UPC Code gets upgraded from 10 Digits to 14 Digits. (Update everywhere, needs faster update.)
11. hosted by HBase - NoSQL, Denormalized columnar schema model For example, Table: transactional_line_item Column Family: f Column Family: o Column Family: t created_datetime Outlet_id Transaction_id file_id Outlet_name Item_id transaction_id Outlet_state Promo_code manufacturer_operator_submitter_id Outlet_city Gross_price Outlet_address1 Discount Outlet_address2 Quantity Outlet_region Upc Outlet_country Uom Outlet_owner_name State_gst Outlet_status Country_gst Outlet_zipcode Delivery_charges Outlet_started_date Packing_charges outlet_sub_chain_name Vendor_discount Partner_discount Corporate_discount reward_point_discount
12. hosted by Case Story: Retail Analytics - Scale - 5x performance improvements by re-engineering entire data lake to analytical engine pipeline’s. - Proposed highly concurrent, elastic, non-blocking, asynchronous architecture to save customer’s ~22 hours runtime (~8 hours from 30 hours) for 4.6 Billion events. - 10x performance improvements on historical load by under the hood algorithms optimization on 17 Billion events (Benchmarks ~1 hour execution time only) - Master Data Management (MDM) - Deduplication and fuzzy logic matching on retails data(Item, Outlet) improved elastic performance. - Using HBase as a Master Data Management (MDM) System.
13. hosted by Case Story: Retail Analytics How ?
14. hosted by Data Platform - Infrastructure Architecture Staging Aggregation HBase Hive Hive Layer Read Rec by Rec Elastic and do match Search Oracle NodeJS Akka-HTTP Spark  maprcli Index Kafka HBase Streaming and setup replication Real time query POS Files Hive / Spark / HDFS Presto  NodeJS HBase https://www.npmjs.com/package/  Asynchronous HBase: node-thrift2-hbase https://github.com/OpenTSDB/asynchbase https://www.npmjs.com/package/  Index MapR-DB Data into Elasticsearch async https://community.mapr.com/community/exchange/blog/ Spark TensorFlow https://www.npmjs.com/package/ 2016/12/12/how-to-index-mapr-db-data-into-elasticsearch-on-aws MLLib thrift
15. hosted by AsyncHBase Build.sbt with Akka HTTP
16. hosted by Data Processing Infrastructure ● MapR Distribution - http://archive.mapr.com/releases/ ecosystem-5.x/redhat/ ● Data Lake - Apache HBase 0.98.12 ● EDW / Analytical Data store - Apache Hive 1.2.1 ● Unified Execution Engine - Apache Spark 2.0.1 ● Distributed File storage - MapR-FS ● Queueing mechanism - Apache Kafka ● Streaming - HTTP Akka + scala 2.11 , Spark Streaming ● Reporting Database - PostgreSQL 9.x ● Legacy Database - Oracle 9 ● Workflow Management tool - BMC Control M
17. hosted by Rethink - Fast Data Architectures. Unify, Simplify. UNIFIED fast data processing engine that provides: The The The SCALE RELIABILITY & PERFORMANCE LOW LATENCY of data lake of data warehouse of streaming
18. hosted by Spark HBase Connector. https://github.com/nerdammer/spark-hbase- connector Credit: Contributors
19. hosted by Spark HBase Connector. It’s a Spark package connector written on top of Java HBase API. A simple and elegant way to write Spark - HBase Jobs. Powerful Functional Scala DSL integrated for Apache Spark. Supports: ● Scala > 2.10 ● Spark > 1.6 ● HBase > 1.0
20. hosted by Spark HBase Connector. Dependency in build.sbt - libraryDependencies += "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3" Setting the HBase Host
21. hosted by Writing to HBase
22. hosted by Reading from HBase
23. hosted by Filtering
24. hosted by Manage Empty Columns with Option[T]
25. hosted by Custom Mapping with Case Classes
26. hosted by Custom Mapping with Case Classes ...
27. hosted by Implicit Reader
28. hosted by Implicit Reader ... Do not forget to override the columns method.
29. hosted by HBase Read table Data in Spark DataFrame