Big Data NoSQL System:Apsara DB Hbase and Spark

来自阿里巴巴的 Wei Li 介绍了基于阿里巴巴云 HBase 构建的融合了计算、存储和检索以及在线和离线的的大数据中台解决方案,同时结合云上的弹性伸缩能力,节省成本。针对扫描大表,造成在线的 HBase 服务不稳定的问题,他们做了一个工作是把在线存储和离线分析使用的数据分离开来,通过一键归档把离线的数据转成列存的格式,带来性能十倍以上的提升,同时也不会影响 HBase 在线服务的稳定性,列存的方式是把源数据通过WAL同步到Spark 集群,存储成列的方式。数据归档完成之后,处理完的数据还需要写回到 HBase,这些数据的具体细节没有说明,可能跟业务有关,猜测是一些经过处理之后的聚合类数据等。他们没有通过传统的使用 HBase API 的方式,而是直接加载 HFile.最后一点是成本,使用云端数据库能带来两个方面的成本节省。一个是计算资源,一个是存储。计算资源是因为不同的业务有不同的波峰和波谷;存储是因为可以利用云上的廉价存储。最后他根据具体的几个 case 详细讲述了这套方案的案例。



2.BigData NoSQL System:ApsaraDB HBase and Spark Wei Li ApsaraDB HBase X-Pack team


4. BigData processing scenario New New Finance E- manufactu Game Social News retail commerce ring Safety control Personalized Statistical Analysis Time&space timing Feeds recommendation ⽤户画像 ⽤户⾏为分析 监控数据 维表和结果表 海量帖⼦、⽂章 爬⾍抓取信息 ⽤户画像 轨迹、设备数据 离线分析 聊天、评论 反欺诈系统 推荐引擎 地理信息 海量实时数据存储 海量实时数据处理 订单数据 海量实时数据处理 区域分布统计

5. System iterations and challenges Centralized database Distributed database Hadoop HBase X-Pack

6. Architecture & Implementation

7. ApsaraDB HBase X-Pack Architecture One-stop shop for big data processing : Storage & Search& Computing, Scalability & Real-Time & Flexibility Two integrations •  Storage and Search integration •  Online and offline integration One cost reduction •  Complex computing flexibility

8. ApsaraDB HBase X-Pack Deployment Spark SparkStreaming BDS Phoenix Search Index Parquet HBase + Solr Streaming Compute Online storage& Search Complex analysis

9. Spark analysis HBase Data Spark on HBase Spark on HBase Spark Parser Performance DataSource API •  distributed scan; Required •  sql optimize like partition pruning column pruning predicate ! GetPartition! Columns PrunedFilter Schema ! pushdown ! •  direct reading hifles Mapping! Filter •  auto transform to column based storage filter Filter PhoenixInputFormat! Get Scan Snapshot ! Phoenix! put/create Get API Scan API newAPIHadoopRDD API TableSnapshot HBase ! Multi get Range Scan ! InPutFormat ! ! region Snapshot

10. Spark analysis HBase Data One-click archiving •  Row-Oriented To Column-Oriented •  Performance improvement 20 times •  HBase Cluster more stable start/stop config HBase BDS /tmp/20190606/00/15/ Spark /tmp/20190606/00/30/ hlog! …… Driver hlog! hlog /tmp/20190606/23/45/ ! ! /tmp/20190607/00/00/ Executor0 /data/20190606 ! ! region Executor1 …. ! ! ! ! ! hfile BulkLoad ExecutorX

11. ApsaraDB X-Pack Spark expand HBase ecology OSS POLARDB RDS Redis Kafka HBase Kafka X-Pack Phoenix4.x Spark LogHub Phoenix5.0 schema DataHub ADS4PG BulkGet ODPS TableStore

12. ApsaraDB X-Pack Spark cost & dynamic ECS Master1 Master2 •  Calculate resource elasticity(1 times lower) •  OSS storage resource flexibility(3 times lower) Core Elastic Elastic ElasticNode! Core HDFS Elastic Elastic 300! 240! (node*h)! 180! 00:00-05:00 Core Elastic Elastic 5 node 10 node 120! 5 node 60! 0! OSS&DFS ElasticNode

13. ApsaraDB X-Pack Spark data desktop Scheduling, relying, interactive

14. ApsaraDB X-Pack Spark data desktop Scheduling, relying, interactive


16. HBase X-Pack Product recommendation platform Scenario: With the increasing number of users accumulating in the APP, the customer is ready to launch the product recommendation function, which requires real-time ETL analysis, storage and model calculation of the user behavior log.

17. HBase X-Pack: Integrated data processing platform Pain points values •  Online HBase and offline analysis shared clusters affect •  HBase data is asynchronously archived to Spark number online HBase query performance warehouse, which has no effect on online •  Spark&hive sql directly reads and writes HBase in bulk, •  After Spark analysis, the result data is transferred by the bulkload affecting HBase stability method, which does not affect the online business.

18. HBase X-Pack: Big data risk control platform ( ) + + Spark Spark Streaming SQL MLlib Load ( HDFS) Kafka HBase Parquet (HDFS OSS) •  Real-time news: Kafka accepts real-time collected messages, and can do simple things with smoke streaming •  Archive by day increment: Data increments that are streamed to the storage service each day are archived to the spark offline warehouse •  Offline data warehouse: used to store the full amount of data, the data is stored in HDFS on the column. •  Full training model: spark supports complex computing, mlib, python is suitable for data computing training model •  Model data Load: The new model Loaded to the model service for offline service to provide external control decision •  Risk control simulation: When adding a heart rule or a new model at the training center, verify its good or bad, you can use the full amount of data to do training in the spark offline warehouse.

19. HBase X-Pack: Game log processing platform values •  Support high performance offline computing and real-time computing; •  Manage data job scheduling; •  Support for elastic scaling calculations (cost savings) •  Support hot and cold storage (cost saving) •  Meet data lake scenarios and support high-throughput mass storage structured and unstructured data;

20. HBase X-Pack:Real-time scene values •  Pre-computation generates a common indicator layer, and uses HBase&Solr's real-time analysis and processing capabilities to meet real-time report calculations of different services. •  Pre-calculation is to use spark streaming, the delay is less than 10s •  Spark streaming can be used with hbase to do de-weighting, correlation dimension table

21. HBase X-Pack:offline data warehouse PolarDB RDS ADB HBase Mongo Redis Spark Spark ( Parquet HIVEMeta) Spark ! Streaming - - ! ! ! ! ! ! ! ! Spark ! ! ! ! ! ! ! ! •  Operational data layer: The most primitive data in the message middleware is similar to Kafka, LogHUB, or in online databases such as PolarDB, RDS, Mongo, HBase, etc. •  Detail wide surface layer: Use the Spark batch ETL or Spark Streaming table to build a detailed wide table •  Public summary wide surface layer: Classification and modeling in Spark according to certain business themes, such as daily/monthly reports, model training, etc. •  Public dimension surface: static dimension table •  Data application layer: high-level summary data processed by offline number bins is stored in the online library for query service.

22.We are hiring! Ø  If you are interested in the online sql analysis engine Ø  If you are interested in the spark kernel and ecosystem ApsaraDB HBase X-Pack: