腾讯如何玩转Apache Hadoop Ozone

Haoop Ozone作为Hadoop社区近年来主推融合对象和文件系统的存储,受到了广泛的关注。腾讯大数据从2019年开始大力投入Ozone社区的开发和Ozone落地的工作,推进了Ozone GA版本的发布。本次议题将会分享腾讯如何利用Ozone融合智能学习平台和数仓平台的数据统一存储,对于Ozone性能和数据节点规模突破和其他关于配置和监控的best practice。最后,议题也会分享一下对于Ozone 1.0.0版本之后的展望。

展开查看详情

1.腾讯如何玩转Apache Hadoop Ozone ——腾讯大数据工程师 程力 licheng@

2.Overlook on Hadoop Ozone Tencent contributions to Ozone Tencent playbook on Hadoop Ozone Upcoming milestones for Hadoop Ozone 2

3.Ozone Overall Architecture Ozone Manager Ozone Client Storage Container Manager Datanode Datanode Datanode Apache Apache Apache Ratis Ratis Ratis 3

4.Versatility of Hadoop Ozone BigDataApp OtherApp YARN Volume RM Manager K8S Master Node Node Ozone Ozone NM Ozone CSI Kubelet CSI Kubelet CSI Driver K8S and CSI Driver Driver Volume Mount Pod Pod NM Ozone CSI Pod Pod Driver Volume Mount Volume Mount Volume Mount OzoneFS S3 S3 Storage Ozone SCM Ozone Manager Datanode Datanode … Datanode Datanode 4

5. Ozone Motivation HDFS architecture HDFS challenges: 1. NN has scalability concerns 2. File operation concurrency 3. Block reports handle 4. Slow NN startup Ozone motivation: 1. Scale both Namespace and Block Layer 2. Scaling to trillions of objects 3. Optimization for small files 5

6.Tencent contribution to Ozone Rack awareness and topology. Multi-Raft for data pipeline. Ozone scaling out. 6

7.Rack awareness and Topology Root Data Center1 Data Center X … … Rack 1 Rack Y Node Group1 Node Group M Datanode1 Datanode… Datanode… Datanode… Datanode… DatanodeN Pipelines and containers are allocated based on topology in order that data replication is aware of racks. 7

8.Multi-Raft on data pipeline Ozone write path Multi-Raft on data pipelines 8

9.Multi-Raft helps on Datanode writing performance Single-Raft DN disk usage Muti-Raft DN disk usage 9

10. Multi-Raft helps on Datanode writing performance Small files benchmark Captured from production cluster Single-raft(1 pipeline) vs Multi-raft (5 pipelines) Latency comparison in prod cluster 97.90% latency 100.00% 90.00% 18 >> 17 80.00% 16 68.20% 70.00% 14 60.00% 12 50.00% 10 40.00% 8 30.00% 27.40% 6 5.26 4.77 4 20.00% 2 1.7 10.00% 2.70% 0.31 0.09 0.01% 0.01% 1.46% 1.64% 0.07% 0.53% 0 0.00% 100KB * 1000 files 100KB * 20000 fil es 100KB * 60000 fil es > 3s 2s ~ 3s 1s ~ 2s 0.2s ~ 1s < 0.2s Single Raft tot al latency (hrs) Mult i Raft total latency (hrs) Single raft Multi raft Cloudera Blog: Multi-Raft – Boost up write performance for Apache Hadoop-Ozone 10

11.Scale single cluster out to 1000 Datanodes Cluster Throughput (TB) Before After • SCM&OM on a physical machine. 500 451 Datanodes are deployed via k8s. All 450 400 347 HDD disks. 350 300 252 • Stably W/R over 100 TB every week for 250 over a month. 200 141 150 93 100 • Fix SCM&OM CPU issues and Ratis 50 stability issues. 0 0 7 14 21 28 35 Time (days) 11

12. SCM & OM CPU usage enhancement SCM CPU Usage(%) OM CPU Usage(%) Before After Before After 500 300 450 450 241 237 400 361 250 350 200 300 250 150 200 150 100 100 37 26 29 50 50 16 24 11 0 0 0 7 14 21 28 35 0 7 14 21 28 35 Time(days) Time(days) 12

13. A bunch of fixes and improvements Ozone stability Performance optimization HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread HDDS-3223. Improve s3g read 1GB object efficiency by HDDS-3630. Merge rocksdb in datanode 100 times HDDS-3514. Fix memory leak of RaftServerImpl HDDS-3745. Improve OM and SCM performance with HDDS-3041. Fix memory leak of s3g by releasing the connection resource 64% by avoid collect datanode information to s3g RATIS-935. Fix memory leak by ungister metrics HDDS-3240. Improve write efficiency by creating RATIS-925. Fix memory leak of RaftServerImpl for no remove from static container in parallel RaftServerMetrics::metricsMap HDDS-3244. Improve write efficiency by opening RATIS-845. Fix memory leak of RaftServerImpl for no unregister from reporter RocksDB only once RATIS-840. Fix memory leak of log appender HDDS-3168. Improve read efficiency by merging a lot of RPC call getContainerWithPipeline into one Ratis Group stability HDDS-3770. Improve getPipelines performance RATIS-995. Leader balance in multi raft HDDS-3737. Avoid serialization between UUID and String RATIS-993. Pre vote before request vote HDDS-3481. SCM ask too many datanodes to replicate RATIS-987. Fix Infinite install snapshot the same container RATIS-983. Check follower state before ask for votes HDDS-3743. Avoid NetUtils#normalize when get RATIS-982. Fix RaftServerImpl illegal transition from RUNNING to RUNNING DatanodeDetails from proto RATIS-980. Fix leader election happens too fast HDDS-3742. Improve OM performance with 5.29% by RATIS-989. Avoid change state from CLOSING to EXCEPTION in LogAppender avoid stream.collect RATIS-977. Fix gRPC failed to read message HDDS-3734. Improve the performance of SCM with 3.86% by avoid TreeSet.addAll RATIS-821. Fix high processor load for ScheduledThreadPoolExecutor with 0 core threads Thanks to runzhiwang@! 13

14.How Tencent uses Ozone Storage solution for DataLake House. Optimized for various types of machines. Storage pool for object and file system. 14

15. Base of DataLake House • Ozone compatibility: 1. Object interface 2. S3 RESTful APIs 3. HCFS 4. Mount point Storage for DataLake house: • Hive clusters W/R to Ozone via FS. • AI platform managed by k8s W/R to ozone via mount point. • • Notebook app W/R to Ozone via S3 APIs. 15

16.Ozone on high density machines High density cluster for private cloud user: Master/Datanode: 60 * 16T HDD disks, 48 * 4 CPUs, 480G Memory, 25G(2Port) network. Key points: 1. 60 ~ 110 16T disks on each node. Each node is expected to store near PB level. 2. Ozone cluster size is relatively small. Each cluster won’t have many datanodes. 3. Performance issues. 16

17.HDFS on Ozone Ozone Namespace HDDS File System requests Control plane FS Client HDFS NameNode SCM Datanodes Object interface requests DN DN DN Ozone Ozone Manager Client DN DN DN 17

18.Ozone 1.0.0 is officially released! Release Manager: Sammi Chen (Hadoop PMC/Committer) “This is a generally available (GA) release. It represents a point of API stability and quality that we consider production-ready.” 18

19.Ozone Recon Configs on Recon node: <property> <name>ozone.recon.address</name> <value>reconAddress:9891</value> </property> <property> <name>ozone.recon.http-address</name> <value>reconAddress:80</value> </property> <property> <name>ozone.recon.db.dir</name> <value>recon-db-dir</value> </property> <property> <name>recon.om.snapshot.task.interval.delay</name> <value>1m</value> </property> <property> <name>ozone.recon.om.db.dir</name> <value>recon-om-db-dir</value> </property> Configs on datanode: <property> <name>ozone.recon.address</name> <value>reconAddress:9891</value> </property> 19

20.Grafana on Ozone

21.Hadoop Ozone milestones Ozone HA support Storage class support Erasure coding support Atomic delete and rename 21

22.Start contributing to Ozone: https://hadoop.apache.org/ozone https://cwiki.apache.org/confluence/display/HADO Welcome to join OP/Ozone+Contributor+Guide Tencent BigData team!

23.Thanks —licheng@