- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- <iframe src="https://www.slidestalk.com/slidestalk/1v1094954?embed&video" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
- 微信扫一扫分享
1.堵俊平-开源大数据发展与趋势v1.0
Hadoop是Apache基金会旗下最知名的基础架构开源项目之一。自2006年诞生以来,是海量数据存储、处理最为重要的基础组件,并由此形成了非常丰富的技术生态。
展开查看详情
1 .
2 .开源大数据技术发展脉络 与趋势 堵俊平 ASF Member, Apache Hadoop Committer & PMC LF AI & DATA基金会董事 开放原子TOC主席
3 .目录 1. Hadoop meetup 历年回顾 2. Hadoop的历史与今天 3. 开源大数据的发展脉络 4. 值得关注的新技术趋势
4 . Apache Hadoop Meetup 历年回顾 Community Activity Every 3 – 6 month @ SF Bay Area 2019年之前 2019.8 北京 2020.9 上海 2021.10 北京 2022.9 上海 …
5 .Is Hadoop Dead? Report -July 2022 - Apache Hadoop 社区讨论的热度有所下降 • * dev@hadoop.apache.org had a 100% decrease in traffic in the past quarter (0 emails compared to 6) • * mapreduce-issues@hadoop.apache.org had a 59% decrease in traffic in the past quarter (87 emails compared to 208) • * user@hadoop.apache.org had a 50% decrease in traffic in the past quarter (26 emails compared to 51) • * user-zh@hadoop.apache.org had a 100% increase in traffic in the past quarter (6 emails compared to 3) • * yarn-dev@hadoop.apache.org had a 26% decrease in traffic in the past quarter (349 emails compared to 466) • * yarn-issues@hadoop.apache.org had a 39% decrease in traffic in the past quarter (751 emails compared to 1227) 社区贡献活跃度有所上升 • * 356 issues opened in JIRA, past quarter (21% increase) • * 267 issues closed in JIRA, past quarter (20% increase) • * 437 commits in the past quarter (-5% change) • * 88 code contributors in the past quarter (4% increase) • * 374 PRs opened on GitHub, past quarter (25% increase) • * 299 PRs closed on GitHub, past quarter (13% increase) Hadoop用户群和贡献者群体相对稳定
6 . Hadoop Community Update • 3.2.X • New PMC • HADOOP-17124 Support LZO using aircompressor • 3.2.3 -> 2022-03-20; - Sun Chao was added to the PMC on • HADOOP-18055 Async Profiler endpoint for Hadoop daemons • 3.2.4 was released on 2022- 2022-03-08 07-22 • HADOOP-17979 Interface EtagSource to allow FileStatus subclasses to provide etags • 3.3.X • New Committer • 3.3.2 was released on 2022- - Gautham Banasandra was added as • MAPREDUCE-7341 Add a task-manifest output committer for committer on 2021-11-04 Azure and GCS. 03-02 • 3.3.3 was released on 2022- - Benjamin Teke was added as committer on 2022-03-24 05-17 • YARN-10496 [Umbrella] Support Flexible Auto Queue Creation • 3.3.4 was released on 2022- - András Győri was added as committer in CapacityScheduler on 2022-02-15 08-08 - Tao Li was added as committer on • YARN-8849 (DynoYARN: A simulation and testing infrastructure 2022-04-22 for YARN clusters). • 2.10.X - Mehakmeet Singh was added as • YARN-9698 & YARN-10843 [Umbrella] Tools to help migration committer on 2022-07-27 from Fair Scheduler to Capacity Scheduler. • 2.10.2 was released on 2022- 05-31 • YARN-11025 Implement distributed decommissioning Release Update Community Growing Notable Features
7 .广义上的”Hadoop” – 大数据开源生态体系 • 历经16年(06-22)的发展,Hadoop 引发的大数据技术革命没有停滞的迹 象,反而更像“寒武纪爆炸” • 应用驱动之外,开源和SaaS云服务成 为数据技术发展的两大推手 • SQL引擎仍然聚焦了最大的热点 • 湖仓一体(LakeHouse)、数据治理、 批流结合,data+AI融合等都是创新热 点 • 项目和社区碎片化的趋势非常明显, 甚至出现了一些“Yet Another XXX” 的项目 Hadoop开源生态技术图谱
8 .新一代数据处理架构由开源平台与SaaS服务构成 https://future.a16z.com/emerging-architectures-modern-data-infrastructure/
9 .趋势一:数据治理 数据需求持续爆发式增长,数据监管体系却呈现区域化趋势 ● 信息时代瞬息万变的不止是市场和技术,还 有监管要求 ● “数据主权”的浪潮,已迅速在不同国家/地区 进行展开 ● 面对重新构建过程中的法律秩序,全球化的 跨国企业势必需要迅速构建合理、合规而业 务高效的全球数据管理体系,进而实现业务 发展与法律合规之间的平衡 ● 非结构化数据是数据增长的主体,海量、非结构化数据的数据治理越来越重要
10 .“东数西算”上升到国家战略 东数西算重塑中国的“数据版图” 2022年2月,在京津冀、长三角、粤港澳大湾区、成渝、内蒙古、 贵州、甘肃、宁夏8地启动建设国家算力枢纽节点,并规划了10个 国家数据中心集群 ● 2022-2023年:将对时延要求不高的业务部署到西部数据中 心,做到“东数西存” ● 2023-2024年:完善不同区域高速率骨干网络建设,实现基 本的东数西算 ● 2025年后:强化并网能力,实现跨区域可灵活调度的实时计 算能力 ● 网络能力,调度能力,隐私保护与数据治理能力,数据“虚拟化”与实时计算能力,数据观 测能力等都面临重大技术挑战
11 .数据治理:从架构到方法论 ● 数据治理的最佳方案讨论热烈,短期难以达成一致 ● 落地与实施方案与组织体量、数据和业务规模有关
12 . 数据治理:开源项目与服务 ● 数据地图与元数据管理 ● 数据质量与可观测性 ● 数据安全 ● 数据流程与合规审计 Ref:https://www.notion.so/atlanhq/The-Ultimate-Repository-of-Data-Discovery-Solutions-149b0ea2a2ed401d84f2b71681c5a369
13 . 趋势2:多计算框架的整合与加速 Arrow已成为数据领域多框架计算和存储交互的“事实标准” Ref: https://engineering.fb.com/2022/08/31/open-source/velox/ ● 多框架在存储层(内存)和计算层整合 ● 越 来 越 多 Native-engine : Velox, Arrow DataFusion, Databricks Photon, Huawei OmniRuntime等
14 .趋势3:多云所带来的新数据孤岛问题 公有云 私有云 / 数据中心 ClickHouse Hive Spark Data Machine HBase Redis Impala RDS Data Data Machine Pipeline Learning Factory DataStore Pub/Sub Cloud ML Share Learning Blob Analysis Cosmos Kafka Ozone Kudu ES Kinesis Athena Bigtable FileStore Spanner Storage Services DB S3 Glue RedShift Data Lake AD SQL GCS Catalog BigQuery Hadoop Iceberg Hudi
15 .多云时代对数据Infra提出新需求 1 st generation 2 nd generation 2.1th generation 2.2th generation 3 rdgeneration local cloud serverless cloud multi-cloud data warehouse data warehouse query engine native database Analytical + AI databricks ?? AWS Athena Google Big Query Unified view to all data AWS Amazon RedShift Unified way to access all data Unified engine for Analytic & ML Appliance On Premise Managed Cloud Serverless Elasticity 云下 云上 多云
16 .