数据编排:企业级大数据管理的演化路径

展开查看详情

1.Open Source Data Orchestration for AI, Big Data, and Cloud Haoyuan (H.Y.) Li | Founder & Chairman & CTO | haoyuan@alluxio.com 2019-06-22 @ Beijing

2. Open Source Started From UC Berkeley AMPLab 1000+ contributors & Apache 2.0 Licensed growing GitHub’s Top 100 Most 4000+ Git Stars Join the Valuable Repositories Out of 96 Million conversation on Slack slackin.alluxio.io

3.Companies Using Alluxio - Read More (Including 8 of the Top 10 Internet Companies in China)

4.4 big trends driving the need for a new architecture Separation of Hybrid – Multi Rise Self-service Compute & cloud of the object data across the Storage environments store enterprise

5.Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE

6. Data stack journey and innovation paths Co-located Disaggregated Support more frameworks Co-located Disaggregated Support Presto, Spark across compute & HDFS compute & HDFS DCs without app changes on the same cluster on the same cluster Enable Hybrid Cloud MR / Hive Hive Burst compute in the cloud, HDFS public or private, using on- HDFS premise (e.g. HDFS) data. Transition to Public Cloud / § Typically compute-bound § Compute & I/O can be Object store clusters over 100% capacity scaled independently but § Compute & I/O need to be I/O still needed on HDFS Enable & accelerate machine scaled together even when which is expensive learning & analytics on object not needed stores

7. Independent scaling of compute & storage Java File API HDFS Interface S3 Interface POSIX Interface REST API Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver

8.Challenges with running Big Data & AI workloads on S3 & Alluxio Solution Accelerate big data frameworks Compute caching for S3 on the public cloud § S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Spark § S3 metadata operations are expensive making workloads run longer Alluxio Alluxio Alluxio Alluxio § S3 egress costs add up making the Same instance / container solution expensive § S3 is eventually consistent making it hard to predict query results

9. Challenges with Hybrid Cloud & Alluxio Solution Burst big data workloads in HDFS for Hybrid Cloud hybrid cloud environments 3 § Accessing data over WAN too slow Solution Benefits § Same performance as local § Copying data to compute cloud time Hive Hive Hive § Same end-user experience consuming and complex Hive Alluxio Alluxio Alluxio § Using another storage system like S3 Alluxio means expensive application changes § Using S3 via HDFS connector leads Same instance to extremely low performance / container § 100% of I/O is offloaded

10.Challenges with supporting more frameworks & Alluxio Solution 4 Support more frameworks Spark Presto § Running new frameworks on existing an HDFS cluster can dramatically affect Alluxio Alluxio performance of existing workloads § In a disaggregate environment, copying data to multiple compute clouds time Same data consuming and error prone Any object store or HDFS center / region § Migrating applications for new storage systems is complex & time consuming or § Storing and managing multiple copies of the data becomes expensive Enable big data on object stores across single or multiple clouds

11.Challenges running AI & Big Data on Object Stores & Alluxio Solution Dramatically speed-up big data Transition to Object store on object stores on premise 5 § Object stores performance for big data workloads can be very poor Presto Presto Presto Presto Solution Benefits § No native support for popular Alluxio frameworks Alluxio Alluxio Alluxio § § Same performance as HDFS Uses HDFS APIs § Same end-user experience Same container § Expensive metadata operations / machine reduce performance even more § No support for hybrid environments directly § Storage at fraction of the or or cost of HDFS

12.Advanced Use Cases Spark Presto Spark Hive Presto Alluxio Alluxio Standalone Any public / Same data private cloud center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks on across single or multiple clouds the public cloud

13. Alluxio – Key innovations Data Locality Data Accessibility Data Elasticity with Intelligent for popular APIs & with a unified Multi-tiering API translation namespace Accelerate big data Run Spark, Hive, Presto, ML Abstract data silos & storage workloads with transparent workloads on your data systems to independently scale tiered local data located anywhere data on-demand with compute

14.Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Read & Write Buffering Transparent to App RAM SSD HDD Hot Warm Cold Policies for pinning, promotion/demotion, TTL

15.Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface POSIX Interface REST API HDFS Driver S3 Driver Swift Driver NFS Driver

16.Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming

17.Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally HDFS #1 SUPPORTS IT OPS FRIENDLY Object Store • HDFS • Storage mounted into Alluxio • NFS by central IT NFS • OpenStack • Security in Alluxio mirrors • Ceph source data • Amazon S3 • Authentication through HDFS #2 • Azure LDAP/AD • Google Cloud • Wireline encryption

18. APIs to Interact with data in Alluxio Application have great flexibility to read / write data with many options Spark > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) Presto CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') POSIX $ cat /mnt/alluxio/myInput Java FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

19.Alluxio Reference Architecture … WAN Alluxio Alluxio Worker Client RAM / SSD / HDD Application Under Store 1 … Alluxio Alluxio Client Worker Application RAM / SSD / HDD Under Store 2 Alluxio Zookeeper / Master RAFT Standby Master

20.Two Sigma Fastest growing big hedge fund managing $50 billion for investors Use case | Cloud bursting on-premise data SPARK SPARK Public Cloud DATA ORCHESTRATION Public Cloud HDFS HDFS § Compute scales elastically independent of storage § Faster time to insights with seamless data orchestration § Accelerated workloads with memory-first data approach https://www.alluxio.io/resources/case-studies/two-sigma-case- study-cloud-bursting-with-spark-for-on-premise-hadoop/

21.China Unicom Leading Chinese Telco serving 320 million subscribers Use case | Data orchestration for agility SPARK Kubernetes SPARK DATA ORCHESTRATION SPARK ETL HDFS OBJECT HBASE HDFS OBJECT HBASE § Single namespace to access & address all data § Data local to compute accelerates workloads https://www.alluxio.io/blog/china-unicom-uses-alluxio-and- spark-to-build-new-computing-platform-to-serve-mobile-users/

22.NetEase Games Leading Online Game Company in China Use Case | On-premise Caching for Presto Presto Presto Alluxio HDFS HDFS § Large query variance during peak hours before § Alluxio brings data local to Presto to reduce the latency during peak hours https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/

23.Architecture: Colocate Alluxio with Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio

24.Questions? Welcome to join the Alluxio Open Source Community! www.alluxio.io | @alluxio | slackin.alluxio.io

Alluxio,世界上第一个将分离的异构存储整合到统一平台,并提供近乎内存访问速度的中间件,广泛用于企业和混合云的商业数据分析加速。
关注他