Alluxio+Fluid 构建云原生AI加速平台

Alluxio+Fluid 构建云原生AI加速平台
云知声 AI Labs 平台研究员,专注于AI基础设施、深度学习和高性能领域,负责 Atlas AI计算平台整体架构设计,为算法科学家提供训练加速和推理优化环境。同时也进行计算机视觉相关研究,主要研究多模态视频生成。目前为 Kubeflow、Fluid、Istio等开源组织 Member


1.Speeding Up In Atlas Deep Learning Platform with Alluxio+Fluid Yuandong Xie Platform Researcher of Unisound AI labs

2.Agenda • Introduction to Unisound & Atlas AI Platform • Challenges & Solutions • Test scenarios & Results • Analysis of speeding up • Contribution to Fluid • Conclusion & Next steps

3.Unisound AI Service Cloud AI Service Provider Application: Medical、Education Education Home Cloud AI UI Medical Security Interactive system Provider Chip Edge Application: Car AI Chip Financial Car AI Chip Provider Application: Home Atlas support all application and expand new business

4.Atlas AI Platform Speech Processing Image Processing Semantic Processing Application Recommendation System Data Mining Text Analysis Data Preprocessing Feature Extraction Statistical Analysis Computing Machine Learning Model Deep Learning Model Job Scheduler Model Registry User Management Controller Manager Logging and Monitoring Image Registry Resource Management CPU cluster GPU cluster Distributed storage cluster Infrastructure 100G InfiniBand High Performance Network

5. Platform Storage - Lustre • Distributed file storage system provides continuous large space (PB scale) • Multi-Lustre clusters with different networks ( InfiniBand and Ethernet ) • Data verification and replication, ensure data security • Support online expansion and version update E Site:

6.Challenges with the current architecture • Storage IO Pressure • Small file(<1Mb) increases pressure of MDT and OST • Single-node multi-task IO preemption • Data duplication

7.Impacts of the challenge • IO pressure -> Hardware damage • Small files -> Low QoS • IO preemption -> Resource waste • Duplicate data -> Storage waste

8.Current solution for challenges • Limit client IO • Pack small files • Priority scheduling to idle nodes • Check duplicate data Not effective $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid” $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start other_clients nid={client@o2ib} rate=80"

9.New Architecture of Atlas Client(atlasctl) Computer Job Dataset Runtime Dataload Pod fuse Master Job Master fuse Alluxio-fuse Alluxio-fuse Short circuit Short circuit E Cache(Alluxio+Fluid) Worker Job Worker Worker Job Worker Pod RAM/SSD/HDD Pod RAM/SSD/HDD Pod Pod Access Control Access Control Node (UGO+RWX) Node (UGO+RWX) /mnt Storage(Lustre) /mnt/$group/$user

10.Alluxio is all we need & Why Customized Storage Atlas Atlas Atlas Good performance Good Performance Poor performance Low QoS Limited scalability Good scalability Expensive Cheap!

11.Fluid is all we need & Why Speech Processing Computer Vision Natural Language Application Processing …… Massive small files Medium size files Large size files Data Type How to manage different user`s dataset? How to speed up different type dataset? How to auto scale alluxio engine? How to schedule job with cache locality? Fluid-controller-manager Fluid-Scheduler Fluid-Function …

12.Scenarios Test • Datasets : • Different Scenarios: • Massive small file (<10Mb) • Noise Reduction • Medium size file(>100Mb,<1Gb) • Image Classification • Large size file(>100Gb) • Optical Character Recognition(OCR) • Comparative Test: • Read directly from Lustre • Read from Cache(Alluxio Engine)

13.Experiment #1 --- Massive small files • Test Details : • Noise Reduction Application • 500000 files ,total size 183 Gb • Cache in Memory • Three experiments: • Load raw wav files from lustre • Load clear data from Alluxio(cold), noise data from lustre • Load all data from Alluxio(warm)

14.Massive small files --- Speed comparision • Test Results: Massive small files Test Results Speed Up: 10x

15.Massive small files --- Bandwidth and GPU Usage comparision • Test Results: I/O Bandwidth Up: 10% 230 Mb/s -> 0

16.Experiment #2 --- Medium size files • Test Details : • Image Classification Application • ImageNet TFRecord (Avg size: 138Mb) • Cache in Memory • Comparative Test: • Run 10 GPUs job (Exclusive) on the same node • Run 7 GPUs job (Preemption) on the same node

17.Medium size files • Test Summary Lustre Alluxio(warm) Memory Theoretical peak (Preemption)2500 steps 236.9 601.8 N/A 768.9 (images/s) per gpu (Exclusive)4000 step (images/s) per gpu 247.2 702.6 699.1 765.9 E2E Time 50 min 20 min 15 min Speed Up: 2.5x • Notes : • Lustre: directly load data from distributed storage system • Alluxio(warm): pre-warm data in Cache • Memory: directly load data from memory • Theoretical peak: load synthetic data

18.Experiment #3 --- Large size files • Test Details : • Optical Character Recognition(Application) • LMDB size: 125 Gb • Cache in Memory • Three experiments: • Read LMDB from lustre • Read LMDB from Alluxio, but not pre-warm • Read LMDB from Alluxio, pre-warm

19.Large size files --- Speed comparision • Test Result: Large size files Test Results All in Memory Speed Up: 30x

20.Large size files --- Bandwidth and GPU Usage comparision • Test Result: I/O Bandwidth 1300 Mb/s -> 0 Up: 31.5% 15.6 Mb/s -> 0

21. Results Summary • Fluid with Alluxio engine can greatly speed up different type datasets • Fluid with Alluxio engine can increase the GPU usage of the task • Fluid with Alluxio engine can significantly reduce the pressure on the Lustre distributed file system

22.Contribution to fluid

23. Conclusion • Alluxio : High-performance architecture • Fluid customizes and optimizes Alluxio parameters for different datasets of scenarios • Alluxio perfect integration : • Data is always immutable in deep learning life cycle • Deep learning need deterministic job • Fluid perfect integration : • Locality based scheduling is needed for Lustre • Cache share different task is needed

24.Next Steps • More Scenarios for Fluid with Alluxio Engine • More Best Practice Sharing • Continuous Content Contributions & Community Activities • ……

25.Just for Smart Life Thanks