无缝云链接和可扩展的硬件卸载

在本文中,我们将演示如何在VM中借助RDMA(远程直接内存访问)支持,利用Azure中的硬件加速来无缝地加速Spark作业。我们将演示基准测试和真实应用程序的用例,这些用例以最小的配置实现令人印象深刻的性能改进。
展开查看详情

1.Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud Yuval Degani, Mellanox Technologies Evan Burness, Microsoft Azure #HWCSAIS18

2.• End-to-end designer and supplier of interconnect solutions: network adapters, switches, system-on-a-chip, cables, silicon and software • 10-400 Gb/s Ethernet and InfiniBand Virtual Protocol Switch / Virtual Protocol Interconnect Gateway Interconnect 56/100/200G 56/100/200G Storage Server / InfiniBand InfiniBand Front / Backend Compute 10/25/40/50/ 10/25/40/50/ 100/200/400GbE 100/200/400GbE #HWCSAIS18 2

3.• RDMA capable network, powered by Mellanox • H-series (Intel CPUs with FDR InfiniBand) • NC-series (Nvidia GPUs with FDR InfiniBand) • Only major Cloud provider with RDMA • Run simulation and AI workloads at large-scale • Dozens of RDMA clusters around the world #HWCSAIS18 3

4.Why are we here? • Azure hardware accelerated networks will soon support general-purpose RDMA (on top of SR-IOV) • SparkRDMA Shuffle Plugin (appeared at Spark Summit Europe 2017) can now be used in the cloud, providing instant speedups for Spark jobs #HWCSAIS18 4

5. Java app What’s RDMA? buffer Socket RDMA • Remote Direct Memory Access – Read/write from/to remote memory locations Context switch • Zero-copy • Direct hardware interface – bypasses the OS kernel and TCP/IP in IO path Sockets • Flow control and reliability is offloaded in hardware TCP/IP • Sub-microsecond latency Driver • Supported on almost all mid-range/high- end network adapters Network Adapter #HWCSAIS18 5

6.RDMA on Azure • No need for buying expensive hardware • Lowest latency on the Cloud (~2.5 uSec) • Pre-built OS images for easy deployment • K80, P100, and V100 GPUs with InfiniBand • Other uses cases for RDMA on Azure: #HWCSAIS18 6

7.RDMA on Azure Azure accelerated networking is build on top of SR-IOV (Single Root Input/Output Virtualization) hardware support provided by Mellanox ConnectX network cards #HWCSAIS18 7

8.Under the hood Spark’s Shuffle Internals #HWCSAIS18 8

9. Spark’s Shuffle Basics Map Reduce #HWCSAIS18 9

10. Spark’s Shuffle Basics Input Map Reduce #HWCSAIS18 9

11. Spark’s Shuffle Basics Input Map Map Map Map Map Map Reduce #HWCSAIS18 9

12. Spark’s Shuffle Basics Input Map output Map Map Map Map Map Map Reduce #HWCSAIS18 9

13. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Map File Map File Reduce #HWCSAIS18 9

14. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Driver Map File Map File Reduce #HWCSAIS18 9

15. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Driver Map File Map File Reduce task Reduce Reduce task Reduce task Reduce task Reduce task #HWCSAIS18 9

16. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Driver Map File Map File Reduce task Fetch blocks Reduce Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks #HWCSAIS18 9

17. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Driver Map File Map File Reduce task Fetch blocks Reduce Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks #HWCSAIS18 9

18. Spark’s Shuffle Basics Input Map output Map File Map File Map Map File Driver Map File Map File Reduce task Fetch blocks Reduce Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks #HWCSAIS18 9

19. Spark’s Shuffle Read Protocol Driver Shuffle Read Reader Writer #HWCSAIS18 10

20. Spark’s Shuffle Read Protocol Driver Shuffle Read Reader Writer #HWCSAIS18 10

21. Spark’s Shuffle Read Protocol Driver Shuffle Read Reader 1 Request Map Statuses Writer #HWCSAIS18 10

22. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Shuffle Read Reader 1 Request Map Statuses Writer #HWCSAIS18 10

23. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Shuffle Read Reader 1 3 Request Map Group block Statuses locations by writer Writer #HWCSAIS18 10

24. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Request blocks Shuffle Read from writers Reader 1 3 4 Request Map Group block Statuses locations by writer Writer #HWCSAIS18 10

25. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Request blocks Shuffle Read from writers Reader 1 3 4 Request Map Group block Statuses locations by writer Writer 5 Locate blocks, and setup as stream #HWCSAIS18 10

26. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Request blocks Request blocks from stream, one Shuffle Read from writers by one Reader 1 3 4 6 Request Map Group block Statuses locations by writer Writer 5 Locate blocks, and setup as stream #HWCSAIS18 10

27. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Request blocks Request blocks from stream, one Shuffle Read from writers by one Reader 1 3 4 6 Request Map Group block Statuses locations by writer Writer 5 7 Locate blocks, and Locate block, send setup as stream back #HWCSAIS18 10

28. Spark’s Shuffle Read Protocol Send back Map Statuses Driver 2 Request blocks Block data is now Request blocks from stream, one ready Shuffle Read from writers by one Reader 1 3 4 6 8 Request Map Group block Statuses locations by writer Writer 5 7 Locate blocks, and Locate block, send setup as stream back #HWCSAIS18 10

29.The Cost of Shuffling • Shuffling is very expensive in terms of CPU, RAM, disk and network IOs • Spark users try to avoid shuffles as much as they can • Speedy shuffles can relieve developers of such concerns, and simplify applications #HWCSAIS18 11