2.深度学习模型向Intel Habana加速卡的迁移和优化

播放视频

视频文档

2.深度学习模型向Intel Habana加速卡的迁移和优化

下载 8

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
视频嵌入链接文档嵌入链接
<iframe src="https://www.slidestalk.com/AIProgrammingDay/87005?embed&video" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

英特尔AI实践日

发布于

2年前

902

人观看

#信息技术

Habana AI 软件解决方案概述
PyTorch 模型迁移实例
PyTorch 混精度训练实例
多卡/多节点扩展训练实例
性能优化

康乐天 ，英特尔人工智能软件方案工程师，主要从事高性能计算，异构计算和AI解决方案相关方面的工作。

展开查看详情

1 . Training with Habana Accelerator Card Letian Kang Habana Labs – Confidential Information 1

2 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 2

3 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 3

4 . General overview - SynapseAI® Software Suite ❑Use apt-get install or yum install ❑habanalabs-dkms – installs the PCIe driver ❑habanalabs-thunk ❑habanalabs-firmware ❑habanalabs-graph ❑habanalabs-firmware-tools ❑habanalabs-container-runtime ❑habanalabs-qual ❑Get via docker image ❑docker pull vault.habana.ai/gaudi-docker/1.6.1/{$OS}/habanalabs/tensorflow- installer-tf-cpu-${TF_VERSION}:latest ❑docker pull vault.habana.ai/gaudi-docker/1.6.1/{$OS}/habanalabs/pytorch- installer-1.12.0:latest Habana Labs – Confidential Information 4

5 . General overview - Supported Framworks ❑Tensorflow https://github.com/HabanaAI/Setup_and_Install/blob/main/installation_scripts/Tens orFlow/tensorflow_installation.sh ❑PyTorch https://github.com/HabanaAI/Setup_and_Install/blob/main/installation_scripts/PyTo rch/pytorch_installation.sh ❑PyTorch Lightning https://pytorch-lightning.readthedocs.io/en/latest/accelerators/hpu.html ❑Hugging Face Optimum-Habana https://huggingface.co/docs/optimum/habana_index Habana Labs – Confidential Information 5

6 . General overview - Orchestration Solutions ❑Docker ❑Kubernetes ❑OpenShift (OCP) ❑VMware Tanzu ❑AWS https://docs.habana.ai/en/latest/Orchestration/index.html Habana Labs – Confidential Information 6

7 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 7

8 . Migration PyTorch model to HPU – key concepts ❑Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts. ❑Lazy mode - deferred execution of ops, accumulating ops delivered from script to graph until there is a “mark_step”. Launch A Launch B Launch C Launch D Launch E Launch F Launch G Launch H CPU Time Launch Latency HPU Time A B C D E F G H Build Launch Build Launch Build Launch time Graph Graph Graph Graph Graph Graph Time saved A B C D E F G H habana_frameworks.torch.core.mark_step() or torch.tensor.to("cpu"), etc. Habana Labs – Confidential Information 8

9 . Migration PyTorch model to HPU – Base Codes https://github.com/pytorch/examples/tree/main/imagenet https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 9

10 . Migration PyTorch model to HPU – Eager Run https://github.com/pytorch/examples/tree/main/imagenet https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 10

11 . Migration PyTorch model to HPU – Lazy Run https://github.com/pytorch/examples/tree/main/imagenet https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 11

12 . Migration PyTorch model to HPU - Summary 1. import habana_frameworks.torch 2. model.to("hpu") && input.to("hpu") 3. (Lazy mode) add mark_step Habana Labs – Confidential Information 12

13 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 13

14 . PyTorch Mixed Precision Training on HPU - hmp.convert ❑Turn on mixed precision training support: from habana_frameworks.torch.hpex import hmp hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False) ❑Convert Rules ❑Two different lists are maintained: (i) BF16, (ii) FP32 ❑Any OPs not in the above two lists will run with precision type of its 1st input ❑For OPs with multiple tensor inputs, cast all inputs to the widest precision type among all input precision types. ❑For in-place OPs, cast all inputs to precision type of 1st input. Habana Labs – Confidential Information 14

15 . PyTorch Mixed Precision Training on HPU - hmp.convert ❑Turn on mixed precision training support: from habana_frameworks.torch.hpex import hmp hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False) ❑opt_level ❑O1(default) Default BF16 list = [addmm, bmm, conv1d, conv2d, conv3d, dot, mm, mv] Default FP32 list = [batch_norm, cross_entropy, log_softmax, softmax, nll_loss, topk] ❑O2 In this mode, only GEMM and Convolution type OPs (e.g. conv1d, conv2d, conv3d, addmm, mm, bmm, mv, dot) should run in BF16 and all other OPs should run in FP32 Habana Labs – Confidential Information 15

16 . PyTorch Mixed Precision Training on HPU - hmp.convert ❑Turn on mixed precision training support: from habana_frameworks.torch.hpex import hmp hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False) ❑bf16_file_path=<.txt> && fp32_file_path=<.txt> ❑isVerbose ❑enable verbose logs https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 16

17 . PyTorch Mixed Precision Training on HPU - hmp.disable_casts ❑Any segment of script (e.g. optimizer) in which you want to avoid using mixed precision should be kept under the following Python context: from habana_frameworks.torch.hpex import hmp with hmp.disable_casts(): #code line 1 #code line 2 #... Habana Labs – Confidential Information 17

18 . PyTorch Mixed Precision Training on HPU https://github.com/pytorch/examples/tree/main/imagenet https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 18

19 . PyTorch Mixed Precision Training on HPU - Summary 1. from habana_frameworks.torch.hpex import hmp 2. hmp.convert 3. (Optional) Add bf16_file_path & fp32_file_path 4. (Optional) hmp.disable_casts Habana Labs – Confidential Information 19

20 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 20

21 . PyTorch Distributed Training with HPU Step 1: follow PyTorch official docs to export the model distributed. https://pytorch.org/docs/stable/notes/ddp.html Step 2: use hccl backend. import habana_frameworks.torch.distributed.hccl torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size) Habana Labs – Confidential Information 21

22 . Summary Single card – FP32 1. import habana_frameworks.torch 2. to("hpu") 3. (Lazy mode) add mark_step Single card – BF16 1. from habana_frameworks.torch.hpex import hmp 2. hmp.convert 3. (Optional) hmp.disable_casts 4. (Optional) Add bf16_file_path & fp32_file_path Distributed 1. import habana_frameworks.torch.distributed.hccl 2. Modify backend of init_process_group to “hccl” Habana Labs – Confidential Information 22

23 . Contents ❑General overview ❑Migration PyTorch model to HPU ❑PyTorch Mixed Precision Training on HPU ❑PyTorch Distributed Training with HPU ❑Profiling & Performance Optimization Habana Labs – Confidential Information 23

24 . Profiling ❑High level profile Users familiar with TensorBoard who develop with TensorFlow or PyTorch can start using the respective profiler similar to MKL, GPU or TPU. ❑Low level profile Users interested in low-level focused profiling as well as those writing their models in SynapseAI® language without framework can use SynapseAI Profiling Subsystem. Habana Labs – Confidential Information 24

25 . Performance Optimization ❑Batch Size ❑PyTorch Mixed Precision ❑Gradient Buckets in Multi-card/Multi-node Training https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html ❑Pinning Memory For Dataloader https://pytorch.org/docs/stable/data.html#memory-pinning ❑Use Fused Operators https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Custom_Ops_PyTorch.html#c ustom-operators ❑Use HPU Graph APIs https://docs.habana.ai/en/latest/PyTorch/PyTorch_User_Guide/Python_Packages.html#hpu-graph- apis https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/index.html Habana Labs – Confidential Information 25

26 . Performance Optimization - HPU Graph ❑HPU Graph APIs import torch import habana_frameworks.torch as ht ❑capture_begin() - Begins capturing g = ht.hpu.HPUGraph() HPU work on the current stream. s = ht.hpu.Stream() with ht.hpu.stream(s): ❑capture_end() - Ends capturing HPU g.capture_begin() work on the current stream. a = torch.full((100,), 1, device="hpu") b=a ❑replay() - Replays the HPU work captured by this graph. b=b+1 g.capture_end() g.replay() ht.hpu.synchronize() https://github.com/pytorch/examples/tree/main/imagenet https://github.com/JoursBleu/resnet_torch_habana Habana Labs – Confidential Information 26

27 . Performance Optimization - HPU Graph CPU Time Build Launch Build Launch Build Launch Build Launch Graph 1 Graph 1 Graph 2 Graph 2 Graph 1 Graph 1 Graph 2 Graph 2 HPU Time A B C D E A B C D E time Build Launch Build Launch Launch Launch Launch Graph 1 Graph 1 Graph 2 Graph 2 Graph 1 Graph 2 Graph 1 A B C D E A B C D E A B C Time saved Time saved Habana Labs – Confidential Information 27

28 . The End Thank you for your time! Habana Labs – Confidential Information 28

29 .Habana Labs – Confidential Information 29

2点赞

0收藏

8下载