提升英特尔®至强®可扩展平台上的TensorFlow*深度学习推理性能

该课程介绍Intel®优化Tensorflow的原理,AI推理优化方法实践,性能提升案例和获取和安装Intel®优化Tensorflow。

展开查看详情

1.Zhang, Jianyu AI Technical Consulting Engineer Intel Architecture, Graphics and Software

2. 法律声明 英特尔技术的特性和优势取决于系统配置,并需要借助兼容的硬件、软件或服务来实现。实际性能可能因系统配置的不同而有所差异。任何计算机系统都无法保证 绝对安全。请咨询您的系统制造商或零售商,也可登录 intel.cn 获取更多信息。 在特定系统中通过特殊测试对组件的文档性能进行测试。硬件、软件或配置的任何不同都会影响实际性能。考虑购买时,请查阅其他信息来源以评估性能。有关性 能和基准测试结果的更完整的信息,请访问 http://www.intel.cn/content/www/cn/zh/benchmarks/intel-product-performance.html。 在性能检测过程中涉及的软件及其性能只有在英特尔® 微处理器的架构下方能得到优化。诸如 SYSmark* 和 MobileMark* 等性能测试均系基于特定计算机系统、组 件、软件、操作及功能。上述任何要素的变动均有可能导致测试结果的变化。您应该参考其他信息和性能测试以帮助您全面评估您正在考虑的采购,包括产品在与 其他产品结合使用时的性能。 更多信息请访问 http://www.intel.com/content/www/cn/zh/benchmarks/intel-product-performance.html。 所描述的降低成本方案仅用作示例,表明某些基于英特尔的产品在特定环境和配置下会如何影响未来的成本,并节约成本。 环境各不相同。 英特尔不保证任何成本 和成本的节约。 本文包含尚处于开发阶段的产品、服务和/或流程的信息。 此处提供的所有信息可随时更改,恕不另行通知。请联系您的英特尔代表,了解最新的预测、时间表、规 格和路线图。 本文件不构成对任何知识产权的授权,包括明示的、暗示的,也无论是基于禁止反言的原则或其他。 本文中涉及的本季度、本年度和未来的英特尔规划和预期的陈述均为前瞻性陈述,包含许多风险和不确定性。英特尔 SEC 报告中包含关于可能影响英特尔结果和计划 的因素的详细讨论,包括有关 10-K 报表的年度报告。 所有指定的产品、计算机系统、日期和数字信息均为依据当前期望得出的初步结果,可随时更改,恕不另行通知。所述产品可能包含设计缺陷或错误(已在勘误表 中注明),这可能会使产品偏离已经发布的技术规范。英特尔提供最新的勘误表备索。 英特尔不控制或审计本文提及的第三方基准测试数据或网址。您应访问引用的网站,确认参考资料准确无误。 © 英特尔公司版权所有。英特尔、英特尔标识、Intel Inside 标记和标识、Arria、Cyclone、英特尔至强、MAX、Quartus 和 Stratix 是英特尔在美国和/或其他国家的商 标。 *其他商标和品牌可能是其他所有者的资产。 英特尔、英特尔标识和其他标识是英特尔公司在美国和/或其他国家的商标。 Altera、Arria、Cyclone、Enpirion、Max、Megcore、Nios、Quartus 和 Stratix 字眼和标识是 Altera 的商标,在美国专利及商标局和其他国家进行了注册。 * 文中涉及的其它名称及商标属于各自所有者资产。 Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

3. 目录 面向英特尔®至强®可扩展平台对TensorFlow*的优化 TensorFlow*上的推理性能优化 获取和安装英特尔®优化的TensorFlow* Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

4. 英特尔®平台上的软件优化 Application Application of AI AI framework Intel® Optimization for TensorFlow* ICC, Intel® OpenMP*, Intel® TBB, oneMKL, Tool & Lib oneDNN Hardware Intrinsic: Intel® SSE, Intel® AVX, Intel® CPU AVX2, Intel® AVX-512 Multiple Cores Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

5. 面向英特尔®至强®可扩展平台对TensorFlow*的优化 Multiple Threads ▪ Multiple cores Vectorization ▪ AVX-512 Graph Optimization ▪ Example: Fuse: Conv2d + Relu Conv2d + Batch Normalization TensorFlow** can be powered by Intel’s highly optimized math routines for deep learning tasks. This primitives library is called Intel® oneAPI Deep Neural Network Library (Intel® oneDNN). Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

6. 英特尔® oneDNN Deep Learning Ecosystem Intel® oneDNN Open Source (C, C++ API) • Support AI ecosystem on Intel Architecture (AI) • Max Intel HW performance with AVX-2/AVX-512 instructions BigDL • https://github.com/intel/mkl-dnn AlexNet, VGG, GoogleNet, ResNet, Image recognition MobileNet Image segmenation FCN, SegNet, MaskRCNN, U-Net Volumetric segmentation 3D-Unet Object detection SSD, Faster R-CNN, Yolo Intel® Neural Machine Translation oneDNN GNMT (experimental) Speech Recognition (experimental) DeepSpeech Adversarial Networks DCGAN, 3DGAN Reinforcement Learning A3C Intel Processors Intel libraries as path to bring optimized ML/DL frameworks to Intel hardware Intel and the Intel logo are trademarks of Intel Corporation in the U. S. and/or other countries. *Other names and brands may be Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices claimed as in Intel the property software of others. products. Copyright © 2019, Intel Corporation. Copyright © , Intel Corporation. All rights reserved. 6 *Other names and brands may be claimed as the property of others. Intel Confidential

7. 面向英特尔®至强®可扩展平台优化的TensorFlow*-标准 TensorFlow* API Native API Slim API Keras API https://www.intel.com/content/www/us/en/artificial- intelligence/posts/improving-tensorflow-inference- performance-on-intel-xeon-processors.html?wapkw=tensorflow Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

8. TensorFlow*上的推理性能优化 -最大吞吐率和延迟 • Throughput Process as many images per second, passing in batches of size > 1. Exercising all the physical cores on a socket by parallel and vectorized fashion. • Latency Process a single image as fast as possible. Avoid penalties from excessive thread launching and orchestration between concurrent processes. Latency Good Throughput Good Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

9. TensorFlow*上的推理性能优化-运行环境设置 intra_op_parallelism_threads inter_op_parallelism_threads Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

10. TensorFlow*上的推理性能优化 -Non-Uniform Memory Access (NUMA) numactl parameter: -C core-range –m N (N=0, 1, … socket) For example: numactl –C 0-25 –m 0 python (Use CPU cores 0, 1,2, …, 25 and memory bind on the socket 0 CPU) intra_op_parallelism_threads = #of local cores in numactl setting Multiple streams For example (HW: 2 sockets and 26 cores per CPU): numactl –C 0-25 –m 0 python & numactl –C 26-51 –m 1 python & Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

11. TensorFlow*上的性能优化 -优化oneDNN运行环境 Intel oneDNN utilizes OpenMP on Intel architecture. OpenMP environment variables: • KMP_AFFINITY Recommend: export KMP_AFFINITY=granularity=fine,compact,1,0 • KMP_BLOCKTIME Recommend: export KMP_BLOCKTIME=0 (or 1) • OMP_NUM_THREADS Recommend: export OMP_NUM_THREADS=num physical cores Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

12. TensorFlow*上的性能优化 -优化运行策略 T1 Batch Size = 1 • Batch size Small batch size = Low latency Big batch size = High throughput T2 Batch Size = 2 • Multiple Streams Concurrent execution on multiple cores. T3 Best performance depend on model, batch T3 size, core allocation, memory locality parameters Stream Num = 2 Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

13. TensorFlow*上的性能优化 -量化加速 FP32 - 32 bits INT8 - 8 bits Benefit (Theoretical) Memory and Bandwidth 4 1 4x Computing (MulAdd) 4 1 4x Precision 100 <100 99% (Example) Intel® Deep Learning Boost: VNNI on Cascade Lake Xeon CPU Limitation: FP32 INT8 INT8 INT8 INT8 FP32 Precision reduction FP32 INT8 FP32 INT8 INT8 FP32 Need calibration dataset Convert FP32 and Depend on the model INT8 Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

14. TensorFlow*上的推理性能优化 -参考文档 Maximize TensorFlow** Performance on CPU: Considerations and Recommendations for Inference Workloads Link: https://software.intel.com/en-us/articles/maximize-tensorflow- performance-on-cpu-considerations-and-recommendations-for- inference GeneralBestPractices.md Link: https://github.com/IntelAI/models/blob/master/docs/general/tensorfl ow/GeneralBestPractices.md Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

15. 面向英特尔®至强®平台优化TensorFlow* System configuration: CPU Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 TRAINING THROUGHPUT INFERENCE THROUGHPUT Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 HyperThreading: ON Turbo: ON Memory 376GB (12 x 32GB) 24 slots, 12 occupied 2666 MHz Disks Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB) BIOS SE5C620.86B.00.01.0004.071220170215 OS Centos Linux 7.4.1708 (Core) Kernel 3.10.0-693.11.6.el7.x86_64 14X 3.2X Up to TensorFlow*Source: https://github.com/TensorFlow*/TensorFlow* 198x TensorFlow* Commit ID: 926fc13f7378d14fa7980963c4fe774e5922e336. Intel Optimization for TensorFlow* ResNet50 training Intel Optimization for TensorFlow* InceptionV3 inference throughput compared to TensorFlow* benchmarks: performance compared to Default TensorFlow* for CPU https://github.com/TensorFlow*/benchmarks default TensorFlow* for CPU Model Data_for Intra_ Inter_o OMP_NUM_T KMP_BLOC Unoptimized TensorFlow* may not exploit the best mat op p HREADS KTIME performance from Intel CPUs. VGG16 NCHW 56 1 56 1 InceptionV3 NCHW 56 2 56 1 ResNet50 NCHW 56 2 56 1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2018, Intel Corporation Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

16. 获取和安装Intel®优化版TensorFlow* – Pip install pip install intel-tensorflow – Anaconda Linux: conda install Tensorflow or conda install Tensorflow -c intel MacOS: conda install Tensorflow Windows: conda install Tensorflow-mkl – Docker image docker pull gcr.io/deeplearning-platform-release/tf-cpu.1-14 docker pull docker.io/intelaipg/intel-optimized-TensorFlow – Build from Source Code Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

17. 获取和安装Intel®优化版TensorFlow* -参考文档 Intel® Optimization for TensorFlow** Installation Guide Link: https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow- installation-guide Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright © , Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

18.Q&A