- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Ampere AI加速方案及其实践分享-黄滨
黄滨-Ampere Computing资深平台应用工程师
Ampere Computing资深平台应用工程师, 主要致力于优化工作负载在Ampere Arm服务器上的运行效能。
分享介绍:
Ampere AI是可在Ampere处理器上加速AI和ML的软件方案,主要介绍如何借助Ampere AI方案,在各类使用场景下快速提高基于CPU推理的工作负载的效能及其收益示例。
展开查看详情
1 .Ampere AI 加速方案及其实践分享 Bin Huang – Platform Application Engineering 1
2 .Targeted AI workloads Computer Vision Natural Language Processing - NLP Generative AI - latest Recommender Engines 2
3 . Training vs Inference Train Model • AI Training is the process of creating an AI model by feeding it a large amount of data and Monitor Validate Model Model adjusting its parameters until it can make Fine-Tune accurate predictions. Model • Inference is the process of using that trained AI Deploy model to make predictions on new data. Model Inference Where AI Silicon is deployed • The vast majority, ~85%, of AI workloads will be Inferencing, either in a data center or edge.* 15% 40% • Ampere AI brings the best price/perf for GPU-Free inferencing and better system efficiency with GPU, making it a perfect for 45% Inference workloads. • AI Training is better served by high-performance GPU Training DataCenter Inference Edge Inference products. * https://www.techspot.com/news/98879-ai-chip-market-landscape-choose-battles-carefully.html 3
4 . Build The Most Cost-Effective AI Inference with Ampere CPU Right-Sizing AI Compute • Higher Efficiency:Higher processor usage efficiency, for small batch processing • More Flexible:More flexible for SW design and deployment cross-platforms; no Performance higher is better dependency on 3rd – party’s HW and SW AI Training Ampere + A/H100 GPU • Lower Complexity:Simpler OS, driver, runtimes AI Inference Server • Higher Scalability:Seamless Ampere + T4/L4/A10 integration with other software stacks and GPU-Free AI GPU microservices architecture paradigms for Inference horizontal scalability Ampere CPU Only • Higher Perf/$: AI model algorithms TCO lower is better continue to innovate On CPU, providing high throughput close to that of GPUs, but at a much lower cost 4
5 .Layered design Framework Integration Layer: Provides full compatibility with popular developer frameworks. Software works with the trained networks “as is”. No conversions or approximations are required. Model Optimization Layer: Implements techniques such as structural network enhancements, changes to the processing order for efficiency, and data flow optimizations, without accuracy degradation. Hardware Acceleration Layer: Includes a “just-in-time”, optimization compiler that utilizes a small number of Microkernels optimized for Ampere processors. This approach allows the inference engine to deliver high-performance and support multiple frameworks. 5
6 .Framework Integration Layer • Wide variety of frameworks supported– TF, TFLite, ONNX, PyTorch • Memory management • Operation fusion • Graph dispatch 6
7 .Ampere Optimized ONNX Runtime There are two execution providers for CPU based inference on Ampere CPU: • AIOExecutionProvider – Accelerated by AIO library • CPUExecutionProvider – Default for operations not supported by AIO Ampere Optimized ONNX Runtime includes: AIO EP • Python Whl installer for ONNXRuntime Framework • libampere-aio.so installed with libampere-aio installer package for AIO dynamic library (libampere-aio.so) either CentOS or Ubuntu Ampere Altra / AltraMax 7
8 .Ampere Optimized PyTorch AIO Graph Fuser combines OPs supported by AIO into ”AIO FusionGroup”. Interpreter delegates AIO FusionGroups computation to AIO engine, running on CPU. Non-AIO nodes executed by default CPU kernels Ampere Optimized PyTorch Runtime includes: • Python Whl installer for PyTorch Framework • libampere-aio.so installed with libampere-aio installer package for either CentOS or Ubuntu 8
9 .Model Optimization Layer / Hardware Acceleration Layer • Kernels tuned for Ampere CPU • Graph • JIT compilation • Tensor shape • Mixed precision support (FP16, INT8, BF16 and • Memory more) with automatic data type conversion on- the-fly • Parallelization and vectorization 9
10 .Ampere Optimized Frameworks Quickest way to use Ampere Optimized Framework is to use docker images: • PyTorch: https://hub.docker.com/r/amperecomputingai/pytorch – Documentation: https://ampereaidevelopus.s3.amazonaws.com/releases/1.8.0/Ampere+Optimized+PyTorch+Documentation+v1.8.0.pdf • Tensorflow: https://hub.docker.com/r/amperecomputingai/tensorflow – Documentation: https://ampereaidevelopus.s3.amazonaws.com/releases/1.8.0/Ampere+Optimized+Tensorflow+Documentation+v.1.8.0.pdf • ONNX Runtime: https://hub.docker.com/r/amperecomputingai/onnxruntime – Documentation: https://ampereaidevelopus.s3.amazonaws.com/releases/1.8.0/Ampere+Optimized+ONNXRuntime+Documentation+v1.8.0.pdf Binary and Package Installers Installers (RPM and DEB) for Ubuntu/Centos/OracleLinu x and TF/Pytorch/ONNXRuntime Python package wheel files are provided to clients as needed. 10
11 .Ampere Model Library Many models in computer vision, NLP, recommendation, speech recognition and Gen AI are available for running off-the-shelf and evaluation: https://github.com/AmpereComputingAI/ampere_model_library 11
12 .Set up Ampere Model Library (AML) Install Docker and Pull Ampere ONNXRuntime Image $apt update && apt install docker.io $docker pull amperecomputingai/onnxruntime:1.8.0 $docker image ls -a Get AML source and launch Ampere Optimized ONNX Runtime Docker container $git clone --recursive https://github.com/AmpereComputingAI/ampere_model_library.git cd ampere_model_library $cd ampere_model_library $docker run --privileged=true --name my-ampere-onnxruntime -v $PWD/:/aml –it amperecomputingai/onnxruntime:1.8.0 Setup AML $cd /aml $bash setup_deb.sh $source set_env_variables.sh 12
13 .Running Computer Vision with AML Setup environment variables $cd computer_vision/classification/resnet_50_v1 $export PYTHONPATH=/aml $export AIO_NUM_THREADS=16 Download ONNX model $wget https://zenodo.org/record/2592612/files/resnet50_v1.onnx Download images and labels to separate sub-folders https://www.image-net.org/ $export IMAGENET_IMG_PATH=/path/to/images $export IMAGENET_LABELS_PATH=/path/to/labels Run inference with AIO disabled $AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp32 -f ort Run inference with AIO (fp32) $AIO_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp32 -f ort Run inference with AIO (implicit fp16) $AIO_NUM_THREADS=16 AIO_IMPLICIT_FP16_TRANSFORM_FILTER=“.*” python3 run.py -m resnet50_v1.onnx -p fp32 -f ort 13
14 .Running Computer Vision with AML Inference output w/o or w/ AIO: Run inference with AIO disabled: Run inference with AIO (fp32): Run inference with AIO (implicit fp16): Top-1 accuracy = 0.800 Top-1 accuracy = 0.800 Top-1 accuracy = 0.800 Top-5 accuracy = 0.900 Top-5 accuracy = 0.900 Top-5 accuracy = 0.900 Accuracy figures above calculated on the Accuracy figures above calculated on the Accuracy figures above calculated on the basis of 100 images. basis of 100 images. basis of 100 images. LATENCY LATENCY LATENCY mean 32.40 [ms] mean 14.42 [ms] mean 7.64 [ms] median 32.12 [ms] median 14.33 [ms] median 7.64 [ms] p90 32.22 [ms] p90 14.41 [ms] p90 7.69 [ms] p99 40.22 [ms] p99 14.74 [ms] p99 7.89 [ms] p99.9 46.18 [ms] p99.9 21.97 [ms] p99.9 7.89 [ms] THROUGHPUT THROUGHPUT THROUGHPUT observed 30.86 [samples/s] observed 69.33 [samples/s] observed 130.87 [samples/s] inverted 32.40 [ms] inverted 14.42 [ms] inverted 7.64 [ms] • Ampere Optimized ONNX Runtime accelerates the inference performance of ONNX models in both FP32 and FP16 formats • It is very easy use. ONNX models can be accelerated off-the-shelf • The performance can be scaled easily across the CPU cores of Ampere CPUs 14
15 . Classify models Native Onnxrt AIO_FP32 AIO_FP16 Model_name / Top1 / Top1 / Top1 • Typical models from PaddleClas for inference on CPU – ResNet50_vd: ResNet50_vd_i 79.120% 79.120% 79.114% • First Introduced in 2015 by Microsoft researchers, commonly used for image nfer classification tasks, and it has been trained on the ImageNet dataset. – PP-LCNetV2: MobileNetV3_l 78.960% 78.960% 78.944% • Proposed by Baidu, target for high performance model on CPU platform. arge_x1_0_inf – MobileNetV3: er • A network launched by Google, target for use on mobile devices or PPLCNetV2_ba 77.042% 77.042% 77.056% embedded devices. se_infer • Test driven by Paddle FastDeploy framework Performance of Ampere Altra-based Dpsv5 Virtual Machines on Price-Performance of Ampere Altra-based Dpsv5 Virtual Machines Azure on Azure 149% 149% 119% 119% 100% 100% 100% 116% 93% 100% 100% 100% 77% 86% 71% 79% 67% 75% ResNet50_vd_infer MobileNetV3_large_x1_0_infer PPLCNetV2_base_infer ResNet50_vd_infer MobileNetV3_large_x1_0_infer PPLCNetV2_base_infer D48s_v5 D48as_v5 D48ps_v5 - FP16 D48s_v5 D48as_v5 D48ps_v5 - FP16 15
16 . YOLOv5 • Inference yolov5s on single Altra Max M128-30 – Initial status without AIO: 1 stream 4 streams 10 streams inference 120ms 600ms 1000ms time – Results after applying AIO: 1 stream 4 8 16 32 streams streams streams streams per 128 32 16 8 4 stream threads inference 14.60ms 20.38ms 30.69ms 53.73ms 105.98m time - s FP16 16
17 .Follow Us • On WeChat • Ampere AI website Welcome visit Ampere AI website to get latest information: https://amperecomputing.com/solutions/ampere-ai 17
18 .18