- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Ray与大语言模型:一站式的预训练、微调和部署解决方案-汪愈舟
汪愈舟-Intel AI研发经理
Intel的AI研发经理,目前专注于大语言模型在Intel GPU、CPU以及加速器上的研发和性能优化。在大数据与AI领域,他的团队成功研发并开源了多个创新项目,其中包括RayDP、Spark自适应执行引擎、Intel OAP以及HiBench等。
分享介绍:
大语言模型因其庞大的参数规模和对计算资源的高需求,为AI基础架构带来了前所未有的挑战。在模型的预训练、微调及部署过程中,如何高效地处理这些挑战成为了关键。在本次分享中,我们将详细探讨Ray框架的特点,展示它在大语言模型领域中的独特优势。我们也将展示一套针对大语言模型量身打造的基于Ray的工作流。借助此工作流,研究者和工程师可以更高效地进行大语言模型的预训练、微调和部署,大大降低技术门槛和成本投入。
展开查看详情
1 .Ray与大语言模型:一站式的预训练, 微调和部署解决方案 汪愈舟 – Intel AI研发经理 1
2 .Ray: Unified Framework for Scaling AI Workloads End-to-End Python Application Ray Data Ray Train RLLib Ray Tune Ray Serve Core 2
3 .Ray Adoption is Accelerating Apache Spark Kafka Ray MLFlow KubeFlow 3
4 .Ray Is Everywhere • Open AI: Trained ChatGPT on Ray. • EleutherAI: Trained GPT-J on Ray. • Cohere: Trained their models on Ray. • Ant Group: Large-scale productions on top of Ray. (Over 1 million CPU cores) • ByteDance: Scaled Offline Inference with multi-modal LLMs on Ray • Uber: ML Platform Michelangelo moved from Spark based to Ray based. • Shopify: ML platform Merlin was built on top of Ray. …… 4
5 .Why OpenAI uses Ray Challenges OpenAI faced: • In-efficient infrastructure for LLM training • Started with open-source ecosystem to build the infra – Kubernetes, Terraform, but • Lack of Application scheduling layer, especially for multiple mode training • A lot of complexity of the home build solution, quite decent amount of work to maintain and managing, and more importantly not core competence Ray was used to train the OpenAI’s largest models • Able to scale up to unprecedented scale and not fail • Ray owns the whole flexible application scheduling layer • Beyond just training the LLM, they extend it to data processing, scheduling across machines “We’re using it to train our largest models. It’s been very helpful for us to scale up to an unprecedented scale” by OpenAI Greg Brockman 5
6 .Ensure Reliability and Robust Fault Recovery in LLM Training • META OPT-175B model was trained on 992 80GB A100 GPUs in roughly ~33 days. • There are 53 or 54 restarts during training because of different issues like: lost GPU, CUDA error, job hanging, NCCL error, job slowdown, code bugs, etc. • Ray's Automatic recovery feature will make recovering from training failures easier, ensuring continuous and reliable model training. Each color represents a restart 6
7 .Reduce Cost with Ray • Ray’s cluster launcher, autoscaler and auto Vicuna: Cost Reduction via Spot Instance fault recovery help to leverage spot instance in 1200 cloud to save your cost. 1000 • Use most cost-effective hardware for each 800 stage in the application and independently Cost($) 600 3.3x scale every stage. 400 3.6x • No expensive SerDe between stages. 200 0 • Examples: Vicuna-7B Vicuna-13B • Vicuna: 3.3-3.6x cost reduction for On-Demand Instance Spot Instance finetuning by using spot instance. • Samsara: 50% cost reduction for model serving. 7
8 .Our Work in the Ray Community • RayDP: Spark on Ray • Intel GPU & Gaudi Support in Ray • LLM-on-Ray Workflow 8
9 .LLM-on-Ray Workflow Introduction Build and Serve your own LLMs on Intel Platform • RecDP-LLM Data preparation Customer • LLM Pretraining on Ray Customer Application • LLM Finetuning on Ray Proprietary Data Config Query Response • LLM Serving on Ray Pretraining Finetuning Serving Open source models RecDP-LLM LLM-on-Ray Workflow Intel LLM Optimizations https://github.com/intel/llm-on-ray (to be open sourced) https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/LLM/README.md 9
10 .RecDP-LLM: LLM Data Preparation Utility 10
11 .RecDP-LLM: LLM Data Preparation Utility RecDP LLM Data Pipeline LLM data pipeline Foundation Model Document Extraction, Non-text Conversion, License Base model w/ clean and Filtering, Profanity Filter, Length Filter, PII removal, Terabytes Deduplication, Toxicity Analysis, etc. approved Liability – no Legal MultiSource DATA Pretrain Dataset License, Ethnical liability Examples: Wiki, Paper, Code Base, WebPages, etc Finetuning LLM data pipeline Improve AI generator on Prompt Template Standardize, Ensemble, Filter, responses’ Structure, Form, Domain-specific Deduplicate, Diversity analyze, Quality Control, perplexity, toxicity&bias, rouge similarity, etc. Finetune Dataset Consistency, logic-reasoning labeled DATA and Domain awareness Examples: Methodologies, Regulations New Domain, Language, style Retrieval Augmented LLM data pipeline Generation - RAG Document Extraction, Non-text Conversion, Filtering, Entity-specific Deduplication, Chunk Split, Tokenizer, Embedding Vector DB Updated dynamic information and Entity specific information DATA Examples: E-Mails, Wiki, Financials Press Releases, White Papers Specifications, Sales Force 11
12 . Data Process RecDP-LLM Architecture Type Description supports DocumentExtract extract text from unstructured format jpg, png, pdf, docx, Reader Read data from directory jsonl, parquet, Read and convert unstructed data to Converter html, document, image, pdf, ... unified format Developer Workflow Clean repeated format in html, latex, Text Fixer html, latex, codes codes Identify major language type of Scoring System Language Identify en, zh, fr, de, .. total 25 langs document Detect and reduce duplication based on Optimal pipeline: Fuzzy Deduplicator minHashLSH Data Process document context Data Quality Filter: Detect and reduce duplication based on - args: 150 Global Deduplicator sha256-hash exact same content Filter, Dedup, Fuzzy Dedup: Rouge-L, GPT3 Rouge Score Remove similar data by calculating the TextFix, - method: alpaca scorer, diversity, Deduplicator rough score pii_remove, - num_perm: bias, perplexity, 128 Repetition Removal Detect and reduce repetition context in document_spli same document hallucination, … …… t… Split Document into multiple sub Document splitter chapter_based, length_based documents High Quality Detect and replace personal infomation email, phone, ip, username, Raw Data Data PII Removal in document password User Defined Easy way to plugin user defined map parallel with ray or spark Transform function Components Architecture Easy way to plugin user defined filter User Defined Filter parallel with ray or spark function ResumableTextPipeline Writer ClassifyWriter write data to directory Classify and write data into sub buckets jsonl, parquet meta fields, language Low-code/no-code settings, plot, export,profile creates high-complexity instructions PromptSource, self-instruct, Ray/spark context management, Fault-tolerate resumable Prompt Enhancement from existing instruct-tuned LLM Data Quality evol-instruct(wizardLM) models pipeline RecDP LLM Operations using LLAMA2 tokenizer and save as Tokenization LLAMA2 tokenizer Diversity Megatron GPT-3 Scoring Toxicity Perplexity Reader, Writer, Filtering, deduplication, textfixer, Visualize the language_id, pii_removal, qualityscorer, diversity indicator, diversity distribution Leverage GPT-3 to scoring Visualize Toxicity probability Visualize Perplexity Distribution etc… of data Third-party library Prompt source, unstructured, langchain, data-jucier, ….. Spark Ray learn more learn more learn more learn more 12
13 .LLM-on-Ray Pretraining Workflow 13
14 .LLM-on-Ray Pretraining Pipeline Prepare Dataset Config Parameters Start Training Megatron Dataset megatron_config: python deepspeed_config: pretrain/megatron_deepspeed_pretrain.py • Document Extraction ...... --config_path ……. • PII removal • Deduplication …… 14
15 .LLM-on-Ray Pretraining Software Stack Tools For Distributed Megatron-LM Deepspeed Megatron-Deepspeed Training Frameworks PyTorch Transformer Low level oneDNN oneCCL HCCL Libraries Intel Extension 15
16 .Megatron-Deepspeed & Ray Integration • Several necessary changes have been merged in Megatron-Deepspeed. • Megatron dataset, Megatron-Deepspeed trainer and checkpoint recovery have all been integrated with Ray. • Support multiple training methods including ZeRO, data parallel, pipeline parallel, and tensor parallel. ZeRO + 16
17 .Fault Recovery Improvement for Pretraining • If a node becomes unavailable, Additional Hosts reschedule all training workers on Host 1 8*GPU Host N 8*GPU available nodes, restart training and resume from a checkpoint. .. . • If a GPU or accelerator fails, avoid Host K′ scheduling training actors on the affected node, ensuring Host K 8*GPU 8*GPU adherence to tensor/pipeline parallelism requirements. 1 GPU or node failed New Host .. .. . . checkpoint Host 1 Host 1 8*GPU 8*GPU 17
18 .LLM-on-Ray Finetuning Workflow 18
19 .Why Might Finetuning be Promising • Llama2-7B and 13B fine-tuned models outperform the 70b-chat and GPT-4 models on specific use cases. Source: https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications 19
20 .LLM-on-Ray Finetuning Pipeline Prepare Dataset Config Parameters Start Finetuning {"instruction":"Why can base_model: Command line: camels survive for long lora_config: python finetune/finetune.py without water?", train_file: --config_path finetune/finetune.conf "context":"", optimizer: "response":"Camels use the batch_size: UI: fat in their humps to keep epochs: them filled with energy and learning_rate: hydration for long periods of num_training_workers: time."} resources_per_worker: ...... 20
21 .LLM-on-Ray Finetuning Software Stack Base LLAMA2 MPT …… Models Parameter Efficient PEFT Deltatuner Finetuning Tools For Distributed Training PyTorch DDP/FSDP Deepspeed Accelerate Frameworks PyTorch Transformer Low level oneDNN oneCCL HCCL Libraries Intel extension 21
22 .LLM-on-Ray Finetuning: LORA Support • LORA Support in the finetuning workflow "lora_config": { "task_type": "CAUSAL_LM", "r": 8, "lora_alpha": 32, "lora_dropout": 0.1 } 22
23 . LLM-on-Ray Finetuning Optimization: Deltatuner • Key Components • Parameter efficient finetuning algorithms • Lora • Scaling and Shifting(SSF) • De-Nas: Automatically construct compact and optimal delta layers with train-free and hardware-aware mode. • step1: Generate search space for delta layers • step2: Search algorithm populates delta layers for LM • step3: Train-free score evaluates LM with adaptive delta layers • Features Overall Architecture • Easy-to-use: enable deltatuner via LLM-on-Ray configuration. Deltatune HW Adaptive • Auto-tuning: automatically select best algorithms r Algorithms LM LM pre-trained and delta structure for finetuning model … • Values pre-trained LM Lora AdaLora … … SSF • Save resources: Reduce trainable parameter / Reduce finetuning time / adaptive delta layers Reduce memory consumption … • Get improved accuracy MPT-7B, Llama-7B… DE-NAS https://github.com/intel/e2eAIOK/tree/main/e2eAIOK/deltatuner 23
24 .LLM-on-Ray Serving Workflow 24
25 .Ray Serve Introduction Framework Model Scalability LLM Support Agnostic Composition • Scalable, flexible, and efficient model serving library built on top of Ray. • Simplify the deployment of AI applications by expressing complex applications as a single Python program. • Optimize for LLMs, such as response streaming, dynamic request batching 25
26 .Efficient LLM Serving is a Full Stack Problem Multi- Single- Cross region Speculative Continuous Autoscaling Serving Accelerator Accelerator & cloud decoding batching Modeling Optimizations Infra layer Serving layer Model layer 26
27 .LLM-on-Ray Serving Workflow • Deploy popular open source LLMs with a single command or a single click on the UI. • Auto scale on a Ray cluster. • Expose OpenAI like Restful API. • Support serving LLMs on Intel CPU, GPU and Gaudi2. • Integrate various Intel optimizations such as Intel Extension for PyTorch/Deepspeed. • More optimizations are being added. 27
28 .LLM-on-Ray RAG Workflow • Provides workflow and tools to make building Retrieval-augmented generation application easier. query Retriever LLM Response Knowledge Base 28
29 .LLM Inference Performance on Intel Xeon 29