Making ML more useful to more people

使ML对更多人更有用

展开查看详情

1.Making ML more useful to more people Markus Weimer Markus.Weimer@Microsoft.com

2. Why Machine Learning? “Programming the UnProgrammable” f(x) { f(x) { “It has exquisite buttons … with long sleeves …works for casual as well as business settings”

3. Models are Software Training data needs management • Built as software, just with • Data is private and increasingly different tools regulated • Deployed and updated as software • Data is dynamic (CRUD, retention • Tested as software policies, …) • Debugged like software • Best managed as part of the data estate • Training and deployment of models needs to respect data governance Point of view: Data Science is Software Engineering with Data

4.ML.NET https://dot.net/ml

5. Brought to you by (amongst others) Zeeshan Ahmed (Microsoft) zeahmed@microsoft.com, Saeed Amizadeh (Microsoft) <saamizad@microsoft.com>, Mikhail Bilenko (Yandex) <mbilenko@yandex-team.ru>, Rogan Carr (Microsoft) <rocarr@microsoft.com>, Wei-Sheng Chin (Microsoft) <WeiSheng.Chin@microsoft.com>, Yael Dekel (Microsoft) <yaeld@microsoft.com>, Xavier Dupre (Microsoft) <xadupre@microsoft.com>, Vadim Eksarevskiy (Microsoft) <Vadim.Eksarevskiy@microsoft.com>, Senja Filipi (Microsoft) <sefilipi@microsoft.com>, Tom Finley (Microsoft) <tfinley@microsoft.com>, Abhishek Goswami (Microsoft) <agoswami@microsoft.com>, Monte Hoover (Microsoft) <Monte.Hoover@microsoft.com>, Scott Inglis (Microsoft) <singlis@microsoft.com>, Matteo Interlandi (Microsoft) <mainterl@microsoft.com>, Najeeb Kazmi (Microsoft) <nakazmi@microsoft.com>, Gleb Krivosheev (Microsoft) <gleb.krivosheev@skype.net>, Pete Luferenko (Microsoft) <Pete.Luferenko@microsoft.com>, Ivan Matantsev (Microsoft) <ivmatan@microsoft.com>, Sergiy Matusevych (Microsoft) <sergiym@microsoft.com>, Shahab Moradi (Microsoft) <shmoradi@microsoft.com>, Gani Nazirov (Microsoft) <ganaziro@microsoft.com>, Justin Ormont (Microsoft) <Justin.Ormont@microsoft.com>, Gal Oshri (Microsoft) <gaoshri@microsoft.com>, Artidoro Pagnoni (Microsoft) <Artidoro.Pagnoni@microsoft.com>, Jignesh Parmar (Microsoft) <jignparm@microsoft.com>, Prabhat Roy (Microsoft) <Prabhat.Roy@microsoft.com>, Zeeshan Siddiqui (Microsoft) <mzs@microsoft.com>, Markus Weimer (Microsoft) <mweimer@microsoft.com>, Shauheen Zahirazami (Microsoft) <shzahira@microsoft.com>, Yiwen Zhu (Microsoft) <zhu.yiwen@microsoft.com>, …

6.About .NET .NET • .NET has cool stuff ML people care about • C#: Like Java, but from the future • F#: Like Python, but with static types and multithreading • Almost-free calls into native code • .NET is OSS and cross platform • Windows (surprise!), Linux, macOS • Phones via Xamarin: Android, iOS • Interesting HW: Xbox, IoT devices, … • Lots of developers build important stuff in .NET • 4M active; 450k added each month • 15% growth MoM in https://github.com/dotnet • Half the top-10k websites are built in .NET

7. Machine Learning made for .NET Covers many developer scenarios Developers Available in C#, F# and VB.NET Windows, Linux, Mac Open source and cross-platform X64, x86 (some), ARM (some) Development started ~10 years ago Proven and extensible Received contribution (and scrutiny) from all over Microsoft ML.NET: An open source and cross- platform machine learning framework

8.This designed most of my slides used today ☺

9.ML.NET is used in many products • Many MS products use TLC ML.NET. • You have likely used ML.NET today ☺ • Why is that? • Many products are written in (ASP).NET • Using ML.NET is just like using any other .NET API

10.Using a model is just like using code Resource Standard shipped with software the app. dependency var model = mlContext.Model.Load(“mymodel.zip”); var predFunc = trainedModel .MakePredictionFunction<T_IN, T_OUT>(mlContext); var result = predFunc.Predict(x); Training: Think sklearn, but with a statically typed language

11. Data Ingestion Featurization and Transforms Learning Algorithms Text Text & Image featurization Supervised: Linear, Trees, Factorization SQL Pre-trained DNNs in ONNX, TensorFlow Machines, … In Memory Feature transforms (normalization, Unsupervised: PCA, LDA, K-Means, … … pruning, …) Time Series … … ML.NET captures end-to-end Machine Learning Pipelines

12.ML.NET is fast & good • Core infrastructure: IDataView • Carefully designed to avoid memory allocations • Only required data is lazily materialized • Carefully tuned defaults • Many ML tasks are more alike than we’d like to admit ☺ GBDT Experiments done on Criteo, using default parameters

13.ML.NET’s journey to OSS • Developed for almost a decade as an internal tool • Open Sourced in May 2018 (at //build) • MIT License, .NET Foundation • Monthly releases ever since; 1.0rc1 this Tuesday • Please check it out, and leave feedback

14. • Pretzel • Model compiler • Especially good at the many models Other efforts → one program problem • http://www.markusweimer.com/pu not discussed blication/2018/10/23/pretzel/ today • TorchSharp • PyTorch – Python + .NET • https://github.com/xamarin/TorchS harp

15.Distributed Machine Learning where the Data is

16.Resource Managers • One cluster used by all Container workloads (interactive, batch, streaming, …) • Resources are handed out as containers • A container is slice of a machine • Fixed RAM, CPU, I/O, … • Examples: • Azure Batch • Apache Hadoop YARN • Apache Mesos • Google Borg

17.Challenges • Fault tolerance • Pre-emption • Elasticity

18. Machine learning • ML thrives with gang scheduling • Iterative • Fixed data sets • Gangs are undesirable on shared clusters • Utilization is paramount • MPI: Wait … • MapReduce: Do the work slowly on fewer machines • Let’s do better than that

19.Approach I: Elastic ML NeurIPS ‘14

20. Elastic ML • Our solution: • Ramp up the workload with the allocations • In each iteration, add machines and data • First iteration

21.Elastic ML • Our solution: • Ramp up the workload with the allocations • In each iteration, add machines and data • Second Iteration

22.Elastic ML • Our solution: • Ramp up the workload with the allocations • In each iteration, add machines and data • End state

23.Is it any good?

24.Approach II: Coded computing Yaoqing Yang (CMU), Matteo Interlandi, Saeed Amizadeh NeurIPS ’18, ongoing work

25.Or: Coded Computing Container 1 Container 2 Container 3 Container 4 Container 5 Container 6 X[1] X[2] X[3] X[1]+2x[2]+3X[3] X[1]+4X[2]+9X[3 X[1]+8X[2]+27X[ ] 3] Y[1] Y[2] Y[3] Y[1]+2Y[2]+3Y[3] Y[1]+4Y[2]+9Y[3] Y[1]+8Y[2]+27Y[ 3] … … … … … … … … … … … … … … … … … … Original Data Coded Data • Encode 3 splits into 6 splits • Any 3 row bloks out of 6 are sufficient

26. Results • Real dataset: 100,000 samples, 3352 Features. • Distributed computing on 20 machines. • Randomly pick 10 machines and let them randomly fail during the computation.

27. Models are Software Training data needs management • Built as software, just with • Data is private and increasingly different tools regulated • Deployed and updated as software • Data is dynamic (CRUD, retention • Tested as software policies, …) • Debugged like software • Best managed as part of the data estate • Training and deployment of models needs to respect data governance Point of view: Data Science is Software Engineering with Data

28.Many open questions • For software, we have source control. For data and models we have …? • For software, we have code reviews. For data we have … ? • For software, we have semantic versions, for data we have … ? • For software, we have debuggers. For models, we have … ? • For software, we have signing. For models, we have … ? • …

29. ML.NET is ML for .NET https://dot.net/ml Thanks for your https://github.com/dotnet/machinelearning time! You can reach me at: Markus.Weimer@Microsoft.com Let’s stay in @MarkusWeimer http://markusweimer.com touch! Of course, we are hiring