- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technol
展开查看详情
1 .Deep Learning on Apache Spark at CERN Large Hadron Collider with Intel Technologies 24th April 2019 Matteo Migliorini, Riccardo Castellotti, Luca Canali, Viktor Khristenko, Maria Girone
2 .International organisation straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 23 members states 1.1 B CHF budget (~1.1B USD) 3’000 members of personnel 15’ 000 associated members from 90 countries Curiosity: CERN has its own TLD! http://home.cern
3 .Largest machine in the world 27 km-long Large Hadron Collider (LHC) Fastest racetrack on Earth Protons travel at 99.9999991% of the speed of light Emptiest place in the solar system Particules circulate in the highest vacuum Hottest spot in the galaxy Lead ion collisions create temperatures 100,000x hotter than the hearth of the sun
4 . CMS CMS LHCb ALICE LHCb ALICE ATLAS ATLAS
5 .CMS 22m long, 15m wide weighs 14’000 tonnes Most powerful superconducting solenoid ever built 182 institutes from 42 countries
6 .40 million 100,000 1,000 collisions selections selections per PB/s per TB/s per GB/s second second second This can generate up to a petabyte of data per second. Filtering the data in real time, selecting potentially interesting events (trigger).
7 .The CERN Data Centre in Numbers 280 PB Hot 15 000 Servers 280 000 Cores Storage 350 PB on tape 50 000 km Fiber (+88PB from Optics LHC in 2018)
8 .Worldwide LHC Computing Grid Tier-0 (CERN): Tier-2 (72 •Data recording Federations, ~150 •Initial data reconstruction centres): •Data distribution • Simulation • End-user analysis Tier-1 (14 centres): •Permanent storage •Re-processing •900,000 cores •Analysis •1 EB
9 .Hadoop and Spark Clusters at CERN • Clusters: YARN/Hadoop and Spark on Kubernetes • Hardware: Intel based servers, continuous refresh and capacity expansion Accelerator logging 30 nodes (part of LHC (Cores - 800, Mem - 13 TB, Storage – 7.5 PB) infrastructure) General Purpose 65 nodes (Cores – 1.3k, Mem – 20 TB, Storage – 12.5 PB) Cloud containers 60 nodes (Cores - 240, Mem – 480 GB) Storage: remote HDFS or EOS (for physics data)
10 .Analytics Pipelines – Use Cases • Physics: • Analysis on computing metadata • Development of new ways to process Physics data, scale-out and ML • IT: • Analytics on IT monitoring data • Computer security • BE (Accelerators): • NXCALS – next generation accelerator logging platform • Industrial controls data and analytics
11 .Analytics Platform Outlook Integrating with existing infrastructure: • Software • Data Experiments storage HDFS HEP software Personal storage
12 .Analytics with SWAN Text Code Monitoring Visualizations All the required tools, software and data available in a single window!
13 .Extending Spark to Read Physics Data • Physics data is stored in EOS system, accessible with xrootd protocol: extended HDFS APIs • Stored in ROOT format: developed a Spark Datasource C++ Java • Currently: 300 PBs EOS Hadoop Storage HDFS • ~90 PBs per year of Service XRootD Hadoop- API JNI XRootD Client operation Connector • https://github.com/cerndb/hadoop-xrootd • https://github.com/diana-hep/spark-root
14 .Deep Learning Pipeline for Physics Data 3 W+j 2 1% 36% Particle QCD Classifier 1 63% t-t̅
15 .Data challenges in physics • Proton-proton collisions in LHC happen at 40MHz. • Hundreds of TB/s of electrical signals that allow physicists to investigate particle collision events. • Storage, limited by bandwidth • Currently, only 1 every 40K events stored to disk (~10 GB/s). 2018: 5 collisions/beam cross 2026: 400 collisions/beam cross LHC High Luminosity-LHC
16 .R&D • Improve the quality of filtering systems • Reduce false positives rate • From rule-based algorithms to Deep Learning classifiers • Advanced analytics at the edge • Avoid wasting resources in offline computing • Reduction of operational costs
17 . Engineering Efforts to Enable Effective ML • From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015
18 .Spark, Analytics Zoo & BigDL • Apache Spark • Leading tools and API for data processing at scale • Analytics Zoo is a platform for unified analytics and AI on Apache Spark leveraging BigDL / Tensorflow • For service developers: integration with infrastructure (hardware, data access, operations) • For users: Keras APIs to run user models, integration with Spark data structures and pipelines • BigDL is a distributed deep learning framework for Apache Spark
19 .Data Pipeline Data Feature Model Training Ingestion Preparation Development Read physics Prepare input 1. Specify model Train the best data and feature for Deep topology model engineering Learning 2. Tune model network topology on small dataset Leveraging Apache Spark and Analytics Zoo in Python Notebooks
20 .The Dataset ● Software simulators generate events and calculate the detector response ● Every event is a 801x19 matrix: for every particle momentum, position, energy, charge and particle type are given
21 .Data Ingestion • Read input files (4 TB) from custom format • Compute physics-motivated features • Store to parquet format 54 M events ~4TB Physics data storage 750 GBs Stored on HDFS
22 .Features Engineering • From the 19 features recorded in the experiment, 14 more are calculated based on domain specific knowledge: these are called High Level Features (HLF) • A sorting metric to create a sequence of particles to be fed to a sequence based classifier
23 .Feature Preparation • All features need to be converted to a format consumable by the network • One Hot Encoding of categories • Sort the particles for the sequence classifier with a UDF • Executed in PySpark using Spark SQL and ML
24 .Models Investigated 1. Fully connected feed-forward DNN with High Level Features 2. DNN with a recursive layer (based on GRUs) Complexity Performance 3. Combination of (1) + (2)
25 .Hyper-Parameter Tuning– DNN • Once the network topology is chosen, hyper-parameter tuning is done with scikit-learn+Keras and parallelized with Spark *Other names and brands may be claimed as the property of others.
26 .Model Development – DNN • Model is instantiated with the Keras- compatible API provided by Analytics Zoo
27 .Model Development – GRU+HLF A more complex topology for the network
28 .Distributed Training Instantiate the estimator using Analytics Zoo / BigDL The actual training is distributed to Spark executors Storing the model for later use
29 .Performance and Scalability of Analytics Zoo & BigDL Analytics Zoo & BigDL scales very well in the range tested