Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technol

In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters. Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives. This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.

1.Deep Learning on Apache Spark at CERN Large Hadron Collider with Intel Technologies 24th April 2019 Matteo Migliorini, Riccardo Castellotti, Luca Canali, Viktor Khristenko, Maria Girone

2.International organisation straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 23 members states 1.1 B CHF budget (~1.1B USD) 3’000 members of personnel 15’ 000 associated members from 90 countries Curiosity: CERN has its own TLD!

3.Largest machine in the world 27 km-long Large Hadron Collider (LHC) Fastest racetrack on Earth Protons travel at 99.9999991% of the speed of light Emptiest place in the solar system Particules circulate in the highest vacuum Hottest spot in the galaxy Lead ion collisions create temperatures 100,000x hotter than the hearth of the sun


5.CMS 22m long, 15m wide weighs 14’000 tonnes Most powerful superconducting solenoid ever built 182 institutes from 42 countries

6.40 million 100,000 1,000 collisions selections selections per PB/s per TB/s per GB/s second second second This can generate up to a petabyte of data per second. Filtering the data in real time, selecting potentially interesting events (trigger).

7.The CERN Data Centre in Numbers 280 PB Hot 15 000 Servers 280 000 Cores Storage 350 PB on tape 50 000 km Fiber (+88PB from Optics LHC in 2018)

8.Worldwide LHC Computing Grid Tier-0 (CERN): Tier-2 (72 •Data recording Federations, ~150 •Initial data reconstruction centres): •Data distribution • Simulation • End-user analysis Tier-1 (14 centres): •Permanent storage •Re-processing •900,000 cores •Analysis •1 EB

9.Hadoop and Spark Clusters at CERN • Clusters: YARN/Hadoop and Spark on Kubernetes • Hardware: Intel based servers, continuous refresh and capacity expansion Accelerator logging 30 nodes (part of LHC (Cores - 800, Mem - 13 TB, Storage – 7.5 PB) infrastructure) General Purpose 65 nodes (Cores – 1.3k, Mem – 20 TB, Storage – 12.5 PB) Cloud containers 60 nodes (Cores - 240, Mem – 480 GB) Storage: remote HDFS or EOS (for physics data)

10.Analytics Pipelines – Use Cases • Physics: • Analysis on computing metadata • Development of new ways to process Physics data, scale-out and ML • IT: • Analytics on IT monitoring data • Computer security • BE (Accelerators): • NXCALS – next generation accelerator logging platform • Industrial controls data and analytics

11.Analytics Platform Outlook Integrating with existing infrastructure: • Software • Data Experiments storage HDFS HEP software Personal storage

12.Analytics with SWAN Text Code Monitoring Visualizations All the required tools, software and data available in a single window!

13.Extending Spark to Read Physics Data • Physics data is stored in EOS system, accessible with xrootd protocol: extended HDFS APIs • Stored in ROOT format: developed a Spark Datasource C++ Java • Currently: 300 PBs EOS Hadoop Storage HDFS • ~90 PBs per year of Service XRootD Hadoop- API JNI XRootD Client operation Connector • •

14.Deep Learning Pipeline for Physics Data 3 W+j 2 1% 36% Particle QCD Classifier 1 63% t-t̅

15.Data challenges in physics • Proton-proton collisions in LHC happen at 40MHz. • Hundreds of TB/s of electrical signals that allow physicists to investigate particle collision events. • Storage, limited by bandwidth • Currently, only 1 every 40K events stored to disk (~10 GB/s). 2018: 5 collisions/beam cross 2026: 400 collisions/beam cross LHC High Luminosity-LHC

16.R&D • Improve the quality of filtering systems • Reduce false positives rate • From rule-based algorithms to Deep Learning classifiers • Advanced analytics at the edge • Avoid wasting resources in offline computing • Reduction of operational costs

17. Engineering Efforts to Enable Effective ML • From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

18.Spark, Analytics Zoo & BigDL • Apache Spark • Leading tools and API for data processing at scale • Analytics Zoo is a platform for unified analytics and AI on Apache Spark leveraging BigDL / Tensorflow • For service developers: integration with infrastructure (hardware, data access, operations) • For users: Keras APIs to run user models, integration with Spark data structures and pipelines • BigDL is a distributed deep learning framework for Apache Spark

19.Data Pipeline Data Feature Model Training Ingestion Preparation Development Read physics Prepare input 1. Specify model Train the best data and feature for Deep topology model engineering Learning 2. Tune model network topology on small dataset Leveraging Apache Spark and Analytics Zoo in Python Notebooks

20.The Dataset ● Software simulators generate events and calculate the detector response ● Every event is a 801x19 matrix: for every particle momentum, position, energy, charge and particle type are given

21.Data Ingestion • Read input files (4 TB) from custom format • Compute physics-motivated features • Store to parquet format 54 M events ~4TB Physics data storage 750 GBs Stored on HDFS

22.Features Engineering • From the 19 features recorded in the experiment, 14 more are calculated based on domain specific knowledge: these are called High Level Features (HLF) • A sorting metric to create a sequence of particles to be fed to a sequence based classifier

23.Feature Preparation • All features need to be converted to a format consumable by the network • One Hot Encoding of categories • Sort the particles for the sequence classifier with a UDF • Executed in PySpark using Spark SQL and ML

24.Models Investigated 1. Fully connected feed-forward DNN with High Level Features 2. DNN with a recursive layer (based on GRUs) Complexity Performance 3. Combination of (1) + (2)

25.Hyper-Parameter Tuning– DNN • Once the network topology is chosen, hyper-parameter tuning is done with scikit-learn+Keras and parallelized with Spark *Other names and brands may be claimed as the property of others.

26.Model Development – DNN • Model is instantiated with the Keras- compatible API provided by Analytics Zoo

27.Model Development – GRU+HLF A more complex topology for the network

28.Distributed Training Instantiate the estimator using Analytics Zoo / BigDL The actual training is distributed to Spark executors Storing the model for later use

29.Performance and Scalability of Analytics Zoo & BigDL Analytics Zoo & BigDL scales very well in the range tested