Accelerating Astronomical Discoveries with Apache Spark

Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy.

On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more.

On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every


1.Accelerating Astronomical Discoveries with Apache Spark Julien Peloton, CNRS #UnifiedDataAnalytics #SparkAISummit 1

2.XXIst century astronomy 2

3.How we can get different data? ~1/100,000 of sky l al w m llo ts a sh bu t p bu ee e D Hubble FoV rg La #UnifiedDataAnalytics #SparkAISummit 3

4.Large Synoptic Survey Telescope 2022-2032: Deep & large survey Non-profit corporation Site: Chile (Cerro Pachón) US led, international collaboration (1000+) #UnifiedDataAnalytics #SparkAISummit 4

5.Million pieces puzzle • LSST will deliver ~full sky map every 3 nights – 3.2 Gigapixels camera (car size!) – 15 TB/night of raw image data collected ? – 1 TB/night of alerts streamed #UnifiedDataAnalytics #SparkAISummit 5

6.Apache Spark for astronomy? We would like to be able to do at scale: • Exploring large catalogs of data • Cross-matching large catalogs • Processing telescope images • Classifying light-curves • Processing telescope alerts • ... #UnifiedDataAnalytics #SparkAISummit 6

7.FITS: astronomical data format • First (last) release: 1981 (2016). • Endorsed by NASA and the International Astronomical Union. • Multi-purposes: vectors, images, tables, ... • Backward compatible • Set of blocks.1 block: ASCII header+binary data arrays of arbitrary dimension • Support for C, C++, C#, Fortran, IDL, Java, Julia, MATLAB, Perl, Python, R, and more… #UnifiedDataAnalytics #SparkAISummit 7

8.spark-fits • FITS data source for Spark SQL and DataFrames. • Data Source V1 API. • Images + tables available. • Schema automatically inferred from the FITS header. #UnifiedDataAnalytics #SparkAISummit 8

9. spark-fits in practice • Spark 2.3.1 / Hadoop 2.8.4 • 1.1 billion rows, 153 cores • Run it 100 times (no cache). • Performances (IO throughput) comparable to other built-in Spark connectors (no attempt to optimise anything anywhere…) #UnifiedDataAnalytics #SparkAISummit 9

10.Current limitations Some limitations currently though… • Need to migrate to Apache Spark DSv2. • No column pruning, no filters at the level of the connector. • (De)Compression is not handled yet. • Scala FITS library lacks of many features. #UnifiedDataAnalytics #SparkAISummit 10

11.We live in a 3D world • Manipulating 2D data with Spark: Geotrellis, Magellan, Geospark, GeoMesa, … • Very little about 3D! • Need for e.g. astronomy, particle physics, meteorology. #UnifiedDataAnalytics #SparkAISummit 11

12. Manipulating 3D spatial data: spark3D • 3D distributed partitioning – KDTree, Octree, shells, ... • Distributed spatial queries & data mining – KNN, join, dbscan, … – Typical usage on million/billion rows • Visualisation – Client/server architecture Student: Mayur Bhosale (now at Qubole) #UnifiedDataAnalytics #SparkAISummit 12

13. On the repartitioning... Frequent as data comes unstructured, but • Repartitioning implies heavy shuffle between executors. • Complex UDF in Spark are often inefficient. #UnifiedDataAnalytics #SparkAISummit 13

14.Need for (efficient) streaming • We explored the static sky - namely what has been observed. • But what about what is happening right now? E.g. – Supernovae (star explosion) – Black hole merger counterparts (multi-messenger astronomy) – Micro-lensing (extrasolar planet search) – Earth killers! – Anomaly detection (unforeseen astronomical sources) • Correlation past/present/future? • Timescales range from seconds to months... #UnifiedDataAnalytics #SparkAISummit 14

15.Desiderata & solution We would like • To work efficiently at scale • Multi-modals analytics capability (streaming & batch) • Good integration with the current ecosystem Structured Streaming #UnifiedDataAnalytics #SparkAISummit 15

16.Introducing Fink Di str 03 ibu Fink is te • A broker system for sky alerts Collect • Based on Apache Spark 02 Fink does • Collect, enrich & distribute sky i c h 01 r alerts En #UnifiedDataAnalytics #SparkAISummit 16

17. Di 03 str ibu te On a quiet night... Collect 02 Credits: E. Bellm 01 h ric En • 10,000 Avro alerts every 30 seconds • 1TB alerts per night Observation • Parquet Database Template Difference #UnifiedDataAnalytics #SparkAISummit 17

18. Di 03 str ibu te Who’s who Collect 02 01 h ric Add values to the raw alerts En • Stream-static join • Classification (BNN) Alert database Alert Internal stream catalogs Structured Structured Streaming Streaming Alert Alert database database #UnifiedDataAnalytics #SparkAISummit 18

19. Di 03 str ibu te Joining external information Collect 02 01 h ric En Spark does all the hard work Optical Neutrino alert stream alert stream • Small delays • Record throughput • Stream position recovery Gamma ray alert stream But it cannot do everything... Join Structured output • Large delays Gravitational Streaming wave • False positives alert stream Still need humans to take decisions #UnifiedDataAnalytics #SparkAISummit 19

20. Di 03 str ibu te The Hero’s Return Collect 02 01 h ric En Processing based on Adaptive Learning (PoC) • Ranking of promising candidates • Improved classification over time Follow-up & Training New Candidates Discovery Streaming infrastructure by: Abhishek Chauhan (now at Morgan Stanley) 20

21. The fear of the shutdown! What if we miss a night? 100 minutes on 3 machines • 14 million alerts, 830 GB of data • Let Spark do the hard work again Collect alerts Broker shutdown…. Collect & write (offsets, updates...) (cache) Limiting factors • Number of machines • Network 21

22.Some lessons learned Handling stream offsets • Manual or not? Still not obvious... Schema evolution • User needs change often… Database choice is crucial Dynamic filtering • Need to adapt quickly to new situations Handling watermarks • How long shall we wait for data? Switch to post-processing. Communication • Using common communication protocols & data format... #UnifiedDataAnalytics #SparkAISummit 22

23.Thanks! You have a public/private project in mind? You want to contribute to astronomy? Come talk to me! #UnifiedDataAnalytics #SparkAISummit 23


由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。