Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy.
On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more.
On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope.
You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!
3. How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all
#UnifiedDataAnalytics #SparkAISummit
4. Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, international
collaboration (1000+)
4#UnifiedDataAnalytics #SparkAISummit
5. Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night of raw image data collected
– 1 TB/night of alerts streamed
5#UnifiedDataAnalytics #SparkAISummit
?
6. We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing telescope images
• Classifying light-curves
• Processing telescope alerts
• ...
Apache Spark for astronomy?
6#UnifiedDataAnalytics #SparkAISummit
7. FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical Union.
• Multi-purposes: vectors, images, tables, ...
• Backward compatible
• Set of blocks.1 block: ASCII header+binary
data arrays of arbitrary dimension
• Support for C, C++, C#, Fortran, IDL, Java,
Julia, MATLAB, Perl, Python, R, and more…
7#UnifiedDataAnalytics #SparkAISummit
8. spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema automatically inferred from the FITS header.
8#UnifiedDataAnalytics #SparkAISummit
9. spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Performances (IO throughput)
comparable to other built-in
Spark connectors (no attempt
to optimise anything
anywhere…)
9#UnifiedDataAnalytics #SparkAISummit
10. Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no filters at the level of the connector.
• (De)Compression is not handled yet.
• Scala FITS library lacks of many features.
10#UnifiedDataAnalytics #SparkAISummit
11. We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D!
• Need for e.g. astronomy, particle
physics, meteorology.
11#UnifiedDataAnalytics #SparkAISummit
12. Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial queries & data mining
– KNN, join, dbscan, …
– Typical usage on million/billion rows
• Visualisation
– Client/server architecture
12
Student:
Mayur
Bhosale (now at Qubole)
#UnifiedDataAnalytics #SparkAISummit
13. On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors.
• Complex UDF in Spark are often
inefficient.
13#UnifiedDataAnalytics #SparkAISummit
14. Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happening right now? E.g.
– Supernovae (star explosion)
– Black hole merger counterparts (multi-messenger astronomy)
– Micro-lensing (extrasolar planet search)
– Earth killers!
– Anomaly detection (unforeseen astronomical sources)
• Correlation past/present/future?
• Timescales range from seconds to months...
14#UnifiedDataAnalytics #SparkAISummit
15. Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)
• Good integration with the
current ecosystem
15
Structured
Streaming
#UnifiedDataAnalytics #SparkAISummit
16. Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute sky
alerts
16#UnifiedDataAnalytics #SparkAISummit
03
01
02
Distribute
Enrich
Collect
17. On a quiet night...
17#UnifiedDataAnalytics #SparkAISummit
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• Parquet Database
Observation
Template
Difference
Credits: E. Bellm
03
01
Distribute
Enrich
Collect
02
18. Who’s who
18#UnifiedDataAnalytics #SparkAISummit
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
Structured
Streaming
Alert
stream
Internal
catalogs
Alert
database
03
01
02
Distribute
Enrich
Collect
Alert database
Alert database
Structured
Streaming
19. Joining external information
19#UnifiedDataAnalytics #SparkAISummit
Structured
Streaming
Neutrino
alert stream
Gamma ray
alert stream
Optical
alert stream
Gravitational
wave
alert stream
Join
output
03
01
02
Distribute
Enrich
Collect
Spark does all the hard work
• Small delays
• Record throughput
• Stream position recovery
But it cannot do everything...
• Large delays
• False positives
Still need humans to take decisions
20. The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification over time
20
New Candidates
Follow-up &
DiscoveryTraining
Streaming infrastructure by:
Abhishek Chauhan
(now at Morgan Stanley)
03
01
02
Distribute
Enrich
Collect
21. The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
(offsets, updates...)
21
Broker shutdown…. Collect & write
100 minutes on 3 machines
Collect alerts
(cache)
Limiting factors
• Number of machines
• Network
22. Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change often… Database choice is crucial
Dynamic filtering
• Need to adapt quickly to new situations
Handling watermarks
• How long shall we wait for data? Switch to post-processing.
Communication
• Using common communication protocols & data format...
22#UnifiedDataAnalytics #SparkAISummit
23. Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!
23#UnifiedDataAnalytics #SparkAISummit
24. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24