The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
2. Petar Zečević, SV Group, University of Zagreb
Mario Jurić, DIRAC Institute, University of Washington
AXS - Astronomical Data
Processing on the LSST
Scale with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
3. About us
Mario Jurić
• Prof. of Astronomy at the University of Washington
• Founding faculty of DIRAC & eScience Institute Fellow
• Fmr. lead of LSST Data Management
Petar Zečević
• CTO at SV Group, Croatia
• CS PhD student at University of Zagreb
• Visiting Fellow at DiRAC institute @ UW
• Author of “Spark in Action”
3#UnifiedDataAnalytics #SparkAISummit
7. Hipparchus of Rhodes (180-125 BC)
In 129 BC, constructed one of the first star
catalogs, containing about 850 stars.
8. Galileo Galilei (1564-1642)
Researched a variety of topics in physics,
but called out here for the introduction of
the Galilean telescope.
Galileo’s telescope allowed us for the first
time to zoom in on the cosmos, and study
the individual objects in great detail.
9. The Astrophysics Two-Step
• Surveys
– Construct catalogs and maps of objects in the sky. Focus on coarse
classification and discovering targets for further follow-up.
• Large telescopes
– Acquire detailed observations of a few representative objects.
Understand the details of astrophysical processes that govern them,
and extrapolate that understanding to the entire class.
10. The Story of Astronomy:
2000 Years of being Data Poor
10
11. Sloan Digital Sky Survey
2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit
5 band, 1%, photometry for over 900M stars
Over 3M R=2000 spectra
10 years of ops: ~10 TB of imaging
12. 1,231,051,050 rows (SDSS DR10, PhotoObjAll table)
~500 columns
Facilitated the development
of large databases, data-
driven discovery, motion
towards what we recognize
as Data Science today.
13. Panoramic Survey Telescope and Rapid Response System
1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit
5 band, better than 1% photometry (goal)
~700 GB/night
15. First Light: 2020 Operations: 2022
Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds)
Largest astronomical camera in the world
Will repeatedly observe the night sky over 10 years
10 million alerts each night (60 seconds)
37 billion astronomical sources, with time series
30 trillion measurements
The Large Synoptic Survey Telescope
A Public, Deep, Wide and Fast, Optical Sky Survey
16. Overview
LSST’s mission is to build a well-understood system that
provides a vast astronomical dataset for unprecedented
discovery of the deep and dynamic universe.
17. The Scale of Things to Come
17
Metric Amount
Number of detections 7 trillion rows
Number of objects 37 billion rows
Nightly alert rate 10 million
Nightly data rate >15 TB
Alert latency 60 seconds
Total images after 10 yrs 50 PB
Total data after 10 yrs 83 PB
Objects detected, measured, and stored in queryable catalogs (tables)
18. Catalog-driven Science
• Once a catalog is available, astronomers “ask” all kinds of questions
18#UnifiedDataAnalytics #SparkAISummit
– Download data locally
– Analyze (usually Python)
•
• The traditional paradigm:
– Subset (filter data using a catalog SQL interface online)
19. Challenges (part 0)
Dataset Size
(keeping ~PBs of data in RBDMSes is not easy, or cheap)
What do you do when the dataset subset is a few ~TBs?
20. Challenges (part 1)
I Want it AllBetter Together
(joining datasets is powerful) (interesting science w. whole dataset operations)
Dataset Size
(keeping ~TBs of data in RBDMs-es is not easy)
21. Challenges (part 2)
Scalability Resources
(how do I write an analysis code that will
scale to petabytes of data?)
(where are the resources to run this code?)
How do you scale exploratory data analysis to ~PB-sized datasets
and thousands of simultaneous users?
22. Enter Spark, AXS
• AXS: Astronomy eXtensions for Spark
• The main idea:
– Spark is a proven, scalable, cloud-ready and widely-supported analytics
framework with full SQL support (legacy support).
– Extend it to exploratory data analysis.
– Add a scalable positional cross-match operator
– Add a domain-specific Python API layer to PySpark
– Couple to S3 API for storage, Kubernetes for orchestration…
• … A scalable platform supporting an arbitrarily sized dataset and a
large number of users, deployable on either public or private cloud.
22
23. Key Issue: Scalable Cross-matching
23#UnifiedDataAnalytics #SparkAISummit
DEC and RA coordinates
Search perimeter
(can also use similarity)
A match
24. AXS data partitioning
• Data partitioning is at the root of AXS' efficient cross-
matching
• Based on (late) Jim Gray's “zones algorithm” (MS Rsch)
• Sky divided into horizontal “zones” of a certain height
• Adapted for distributed architectures
• Data stored in Parquet files
– bucketed by zone
– sorted by zone and ra columns
– data from zone borders duplicated to the zone below
24
28. Epsilon join
SELECT ... FROM TA, TB
WHERE TA.zone = TB.zone
AND TA.ra BETWEEN TB.ra - e
AND TB.ra + e
28
SPARK-24020: Sort-merge join “inner
range optimization”
30. AXS performance results
Gaia (1.7 B) x SDSS (800 M)
37s warm (148s cold)
Gaia (1.7 B) x ZTF (2.9 B)
39s warm (315s cold)
Left: tests on a single large
machine. An AWS deployment
scales out nearly linearly, as
long as there are sufficient
partitions in the dataset.
30#UnifiedDataAnalytics #SparkAISummit
32. AXS - other functionalities
• crossmatch (return all or the first crossmatch candidate)
• region queries
• cone queries
• histogram
• histogram2d
• Spark array functions for handling lightcurve data
• All other Spark functions
33. Astronomy Example: Computing Light
Curve Features with Python UDFs
This works on arbitrarily large datasets!
Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
34. Observations and experiences
• Spark scales really well!
• SQL support is fantastic for supporting legacy code
• Efficient data exchange with Python is key to having reasonable
performance (Arrow and friends)
• The language barrier is non-trivial: astronomy is in Python, little
experience with JVM/Scala
• Pushing Spark into exploratory data analysis – the challenge of
converting a batch system to support more dynamic workflows.
36. DATA INTENSIVE RESEARCH IN
ASTROPHYSICS AND COSMOLOGY
DIRAC Data Engineering Group
We’re a collaborative incubator that supports people and communities
researching and building next generations of software technologies for
astronomy.
We emphasize cross-pollination with other fields, the industry, and delivering
usable, community supported, projects.
37. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
40. EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4
0
Cataloging the Solar System
• Potentially Hazardous Asteroids
• Main Belt Asteroids
• Census of small bodies in the Solar
System
Exploring the Transient sky
• Variable stars, Supernovae
• Fill in the variability phase-space
• Discovery of new classes of transients
Dark Matter, Dark Energy
• Weak Lensing
• Baryon acoustic oscillations
• Supernovae, Quasars
Milky Way Structure & Formation
• Structure and evolutionary history
• Spatial maps of stellar characteristics
• Reach well into the halo
LSST Science Drivers
41. Solar System Science with LSST
Animation: SDSS Asteroids
(Alex Parker, SwRI)
About ~0.7 million are known
Will grow to >5 million in the next 5 years
Estimates: Lynne Jones et al.
42.
43. Whole Dataset Operations• Galactic structure: density/proper motion maps of
the Galaxy
– => forall stars, compute distance, bin, create 5D map
• Galactic structure: dust distribution
– => forall stars, compute g-r color, bin, find blue tip edge,
infer dust distribution
• Near-field cosmology: MW satellite searches
– => forall stars, compute colors, convolve with spatial
filters, report any satellite-like peaks
• Variability: Bayesian classification of transients and
discovery of variables
– => forall stars, get light curves, compute likelihoods,
alert if interesting
• …
44. Astronomical catalogs
• Just (big!) databases
• Each row corresponds to a detection or an object
(star/galaxy/asteroid)
• Producing catalogs from images is not trivial - non-exhaustive list of
problems (for software to solve):
– background estimation
– PSF estimation
– object detection
– image co-addition
– deblending
44
45. AXS history: LSD by Mario Jurić
• Tool for querying, cross-matching and analysis of positionally or
temporally indexed datasets
• Inspired by Google's BigTable and MapReduce papers
• However it has some shortcomings:
– Fixed data partitioning (significant data skew)
– Time-partitioning problematic (most queries do not slice by
time)
– Not resilient to worker failures
– Contains a lot of custom solutions for functionalities that are
common today
45
46. Enter Spark and AXS
• Astronomy eXtensions for Spark
• DiRAC institute @ UW saw the need for next generation
astronomical analysis tool
• Efficient cross-matching
• Based on industry standards (Apache Spark)
• Provides simple (but powerful) astronomical API
extensions
• Easy to use on-premises or in the cloud
46