SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Petar Zečević, SV Group, University of Zagreb
Mario Jurić, DIRAC Institute, University of Washington
AXS - Astronomical Data
Processing on the LSST
Scale with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
About us
Mario Jurić
• Prof. of Astronomy at the University of Washington
• Founding faculty of DIRAC & eScience Institute Fellow
• Fmr. lead of LSST Data Management
Petar Zečević
• CTO at SV Group, Croatia
• CS PhD student at University of Zagreb
• Visiting Fellow at DiRAC institute @ UW
• Author of “Spark in Action”
3#UnifiedDataAnalytics #SparkAISummit
About us
4#UnifiedDataAnalytics #SparkAISummit
Context: The Large Survey
Revolution in Astronomy
Hipparchus of Rhodes (180-125 BC)
In 129 BC, constructed one of the first star
catalogs, containing about 850 stars.
Galileo Galilei (1564-1642)
Researched a variety of topics in physics,
but called out here for the introduction of
the Galilean telescope.
Galileo’s telescope allowed us for the first
time to zoom in on the cosmos, and study
the individual objects in great detail.
The Astrophysics Two-Step
• Surveys
– Construct catalogs and maps of objects in the sky. Focus on coarse
classification and discovering targets for further follow-up.
• Large telescopes
– Acquire detailed observations of a few representative objects.
Understand the details of astrophysical processes that govern them,
and extrapolate that understanding to the entire class.
The Story of Astronomy:
2000 Years of being Data Poor
10
Sloan Digital Sky Survey
2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit
5 band, 1%, photometry for over 900M stars
Over 3M R=2000 spectra
10 years of ops: ~10 TB of imaging
1,231,051,050 rows (SDSS DR10, PhotoObjAll table)
~500 columns
Facilitated the development
of large databases, data-
driven discovery, motion
towards what we recognize
as Data Science today.
Panoramic Survey Telescope and Rapid Response System
1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit
5 band, better than 1% photometry (goal)
~700 GB/night
14
https://sci.esa.int/s/wV6oG5w
Gaia DR2: 1.7 billion stars
First Light: 2020 Operations: 2022
Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds)
Largest astronomical camera in the world
Will repeatedly observe the night sky over 10 years
10 million alerts each night (60 seconds)
37 billion astronomical sources, with time series
30 trillion measurements
The Large Synoptic Survey Telescope
A Public, Deep, Wide and Fast, Optical Sky Survey
Overview
LSST’s mission is to build a well-understood system that
provides a vast astronomical dataset for unprecedented
discovery of the deep and dynamic universe.
The Scale of Things to Come
17
Metric Amount
Number of detections 7 trillion rows
Number of objects 37 billion rows
Nightly alert rate 10 million
Nightly data rate >15 TB
Alert latency 60 seconds
Total images after 10 yrs 50 PB
Total data after 10 yrs 83 PB
Objects detected, measured, and stored in queryable catalogs (tables)
Catalog-driven Science
• Once a catalog is available, astronomers “ask” all kinds of questions
18#UnifiedDataAnalytics #SparkAISummit
– Download data locally
– Analyze (usually Python)
•
• The traditional paradigm:
– Subset (filter data using a catalog SQL interface online)
Challenges (part 0)
Dataset Size
(keeping ~PBs of data in RBDMSes is not easy, or cheap)
What do you do when the dataset subset is a few ~TBs?
Challenges (part 1)
I Want it AllBetter Together
(joining datasets is powerful) (interesting science w. whole dataset operations)
Dataset Size
(keeping ~TBs of data in RBDMs-es is not easy)
Challenges (part 2)
Scalability Resources
(how do I write an analysis code that will
scale to petabytes of data?)
(where are the resources to run this code?)
How do you scale exploratory data analysis to ~PB-sized datasets
and thousands of simultaneous users?
Enter Spark, AXS
• AXS: Astronomy eXtensions for Spark
• The main idea:
– Spark is a proven, scalable, cloud-ready and widely-supported analytics
framework with full SQL support (legacy support).
– Extend it to exploratory data analysis.
– Add a scalable positional cross-match operator
– Add a domain-specific Python API layer to PySpark
– Couple to S3 API for storage, Kubernetes for orchestration…
• … A scalable platform supporting an arbitrarily sized dataset and a
large number of users, deployable on either public or private cloud.
22
Key Issue: Scalable Cross-matching
23#UnifiedDataAnalytics #SparkAISummit
DEC and RA coordinates
Search perimeter
(can also use similarity)
A match
AXS data partitioning
• Data partitioning is at the root of AXS' efficient cross-
matching
• Based on (late) Jim Gray's “zones algorithm” (MS Rsch)
• Sky divided into horizontal “zones” of a certain height
• Adapted for distributed architectures
• Data stored in Parquet files
– bucketed by zone
– sorted by zone and ra columns
– data from zone borders duplicated to the zone below
24
AXS data partitioning
25
AXS - optimal joins
26
AXS - optimal joins
27
Epsilon join
SELECT ... FROM TA, TB
WHERE TA.zone = TB.zone
AND TA.ra BETWEEN TB.ra - e
AND TB.ra + e
28
SPARK-24020: Sort-merge join “inner
range optimization”
Other approaches
Other systems use
HEALPix
or Hierarchical Triangular Mesh (HTM)
29
AXS performance results
Gaia (1.7 B) x SDSS (800 M)
37s warm (148s cold)
Gaia (1.7 B) x ZTF (2.9 B)
39s warm (315s cold)
Left: tests on a single large
machine. An AWS deployment
scales out nearly linearly, as
long as there are sufficient
partitions in the dataset.
30#UnifiedDataAnalytics #SparkAISummit
AXS API
31#UnifiedDataAnalytics #SparkAISummit
AXS - other functionalities
• crossmatch (return all or the first crossmatch candidate)
• region queries
• cone queries
• histogram
• histogram2d
• Spark array functions for handling lightcurve data
• All other Spark functions
Astronomy Example: Computing Light
Curve Features with Python UDFs
This works on arbitrarily large datasets!
Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
Observations and experiences
• Spark scales really well!
• SQL support is fantastic for supporting legacy code
• Efficient data exchange with Python is key to having reasonable
performance (Arrow and friends)
• The language barrier is non-trivial: astronomy is in Python, little
experience with JVM/Scala
• Pushing Spark into exploratory data analysis – the challenge of
converting a batch system to support more dynamic workflows.
“Astronomy 2025”
Towards a scalable
astronomical analysis
platform
DATA INTENSIVE RESEARCH IN
ASTROPHYSICS AND COSMOLOGY
DIRAC Data Engineering Group
We’re a collaborative incubator that supports people and communities
researching and building next generations of software technologies for
astronomy.
We emphasize cross-pollination with other fields, the industry, and delivering
usable, community supported, projects.
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Backups
38
39
http://astro.washington.e
du
EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4
0
Cataloging the Solar System
• Potentially Hazardous Asteroids
• Main Belt Asteroids
• Census of small bodies in the Solar
System
Exploring the Transient sky
• Variable stars, Supernovae
• Fill in the variability phase-space
• Discovery of new classes of transients
Dark Matter, Dark Energy
• Weak Lensing
• Baryon acoustic oscillations
• Supernovae, Quasars
Milky Way Structure & Formation
• Structure and evolutionary history
• Spatial maps of stellar characteristics
• Reach well into the halo
LSST Science Drivers
Solar System Science with LSST
Animation: SDSS Asteroids
(Alex Parker, SwRI)
About ~0.7 million are known
Will grow to >5 million in the next 5 years
Estimates: Lynne Jones et al.
Whole Dataset Operations• Galactic structure: density/proper motion maps of
the Galaxy
– => forall stars, compute distance, bin, create 5D map
• Galactic structure: dust distribution
– => forall stars, compute g-r color, bin, find blue tip edge,
infer dust distribution
• Near-field cosmology: MW satellite searches
– => forall stars, compute colors, convolve with spatial
filters, report any satellite-like peaks
• Variability: Bayesian classification of transients and
discovery of variables
– => forall stars, get light curves, compute likelihoods,
alert if interesting
• …
Astronomical catalogs
• Just (big!) databases
• Each row corresponds to a detection or an object
(star/galaxy/asteroid)
• Producing catalogs from images is not trivial - non-exhaustive list of
problems (for software to solve):
– background estimation
– PSF estimation
– object detection
– image co-addition
– deblending
44
AXS history: LSD by Mario Jurić
• Tool for querying, cross-matching and analysis of positionally or
temporally indexed datasets
• Inspired by Google's BigTable and MapReduce papers
• However it has some shortcomings:
– Fixed data partitioning (significant data skew)
– Time-partitioning problematic (most queries do not slice by
time)
– Not resilient to worker failures
– Contains a lot of custom solutions for functionalities that are
common today
45
Enter Spark and AXS
• Astronomy eXtensions for Spark
• DiRAC institute @ UW saw the need for next generation
astronomical analysis tool
• Efficient cross-matching
• Based on industry standards (Apache Spark)
• Provides simple (but powerful) astronomical API
extensions
• Easy to use on-premises or in the cloud
46
Scaling with Spark
https://www.toptal.com/spark/introduction-to-apache-spark
+ government-sponsored private clouds (e.g., JetStream)
Meeting the Challenges
Resources
Dataset Storage
Scalable
Analysis Code
Interface

Contenu connexe

Tendances

MODIS (Moderate Resolution Imaging Spectrometer)
MODIS (Moderate Resolution Imaging Spectrometer)MODIS (Moderate Resolution Imaging Spectrometer)
MODIS (Moderate Resolution Imaging Spectrometer)Nepal Flying Labs
 
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...P.K. Mani
 
Hubble Telescope
Hubble TelescopeHubble Telescope
Hubble Telescopehariom04
 
Geosynchronous earth orbit(geo)
Geosynchronous earth orbit(geo)Geosynchronous earth orbit(geo)
Geosynchronous earth orbit(geo)Ananda Mohan
 
Global positioning system ppt
Global positioning system pptGlobal positioning system ppt
Global positioning system pptSwapnil Ramgirwar
 
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...3D Laser Mapping
 
Global Positioning System
Global Positioning SystemGlobal Positioning System
Global Positioning SystemGaurav Raj
 
Sentinel 2
Sentinel 2Sentinel 2
Sentinel 2Openmaps
 
Different types of galaxies
Different types of galaxiesDifferent types of galaxies
Different types of galaxiesSamanthad
 

Tendances (20)

MODIS (Moderate Resolution Imaging Spectrometer)
MODIS (Moderate Resolution Imaging Spectrometer)MODIS (Moderate Resolution Imaging Spectrometer)
MODIS (Moderate Resolution Imaging Spectrometer)
 
Voyager mission
Voyager missionVoyager mission
Voyager mission
 
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...
Black body radiation,planck's radiation, wien's law, stephen boltzmann law in...
 
BASICS OF COSMOLOGY
BASICS OF COSMOLOGYBASICS OF COSMOLOGY
BASICS OF COSMOLOGY
 
Gps
GpsGps
Gps
 
Hubble Telescope
Hubble TelescopeHubble Telescope
Hubble Telescope
 
Geosynchronous earth orbit(geo)
Geosynchronous earth orbit(geo)Geosynchronous earth orbit(geo)
Geosynchronous earth orbit(geo)
 
Space segment
Space segment Space segment
Space segment
 
introduction-of-GNSS-1
introduction-of-GNSS-1introduction-of-GNSS-1
introduction-of-GNSS-1
 
Global positioning system ppt
Global positioning system pptGlobal positioning system ppt
Global positioning system ppt
 
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...
An Introduction to Laser Scanning - Part 3: Mobile mapping and accuracy chall...
 
Milky Way physics 101
Milky Way physics 101Milky Way physics 101
Milky Way physics 101
 
Astronomical scales
Astronomical scalesAstronomical scales
Astronomical scales
 
Global Positioning System
Global Positioning SystemGlobal Positioning System
Global Positioning System
 
Sentinel 2
Sentinel 2Sentinel 2
Sentinel 2
 
Remote Sensing
Remote Sensing Remote Sensing
Remote Sensing
 
Chandrayaan-2
Chandrayaan-2Chandrayaan-2
Chandrayaan-2
 
Different types of galaxies
Different types of galaxiesDifferent types of galaxies
Different types of galaxies
 
Glonass
GlonassGlonass
Glonass
 
Earths magnetic field
Earths magnetic fieldEarths magnetic field
Earths magnetic field
 

Similaire à Astronomical Data Processing on the LSST Scale with Apache Spark

AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...Mario Juric
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsAdvanced-Concepts-Team
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateMario Juric
 
Round Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsRound Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsMario Juric
 
Computational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsComputational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsJoshua Bloom
 
AstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for AstronomyAstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for AstronomyRoberto Muñoz
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InLarry Smarr
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsMario Juric
 
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...Larry Smarr
 
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...Mario Juric
 
SKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global AstronomySKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global AstronomyEUDAT
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014William Comaskey
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesLarry Smarr
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineeringinside-BigData.com
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRLucaCinquini
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 

Similaire à Astronomical Data Processing on the LSST Scale with Apache Spark (20)

Presentation
PresentationPresentation
Presentation
 
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
Round Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsRound Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogs
 
Computational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsComputational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain Scientists
 
AstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for AstronomyAstroCV: A computer vision library for Astronomy
AstroCV: A computer vision library for Astronomy
 
The Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years InThe Pacific Research Platform
 Two Years In
The Pacific Research Platform
 Two Years In
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your Questions
 
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...
 
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
 
SKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global AstronomySKA Regional Sciences Centres - A Platform for Global Astronomy
SKA Regional Sciences Centres - A Platform for Global Astronomy
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014
 
Cyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean ObservatoriesCyberinfrastructure to Support Ocean Observatories
Cyberinfrastructure to Support Ocean Observatories
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
Toward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing CyberinfrastructureToward a Global Interactive Earth Observing Cyberinfrastructure
Toward a Global Interactive Earth Observing Cyberinfrastructure
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Astronomical Data Processing on the LSST Scale with Apache Spark

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Petar Zečević, SV Group, University of Zagreb Mario Jurić, DIRAC Institute, University of Washington AXS - Astronomical Data Processing on the LSST Scale with Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3. About us Mario Jurić • Prof. of Astronomy at the University of Washington • Founding faculty of DIRAC & eScience Institute Fellow • Fmr. lead of LSST Data Management Petar Zečević • CTO at SV Group, Croatia • CS PhD student at University of Zagreb • Visiting Fellow at DiRAC institute @ UW • Author of “Spark in Action” 3#UnifiedDataAnalytics #SparkAISummit
  • 5. Context: The Large Survey Revolution in Astronomy
  • 6.
  • 7. Hipparchus of Rhodes (180-125 BC) In 129 BC, constructed one of the first star catalogs, containing about 850 stars.
  • 8. Galileo Galilei (1564-1642) Researched a variety of topics in physics, but called out here for the introduction of the Galilean telescope. Galileo’s telescope allowed us for the first time to zoom in on the cosmos, and study the individual objects in great detail.
  • 9. The Astrophysics Two-Step • Surveys – Construct catalogs and maps of objects in the sky. Focus on coarse classification and discovering targets for further follow-up. • Large telescopes – Acquire detailed observations of a few representative objects. Understand the details of astrophysical processes that govern them, and extrapolate that understanding to the entire class.
  • 10. The Story of Astronomy: 2000 Years of being Data Poor 10
  • 11. Sloan Digital Sky Survey 2.5m telescope >14,500 deg2 0.1” astrometry r<22.5 flux limit 5 band, 1%, photometry for over 900M stars Over 3M R=2000 spectra 10 years of ops: ~10 TB of imaging
  • 12. 1,231,051,050 rows (SDSS DR10, PhotoObjAll table) ~500 columns Facilitated the development of large databases, data- driven discovery, motion towards what we recognize as Data Science today.
  • 13. Panoramic Survey Telescope and Rapid Response System 1.8m telescope 30,000 deg2 50mas astrometry r<23 flux limit 5 band, better than 1% photometry (goal) ~700 GB/night
  • 15. First Light: 2020 Operations: 2022 Deep (24th mag), Wide (60% of the sky), Fast (every 15 seconds) Largest astronomical camera in the world Will repeatedly observe the night sky over 10 years 10 million alerts each night (60 seconds) 37 billion astronomical sources, with time series 30 trillion measurements The Large Synoptic Survey Telescope A Public, Deep, Wide and Fast, Optical Sky Survey
  • 16. Overview LSST’s mission is to build a well-understood system that provides a vast astronomical dataset for unprecedented discovery of the deep and dynamic universe.
  • 17. The Scale of Things to Come 17 Metric Amount Number of detections 7 trillion rows Number of objects 37 billion rows Nightly alert rate 10 million Nightly data rate >15 TB Alert latency 60 seconds Total images after 10 yrs 50 PB Total data after 10 yrs 83 PB Objects detected, measured, and stored in queryable catalogs (tables)
  • 18. Catalog-driven Science • Once a catalog is available, astronomers “ask” all kinds of questions 18#UnifiedDataAnalytics #SparkAISummit – Download data locally – Analyze (usually Python) • • The traditional paradigm: – Subset (filter data using a catalog SQL interface online)
  • 19. Challenges (part 0) Dataset Size (keeping ~PBs of data in RBDMSes is not easy, or cheap) What do you do when the dataset subset is a few ~TBs?
  • 20. Challenges (part 1) I Want it AllBetter Together (joining datasets is powerful) (interesting science w. whole dataset operations) Dataset Size (keeping ~TBs of data in RBDMs-es is not easy)
  • 21. Challenges (part 2) Scalability Resources (how do I write an analysis code that will scale to petabytes of data?) (where are the resources to run this code?) How do you scale exploratory data analysis to ~PB-sized datasets and thousands of simultaneous users?
  • 22. Enter Spark, AXS • AXS: Astronomy eXtensions for Spark • The main idea: – Spark is a proven, scalable, cloud-ready and widely-supported analytics framework with full SQL support (legacy support). – Extend it to exploratory data analysis. – Add a scalable positional cross-match operator – Add a domain-specific Python API layer to PySpark – Couple to S3 API for storage, Kubernetes for orchestration… • … A scalable platform supporting an arbitrarily sized dataset and a large number of users, deployable on either public or private cloud. 22
  • 23. Key Issue: Scalable Cross-matching 23#UnifiedDataAnalytics #SparkAISummit DEC and RA coordinates Search perimeter (can also use similarity) A match
  • 24. AXS data partitioning • Data partitioning is at the root of AXS' efficient cross- matching • Based on (late) Jim Gray's “zones algorithm” (MS Rsch) • Sky divided into horizontal “zones” of a certain height • Adapted for distributed architectures • Data stored in Parquet files – bucketed by zone – sorted by zone and ra columns – data from zone borders duplicated to the zone below 24
  • 26. AXS - optimal joins 26
  • 27. AXS - optimal joins 27
  • 28. Epsilon join SELECT ... FROM TA, TB WHERE TA.zone = TB.zone AND TA.ra BETWEEN TB.ra - e AND TB.ra + e 28 SPARK-24020: Sort-merge join “inner range optimization”
  • 29. Other approaches Other systems use HEALPix or Hierarchical Triangular Mesh (HTM) 29
  • 30. AXS performance results Gaia (1.7 B) x SDSS (800 M) 37s warm (148s cold) Gaia (1.7 B) x ZTF (2.9 B) 39s warm (315s cold) Left: tests on a single large machine. An AWS deployment scales out nearly linearly, as long as there are sufficient partitions in the dataset. 30#UnifiedDataAnalytics #SparkAISummit
  • 32. AXS - other functionalities • crossmatch (return all or the first crossmatch candidate) • region queries • cone queries • histogram • histogram2d • Spark array functions for handling lightcurve data • All other Spark functions
  • 33. Astronomy Example: Computing Light Curve Features with Python UDFs This works on arbitrarily large datasets! Cesium (Naul, 2016), Astronomy eXtensions for Spark (Zecevic+ 2018)
  • 34. Observations and experiences • Spark scales really well! • SQL support is fantastic for supporting legacy code • Efficient data exchange with Python is key to having reasonable performance (Arrow and friends) • The language barrier is non-trivial: astronomy is in Python, little experience with JVM/Scala • Pushing Spark into exploratory data analysis – the challenge of converting a batch system to support more dynamic workflows.
  • 35. “Astronomy 2025” Towards a scalable astronomical analysis platform
  • 36. DATA INTENSIVE RESEARCH IN ASTROPHYSICS AND COSMOLOGY DIRAC Data Engineering Group We’re a collaborative incubator that supports people and communities researching and building next generations of software technologies for astronomy. We emphasize cross-pollination with other fields, the industry, and delivering usable, community supported, projects.
  • 37. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 40. EPSC-DPS Meeting 2019 • Geneva, Switzerland • September 16, 2019 4 0 Cataloging the Solar System • Potentially Hazardous Asteroids • Main Belt Asteroids • Census of small bodies in the Solar System Exploring the Transient sky • Variable stars, Supernovae • Fill in the variability phase-space • Discovery of new classes of transients Dark Matter, Dark Energy • Weak Lensing • Baryon acoustic oscillations • Supernovae, Quasars Milky Way Structure & Formation • Structure and evolutionary history • Spatial maps of stellar characteristics • Reach well into the halo LSST Science Drivers
  • 41. Solar System Science with LSST Animation: SDSS Asteroids (Alex Parker, SwRI) About ~0.7 million are known Will grow to >5 million in the next 5 years Estimates: Lynne Jones et al.
  • 42.
  • 43. Whole Dataset Operations• Galactic structure: density/proper motion maps of the Galaxy – => forall stars, compute distance, bin, create 5D map • Galactic structure: dust distribution – => forall stars, compute g-r color, bin, find blue tip edge, infer dust distribution • Near-field cosmology: MW satellite searches – => forall stars, compute colors, convolve with spatial filters, report any satellite-like peaks • Variability: Bayesian classification of transients and discovery of variables – => forall stars, get light curves, compute likelihoods, alert if interesting • …
  • 44. Astronomical catalogs • Just (big!) databases • Each row corresponds to a detection or an object (star/galaxy/asteroid) • Producing catalogs from images is not trivial - non-exhaustive list of problems (for software to solve): – background estimation – PSF estimation – object detection – image co-addition – deblending 44
  • 45. AXS history: LSD by Mario Jurić • Tool for querying, cross-matching and analysis of positionally or temporally indexed datasets • Inspired by Google's BigTable and MapReduce papers • However it has some shortcomings: – Fixed data partitioning (significant data skew) – Time-partitioning problematic (most queries do not slice by time) – Not resilient to worker failures – Contains a lot of custom solutions for functionalities that are common today 45
  • 46. Enter Spark and AXS • Astronomy eXtensions for Spark • DiRAC institute @ UW saw the need for next generation astronomical analysis tool • Efficient cross-matching • Based on industry standards (Apache Spark) • Provides simple (but powerful) astronomical API extensions • Easy to use on-premises or in the cloud 46
  • 48. + government-sponsored private clouds (e.g., JetStream)
  • 49. Meeting the Challenges Resources Dataset Storage Scalable Analysis Code Interface