SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Julien Peloton, CNRS
Accelerating
Astronomical Discoveries
with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
1
XXIst century astronomy
2
How we can get different data?
3
~1/100,000 of sky
Large
butshallow
Hubble FoV
D
eep
butsm
all
#UnifiedDataAnalytics #SparkAISummit
Large Synoptic Survey Telescope
2022-2032: Deep & large survey
Non-profit corporation
Site: Chile (Cerro Pachón)
US led, international
collaboration (1000+)
4#UnifiedDataAnalytics #SparkAISummit
Million pieces puzzle
• LSST will deliver ~full sky map every 3 nights
– 3.2 Gigapixels camera (car size!)
– 15 TB/night of raw image data collected
– 1 TB/night of alerts streamed
5#UnifiedDataAnalytics #SparkAISummit
?
We would like to be able to do at scale:
• Exploring large catalogs of data
• Cross-matching large catalogs
• Processing telescope images
• Classifying light-curves
• Processing telescope alerts
• ...
Apache Spark for astronomy?
6#UnifiedDataAnalytics #SparkAISummit
FITS: astronomical data format
• First (last) release: 1981 (2016).
• Endorsed by NASA and the International
Astronomical Union.
• Multi-purposes: vectors, images, tables, ...
• Backward compatible
• Set of blocks.1 block: ASCII header+binary
data arrays of arbitrary dimension
• Support for C, C++, C#, Fortran, IDL, Java,
Julia, MATLAB, Perl, Python, R, and more…
7#UnifiedDataAnalytics #SparkAISummit
spark-fits
• FITS data source for Spark SQL and DataFrames.
• Data Source V1 API.
• Images + tables available.
• Schema automatically inferred from the FITS header.
8#UnifiedDataAnalytics #SparkAISummit
spark-fits in practice
• Spark 2.3.1 / Hadoop 2.8.4
• 1.1 billion rows, 153 cores
• Run it 100 times (no cache).
• Performances (IO throughput)
comparable to other built-in
Spark connectors (no attempt
to optimise anything
anywhere…)
9#UnifiedDataAnalytics #SparkAISummit
Current limitations
Some limitations currently though…
• Need to migrate to Apache Spark DSv2.
• No column pruning, no filters at the level of the connector.
• (De)Compression is not handled yet.
• Scala FITS library lacks of many features.
10#UnifiedDataAnalytics #SparkAISummit
We live in a 3D world
• Manipulating 2D data with Spark:
Geotrellis, Magellan, Geospark,
GeoMesa, …
• Very little about 3D!
• Need for e.g. astronomy, particle
physics, meteorology.
11#UnifiedDataAnalytics #SparkAISummit
Manipulating 3D spatial data: spark3D
• 3D distributed partitioning
– KDTree, Octree, shells, ...
• Distributed spatial queries & data mining
– KNN, join, dbscan, …
– Typical usage on million/billion rows
• Visualisation
– Client/server architecture
12
Student:
Mayur
Bhosale (now at Qubole)
#UnifiedDataAnalytics #SparkAISummit
On the repartitioning...
Frequent as data comes unstructured, but
• Repartitioning implies heavy shuffle
between executors.
• Complex UDF in Spark are often
inefficient.
13#UnifiedDataAnalytics #SparkAISummit
Need for (efficient) streaming
• We explored the static sky - namely what has been observed.
• But what about what is happening right now? E.g.
– Supernovae (star explosion)
– Black hole merger counterparts (multi-messenger astronomy)
– Micro-lensing (extrasolar planet search)
– Earth killers!
– Anomaly detection (unforeseen astronomical sources)
• Correlation past/present/future?
• Timescales range from seconds to months...
14#UnifiedDataAnalytics #SparkAISummit
Desiderata & solution
We would like
• To work efficiently at scale
• Multi-modals analytics
capability (streaming & batch)
• Good integration with the
current ecosystem
15
Structured
Streaming
#UnifiedDataAnalytics #SparkAISummit
Introducing Fink
Fink is
• A broker system for sky alerts
• Based on Apache Spark
Fink does
• Collect, enrich & distribute sky
alerts
16#UnifiedDataAnalytics #SparkAISummit
03
01
02
Distribute
Enrich
Collect
On a quiet night...
17#UnifiedDataAnalytics #SparkAISummit
• 10,000 Avro alerts every 30 seconds
• 1TB alerts per night
• Parquet Database
Observation
Template
Difference
Credits: E. Bellm
03
01
Distribute
Enrich
Collect
02
Who’s who
18#UnifiedDataAnalytics #SparkAISummit
Add values to the raw alerts
• Stream-static join
• Classification (BNN)
Structured
Streaming
Alert
stream
Internal
catalogs
Alert
database
03
01
02
Distribute
Enrich
Collect
Alert database
Alert database
Structured
Streaming
Joining external information
19#UnifiedDataAnalytics #SparkAISummit
Structured
Streaming
Neutrino
alert stream
Gamma ray
alert stream
Optical
alert stream
Gravitational
wave
alert stream
Join
output
03
01
02
Distribute
Enrich
Collect
Spark does all the hard work
• Small delays
• Record throughput
• Stream position recovery
But it cannot do everything...
• Large delays
• False positives
Still need humans to take decisions
The Hero’s Return
Processing based on Adaptive Learning (PoC)
• Ranking of promising candidates
• Improved classification over time
20
New Candidates
Follow-up &
DiscoveryTraining
Streaming infrastructure by:
Abhishek Chauhan
(now at Morgan Stanley)
03
01
02
Distribute
Enrich
Collect
The fear of the shutdown!
What if we miss a
night?
• 14 million alerts, 830
GB of data
• Let Spark do the
hard work again
(offsets, updates...)
21
Broker shutdown…. Collect & write
100 minutes on 3 machines
Collect alerts
(cache)
Limiting factors
• Number of machines
• Network
Some lessons learned
Handling stream offsets
• Manual or not? Still not obvious...
Schema evolution
• User needs change often… Database choice is crucial
Dynamic filtering
• Need to adapt quickly to new situations
Handling watermarks
• How long shall we wait for data? Switch to post-processing.
Communication
• Using common communication protocols & data format...
22#UnifiedDataAnalytics #SparkAISummit
Thanks!
You have a public/private project in
mind? You want to contribute to
astronomy?
Come talk to me!
23#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
24

Contenu connexe

Tendances

Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learningJason Brown
 
Fundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication CompetitionFundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication CompetitionMarcoFasanella
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of GrapesDerek Groen
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDeltares
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster packageAlberto Labarga
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DCCCRinc
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Eirini Ntoutsi
 
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and DisplacementTraining and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and DisplacementEmmanuel Mathot
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringHannaneh Najdataei
 
13.00 o7 j adams
13.00 o7 j adams13.00 o7 j adams
13.00 o7 j adamsNZIP
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryAstraea, Inc.
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateMario Juric
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasaddatascienceiqss
 
ExoNetLoopFinalShort
ExoNetLoopFinalShortExoNetLoopFinalShort
ExoNetLoopFinalShortGreg McNeill
 
Power Point Project
Power Point ProjectPower Point Project
Power Point ProjectHaganl
 

Tendances (19)

Detecting solar farms with deep learning
Detecting solar farms with deep learningDetecting solar farms with deep learning
Detecting solar farms with deep learning
 
Fundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication CompetitionFundamentals of Internet and Communication Competition
Fundamentals of Internet and Communication Competition
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of Grapes
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 
NASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud ComputingNASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud Computing
 
GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data
 
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and DisplacementTraining and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
Training and hands-on SNAP Sentinel-1 IW SLC Interferogram and Displacement
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
 
13.00 o7 j adams
13.00 o7 j adams13.00 o7 j adams
13.00 o7 j adams
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite Imagery
 
Solar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status UpdateSolar System Processing with LSST: A Status Update
Solar System Processing with LSST: A Status Update
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
 
ExoNetLoopFinalShort
ExoNetLoopFinalShortExoNetLoopFinalShort
ExoNetLoopFinalShort
 
Power Point Project
Power Point ProjectPower Point Project
Power Point Project
 

Similaire à Accelerating Astronomical Discoveries with Apache Spark

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"Frank Wuerthwein
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Big Data Spain
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19ExtremeEarth
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Larry Smarr
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Frank Wuerthwein
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the CloudFrank Wuerthwein
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyFabio Porto
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana Universityinside-BigData.com
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRLucaCinquini
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...Larry Smarr
 
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...inside-BigData.com
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Spark Summit
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 

Similaire à Accelerating Astronomical Discoveries with Apache Spark (20)

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
"Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner""Building and running the cloud GPU vacuum cleaner"
"Building and running the cloud GPU vacuum cleaner"
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Near Exascale Computing in the Cloud
Near Exascale Computing in the CloudNear Exascale Computing in the Cloud
Near Exascale Computing in the Cloud
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in Astronomy
 
Data Capacitor II at Indiana University
Data Capacitor II at Indiana UniversityData Capacitor II at Indiana University
Data Capacitor II at Indiana University
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
 
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Accelerating Astronomical Discoveries with Apache Spark

  • 1. Julien Peloton, CNRS Accelerating Astronomical Discoveries with Apache Spark #UnifiedDataAnalytics #SparkAISummit 1
  • 3. How we can get different data? 3 ~1/100,000 of sky Large butshallow Hubble FoV D eep butsm all #UnifiedDataAnalytics #SparkAISummit
  • 4. Large Synoptic Survey Telescope 2022-2032: Deep & large survey Non-profit corporation Site: Chile (Cerro Pachón) US led, international collaboration (1000+) 4#UnifiedDataAnalytics #SparkAISummit
  • 5. Million pieces puzzle • LSST will deliver ~full sky map every 3 nights – 3.2 Gigapixels camera (car size!) – 15 TB/night of raw image data collected – 1 TB/night of alerts streamed 5#UnifiedDataAnalytics #SparkAISummit ?
  • 6. We would like to be able to do at scale: • Exploring large catalogs of data • Cross-matching large catalogs • Processing telescope images • Classifying light-curves • Processing telescope alerts • ... Apache Spark for astronomy? 6#UnifiedDataAnalytics #SparkAISummit
  • 7. FITS: astronomical data format • First (last) release: 1981 (2016). • Endorsed by NASA and the International Astronomical Union. • Multi-purposes: vectors, images, tables, ... • Backward compatible • Set of blocks.1 block: ASCII header+binary data arrays of arbitrary dimension • Support for C, C++, C#, Fortran, IDL, Java, Julia, MATLAB, Perl, Python, R, and more… 7#UnifiedDataAnalytics #SparkAISummit
  • 8. spark-fits • FITS data source for Spark SQL and DataFrames. • Data Source V1 API. • Images + tables available. • Schema automatically inferred from the FITS header. 8#UnifiedDataAnalytics #SparkAISummit
  • 9. spark-fits in practice • Spark 2.3.1 / Hadoop 2.8.4 • 1.1 billion rows, 153 cores • Run it 100 times (no cache). • Performances (IO throughput) comparable to other built-in Spark connectors (no attempt to optimise anything anywhere…) 9#UnifiedDataAnalytics #SparkAISummit
  • 10. Current limitations Some limitations currently though… • Need to migrate to Apache Spark DSv2. • No column pruning, no filters at the level of the connector. • (De)Compression is not handled yet. • Scala FITS library lacks of many features. 10#UnifiedDataAnalytics #SparkAISummit
  • 11. We live in a 3D world • Manipulating 2D data with Spark: Geotrellis, Magellan, Geospark, GeoMesa, … • Very little about 3D! • Need for e.g. astronomy, particle physics, meteorology. 11#UnifiedDataAnalytics #SparkAISummit
  • 12. Manipulating 3D spatial data: spark3D • 3D distributed partitioning – KDTree, Octree, shells, ... • Distributed spatial queries & data mining – KNN, join, dbscan, … – Typical usage on million/billion rows • Visualisation – Client/server architecture 12 Student: Mayur Bhosale (now at Qubole) #UnifiedDataAnalytics #SparkAISummit
  • 13. On the repartitioning... Frequent as data comes unstructured, but • Repartitioning implies heavy shuffle between executors. • Complex UDF in Spark are often inefficient. 13#UnifiedDataAnalytics #SparkAISummit
  • 14. Need for (efficient) streaming • We explored the static sky - namely what has been observed. • But what about what is happening right now? E.g. – Supernovae (star explosion) – Black hole merger counterparts (multi-messenger astronomy) – Micro-lensing (extrasolar planet search) – Earth killers! – Anomaly detection (unforeseen astronomical sources) • Correlation past/present/future? • Timescales range from seconds to months... 14#UnifiedDataAnalytics #SparkAISummit
  • 15. Desiderata & solution We would like • To work efficiently at scale • Multi-modals analytics capability (streaming & batch) • Good integration with the current ecosystem 15 Structured Streaming #UnifiedDataAnalytics #SparkAISummit
  • 16. Introducing Fink Fink is • A broker system for sky alerts • Based on Apache Spark Fink does • Collect, enrich & distribute sky alerts 16#UnifiedDataAnalytics #SparkAISummit 03 01 02 Distribute Enrich Collect
  • 17. On a quiet night... 17#UnifiedDataAnalytics #SparkAISummit • 10,000 Avro alerts every 30 seconds • 1TB alerts per night • Parquet Database Observation Template Difference Credits: E. Bellm 03 01 Distribute Enrich Collect 02
  • 18. Who’s who 18#UnifiedDataAnalytics #SparkAISummit Add values to the raw alerts • Stream-static join • Classification (BNN) Structured Streaming Alert stream Internal catalogs Alert database 03 01 02 Distribute Enrich Collect Alert database Alert database Structured Streaming
  • 19. Joining external information 19#UnifiedDataAnalytics #SparkAISummit Structured Streaming Neutrino alert stream Gamma ray alert stream Optical alert stream Gravitational wave alert stream Join output 03 01 02 Distribute Enrich Collect Spark does all the hard work • Small delays • Record throughput • Stream position recovery But it cannot do everything... • Large delays • False positives Still need humans to take decisions
  • 20. The Hero’s Return Processing based on Adaptive Learning (PoC) • Ranking of promising candidates • Improved classification over time 20 New Candidates Follow-up & DiscoveryTraining Streaming infrastructure by: Abhishek Chauhan (now at Morgan Stanley) 03 01 02 Distribute Enrich Collect
  • 21. The fear of the shutdown! What if we miss a night? • 14 million alerts, 830 GB of data • Let Spark do the hard work again (offsets, updates...) 21 Broker shutdown…. Collect & write 100 minutes on 3 machines Collect alerts (cache) Limiting factors • Number of machines • Network
  • 22. Some lessons learned Handling stream offsets • Manual or not? Still not obvious... Schema evolution • User needs change often… Database choice is crucial Dynamic filtering • Need to adapt quickly to new situations Handling watermarks • How long shall we wait for data? Switch to post-processing. Communication • Using common communication protocols & data format... 22#UnifiedDataAnalytics #SparkAISummit
  • 23. Thanks! You have a public/private project in mind? You want to contribute to astronomy? Come talk to me! 23#UnifiedDataAnalytics #SparkAISummit
  • 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT 24