Round Table Introduction: Analytics on 100 TB+ catalogs

•Download as PPTX, PDF•

0 likes•279 views

Introductory slides to spark the discussion at the MSDSE 2017 round table on tools enabling data management and analytics of 10-100 TB catalogs, using a specific astronomy problem as a case study.

Science

Analytics on
100 TB+ catalogs
Enabling astronomy in the
era of massive survey
telescopesMario Juric <mjuric@astro.washington.edu>
UW Astronomy | DIRAC | eScience
@mjuric

Zwicky Transient Facility
> 1000 images/night, 576 mpix
> 300 M detected sources/night
> 1 billion objects, 75-250 mea/obj/year
> 1 M alerts/night
http://ztf.caltech.edu

Zwicky Transient Facility
> 1 TB/night (raw), 10 TB (processed)
> 150 GB sources/night
> 20-40 GB alerts/night
http://ztf.caltech.edu

Zwicky Transient Facility
> 2.5 PB images/yr
> 37.5 TB of sources/year
> 5-10 TB of alerts/year
Starting: ~NOW!
Uncertainty: ~3x!
http://ztf.caltech.edu

Spatial Extent: the Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data

Spatial Extent: the ~Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
(zoomed in on a ”medium deep field”)

New Science: the Time Component
> Time series analysis
(classification)
> Rapid identification and
alerting on “interesting”
variability
> Identification of moving
sources
Example RR Lyrae light curves from Székely et al. (2007)

The Wishlist: What we’re looking for in
a DBMS
> Must be able to reliably store the data
> Must enable efficient batch processing
– I.e., ”compute this statistic over all time series”, in ~hours
> Must enable fast extraction of individual time series
– I.e., ”give me the light curve of X”, in <1s
> Must enable fast spatial queries, fast histograms
– I.e., “Give me all objects in this area on the sky”, in <1s to start
> Must enable easy “cross matching”
– Positionally cross-match N catalogs, find neighbors

The Wishlist: What we’re looking for in
a DBMS
> Must support insertions of ~300M rows/night
> Must scale to ~100TB+ catalogs in ~3 years
> Efficient in multi-user mode
> Should (must) be easy to use
– Shallow learning curve, ease of install, strong Python APIs
– Ideally easily replicated and manageable by astronomers.
– SQL-like interface is a plus (declarative queries)
> Ideally would like to be able to get it up and running in ~4-6
months.

Options We’re Looking At
> Relational Databases
– Postgres, Oracle, qserv (experimental)
– Challenging to have tables of ~100 billion rows (expectation after ~1yr)
– Slow time-series extraction
> Parquet+Spark
– Looks like it may scale.
– Not easy to set up, steep learning curve
– No native multi-user awareness
> Custom solution (”Large Survey Database”; http://lsddb.org)
– Partitioned tree of HDF5 files, Parquet before Parquet + Python client
– Special snowflake, will need eternal support, no community.

Discuss
Are there other areas that have to deal with
~billion time series of 100+ measurements?
What are the technology choices you use to
manage your data sets? What should we
be looking at?

A Related Problem: Telemetry
Databases
> ~100+ sensors, <=10 Hz sampling
– ~500 MB/night
– ~150 GB/yr
> Slightly different slicing needs
– ”Give me the data from all sensors in the following time
window”, as opposed to “give me all the data for the following
set of objects”
> Simple HDF5 may work

The Next Problem (in 2022)
The Large Synoptic Survey Telescope
An automated 8.4 meter telescope that for 10 years will
image half the sky every ~3 days, generate ~50 PB of
(raw) imaging data, issue real-time alerts to any changes
in the sky (~10 million/night), measure properties of
~40 billion objects in the sky (~1000 times
each), and make the results available
in a web-accessible database.
http://lsst.org

What's hot

The big data Universe. Literally.J On The Beach

Autoencoding RNN for inference on unevenly sampled time-series dataJoshua Bloom

The Pacific Research Platform  Two Years InLarry Smarr

Big Data for Big DiscoveriesGovnet Events

Q4 2016 GeoTrellis PresentationRob Emanuele

NERSC, AI and the Superfacility, Debbie BardPacificResearchPlatform

Of Sampling and Smoothing: Approximating Distributions over Linked Open DataThomas Gottron

SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi

Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi

Many Task Applications for Grids and SupercomputersIan Foster

AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...Mario Juric

GaiaCal2014: Creating and Calibrating LSST Data ProductMario Juric

Talk for "The X-ray Universe 2014, Dublin"Alexey Mints

$Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...$ $Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...$

Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...EarthCube

LocationTech ProjectsJody Garnett

Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...The HDF-EOS Tools and Information Center

Climate data in r with the raster packageAlberto Labarga

LSST/DM: Building a Next Generation Survey Data Processing SystemMario Juric

ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...Mario Juric

What's hot (20)

The big data Universe. Literally.

Autoencoding RNN for inference on unevenly sampled time-series data

The Pacific Research Platform  Two Years In

Big Data for Big Discoveries

Q4 2016 GeoTrellis Presentation

NERSC, AI and the Superfacility, Debbie Bard

Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

SkyhookDM - Towards an Arrow-Native Storage System

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...

Many Task Applications for Grids and Supercomputers

AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...

GaiaCal2014: Creating and Calibrating LSST Data Product

Talk for "The X-ray Universe 2014, Dublin"

$Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...$ $Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...$

Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...

LocationTech Projects

Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...

Climate data in r with the raster package

LSST/DM: Building a Next Generation Survey Data Processing System

ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs

World widetelescopetecfestPREMKUMAR

E Science As A Lens On The World Lazowskaguest43b4df3

E Science As A Lens On The World LazowskaWCET

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks

The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster

Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr

Computation and KnowledgeIan Foster

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData

Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.

Positioning University of California Information Technology for the Future: S...Larry Smarr

Solar System Processing with LSST: A Status UpdateMario Juric

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...Amazon Web Services

Using the Open Science Data Cloud for Data Science ResearchRobert Grossman

OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...OpenNebula Project

Creating a Science-Driven Big Data SuperhighwayLarry Smarr

Presentationfarrelle25

Accelerating data-intensive science by outsourcing the mundaneIan Foster

Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi

Terabit Applications: What Are They, What is Needed to Enable Them?Larry Smarr

WebServices_Grid.pptEqinNiftalyev

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs (20)

World widetelescopetecfest

E Science As A Lens On The World Lazowska

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...

The Discovery Cloud: Accelerating Science via Outsourcing and Automation

Toward a Global Interactive Earth Observing Cyberinfrastructure

Computation and Knowledge

Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...

Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard

Positioning University of California Information Technology for the Future: S...

Solar System Processing with LSST: A Status Update

AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...

Using the Open Science Data Cloud for Data Science Research

OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...

Creating a Science-Driven Big Data Superhighway

Presentation

Accelerating data-intensive science by outsourcing the mundane

Burst data retrieval after 50k GPU Cloud run

Terabit Applications: What Are They, What is Needed to Enable Them?

WebServices_Grid.ppt

Recently uploaded

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin

Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School

Harmful and Useful Microorganisms Presentationtahreemzahra82

Observational constraints on mergers creating magnetism in massive starsSérgio Sacani

Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita

GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1

Let’s Say Someone Did Drop the Bomb. Then What?LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane

CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456

Topic 9- General Principles of International Law.pptxJorenAcuavera1

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131

Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix

Radiation physics in Dental Radiology...navyadasi1992

Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain

Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9

Recently uploaded (20)

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx

Pests of safflower_Binomics_Identification_Dr.UPR.pdf

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX

Harmful and Useful Microorganisms Presentation

Observational constraints on mergers creating magnetism in massive stars

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine

GenBio2 - Lesson 1 - Introduction to Genetics.pptx

Let’s Say Someone Did Drop the Bomb. Then What?

Microphone- characteristics,carbon microphone, dynamic microphone.pptx

CHROMATOGRAPHY PALLAVI RAWAT.pptx

Topic 9- General Principles of International Law.pptx

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai

Base editing, prime editing, Cas13 & RNA editing and organelle base editing

Radiation physics in Dental Radiology...

Servosystem Theory / Cybernetic Theory by Petrovic

Volatile Oils Pharmacognosy And Phytochemistry -I

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...

Round Table Introduction: Analytics on 100 TB+ catalogs

1. Analytics on 100 TB+ catalogs Enabling astronomy in the era of massive survey telescopesMario Juric <mjuric@astro.washington.edu> UW Astronomy | DIRAC | eScience @mjuric

2. Zwicky Transient Facility > 1000 images/night, 576 mpix > 300 M detected sources/night > 1 billion objects, 75-250 mea/obj/year > 1 M alerts/night http://ztf.caltech.edu

3. Zwicky Transient Facility > 1 TB/night (raw), 10 TB (processed) > 150 GB sources/night > 20-40 GB alerts/night http://ztf.caltech.edu

4. Zwicky Transient Facility > 2.5 PB images/yr > 37.5 TB of sources/year > 5-10 TB of alerts/year Starting: ~NOW! Uncertainty: ~3x! http://ztf.caltech.edu

5. Spatial Extent: the Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data

6. Spatial Extent: the ~Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data (zoomed in on a ”medium deep field”)

7. New Science: the Time Component > Time series analysis (classification) > Rapid identification and alerting on “interesting” variability > Identification of moving sources Example RR Lyrae light curves from Székely et al. (2007)

8. The Wishlist: What we’re looking for in a DBMS > Must be able to reliably store the data > Must enable efficient batch processing – I.e., ”compute this statistic over all time series”, in ~hours > Must enable fast extraction of individual time series – I.e., ”give me the light curve of X”, in <1s > Must enable fast spatial queries, fast histograms – I.e., “Give me all objects in this area on the sky”, in <1s to start > Must enable easy “cross matching” – Positionally cross-match N catalogs, find neighbors

9. The Wishlist: What we’re looking for in a DBMS > Must support insertions of ~300M rows/night > Must scale to ~100TB+ catalogs in ~3 years > Efficient in multi-user mode > Should (must) be easy to use – Shallow learning curve, ease of install, strong Python APIs – Ideally easily replicated and manageable by astronomers. – SQL-like interface is a plus (declarative queries) > Ideally would like to be able to get it up and running in ~4-6 months.

10. Options We’re Looking At > Relational Databases – Postgres, Oracle, qserv (experimental) – Challenging to have tables of ~100 billion rows (expectation after ~1yr) – Slow time-series extraction > Parquet+Spark – Looks like it may scale. – Not easy to set up, steep learning curve – No native multi-user awareness > Custom solution (”Large Survey Database”; http://lsddb.org) – Partitioned tree of HDF5 files, Parquet before Parquet + Python client – Special snowflake, will need eternal support, no community.

11. Discuss Are there other areas that have to deal with ~billion time series of 100+ measurements? What are the technology choices you use to manage your data sets? What should we be looking at?

12. A Related Problem: Telemetry Databases > ~100+ sensors, <=10 Hz sampling – ~500 MB/night – ~150 GB/yr > Slightly different slicing needs – ”Give me the data from all sensors in the following time window”, as opposed to “give me all the data for the following set of objects” > Simple HDF5 may work

13. The Next Problem (in 2022) The Large Synoptic Survey Telescope An automated 8.4 meter telescope that for 10 years will image half the sky every ~3 days, generate ~50 PB of (raw) imaging data, issue real-time alerts to any changes in the sky (~10 million/night), measure properties of ~40 billion objects in the sky (~1000 times each), and make the results available in a web-accessible database. http://lsst.org

Round Table Introduction: Analytics on 100 TB+ catalogs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs (20)

Recently uploaded

Recently uploaded (20)

Round Table Introduction: Analytics on 100 TB+ catalogs