Introductory slides to spark the discussion at the MSDSE 2017 round table on tools enabling data management and analytics of 10-100 TB catalogs, using a specific astronomy problem as a case study.
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Round Table Introduction: Analytics on 100 TB+ catalogs
1. Analytics on
100 TB+ catalogs
Enabling astronomy in the
era of massive survey
telescopesMario Juric <mjuric@astro.washington.edu>
UW Astronomy | DIRAC | eScience
@mjuric
5. Spatial Extent: the Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
6. Spatial Extent: the ~Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
(zoomed in on a ”medium deep field”)
7. New Science: the Time Component
> Time series analysis
(classification)
> Rapid identification and
alerting on “interesting”
variability
> Identification of moving
sources
Example RR Lyrae light curves from Székely et al. (2007)
8. The Wishlist: What we’re looking for in
a DBMS
> Must be able to reliably store the data
> Must enable efficient batch processing
– I.e., ”compute this statistic over all time series”, in ~hours
> Must enable fast extraction of individual time series
– I.e., ”give me the light curve of X”, in <1s
> Must enable fast spatial queries, fast histograms
– I.e., “Give me all objects in this area on the sky”, in <1s to start
> Must enable easy “cross matching”
– Positionally cross-match N catalogs, find neighbors
9. The Wishlist: What we’re looking for in
a DBMS
> Must support insertions of ~300M rows/night
> Must scale to ~100TB+ catalogs in ~3 years
> Efficient in multi-user mode
> Should (must) be easy to use
– Shallow learning curve, ease of install, strong Python APIs
– Ideally easily replicated and manageable by astronomers.
– SQL-like interface is a plus (declarative queries)
> Ideally would like to be able to get it up and running in ~4-6
months.
10. Options We’re Looking At
> Relational Databases
– Postgres, Oracle, qserv (experimental)
– Challenging to have tables of ~100 billion rows (expectation after ~1yr)
– Slow time-series extraction
> Parquet+Spark
– Looks like it may scale.
– Not easy to set up, steep learning curve
– No native multi-user awareness
> Custom solution (”Large Survey Database”; http://lsddb.org)
– Partitioned tree of HDF5 files, Parquet before Parquet + Python client
– Special snowflake, will need eternal support, no community.
11. Discuss
Are there other areas that have to deal with
~billion time series of 100+ measurements?
What are the technology choices you use to
manage your data sets? What should we
be looking at?
12. A Related Problem: Telemetry
Databases
> ~100+ sensors, <=10 Hz sampling
– ~500 MB/night
– ~150 GB/yr
> Slightly different slicing needs
– ”Give me the data from all sensors in the following time
window”, as opposed to “give me all the data for the following
set of objects”
> Simple HDF5 may work
13. The Next Problem (in 2022)
The Large Synoptic Survey Telescope
An automated 8.4 meter telescope that for 10 years will
image half the sky every ~3 days, generate ~50 PB of
(raw) imaging data, issue real-time alerts to any changes
in the sky (~10 million/night), measure properties of
~40 billion objects in the sky (~1000 times
each), and make the results available
in a web-accessible database.
http://lsst.org