Analyzing Extended and Scientific Metadata for Scalable Index Designs

Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*,
Darrell D.E. Long*, Ian F. Adams*, Avani Wildani*
*University of California Santa Cruz
^Conservatoire National des Arts et Métiers
Examining Extended and
Scientiﬁc Metadata for
Scalable Index Designs

What we call metadata
• Data for the system
• External to the ﬁle
• Small
• Dense
2
Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin,
"Operating System Concepts, Eighth Edition "

What everyone else calls metadata
• Data for the user
• Embedded in:
• the file
• the inode
• a separate file
• a notebook somewhere on
their desk
• Wildly varying size
• Sparse
3
Embedded
Metadata
Metadata
filesMetadata
filesMetadata
files
Metadata outside
the system
Inode metadata

A scientist at work
• “Show me the data set about bears in Alaska from
last fall”
• “Show me simulation results from last week for
Vesuvius which used this code library, and where
the pressure is higher than 500 kiloPascals”
• A mix of system and scientiﬁc metadata
4

Our options
• Relational databases
• Column stores
• Spatial trees (E.g., Spyglass, Smartstore)
• Inverted indexes
• Bitmap indexes (E.g. FastBit)
• The choice of index depends on the data, but what
does the data look like?
5

Outline
• The data in brief
• Dimensionality
• Sparsity
• Atomicity
• Entropy
6

The metadata in brief
7
Discipline
Native

Format
Record

count
Subsample
d?
Sample

count
Total
size
Dryad Biology XML 31K No 31K 400
MB
WISE Astronomy CSV 564M Yes 10K 1
TB
ARGO
Oceanograp
hy
NetCDF 2B Yes 635K 330GB
ORNL Climatology CSV 1478 No 1478 154KB

Dimensionality
8
Dryad WISE Argo ORNL
Total

Dimensions
44 285 108 14 451
•Much higher dimensional than POSIX data
•Curse of dimensionality concerns

Sparsity
9
Sparse even within a discipline (extremely sparse
across all disciplines)
• CDF of sparsity
• For a randomly
chosen element from
X% of columns, there
is a Y% chance it will
be null

Atomicity (Dryad)
• How many times can a
ﬁeld be present for a
single item?
• E.g.: A single paper can
have multiple authors
• Truncated to show
detail. One study had
800 species!
10
Some disciplines have many ﬁeld values per item.
Others have range values (e.g., May-June 2010)

Entropy
• Row organization
versus column
• How compressible is
the data?
• How selective are
queries?
• Plenty of compression
available
11

Bringing it all together
• Scientiﬁc data is:
• Sparse
• High-dimensional
• Compressible
• Non-atomic (one to many)
• A mix of cardinal, ordinal, spatial, and binary data
• Query models:
• Spatial
• Range and point
• Key word
12

Comparing indexes
13
Column

stores
Row
stores Spatial
trees
Inverted

Indexes
HDF5 FastBit
High

dimensional
Yes Yes No Yes Yes Yes
Sparse Yes Stores
nulls No Yes Yes Stores
nulls
Multiple

values
Yes Yes No
List,
not

range
Yes Yes
Non-‐numeric

data
Yes Yes No Yes Yes No
Range

queries
Yes Yes Yes No Yes Yes
Specialized

indexes
Yes Yes No No No No
High
Compression
Yes No No Yes No Yes

Conclusions
14
• Currently popular approaches to file system
indexing (spatial trees, RDBMS) are a poor match
for scientific data
• Current approaches to scientific indexing are not a
complete solution
• Column stores are a natural fit for scientific
metadata and queries
• Specialized indexes based on inverted indexes,
bitmaps, and spatial trees are appropriate for some
data

Data types (raw and semantic)
16
Dryad WISE Argo ORNL Total
String
Numeric
Str/Num
Date
Spatial
Flagsets
100% 4% 62% 29% 28%
0% 96% 38% 71% 72%
96% 68% 77% 72% 73%
2% 4% 7% 7% 5%
2% 9% 2% 21% 7%
0% 19% 14% 0% 15%
•Support for spatial search is useful
•Application hinting is needed for good search (is
this a string, a location, or a ﬂag set?)

How can we support this?
• Search functionality which:
• Supports these kinds of queries
• Does not double the size of storage
• Does not require a linear scan over petabytes of data
• The answers to queries are documents
• We rarely need an entire row
• Complex transactions and joins are less important
17

Analyzing Extended and Scientific Metadata for Scalable Index Designs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à Analyzing Extended and Scientific Metadata for Scalable Index Designs

Similaire à Analyzing Extended and Scientific Metadata for Scalable Index Designs (20)

Dernier

Dernier (20)

Analyzing Extended and Scientific Metadata for Scalable Index Designs