Analyzing Extended and Scientific Metadata for Scalable Index Designs
1. Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*,
Darrell D.E. Long*, Ian F. Adams*, Avani Wildani*
*University of California Santa Cruz
^Conservatoire National des Arts et Métiers
Examining Extended and
Scientific Metadata for
Scalable Index Designs
2. What we call metadata
• Data for the system
• External to the file
• Small
• Dense
2
Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin,
"Operating System Concepts, Eighth Edition "
3. What everyone else calls metadata
• Data for the user
• Embedded in:
• the file
• the inode
• a separate file
• a notebook somewhere on
their desk
• Wildly varying size
• Sparse
3
Embedded
Metadata
Metadata
filesMetadata
filesMetadata
files
Metadata outside
the system
Inode metadata
4. A scientist at work
• “Show me the data set about bears in Alaska from
last fall”
• “Show me simulation results from last week for
Vesuvius which used this code library, and where
the pressure is higher than 500 kiloPascals”
• A mix of system and scientific metadata
4
5. Our options
• Relational databases
• Column stores
• Spatial trees (E.g., Spyglass, Smartstore)
• Inverted indexes
• Bitmap indexes (E.g. FastBit)
• The choice of index depends on the data, but what
does the data look like?
5
6. Outline
• The data in brief
• Dimensionality
• Sparsity
• Atomicity
• Entropy
6
7. The metadata in brief
7
Discipline
Native
Format
Record
count
Subsample
d?
Sample
count
Total
size
Dryad Biology XML 31K No 31K 400
MB
WISE Astronomy CSV 564M Yes 10K 1
TB
ARGO
Oceanograp
hy
NetCDF 2B Yes 635K 330GB
ORNL Climatology CSV 1478 No 1478 154KB
8. Dimensionality
8
Dryad WISE Argo ORNL
Total
Dimensions
44 285 108 14 451
•Much higher dimensional than POSIX data
•Curse of dimensionality concerns
9. Sparsity
9
Sparse even within a discipline (extremely sparse
across all disciplines)
• CDF of sparsity
• For a randomly
chosen element from
X% of columns, there
is a Y% chance it will
be null
10. Atomicity (Dryad)
• How many times can a
field be present for a
single item?
• E.g.: A single paper can
have multiple authors
• Truncated to show
detail. One study had
800 species!
10
Some disciplines have many field values per item.
Others have range values (e.g., May-June 2010)
11. Entropy
• Row organization
versus column
• How compressible is
the data?
• How selective are
queries?
• Plenty of compression
available
11
12. Bringing it all together
• Scientific data is:
• Sparse
• High-dimensional
• Compressible
• Non-atomic (one to many)
• A mix of cardinal, ordinal, spatial, and binary data
• Query models:
• Spatial
• Range and point
• Key word
12
13. Comparing indexes
13
Column
stores
Row
stores Spatial
trees
Inverted
Indexes
HDF5 FastBit
High
dimensional
Yes Yes No Yes Yes Yes
Sparse Yes Stores
nulls No Yes Yes Stores
nulls
Multiple
values
Yes Yes No
List,
not
range
Yes Yes
Non-‐numeric
data
Yes Yes No Yes Yes No
Range
queries
Yes Yes Yes No Yes Yes
Specialized
indexes
Yes Yes No No No No
High
Compression
Yes No No Yes No Yes
14. Conclusions
14
• Currently popular approaches to file system
indexing (spatial trees, RDBMS) are a poor match
for scientific data
• Current approaches to scientific indexing are not a
complete solution
• Column stores are a natural fit for scientific
metadata and queries
• Specialized indexes based on inverted indexes,
bitmaps, and spatial trees are appropriate for some
data
16. Data types (raw and semantic)
16
Dryad WISE Argo ORNL Total
String
Numeric
Str/Num
Date
Spatial
Flagsets
100% 4% 62% 29% 28%
0% 96% 38% 71% 72%
96% 68% 77% 72% 73%
2% 4% 7% 7% 5%
2% 9% 2% 21% 7%
0% 19% 14% 0% 15%
•Support for spatial search is useful
•Application hinting is needed for good search (is
this a string, a location, or a flag set?)
17. How can we support this?
• Search functionality which:
• Supports these kinds of queries
• Does not double the size of storage
• Does not require a linear scan over petabytes of data
• The answers to queries are documents
• We rarely need an entire row
• Complex transactions and joins are less important
17