The document discusses Unidata's Common Data Model (CDM) which aims to provide a standardized way of representing scientific datasets. It describes key components of the CDM including scientific data types, coordinate systems, and data access layers. The goals are to make datasets more useful and interoperable by defining common semantics, georeferencing, and specialized querying capabilities. The CDM defines abstract representations that are agnostic of specific file formats or programming interfaces.
2. Goals / Overview
• Look at the landscape of scientific
datasets from a few thousand feet up.
• What semantics are needed to make
these useful?
– georeferencing
– specialized subsetting
3. What’s a Data Model?
• An Abstract Data Model describes data objects
and what methods you can use on them.
• An API is the interface to the Data Model for a
specific programming language
• A file format is a way to persist the objects in
the Data Model.
• An Abstract Data Model removes the details of
any particular API and the persistence format.
4. Common Data Model Layers
Scientific Datatypes
Point
Trajectory
Radial
Grid
Station
Swath
Coordinate Systems
Data Access
Profile
7. I/O Service Provider
Implementations
•
•
•
•
•
•
General: NetCDF, HDF5, OPeNDAP
Gridded: GRIB-1, GRIB-2
Radar: NEXRAD level 2 and 3, DORADE
Point: BUFR, ASCII
Satellite: DMSP, GINI
In development
– NOAA: GOES (Knapp/Nelson), many others
8. Coordinate Systems needed
• NetCDF, OPeNDAP, HDF data models do
not have integrated coordinate systems
– so georeferencing not part of API
– Need conventions to specify (eg CF-1,
COARDS, etc)
• Contrast GRIB, HDF-EOS, other
specialized formats
10. Coordinate Variables
– One-dimension variable with same
name as its dimension
– Strictly monotonic values
– No missing values
The coordinates of a point (i,j,k) is
{CV1(i), CV2(j), CV3(k)}
13. Coordinate Systems (abstract)
• A Coordinate System for a data variable is
a set of Coordinate Variables2 such that the
coordinates of the (i,j,k) data point is
{CV1(i,j,k),CV2(i,j,k),CV3(i,j,k),CV4(i,j,k)…}
previous was {CV1(i), CV2(j), CV3(k)}
• The dimensions of each Coordinate
Variable must be a subset of the
dimensions of the data variable.
16. Revised Coordinate Systems
1. Specify Coordinate Variables
2. Specify Coordinate Types
(time, lat, lon, projection x, y, height,
pressure, z, radial, azimuth, elevation)
3. Specify connectivity (implicit or
explicit) between data points
– Implicit: Neighbors in index space are
(connected) neighbors in coordinate
space. Allows efficient searching.
17. Gridded Data
float gridData(t,z,y,x);
float time(t); // Time
float y(y); // GeoX
float x(x); // GeoY
float z(t,z,y,x); // Height or Pressure
• Cartesian
coordinates
• All dimensions are connected
Connected means
Neighbors in index space
are neighbors in
coordinate space
19. Scientific Data Types
• Based on datasets Unidata is familiar with
– APIs are evolving
• How are data points connected?
• Intended to scale to large, multifile
collections
• Intended to support “specialized queries”
– Space, Time
• Corresponding “standard” NetCDF file
conventions
20. Gridded Data
• Cartesian
coordinates
• All dimensions are connected
• x, y, z, time
• recently added runtime and ensemble
• refactored into GridDatatype interface
float gridData(t,z,y,x);
float time(t);
float y(y);
float x(x);
float lat(y,x);
float lon(y,x);
float height(t,z,y,x);
22. Radial Data
• Polar
coordinates
• All dimensions are connected
• Not separate time dimension
radialData(radial, gate) :
distance(gate)
azimuth(radial)
elevation(radial)
time(radial)
23. Swath
• lat/lon
coordinates
• not separate time dimension
• all dimensions are connected
swathData(line,cell)
lat(line,cell)
lon(line,cell)
time(line)
z(line,cell) ??
24. Point Observation Data
• Set
of measurements at the
same point in space and time
• Point dimension not connected
float obs1(pt);
float obs2(pt);
float lat(pt);
float lon(pt);
float z(pt);
float time(pt);
Structure {
lat, lon, z, time;
v1, v2, ...
} obs( pt);
28. Trajectory Data
• pt dimension is connected
• Collection dimension not
connected
Structure {
lat, lon, z, time;
v1, v2, ...
} obs(pt); // connected
Structure {
name;
Structure {
lat, lon, z, time;
v1, v2, ...
} obs(*); // connected
} traj(traj) // not connected
29. Profiler/Sounding Station Data
Structure {
name;
lat, lon, time;
Structure {
z;
v1, v2, ...
} obs(*); // connected
} loc(nloc); // not connected
Structure {
name;
lat, lon;
Structure {
time,
Structure {
z;
v1, v2, ...
} obs(*); // connected
} time(*); // connected
} stn(stn); // not connected
30. Unstructured Grid
• Pt dimension not connected
• Looks the same as point data
• Need to specify the connectivity
explicitly
float unstructGrid(t,z,pt);
float lat(pt);
float lon(pt);
float time(t);
float height(z);
31. Data Types Summary
• Data access through a standard API
• Convenient georeferencing
• Specialized subsetting methods
– Efficiency for large datasets
32. Payoff
N + M instead of N * M things on your TODO List!
File Format
#1
CDM
Visualization
&Analysis
NetCDF file
File Format
#2
OpenDAP Server
File Format
#N
WCS Service
Web Service
33. THREDDS Data Server
HTTP Tomcat Server
Catalog.xml
THREDDS Server
•OPeNDAP
•HTTPServer
•WCS
NetCDF-Java
library
hostname.edu
Datasets
IDD Data
Application
34. Next: DataType Aggregation
•
•
Work at the CDM DataType level, know (some)
data semantics
Forecast Model Collection
–
–
•
Combine multiple model forecasts into single
dataset with two time dimensions
With NOAA/IOOS (Steve Hankin)
Point/Station/Trajectory/Profile Data
–
–
Allow space/time queries, return nested sequences
Start from / standardize “Dapper conventions”
36. Conclusion
• Standardized Data Access in good shape
– HDF5, NetCDF, OPeNDAP
– Write an IOSP for proprietary formats (Java)
• But that’s not good enough!
• To do:
– Standard representations of coordinate
systems
– Classifications of data types, standard
services for them
Notes de l'éditeur
Diversity of formats:
Appropriate design decision for General formats
Need more dynamic system for real time and very large datasets.
Catalog is a file, but these are services, that is, code.
Show IDD Server catalog – show sattellite DQC, then show radar DQC