Slides from our talk at the Oct 10, 2018 Docker Cambridge meetup discussing our product, Splitgraph, demoing an early prototype and talking about the challenges of applying the Docker model for data.
https://splitgraph.com, twitter.com/@splitgraph
Artjoms Iskovs: twitter.com/@mildbyte, mildbyte.xyz
Miles Richardson: milesrichardson.com
4. Why so hard to build and maintain data sets?
• Sourcing data is not composable
• Why can’t I query multiple data sets at once?
• Wrangling and cleaning data is not maintainable
• Why can’t I keep my data sets up to date?
• Running ad-hoc queries is not reproducible
• Why can’t I share my data sets?
5. What do we mean by data?
Sources
• Open Data
• Internal Data
• Licensed Data
Types
• SQL Databases
• NoSQL Databases
• CSV Files...
6. The journey of a dataset: Scenario
• Two publishers:
• NOAA publishes climate data
• USDA publishes corn yields
• Consumer wants to merge both data sets
• Let’s follow the climate data...
14. SGFiles: Dockerfiles for data
• Like Dockerfiles.
• Image: state of a database schema
• Layers w/ deterministic hashes and cache invalidation if:
• Previous layer changes
• Command changes
• Commands:
• FROM – base the image on something else
• IMPORT – import tables from another image
• SQL – run SQL against the image
15. Consumption: Demo
FROM usda/yields IMPORT crop_yields
FROM noaa/climate:latest IMPORT rainfall
SQL CREATE TABLE rainfall_yields AS
SELECT * FROM rainfall JOIN crop_yields ...
19. The journey of a dataset 4: Updating
• Puerto Rico is now a US state
• NOAA wants to revise its climate data
• Can the consumer get just the changes?
20. Delta compression
• Only care about changes
• Need to efficiently:
• Create diffs (→ commit, push)
• Apply diffs (→ checkout, pull)
30. The journey of a dataset 5: Maintenance
• Can we update it?
• Where did this dataset come from?
• Build context fully encapsulated within the metadata