Docker for Data: Splitgraph Provides Composable, Versioned Datasets

•

0 likes•560 views

Slides from our talk at the Oct 10, 2018 Docker Cambridge meetup discussing our product, Splitgraph, demoing an early prototype and talking about the challenges of applying the Docker model for data. https://splitgraph.com, twitter.com/@splitgraph Artjoms Iskovs: twitter.com/@mildbyte, mildbyte.xyz Miles Richardson: milesrichardson.com

Technology

Splitgraph
"Docker for Data"
Artjoms Iškovs, Miles Richardson

"B.D." Building Packages Before Docker
The Dark Ages
• Sourcing packages
• Rebuilding,
reconﬁguring,
rebuilding...
• Googling, rage
inducing

Data preparation
accounts for about
80% of the work
of data scientists.

Why so hard to build and maintain data sets?
• Sourcing data is not composable
• Why can’t I query multiple data sets at once?
• Wrangling and cleaning data is not maintainable
• Why can’t I keep my data sets up to date?
• Running ad-hoc queries is not reproducible
• Why can’t I share my data sets?

What do we mean by data?
Sources
• Open Data
• Internal Data
• Licensed Data
Types
• SQL Databases
• NoSQL Databases
• CSV Files...

The journey of a dataset: Scenario
• Two publishers:
• NOAA publishes climate data
• USDA publishes corn yields
• Consumer wants to merge both data sets
• Let’s follow the climate data...

$Ingesting data from another DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall$

Committing and Publishing Data via CLI
$ sgr publish noaa/climate data.splitgraph.com

SGFiles: Dockerﬁles for data
• Like Dockerﬁles.
• Image: state of a database schema
• Layers w/ deterministic hashes and cache invalidation if:
• Previous layer changes
• Command changes
• Commands:
• FROM – base the image on something else
• IMPORT – import tables from another image
• SQL – run SQL against the image

Consumption: Demo
FROM usda/yields IMPORT crop_yields
FROM noaa/climate:latest IMPORT rainfall
SQL CREATE TABLE rainfall_yields AS
SELECT * FROM rainfall JOIN crop_yields ...

The journey of a dataset 4: Updating
• Puerto Rico is now a US state
• NOAA wants to revise its climate data
• Can the consumer get just the changes?

Delta compression
• Only care about changes
• Need to eﬃciently:
• Create diffs (→ commit, push)
• Apply diffs (→ checkout, pull)

Delta compression
Docker
• Files
• Custom FS
Git
• Lines
• diff
Splitgraph
• Rows
• Audit triggers

The journey of a dataset 5: Maintenance
• Can we update it?
• Where did this dataset come from?
• Build context fully encapsulated within the metadata

Q&A
twitter.com/splitgraph · splitgraph.com

What's hot

Advanced topics in hiveUday Vakalapudi

Persistence in Androidma-polimi

Workspace Managementwaldotyson

Online Oracle Training For Beginnersvibrantuser

SANSA ISWC 2017 TalkJens Lehmann

“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee

20131191 msbuild propertiesLearningTech

Solr in DrupalPéter Király

Apache Spark — Fundamentals and MLlibJens Fisseler, Dr.

Hello cloud 2Gireesh Kumar

Updating materialized views and caches using kafkaZach Cox

Klevis Mino: MongoDBCarlo Vaccari

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Replicating application data into materialized viewsZach Cox

Using load tables to manage electronic resource recordsNina Acosta

Introduction to Apache SparkDatio Big Data

Users as Datapdingles

Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce

Working with Scientific Data in MATLABThe HDF-EOS Tools and Information Center

MapReduce and HadoopSalil Navgire

What's hot (20)

Advanced topics in hive

Persistence in Android

Workspace Management

Online Oracle Training For Beginners

SANSA ISWC 2017 Talk

“Open Data Web” – A Linked Open Data Repository Built with CKAN

20131191 msbuild properties

Solr in Drupal

Apache Spark — Fundamentals and MLlib

Hello cloud 2

Updating materialized views and caches using kafka

Klevis Mino: MongoDB

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab

Replicating application data into materialized views

Using load tables to manage electronic resource records

Introduction to Apache Spark

Users as Data

Leveraging Map Reduce With Hadoop for Weather Data Analytics

Working with Scientific Data in MATLAB

MapReduce and Hadoop

Similar to Docker for Data: Splitgraph Provides Composable, Versioned Datasets

Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar

Splitgraph: AHL talkSplitgraph

20160922 Materials Data Facility TMS WebinarBen Blaiszik

Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB

Hadoop introductionmusrath mohammad

SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services

Big Data Analysis : Deciphering the haystack Srinath Perera

Spark etlImran Rashid

Minerva: Drill Storage Plugin for IPFSBowenDing4

Cyclone DDS Unleashed: Reasons for Choosing Cyclone DDS Shared MemoryZettaScaleTechnology

CCI2019 - Monitorare SQL Server Senza Andare in Bancarottawalk2talk srl

Docker: Containers for Data ScienceAlessandro Adamo

Flashback in OCITércio Costa

Take your database source code and data under controlMarcin Przepiórowski

Kubernetes2Joaquín Salvachúa

Modern data warehouse presentationDavid Rice

So You Want to Build a Data Lake?David P. Moore

No sql Databasemymail2ashok

Evolutionary database designSalehein Syed

Using PostgreSQL With Docker & Kubernetes - July 2018Jonathan Katz

Similar to Docker for Data: Splitgraph Provides Composable, Versioned Datasets (20)

Data Stream Processing for Beginners with Kafka and CDC

Splitgraph: AHL talk

20160922 Materials Data Facility TMS Webinar

Scylla Summit 2016: Compose on Containing the Database

Hadoop introduction

SQL To NoSQL - Top 6 Questions Before Making The Move

Big Data Analysis : Deciphering the haystack

Spark etl

Minerva: Drill Storage Plugin for IPFS

Cyclone DDS Unleashed: Reasons for Choosing Cyclone DDS Shared Memory

CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta

Docker: Containers for Data Science

Flashback in OCI

Take your database source code and data under control

Kubernetes2

Modern data warehouse presentation

So You Want to Build a Data Lake?

No sql Database

Evolutionary database design

Using PostgreSQL With Docker & Kubernetes - July 2018

Recently uploaded

AI as an Interface for Commercial BuildingsMemoori

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Slack Application Development 101 Slidespraypatel2

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Key Features Of Token Development (1).pptxLBM Solutions

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Recently uploaded (20)

AI as an Interface for Commercial Buildings

Presentation on how to chat with PDF using ChatGPT code interpreter

Slack Application Development 101 Slides

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Scaling API-first – The story of a global engineering organization

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Understanding the Laravel MVC Architecture

Unblocking The Main Thread Solving ANRs and Frozen Frames

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Key Features Of Token Development (1).pptx

Benefits Of Flutter Compared To Other Frameworks

[2024]Digital Global Overview Report 2024 Meltwater.pdf

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

How to Troubleshoot Apps for the Modern Connected Worker

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Docker for Data: Splitgraph Provides Composable, Versioned Datasets

1. Splitgraph "Docker for Data" Artjoms Iškovs, Miles Richardson

2. "B.D." Building Packages Before Docker The Dark Ages • Sourcing packages • Rebuilding, reconﬁguring, rebuilding... • Googling, rage inducing

3. Data preparation accounts for about 80% of the work of data scientists.

4. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets?

5. What do we mean by data? Sources • Open Data • Internal Data • Licensed Data Types • SQL Databases • NoSQL Databases • CSV Files...

6. The journey of a dataset: Scenario • Two publishers: • NOAA publishes climate data • USDA publishes corn yields • Consumer wants to merge both data sets • Let’s follow the climate data...

7. The journey of a dataset: Introduction

8. The journey of a dataset: Introduction

9. The journey of a dataset 1: Creation

10. Ingesting data from another DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall

11. The journey of a dataset 2: Publication

12. Committing and Publishing Data via CLI $ sgr publish noaa/climate data.splitgraph.com

13. The journey of a dataset 3: Usage

14. SGFiles: Dockerﬁles for data • Like Dockerﬁles. • Image: state of a database schema • Layers w/ deterministic hashes and cache invalidation if: • Previous layer changes • Command changes • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image

15. Consumption: Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:latest IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ...

16.

17.

18. The journey of a dataset 4: Updating

19. The journey of a dataset 4: Updating • Puerto Rico is now a US state • NOAA wants to revise its climate data • Can the consumer get just the changes?

20. Delta compression • Only care about changes • Need to eﬃciently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull)

21. Delta compression Docker • Files • Custom FS Git • Lines • diff Splitgraph • Rows • Audit triggers

22. Updating: Demo

23.

24.

25.

26.

27.

28.

29. The journey of a dataset 5: Maintenance

30. The journey of a dataset 5: Maintenance • Can we update it? • Where did this dataset come from? • Build context fully encapsulated within the metadata

31. Provenance and rebasing demo

32.

33.

34.

35.

36.

37.

38.

39. Q&A twitter.com/splitgraph · splitgraph.com

Docker for Data: Splitgraph Provides Composable, Versioned Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Docker for Data: Splitgraph Provides Composable, Versioned Datasets

Similar to Docker for Data: Splitgraph Provides Composable, Versioned Datasets (20)

Recently uploaded

Recently uploaded (20)

Docker for Data: Splitgraph Provides Composable, Versioned Datasets