SlideShare a Scribd company logo
1 of 24
Download to read offline
Keep: Open Source 
Content-Addressed 
Storage 
How We Turned a Big Hot Mess of Perl Into 
a Sweet Go Ride
Overview 
โ— The problem: large-scale data management is 
hard. 
โ— The solution: content-addressed storage and 
federation. 
โ— History of Warehouse and Keep 
โ— Keep: motivations and design goals 
โ— What we learned from Go
The problem: 
Data management is hard.
The problem 
Managing data for scientific research is hard. 
It's too easy to lose data: 
rm result<tab> ---OH WAIT CRAP NO 
./generate_results.py -o results1.csv ---OH WAIT CRAP NO 
Or lose track of how we got it. 
$ ls -l results/ 
-rw-r----- 1 twp twp 859458786 Sep 2 15:39 results1.csv 
-rw-r----- 1 twp twp 758489475 Sep 3 15:51 results2.csv 
-rw-r----- 1 twp twp 958747348 Sep 4 11:46 results3.csv 
-rw-r----- 1 twp twp 795984373 Sep 5 17:12 results4.csv 
-rw-r----- 1 twp twp 833857373 Sep 6 9:38 results5.csv 
-rw-r----- 1 twp twp 894847636 Sep 7 12:46 results6.csv 
-rw-r----- 1 twp twp 847476854 Oct 2 12:17 results_umm.csv 
-rw-r----- 1 twp twp 766845784 Sep 12 19:08 results_wednesday_i_think.csv 
-rw-r----- 1 twp twp 932875738 Sep 8 18:32 results_whatever.csv
The problem 
Why is federation important? 
Because the alternative is snail-mailing hard drives 
of data all over the world.
The solution: Keep
Keep: open source content-addressed 
storage. 
Design goals: 
โ— Gracefully handle data sets measured in the 
terabytes and petabytes. 
โ— Multi-tenant architecture 
โ— Lightweight permissions system 
โ— Minimize external dependencies 
โ— Data federation
What is Content-Addressed Storage? 
A very large key/value store, in which the address of a data object is the hash of its 
content. Example: 
$ head -c 10000000 /dev/urandom > /tmp/stuff 
$ md5sum /tmp/stuff 
c54a33209b03905476bf971e722c683d /tmp/stuff 
This file can only be stored under the address c54a33209b03905476bf971e722c683d. 
PUT /c54a33209b03905476bf971e722c683d 
-> HTTP/1.1 200 OK 
c54a33209b03905476bf971e722c683d+10000000 
Attempting to store it under a different name yields an HTTP 4xx error. 
PUT /ffffffffffffffffffffffffffffffff 
-> HTTP/1.1 422 Hash mismatch in request 
With CAS, if you store the same data blob twice, you always get the same key back. 
Best known example: Git refs.
Why is CAS useful? 
Permanent storage. 
โ— Blobs cannot be overwritten. 
โ— Even on purpose! 
Determine quickly whether a large data blob is present in the store. 
These characteristics make content-addressed storage extremely 
well suited to any system which demands accountability on very 
large data sets, such as: 
โ— Scientific computing 
โ— Accounting data management 
โ— Photo retouching
Existing alternatives 
CAS systems typically have names like: EMC, 
NetApp, IBM. 
Cost tends to be on the order of $2,000-$10,000 
per terabyte. 
Not open source, therefore, poopy. 
A few open source alternatives exist, notably 
Camlistore. But some missing or immature features: 
โ— Multi-tenant support 
โ— Blob permissions
Keep architecture 
Data is written in blocks up to 64MB. 
Smart client, dumb server. Client is responsible for: 
โ— Structured data (directories, folders, names, etc) 
โ— Replication 
Default implementation on top of POSIX filesystem. Cheap, easy to deploy. 
-rw------- 1 keep keep 14551778 Oct 10 17:06 /keep/87b/87b0f2f2eb0c1f90c6da46309a799cc0 
-rw------- 1 keep keep 8154404 Oct 10 17:06 /keep/cc5/cc57ebe000aed447e1e481569e1a8abd 
-rw------- 1 keep keep 7239086 Oct 10 17:06 /keep/ae8/ae8a8b29d9fb6325ee93d951cdae896f 
-rw------- 1 keep keep 14455989 Oct 10 17:06 /keep/e92/e928a4d4b5c3ea903914d178bbfdb035 
Keep Volumes can be implemented on top of any backend service: 
โ— RAID 
โ— Amazon S3 
โ— Google Cloud Storage
Keep permissions 
Goals: 
โ— Require permission to read blocks 
โ— No hard dependencies on external authentication services 
Solution: permission hints. 
Block requests are accompanied by a timestamped signature, e.g.: 
87b0f2f2eb0c1f90c6da46309a799cc0+14551778 + Abcf33732294c3e1fe16e39cea3114c9461274645 @ 5438550a 
--------- block locator string --------/ ------------ SHA-1 signature ----------/ timestamp 
Signatures are derived from the block hash, the user's OAuth token, the expiration 
timestamp, and a server-side signing secret. 
Permission hints can be generated by the authentication server. 
The Keep server can verify a valid permission signature instantly, without even having to 
contact any other service.
From Perl to Go 
Original Keep implementation, "Warehouse", written in Perl for the 
Harvard Personal Genome Project. 
Many drawbacks: 
โ— Perl 
โ— Eats ALL the memory 
โ— Not multithreaded (see also: Perl) 
โ— Slooooooooooooow. Slow slow slow. 
โ— Slow. 
โ— Did I mention Perl?
What did Go give us? 
So we rewrote Warehouse in Go. 
โ— 3 weeks to working prototype (supports GET and 
PUT) 
โ— 6 weeks to production (including permissions) 
Advantages: 
โ— Easier dependency management 
โ— Clean concurrency architecture 
โ— Rapid develop/build/test/deploy cycle
Dependency management: Perl 
Welcome to dependency hell. 
Package: libwarehouse-perl 
Architecture: all 
Depends: ${perl:Depends}, ${misc:Depends}, libdbi-perl, libwww-perl, 
libio-stringy-perl, libtimedate-perl, libgnupg-interface-perl, 
libunix-syslog-perl, libbsd-resource-perl, libio-compress-zlib-perl, 
libdigest-sha-perl, bioperl, gcc, g++, libstdc++6, bison, perlmagick, 
imagemagick, gnuplot, bzip2, libbz2-dev, libfftw3-3, libfftw3-dev, 
libxml-simple-perl, ghostscript, xsltproc, libyaml-perl, 
libjson-perl, realpath, psmisc 
Recommends: libhttp-ghttp-perl, libwhisker2-perl, libfuse-perl 
Description: Warehouse -- Client and Server library for the storage warehouse. 
Warehouse -- Client and Server library for the Free Factory storage warehouse. 
Deployment dependencies include MogileFS, MySQL, PGP, others. 
The stuff of nightmares.
Dependency Management: Go 
import ( 
"bufio" 
"bytes" 
"container/list" 
"crypto/md5" 
"encoding/json" 
"fmt" 
"github.com/gorilla/mux" 
"io" 
"log" 
"net/http" 
"os" 
"regexp" 
"runtime" 
"strconv" 
"strings" 
"syscall" 
"time" 
) 
Exactly one third-party 
dependency: 
github.com/gorilla/mux 
(HTTP request routing).e
Concurrency: Perl
Concurrency: Keep 
Go implements concurrency 
in "goroutines" -- small 
tasks that run 
independently of each other 
and may be run on other 
threads. 
The Go runtime is 
responsible for managing 
threads and scheduling 
tasks. 
Writing concurrent code is as 
simple as: 
func master() { 
go start_slave() 
// The slave may continue to run after master() 
returns. 
} 
func start_slave() { 
while (work_to_do()) { 
... 
} 
}
Concurrency: Keep 
The standard Go libraries 
provide a rich set of tools for 
writing concurrent 
applications. 
This is a complete 
implementation of a working 
multithreaded HTTP server in 
Go. 
The HTTP library's http. 
Server type handles each 
request in its own goroutine. 
package main 
import ( 
"log" 
"net/http" 
) 
func main() { 
http.HandleFunc("/hello", helloHandler) 
log.Fatal(http.ListenAndServe(":8080", nil)) 
} 
func helloHandler(w http.ResponseWriter, r *http.Request) 
{ 
w.Write([]byte("Hello!n")) 
}
Rapid development cycles 
Very fast build cycles: 
hitchcock:/home/twp/arvados/services/keepstore% wc handlers.go keepstore.go perms.go volume.go work_queue.go 
... 
1515 6183 45084 total 
hitchcock:/home/twp/arvados/services/keepstore% time go build 
go build 1.01s user 0.15s system 99% cpu 1.154 total 
Testing: 
hitchcock:/home/twp/arvados/services/keepstore% wc *_test.go 
... 
1830 6233 50798 total 
hitchcock:/home/twp/arvados/services/keepstore% go test 
... 
PASS 
ok _/home/twp/arvados/services/keepstore 1.018s
Rapid development cycles 
PASS 
ok _/home/twp/arvados/services/keepstore 1.018s 
Why does the test take that long? 
Oh right. 
// Sleep for 1s, then put the block again. The volume 
// should report a more recent mtime. 
// 
// TODO(twp): this would be better handled with a mock Time object. 
// Alternatively, set the mtime manually to some moment in the past 
// (maybe a v.SetMtime method?) 
// 
time.Sleep(time.Second)
Lessons from Go 
You don't have to choose between performance 
and rapid prototyping. 
Extremely fast design and test cycle. 
Where Perl celebrates "laziness, impatience, and 
hubris," Go makes it easy to do the right thing. 
Go makes laziness awkward.
Where do we go from here? 
Keep is open source (AGPLv3) 
Keep source code is at github. 
com/curoverse/arvados/tree/master/services/kee 
pstore 
The full source code repository: github. 
com/curoverse/arvados 
We are eager for contributors and new ideas for 
how to use Keep!
Thank you! 
Tim Pierce 
Senior Software Engineer, Curoverse 
twp@curoverse.com 
http://curoverse.com/ 
@qwrrty

More Related Content

Recently uploaded

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
ย 
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .DerechoLaboralIndivi
ย 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
ย 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
ย 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
ย 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
ย 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
ย 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
ย 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
ย 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
ย 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
ย 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
ย 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
ย 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
ย 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
ย 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
ย 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
ย 

Recently uploaded (20)

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
ย 
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
ย 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
ย 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
ย 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
ย 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
ย 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
ย 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
ย 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
ย 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
ย 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
ย 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
ย 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ย 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
ย 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
ย 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
ย 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
ย 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
ย 

Featured

2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by HubspotMarius Sescu
ย 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
ย 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
ย 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
ย 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
ย 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
ย 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
ย 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
ย 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
ย 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
ย 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
ย 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
ย 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
ย 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
ย 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
ย 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
ย 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
ย 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
ย 

Featured (20)

2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot2024 State of Marketing Report โ€“ by Hubspot
2024 State of Marketing Report โ€“ by Hubspot
ย 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
ย 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
ย 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ย 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
ย 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
ย 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
ย 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
ย 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
ย 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
ย 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
ย 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ย 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
ย 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
ย 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
ย 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
ย 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ย 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
ย 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
ย 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
ย 

Keep: Open Source Content-Addressed Storage

  • 1. Keep: Open Source Content-Addressed Storage How We Turned a Big Hot Mess of Perl Into a Sweet Go Ride
  • 2. Overview โ— The problem: large-scale data management is hard. โ— The solution: content-addressed storage and federation. โ— History of Warehouse and Keep โ— Keep: motivations and design goals โ— What we learned from Go
  • 3. The problem: Data management is hard.
  • 4. The problem Managing data for scientific research is hard. It's too easy to lose data: rm result<tab> ---OH WAIT CRAP NO ./generate_results.py -o results1.csv ---OH WAIT CRAP NO Or lose track of how we got it. $ ls -l results/ -rw-r----- 1 twp twp 859458786 Sep 2 15:39 results1.csv -rw-r----- 1 twp twp 758489475 Sep 3 15:51 results2.csv -rw-r----- 1 twp twp 958747348 Sep 4 11:46 results3.csv -rw-r----- 1 twp twp 795984373 Sep 5 17:12 results4.csv -rw-r----- 1 twp twp 833857373 Sep 6 9:38 results5.csv -rw-r----- 1 twp twp 894847636 Sep 7 12:46 results6.csv -rw-r----- 1 twp twp 847476854 Oct 2 12:17 results_umm.csv -rw-r----- 1 twp twp 766845784 Sep 12 19:08 results_wednesday_i_think.csv -rw-r----- 1 twp twp 932875738 Sep 8 18:32 results_whatever.csv
  • 5. The problem Why is federation important? Because the alternative is snail-mailing hard drives of data all over the world.
  • 7. Keep: open source content-addressed storage. Design goals: โ— Gracefully handle data sets measured in the terabytes and petabytes. โ— Multi-tenant architecture โ— Lightweight permissions system โ— Minimize external dependencies โ— Data federation
  • 8. What is Content-Addressed Storage? A very large key/value store, in which the address of a data object is the hash of its content. Example: $ head -c 10000000 /dev/urandom > /tmp/stuff $ md5sum /tmp/stuff c54a33209b03905476bf971e722c683d /tmp/stuff This file can only be stored under the address c54a33209b03905476bf971e722c683d. PUT /c54a33209b03905476bf971e722c683d -> HTTP/1.1 200 OK c54a33209b03905476bf971e722c683d+10000000 Attempting to store it under a different name yields an HTTP 4xx error. PUT /ffffffffffffffffffffffffffffffff -> HTTP/1.1 422 Hash mismatch in request With CAS, if you store the same data blob twice, you always get the same key back. Best known example: Git refs.
  • 9. Why is CAS useful? Permanent storage. โ— Blobs cannot be overwritten. โ— Even on purpose! Determine quickly whether a large data blob is present in the store. These characteristics make content-addressed storage extremely well suited to any system which demands accountability on very large data sets, such as: โ— Scientific computing โ— Accounting data management โ— Photo retouching
  • 10. Existing alternatives CAS systems typically have names like: EMC, NetApp, IBM. Cost tends to be on the order of $2,000-$10,000 per terabyte. Not open source, therefore, poopy. A few open source alternatives exist, notably Camlistore. But some missing or immature features: โ— Multi-tenant support โ— Blob permissions
  • 11. Keep architecture Data is written in blocks up to 64MB. Smart client, dumb server. Client is responsible for: โ— Structured data (directories, folders, names, etc) โ— Replication Default implementation on top of POSIX filesystem. Cheap, easy to deploy. -rw------- 1 keep keep 14551778 Oct 10 17:06 /keep/87b/87b0f2f2eb0c1f90c6da46309a799cc0 -rw------- 1 keep keep 8154404 Oct 10 17:06 /keep/cc5/cc57ebe000aed447e1e481569e1a8abd -rw------- 1 keep keep 7239086 Oct 10 17:06 /keep/ae8/ae8a8b29d9fb6325ee93d951cdae896f -rw------- 1 keep keep 14455989 Oct 10 17:06 /keep/e92/e928a4d4b5c3ea903914d178bbfdb035 Keep Volumes can be implemented on top of any backend service: โ— RAID โ— Amazon S3 โ— Google Cloud Storage
  • 12. Keep permissions Goals: โ— Require permission to read blocks โ— No hard dependencies on external authentication services Solution: permission hints. Block requests are accompanied by a timestamped signature, e.g.: 87b0f2f2eb0c1f90c6da46309a799cc0+14551778 + Abcf33732294c3e1fe16e39cea3114c9461274645 @ 5438550a --------- block locator string --------/ ------------ SHA-1 signature ----------/ timestamp Signatures are derived from the block hash, the user's OAuth token, the expiration timestamp, and a server-side signing secret. Permission hints can be generated by the authentication server. The Keep server can verify a valid permission signature instantly, without even having to contact any other service.
  • 13. From Perl to Go Original Keep implementation, "Warehouse", written in Perl for the Harvard Personal Genome Project. Many drawbacks: โ— Perl โ— Eats ALL the memory โ— Not multithreaded (see also: Perl) โ— Slooooooooooooow. Slow slow slow. โ— Slow. โ— Did I mention Perl?
  • 14. What did Go give us? So we rewrote Warehouse in Go. โ— 3 weeks to working prototype (supports GET and PUT) โ— 6 weeks to production (including permissions) Advantages: โ— Easier dependency management โ— Clean concurrency architecture โ— Rapid develop/build/test/deploy cycle
  • 15. Dependency management: Perl Welcome to dependency hell. Package: libwarehouse-perl Architecture: all Depends: ${perl:Depends}, ${misc:Depends}, libdbi-perl, libwww-perl, libio-stringy-perl, libtimedate-perl, libgnupg-interface-perl, libunix-syslog-perl, libbsd-resource-perl, libio-compress-zlib-perl, libdigest-sha-perl, bioperl, gcc, g++, libstdc++6, bison, perlmagick, imagemagick, gnuplot, bzip2, libbz2-dev, libfftw3-3, libfftw3-dev, libxml-simple-perl, ghostscript, xsltproc, libyaml-perl, libjson-perl, realpath, psmisc Recommends: libhttp-ghttp-perl, libwhisker2-perl, libfuse-perl Description: Warehouse -- Client and Server library for the storage warehouse. Warehouse -- Client and Server library for the Free Factory storage warehouse. Deployment dependencies include MogileFS, MySQL, PGP, others. The stuff of nightmares.
  • 16. Dependency Management: Go import ( "bufio" "bytes" "container/list" "crypto/md5" "encoding/json" "fmt" "github.com/gorilla/mux" "io" "log" "net/http" "os" "regexp" "runtime" "strconv" "strings" "syscall" "time" ) Exactly one third-party dependency: github.com/gorilla/mux (HTTP request routing).e
  • 18. Concurrency: Keep Go implements concurrency in "goroutines" -- small tasks that run independently of each other and may be run on other threads. The Go runtime is responsible for managing threads and scheduling tasks. Writing concurrent code is as simple as: func master() { go start_slave() // The slave may continue to run after master() returns. } func start_slave() { while (work_to_do()) { ... } }
  • 19. Concurrency: Keep The standard Go libraries provide a rich set of tools for writing concurrent applications. This is a complete implementation of a working multithreaded HTTP server in Go. The HTTP library's http. Server type handles each request in its own goroutine. package main import ( "log" "net/http" ) func main() { http.HandleFunc("/hello", helloHandler) log.Fatal(http.ListenAndServe(":8080", nil)) } func helloHandler(w http.ResponseWriter, r *http.Request) { w.Write([]byte("Hello!n")) }
  • 20. Rapid development cycles Very fast build cycles: hitchcock:/home/twp/arvados/services/keepstore% wc handlers.go keepstore.go perms.go volume.go work_queue.go ... 1515 6183 45084 total hitchcock:/home/twp/arvados/services/keepstore% time go build go build 1.01s user 0.15s system 99% cpu 1.154 total Testing: hitchcock:/home/twp/arvados/services/keepstore% wc *_test.go ... 1830 6233 50798 total hitchcock:/home/twp/arvados/services/keepstore% go test ... PASS ok _/home/twp/arvados/services/keepstore 1.018s
  • 21. Rapid development cycles PASS ok _/home/twp/arvados/services/keepstore 1.018s Why does the test take that long? Oh right. // Sleep for 1s, then put the block again. The volume // should report a more recent mtime. // // TODO(twp): this would be better handled with a mock Time object. // Alternatively, set the mtime manually to some moment in the past // (maybe a v.SetMtime method?) // time.Sleep(time.Second)
  • 22. Lessons from Go You don't have to choose between performance and rapid prototyping. Extremely fast design and test cycle. Where Perl celebrates "laziness, impatience, and hubris," Go makes it easy to do the right thing. Go makes laziness awkward.
  • 23. Where do we go from here? Keep is open source (AGPLv3) Keep source code is at github. com/curoverse/arvados/tree/master/services/kee pstore The full source code repository: github. com/curoverse/arvados We are eager for contributors and new ideas for how to use Keep!
  • 24. Thank you! Tim Pierce Senior Software Engineer, Curoverse twp@curoverse.com http://curoverse.com/ @qwrrty