Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2HOCiRw.
Nate Kupp shares some of Thumbtack’s key learnings on their journey to scale: from a PHP/PostgreSQL monolith with a self-managed Hadoop cluster, to Dockerized microservices paired with managed/serverless data infrastructure, and their future with fully-managed systems. Filmed at qconsf.com.
Nate Kupp leads the Technical Infrastructure team at Thumbtack where he has worked on scaling both engineering teams and infrastructure to support Thumbtack's rapidly growing marketplaces. Before joining Thumbtack, he spent a few years at Apple, where he built out data infrastructure for iOS and watchOS battery life analytics.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Scaling Marketplaces at Thumbtack
1. Scaling Marketplaces at Thumbtack
QCon SF 2017
Nate Kupp
Technical Infrastructure
Data Eng, Experimentation, Platform
Infrastructure, Security, Dev Tools
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
thumbtack-scalability
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
8. ME: HARDWARE SOFTWARE
Semiconductor manufacturing research
Apple iOS battery life
Data Infrastructure @ Thumbtack
Technical Infrastructure @ Thumbtack
Data, core platform, A/B testing & dev tools
Yup, those are working transistors
9.
10.
11. Growth Hacking with Team Philippines
Customer
Pro A
Pro B
Pro C
I need a
plumber!
Is this a spam
request?
SPAM DETECTION
Can we automate
quoting?
“QUOTING SERVICE”
Which pros do we
invite to quote?
MATCHING
14. “It has long been an axiom of mine
that the little things are infinitely
the most important.”
- Sherlock Holmes
“Small data”
CSV files Upload Web UI
Analysts
“Big data”
Hive/Impala
15. Change doesn’t happen in a vacuum.
Give people a way to get things done while
you build or migrate.
16. Late 2015
Starting to hit
“real” scale
Growth @ Thumbtack
2009
Thumbtack
founded
2014
Raise
$130mm
Dec. 2014
I join
Thumbtack
Now
Continued
growth
17. Mid-2015
Early data infrastructure
2009 - 2014
The monolith
2015
Finally on AWS! Started
using DynamoDB, Golang
microservices
The Evolution of Thumbtack’s Infrastructure
Go servicesWebsite Website
Sqoop ETL
HDFS
Event Logging
18. Today
Mixed clouds,
fully-containerized
infrastructure, terraform
The Evolution of Thumbtack’s Infrastructure
Production Serving Infrastructure Data Infrastructure
Application Layer
(Docker on ECS)
PHP, Go, Scala
Storage Layer
PostgreSQL, DynamoDB,
Elasticsearch
Processing
Scala/Spark on Dataproc
Storage
GCS
SQL
BigQuery
20. Application Layer
Storage
Cloud
Storage
Ingest
Cloud
Pub/Sub
Analytics
BigQuery
Data Infrastructure
Spark ETL
Cloud Dataproc
Airflow
Thrift/JSON
events
Search
Indexing
Pipelines
Pipeline
Orchestration
Full table
Sqoop ETL
Batch ETL
Cloud Dataproc
Job-scoped clusters on
Dataproc 1.1 / Spark 2.0.2,
Scala
Data Retention
Data retained forever
on GCS / BigQueryCustom DynamoDB ETL
Internal DynamoDB ETL
pipelines, scales up/down
read capacity
Sqoop
Cloud Dataproc
Pipelines
ML Pipelines
21. And it was great!
- Costs comparable to monolithic cluster
- Creating and destroying > 600 clusters per day
- > 12,000 BigQuery queries per workday
22. But, getting there took some real work
HDFS
Storage
Cloud
Storage
Analytics
BigQuery
Code Changes
- Add retries everywhere (many 500/503 error codes)
- Rewrite HDFS utilities
- Change all Parquet to Avro
- Educate team (100s of users) on new SQL dialect
and web interface
23. And, still needed to build interfaces to managed services
Ingest
Cloud
Pub/Sub
Go services
services
Website
Analytics
BigQuery
STREAMING PIPELINEDATA
INGEST
Spark Streaming
Cloud Dataproc
Spark Streaming
Wrote custom Spark receiver for Cloud
Pub/Sub, checkpointing WAL to local
HDFS, no backpressure (yet)
End-to-end Durability
Built internal solutions; achieving end-to-end
durability guarantees here is hard
BigQuery Streaming APIs
Streaming writes of O(100s of millions)
of events / day into BigQuery tables
24. + Uptime: GCP uptime has actually been great
- Unexpected BigQuery API Changes: Google occasionally makes production changes
which break us. When these happen, we can’t really do anything until fixed by Google or a
workaround is identified.
“Here be dragons”
April 2017
BigQuery Avro Ingest API Changes
Previously, a field marked as required by the Avro
schema could be loaded into a table with the field
marked nullable; this started failing.
October 2017
BigQuery Sharded Export Changes
Noticed many hung Dataproc clusters. Team identified
workaround to disable BQ sharded export by setting
mapred.bq.input.sharded.export.enable to false.
25. Managed services make some things easier…
But only some things.
Go in with eyes wide open.
28. GCP Migration & BigQuery
Why? A few reasons...
1. Fully-managed, serverless, petabyte-scale data
warehouse
2. Nested/complex data types
3. Security & access controls
4. Self-serve interface to Google Sheets
29. Surprise dependencies
Exception: BigQuery job failed. Final error was:
{
'reason': 'invalid',
'message':"Could not parse '.' as int for field category_id
(position 0) starting at location 340 ",
'location': '/gdrive/home/Category taxonomy - master file'
}
30. When you make it really easy to deploy infrastructure...
...you get A LOT of random infrastructure