15. Why?
“There are only two hard problems in distributed
systems:
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly once delivery”
● Mathias Varraes
23. Org Objectives
Distributed System Objectives
● Performance
● Security
● Stability
● Cost
These are cross cutting objectives that are crucial in a
SOA.
24. Org Objectives
No problem …
● Performance is important!
● Security is important!
● Stability is important!
● Cost is important!
… all done right?
30. Service level objectives
● Contracts are cool
● Performance is cool
● Uptime is cool
● Keeping cost low is cool
… So write these things down in a Contract and page
owners when we violate them?
37. Ditching Libraries
●Libraries can be pretty terrible
– Break tests
– Deploy 20 versions
– Function calls work only in ${language}
– Bugs can take weeks to fix
– Tests can take a long time
40. ●Libraries can be pretty awesome
– Break tests not websites
– Deploy 20 versions
– Function calls are wicked fast
– Have weeks to fix a bug
– Unit tests are fast
Ditching Libraries
42. ● Not all Devs can (want to) do Ops
● Not all Ops can (want to) do Dev
● What about?
○ DBAs
○ Security Engineers
○ Designers
Everybody does Ops?
43. What to Aim For Instead?
1.Encourage cooperation
2.Acknowledge your engineers have varied
skills: {Ops, Dev, Security, Databases, Design,
Frontend, API design, Performance, etc …}
3.Try to build teams that have a wide range of
skills
46. Image Citations
● Deep dive:
https://en.wikipedia.org/wiki/Deep_diving#/media/File:Trevor_Jackson_returns_from_SS_Ky
ogle.jpg
● Map: https://commons.wikimedia.org/wiki/File:Carta_Marina_AB_stitched.jpg
● AWS Total Cost of Onwership: https://aws.amazon.com/blogs/aws/the-new-aws-tco-
calculator/
● Sharing milkshake:
https://commons.wikimedia.org/wiki/File:Children_sharing_a_milkshake.jpg
● Field of flowers: TODO
Notes de l'éditeur
Approx. 89 million UMVs via mobile
More than 90 million reviews contributed since inception
Approx. 71% of all searches on Yelp came from mobile (mobile web & app)
Yelp is present across 32 countries
Here’s a cool picture showing exponential increase in complexity over time
What does this have to do with services, you might ask?
In the beginning, there were zero lines of code in yelp-main
In 2016, there are about three million
The problem is, it’s hard to scale up our release process as we keep adding code and developers
What is our release process?
Once your branch is code reviewed, you submit your branch as a push request
Three times a day, a push master grabs around 20 branches and pushes that code out to production
So at most around 60 branches get released per day
We needed an alternative approach...
This was our first production service
It didn’t do very much :)
But it was a very useful testing ground for service technologies, as well as deployment, monitoring etc.
We generalized it to become v1 of our service template
Which then begot PaaSTA, our Platform as a Service
In five years we saw an explosion of over 150 services
Maybe we overshot the mark a little? :)
Joey is going to talk more about this in a bit
In order to get good at deploying services we’ve had to make lots of changes to the org
It used to take several weeks to deploy a service, now it takes an hour or two
We spread out operations responsibilities to minimize queuing
This is a specific case of a more general one of distributing knowledge
Programming the monolith is hard
Programming a service oriented architecture is very hard
A few weeks ago we had an issue in our task queues due to a kafka issue
This caused massive duplication of some tasks e.g. 50x for some
These duplicate tasks caused duplicate photos to appear in timelines :(
Great example of why knowing about idempotency is important
Service principles document
Outlines what we think are the important things wrt design and operations
Technology agnostic
Service tutorial
We use a cool program called dexy to script incremental service creation and display the output
“Here the diff, here’s the output of the service when you apply the diff”
Deputy programs
There are some processes where you can cause a lot of damage if you do them wrong
e.g. Making puppet changes, setting up new services
So we really don’t want to hand the keys to new developers
Solution: take one or two more senior engineers from each team and train them to do these things
Every week we hold office hours
Anyone from across the org can drop in and ask questions about services
Deep dives
Every Monday we have an engineering all-hands meeting
As part of this, we have a deep dive where an engineer discusses something they’ve been working on
Periodically use this to talk about some aspect of services
Service Creation Form (SCF) documents the basics of your service
Reviewed by a small group of more experienced engineers
It’s a balancing act wrt process (goldilocks)
In general, we’ve tried to disperse knowledge across the organisation instead
Examples of areas covered by SCF: Load balancing, failure modes, caching
Review process?
In the monolith, you usually have just one language, one ORM, one database technology, one caching technology
When we first went to services, everybody did their own thing: clojure, redis, thrift, couchdb
Person-SPOF
This is today’s map of the world
Yours will probably look different
Common set of ‘safe’, well-supported technologies
You don’t *have* to use these, but if you don’t then you’re on your own...
One thing that we have standardized on is HTTP/JSON
Interface definition!
Many (not all) services are using Swagger to define their interfaces
Here’s an example of a partial swagger definition
Especially successful for our internalapi service
Previously: anything goes anywhere
Now: Swagger spec for every new endpoint, all spec changes go out to reviewboard group
Every single service has a per service endpoint
Not just the website
This is a service’s uptime + reliability, not the website
Each team owns their own services
Why? It’s a lot easier to assign responsibility if ownership is clear
e.g. upgrade this library
Ideally >= 2 people know about a service on a team
Some services do effectively become unowned
We use a JIRA project to track ongoing incidents
Once resolved, enters into the postmortem status
All postmortems go to all developers
I like postmortems, but they do take quite a lot of work.
Luckily Yelp is very supportive of these efforts
Initially some of this was a bit of a struggle for teams not used to operations
So we had to spread some of the operations best practices across the org
Oncall, not everyone wants to be oncall
Teams need Ops
No dedicated DevOps teams, rather empower your existing developers to become a DevOps or a SecOps or a DevSec, but don’t expect them to be everything