Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.
18. • Launched in London
November 2011
• Now in 5 cities in 3 countries
(30%+ growth every month)
• A Hailo hail is accepted around
the world every 5 seconds
19. “.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in
complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
20. • SOA (10+ services)
• AWS (3 regions, 9 AZs, lots of
instances)
• 10+ engineers building services
and you?
(hailo is hiring)
34. CRUD
Locking
MySQL Search
Analytics
ID generation
also queuing…
Separating concerns
35. At Hailo we look for technologies that are:
• Distributed
run on more than one machine
• Homogenous
all nodes look the same
• Resilient
can cope with the loss of node(s) with no
loss of data
36. “There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”
http://blog.b3k.us/2012/01/24/some-rules.html
37. • Highly performant, scalable and
resilient data store
• Underpins much of what we do
at Hailo
• Makes multi-DC easy!
38. ZooKeeper
• Highly reliable distributed
coordination
• We implement locking and
leadership election on top of ZK
and use sparingly
39. • Distributed, RESTful, Search
Engine built on top of Apache
Lucene
• Replaced basic foo LIKE ‘%bar%’
queries (so much better)
40. NSQ
• Realtime message processing
system designed to handle
billions of messages per day
• Fault tolerant, highly available
with reliable message delivery
guarantee
41. • Real time incremental analytics
platform, backed by Apache
Cassandra
• Powerful SQL-like interface
• Scalable and highly available
43. • All these technologies have
similar properties of distribution
and resilience
• They are designed to cope with
failure
• They are not broken by design
47. class HailoMemcacheService {
private $mc = null;
public function __call() {
$mc = $this->getInstance();
// do stuff
}
private function getInstance() {
if ($this->instance === null) {
$this->mc = new Memcached;
$this->mc->addServers($s);
}
return $this->mc;
}
} Lazy-init instances; connect on use
51. “Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”
http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html
63. RabbitMQ RabbitMQ RabbitMQ
HA cluster
AMQP (port 5672)
Service
64. $ iptables -A INPUT -i eth0
-p tcp --dport 5672
-m state --state ESTABLISHED
-j DROP
$ php test-rabbitmq.php
Fantastic! Block AMQP port, client times out
87. Thanks
Software used at Hailo
http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp
Plus a load of other things I’ve not mentioned.
88. Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix
Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator
Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/
Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/
ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems
Editor's Notes
I’m dave!
I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
The title of my talk is “planning to fail”
First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
Why do we care about reliability?
Advantages
Advantages
Advantages
Advantages
Advantages
But first, let’s rewind to the beginning
The pure joy of inserting a php tag in the middle of an HTML table
My website still follows this pattern. I’d like to think my website is quite reliable.
My website is reliable, but simple. Doesn’t change very often.
Hailo is complex!
Hailo is growing.
Key quote: less machinery is quadratically better.
Hailo have a lot of machinery!
Enter the chaos monkey… If you want to be good at something, practice often!
How about the “reliable” VPC that runs my website?
But not resilient; my website would not cope well with the chaos monkey approach.
We have to choose our stack appropriately if we are going to go down the chaos monkey route.
Hailo didn’t start out this way; but the PHP component did
Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
Advantages
Here’s one of our services… is this reliable?
But Hailo is going global
At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
Don’t pick things that arebroken by design
We remove services from the critical path using lazy-init pattern
We want to define timeouts so that under failure conditions we don’t hang forever