2. • Messaging tech lead at Workiva
• Platform infrastructure
• Distributed systems
• bravenewgeek.com
@tyler_treat
tyler.treat@workiva.com
ABOUT THE SPEAKER
3. • Embracing the reality of complex
systems
• Using simplicity to your advantage
• Why NATS?
• How Workiva uses NATS
ABOUT THIS TALK
4. There are a lot of parallels between
real-world systems and
distributed software systems.
6. …and the database is just
an optimization.[1]
[1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html
7. “There will be no further print editions
[of the Merck Manual]. Publishing a
printed book every five years and
sending reams of paper around the
world on trucks, planes, and boats is
no longer the optimal way to provide
medical information.”
Dr. Robert S. Porter
Editor-in-Chief, The Merck Manuals
14. …but the natural tendency is to build
distributed systems as if they aren’t
distributed at all because it’s
easy to reason about.
strong consistency - reliable messaging - predictability
15. • Complicated algorithms
• Transaction managers
• Coordination services
• Distributed locking
What’s in a guarantee?
16.
17. • Message handed to the transport layer?
• Enqueued in the recipient’s mailbox?
• Recipient started processing it?
• Recipient finished processing it?
What’s a delivery guarantee?
18. Each of these has a very different set of
conditions, constraints, and costs.
35. If you need business-level
guarantees, build them into
the business layer.
36.
37. We can always build
stronger guarantees on top,
but we can’t always remove
them from below.
38. End-to-end system semantics matter
much more than the semantics of an
individual building block[4].
[4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf
43. Simple doesn’t mean easy.
[5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development
44. “Simple can be harder than complex.
You have to work hard to get your thinking
clean to make it simple. But it’s worth it in
the end because once you get there, you
can move mountains.”
45. • Wdesk: platform for enterprises to collect, manage,
and report critical business data in real time
• Increasing amounts of data and complexity of
formats
• Cloud solution:
- Data accuracy
- Secure
- Highly available
- Scalable
- Mobile-enabled
About Workiva
46.
47.
48. • First solution built on Google App Engine
• Scaling new solutions requires service-oriented
approach
• Scaling new services requires a low-latency
communication backplane
About Workiva
51. • Always on, always available
• Protects itself at all costs—no compromises on
performance
• Disconnects slow consumers and lazy listeners
• Clients have automatic failover and reconnect logic
• Clients buffer messages while temporarily
partitioned
Availability over Everything
57. • Fast, predictable performance at scale and at tail
• ~8 million messages per second
• Auto-pruning of interest graph allows efficient
routing
• When SLAs matter, it’s hard to beat NATS
Fast as Hell
67. “Just send this thing containing these fields
serialized in this way using that encoding to
this topic!”
68. “Just subscribe to this topic and decode
using that encoding then deserialize in
this way and extract these fields from
this thing!”
69.
70. Pub/Sub is meant to decouple services
but often ends up coupling the teams
developing them.
71. How do we evolve services in isolation
and reduce development overhead?
72. • Extension of Apache Thrift
• IDL and cross-language, code-generated pub/sub
APIs
• Allows developers to think in terms of services and
APIs rather than opaque messages and topics
• Allows APIs to evolve while maintaining compatibility
• Transports are pluggable (we use NATS)
Frugal RPC
74. • Service instances form a queue group
• Client “connects” to instance by publishing a message to the service
queue group
• Serving instance sets up an inbox for the client and sends it back in the
response
• Client sends requests to the inbox
• Connecting is cheap—no service discovery and no sockets to create, just
a request/response
• Heartbeats used to check health of server and client
• Very early prototype code: https://github.com/workiva/thrift-nats
RPC over NATS
75.
76. • Store JSON containing cluster membership in S3
• Container reads JSON on startup and creates
routes w/ correct credentials
• Services only talk to the NATS daemon on their VM
via localhost
• Don’t have to worry about encryption between
services and NATS, only between NATS peers
NATS per VM
77. • Only messages intended for a process on another
host go over the network since NATS cluster
maintains interest graph
• Greatly reduces network hops (usually 0 vs. 2-3)
• If local NATS daemon goes down, restart it
automatically
NATS per VM
78. • Doesn’t scale to large number of VMs
• Fairly easy to transition to floating NATS cluster or
running on a subset of machines per AZ
• NATS communication abstracted from service
• Send messages to services without thinking about
routing or service discovery
• Queue groups provide service load balancing
NATS per VM
79. • We’re a SaaS company, not an infrastructure company
• High availability
• Operational simplicity
• Performance
• First-party clients:
Go Java C C#
Python Ruby Elixir Node.js
NATS as a Messaging Backplane
80. –Derek Landy, Skulduggery Pleasant
“Every solution to every problem is simple…
It's the distance between the two where the mystery lies.”