Simple Solutions for Complex Problems

Simple Solutions 
for Complex Problems
Tyler Treat / Workiva
Bay Area NATS Meetup 3/22/2016

• Messaging tech lead at Workiva
• Platform infrastructure
• Distributed systems
• bravenewgeek.com
@tyler_treat 
tyler.treat@workiva.com
ABOUT THE SPEAKER

• Embracing the reality of complex
systems
• Using simplicity to your advantage
• Why NATS?
• How Workiva uses NATS
ABOUT THIS TALK

There are a lot of parallels between
real-world systems and 
distributed software systems.

The world is eventually consistent…

…and the database is just
an optimization.[1]
[1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html

“There will be no further print editions
[of the Merck Manual]. Publishing a
printed book every ﬁve years and
sending reams of paper around the
world on trucks, planes, and boats is
no longer the optimal way to provide
medical information.”
Dr. Robert S. Porter 
Editor-in-Chief, The Merck Manuals

Programmers ﬁnd asynchrony hard
to reason about, but the truth is…

What does this mean for us as
programmers?

time / complexity
timesharing
monoliths
soa
virtualization
microservices
???
Complicated made complex…

Distributed computation is 
inherently asynchronous 
and the network is 
inherently unreliable[2]…
[2] http://queue.acm.org/detail.cfm?id=2655736

…but the natural tendency is to build
distributed systems as if they aren’t
distributed at all because it’s 
easy to reason about.
strong consistency - reliable messaging - predictability

• Complicated algorithms
• Transaction managers
• Coordination services
• Distributed locking
What’s in a guarantee?

• Message handed to the transport layer?
• Enqueued in the recipient’s mailbox?
• Recipient started processing it?
• Recipient ﬁnished processing it?
What’s a delivery guarantee?

Each of these has a very different set of
conditions, constraints, and costs.

Guaranteed, ordered,
exactly-once delivery
is expensive (if not impossible[3]).
[3] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Difﬁcult to deploy & operate

At large scale, guarantees will give out.

0.1% failure at scale is huge.

Replayable > Guaranteed
Idempotent > Exactly-once

Replayable > Guaranteed
Idempotent > Exactly-once
Commutative > Ordered

Also, what does it even mean to
“process” a message?

It depends on the 
business context!

If you need business-level
guarantees, build them into 
the business layer.

We can always build 
stronger guarantees on top, 
but we can’t always remove 
them from below.

End-to-end system semantics matter
much more than the semantics of an 
individual building block[4].
[4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf

“Simplicity is the ultimate sophistication.”

EMBRACING THE CHAOS MEANS 
LOOKING AT THE NEGATIVE SPACE.

A simple technology 
in a sea of complexity.

Simple doesn’t mean easy.
[5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development

“Simple can be harder than complex.
You have to work hard to get your thinking
clean to make it simple. But it’s worth it in
the end because once you get there, you
can move mountains.”

• Wdesk: platform for enterprises to collect, manage,
and report critical business data in real time
• Increasing amounts of data and complexity of
formats
• Cloud solution: 
- Data accuracy 
- Secure 
- Highly available 
- Scalable 
- Mobile-enabled
About Workiva

• First solution built on Google App Engine
• Scaling new solutions requires service-oriented
approach
• Scaling new services requires a low-latency
communication backplane
About Workiva

Availability 
over 
everything.

• Always on, always available
• Protects itself at all costs—no compromises on
performance
• Disconnects slow consumers and lazy listeners
• Clients have automatic failover and reconnect logic
• Clients buffer messages while temporarily
partitioned
Availability over Everything

• Single, lightweight binary
• Embraces the “negative space”: 
- Simplicity —> high-performance 
- No complicated conﬁguration or external dependencies 
(e.g. ZooKeeper) 
- No fragile guarantees —> face complexity head-on, encourage async
• Simple pub/sub semantics provide a versatile primitive: 
- Fan-in 
- Fan-out 
- Request/response 
- Distributed queueing
• Simple text-based wire protocol
Simplicity as a Feature

[6] http://bravenewgeek.com/benchmarking-message-queue-latency/

• Fast, predictable performance at scale and at tail
• ~8 million messages per second
• Auto-pruning of interest graph allows efﬁcient
routing
• When SLAs matter, it’s hard to beat NATS
Fast as Hell

• Low-latency service bus
• Pub/Sub
• RPC
How We Use NATS

Service
Service
Service
NATS
Service
Gateway
Web
Client
Web
Client
Web
Client

Service
Service
Service
Service
Service
NATS
Service
Gateway
Web
Client
Web
Client
Web
Client

Web
Client
Web
Client
Web
Client
Service
Gateway
NATS
Service
Service
Service

“Just send this thing containing these ﬁelds
serialized in this way using that encoding to
this topic!”

“Just subscribe to this topic and decode
using that encoding then deserialize in 
this way and extract these ﬁelds from 
this thing!”

Pub/Sub is meant to decouple services
but often ends up coupling the teams
developing them.

How do we evolve services in isolation
and reduce development overhead?

• Extension of Apache Thrift
• IDL and cross-language, code-generated pub/sub
APIs
• Allows developers to think in terms of services and
APIs rather than opaque messages and topics
• Allows APIs to evolve while maintaining compatibility
• Transports are pluggable (we use NATS)
Frugal RPC

struct Event { 
1: i64 id, 
2: string message, 
3: i64 timestamp, 
}
scope Events prefix {user} { 
EventCreated: Event 
EventUpdated: Event 
EventDeleted: Event 
}
subscriber.SubscribeEventCreated( 
"user-1", func(e *event.Event) { 
fmt.Println(e) 
}, 
)
. . .
publisher.PublishEventCreated( 
"user-1", event.NewEvent())
generated

• Service instances form a queue group
• Client “connects” to instance by publishing a message to the service
queue group
• Serving instance sets up an inbox for the client and sends it back in the
response
• Client sends requests to the inbox
• Connecting is cheap—no service discovery and no sockets to create, just
a request/response
• Heartbeats used to check health of server and client
• Very early prototype code: https://github.com/workiva/thrift-nats
RPC over NATS

• Store JSON containing cluster membership in S3
• Container reads JSON on startup and creates
routes w/ correct credentials
• Services only talk to the NATS daemon on their VM
via localhost
• Don’t have to worry about encryption between
services and NATS, only between NATS peers
NATS per VM

• Only messages intended for a process on another
host go over the network since NATS cluster
maintains interest graph
• Greatly reduces network hops (usually 0 vs. 2-3)
• If local NATS daemon goes down, restart it
automatically
NATS per VM

• Doesn’t scale to large number of VMs
• Fairly easy to transition to ﬂoating NATS cluster or
running on a subset of machines per AZ
• NATS communication abstracted from service
• Send messages to services without thinking about
routing or service discovery
• Queue groups provide service load balancing
NATS per VM

• We’re a SaaS company, not an infrastructure company
• High availability
• Operational simplicity
• Performance
• First-party clients: 
Go Java C C# 
Python Ruby Elixir Node.js
NATS as a Messaging Backplane

–Derek Landy, Skulduggery Pleasant
“Every solution to every problem is simple… 
It's the distance between the two where the mystery lies.”

@tyler_treat
github.com/tylertreat
bravenewgeek.com
Thanks!

Simple Solutions for Complex Problems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Simple Solutions for Complex Problems

Similaire à Simple Solutions for Complex Problems (20)

Plus de Tyler Treat

Plus de Tyler Treat (7)

Dernier

Dernier (20)

Simple Solutions for Complex Problems