Designing a system that can extract immediate insights from large amounts of data in real-time requires a special way of thinking. This talk presents a “reactive” approach to designing real-time, responsive, and scalable data applications that can continuously compute analytics on-the-fly. It also highlights a case study as an example of reactive design in action.
2. Disclaimer
September 13, 2018
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two
Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may
employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
3. About Me
September 13, 2018
Engineer at Two Sigma
Lead a team that builds analytics engines and data dashboard
platforms that provide real-time monitoring
4.
5. Agenda
What is streaming analytics?
Reactive principles: Framework for building real-time analytics
Case Study: Real-time data analytics engine
6. VS
Data in MotionData at Rest
v
Analytics done after
the data creating events
have occurred
Analytics happens
in real-time
as events take place
7. VS
Stream OrientedBatch Oriented
v
Data captured in data warehouses
& Processed some time later
in a scheduled batch job
Continuous computation &
Extract information as soon as
data arrives
8. Real-time analytics is valuable to uses cases in many fields…
Monitor financial markets and trading systems
Detect fraudulent credit card activity as it happens
Identify anomalies in telemetry collected from
home automation systems
9. Key Considerations
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
10. Agenda
What is streaming analytics?
Reactive principles: Framework for building real-time analytics
Case Study: Real-time data analytics engine
12. Key Considerations Revisited
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
13. Key Considerations
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
React To Events
14. Key Considerations
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
React To Events
React To Load
15. Key Considerations
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
React To Events
React To Load
React To Failures
16. Key Considerations
Fast
Respond instantly (or near instantly) to new information
Scalable
Able to handle varying incoming workloads
Resilient
Able to handle various failure conditions gracefully
Responsive
Respond to users in a timely fashion
React To Events
React To Load
React To Failures
React To Users
17. A model of concurrent computation
Provides an abstraction for supporting reactive
principles
Actor Model
19. How do Actors communicate?
A Real-life analogy
Send to a friend …
20. How do Actors communicate?
A Real-life analogy
The communication is asynchronous
21. Use messages to
communicate
Actor A
Actor
B
M
Decouples the sending
and receiving of
messages
Actor B may or may
not have to respond to
actor A
Non-blocking response
22. Data flows respond automatically to
propagating changes
Data-flow
Focused
Event-based
Non-
blocking
Availability of new information drives the
logic forward
Emphasizes asynchronous techniques &
non-blocking execution
Reactive Key Traits
23. Agenda
What is streaming analytics?
Reactive principles: Framework for building real-time analytics
Case Study: Real-time data analytics engine
25. Real time &
Throughput
Guarantees
Minimize latency
between new
information and output
of results, even under
high loads
Correctness
Guarantees
Streaming analysis
must be accurate and
consistent with results
as if processed in
batch
Design Considerations
26. Real time &
Throughput
Guarantees
Minimize latency
between new
information and output
of results, even under
high loads
Correctness
Guarantees
Streaming analysis
must be accurate and
consistent with results
as if processed in
batch
Complex
Transformations
Customizable
analytics functions &
Handle different data
formats
Design Considerations
27. Real time &
Throughput
Guarantees
Minimize latency
between new
information and output
of results, even under
high loads
Correctness
Guarantees
Streaming analysis
must be accurate and
consistent with results
as if processed in
batch
Complex
Transformations
Customizable
analytics functions &
Handle different data
formats
Handle out-of-
order or late
data
Keep track of late
arriving data and
manage the ordering
correctly
Design Considerations
28. Real time &
Throughput
Guarantees
Minimize latency
between new
information and output
of results, even under
high loads
Correctness
Guarantees
Streaming analysis
must be accurate and
consistent with results
as if processed in
batch
Complex
Transformations
Business-specific
analytics functions &
Handle different data
formats
Handle out-of-
order or late
data
Keep track of late
arriving data and
manage the ordering
correctly
Reliability
Resilient to failures,
including problems of
upstream data source
Design Considerations
29. Implementation
• Uses Akka, a toolkit that supports building actor systems on the JVM
• Clean separation between “plumbing and wiring” and data
transformation logic
• Allow us to focus more on the functionality and analytics & less on the
low-level wiring of asynchronous programming
30. Sources
Trade Data
Publisher
Actor A
Market Data
Publisher
Actor A
Trade Data
Publisher
Actor B
Transformations & Analysis Sinks
Join Function
Actor
Aggregation
Function
Actor
Bespoke
Analysis
Actor
Filter
Actor
In-Memory
Cache Actor
MMaped
Cache Actor
DB Writer
Actor
Real-time
Data
Example Data Flow
31. Sources
Trade Data
Publisher
Actor A
Market Data
Publisher
Actor A
Trade Data
Publisher
Actor B
Real-time
Data
Data can come from a
many sources
Could be unbounded
flows of data
32. Sources
Trade Data
Publisher
Actor A
Market Data
Publisher
Actor A
Trade Data
Publisher
Actor B
Transformations & Analysis
Join Function
Actor
Aggregation
Function
Actor
Bespoke
Analysis
Actor
Filter
Actor
Real-time
Data
New information flows through the
system as messages between actors
Continuously calculates
statistics and metrics on-
the-fly from live streams of
data
33. Transformations & Analysis
Join Function
Actor
Aggregation
Function
Actor
Bespoke
Analysis
Actor
Filter
Actor
Analysis decomposed
into multiple discrete
steps, each represented
by an actor
Composable Workflows:
Chain together a
composition of functions
to form a data analysis
pipeline
34. Transformations & Analysis
Join Function
Actor
Aggregation
Function
Actor
Bespoke
Analysis
Actor
Filter
Actor
A vocabulary of reusable
functional transformations
offers solutions to most
analytics problems
Allow custom logic
encapsulated in an actor
construct to solve
problems that are more
business-specific
35. Sources
Trade Data
Publisher
Actor A
Market Data
Publisher
Actor A
Trade Data
Publisher
Actor B
Transformations & Analysis Sinks
Join Function
Actor
Aggregation
Function
Actor
Bespoke
Analysis
Actor
Filter
Actor
In-Memory
Cache Actor
MMaped
Cache Actor
DB Writer
Actor
…
Real-time
Data
The results can have
many destinations
Dashboard
& Visualization
Data
Storage
36. Hardware and configurations: One VM with 15 vCPUs, 96 GB Memory, Linux Debian Wheezy OS
Metric Sizes and units
Typical load 4k-20k events per second
Peak capability 150k events per second
Number of Actors 7,000+
Typical time between data
arrival and processing
Milliseconds under typical load;
seconds under high load
Analytics Engine Capabilities and Performance
In case you haven't heard of us, Two Sigma is a New York City-based tech company
set on redefining the investment management domain harnessing the power of technology, data and math to systematically derive insights from data.
founded by a statistician and a computer scientist in 2001 with the goal of applying leading-edge technology to the data-rich world of finance. Although we do a lot of things, at our core we are a company that harnesses data, TS prides itself on having a huge array of data sources;
“…and the power of cutting edge TECHNOLOGY… , turns it into models of how the world works…
“…to make more rational decisions in the field of investments.”
The ability to make sense of large amounts of data from disparate sources in real-time is valuable to us.
At Two Sigma, we have many critical use cases that require continuous real-time computation of statistics and metrics from high volumes of streaming data from disparate sources.