Contenu connexe
Similaire à Reactive Systems that focus on High Availability with Lerna (20)
Reactive Systems that focus on High Availability with Lerna
- 1. © 2021 TIS Inc.
Reactive Summit 2021
Reactive Systems that focus on High Availability with Lerna
2021.11.2
Yugo Maede
@yugolf
- 2. © 2021 TIS Inc. 2
About me
Yugo Maede ( Twitter: @yugolf )
TIS Inc. Technology & Innovation SBU
Technology & Engineering Center <- technology-specific organization
• current mission
The product owner of Lerna. Developing Lerna and support projects which adopt Lerna.
Lerna enables to build high-available and high-throughput systems quickly and inexpensively.
• translated book
Akka in Action Japanese version
• web media contributions
ThinkIT: Learn about reactive systems, the paradigm of the many-core era
• speaking at events
Scala Matsuri、JJUG CCC etc.
- 3. © 2021 TIS Inc. 3
Non-functional requirements for mission-critical systems
Motivation
high availability & high throughput
want a mechanism that can accomplish high availability
and high throughput quickly with low cost!
• costly
• high-throughput is more costly
• complex and difficult
• spend time on non-business logic
distributed system
high-available server
- 4. © 2021 TIS Inc. 4
• Message Driven、Actor Model
• Stateful Application
• Distributed System、Cluster
• Event Sourcing、Distributed DB
• CQRS
Solution
difficult
adapt repeatedly, continue to operate the system,
refine the architecture, and nurture engineers
OSS
『Lerna』
building Reactive Systems with Akka
complexity
- 5. © 2021 TIS Inc. 5
– libraries (support Akka Typed)
– Developer guides
– reference Code
– learning Contents
and everything you need to build a highly available system.
We make one package which is ready to use.
Building high availability systems
software stacks that focus on high availability
and build reactive systems
- 6. © 2021 TIS Inc. 6
execute Terraform scripts on VMs to create environments for highly available systems
Overview of Lerna
https://fintan.jp/?p=5948&lang=en
- 7. © 2021 TIS Inc. 7
• availability
– logical availability calculated from MTBF and MTTR
(Note: Not the availability of the service itself.)
• The numeric value of ”Design for Failure”
– How many seconds does it take for your application to recover from a failure?
– minimize the time to repair as failure always occurs
• Focus on minimizing the MTTR; Mean Time To Repair
Availability levels enabled by Lerna
https://en.wikipedia.org/wiki/Availability
- 8. © 2021 TIS Inc. 8
Under the following conditions, a simulated failure occurs and MTTR is measured
– building Payment Services on AWS
– adopt CQRS + Event Sourcing architecture
• command side APIs persist events to Cassandra in real-time
• propagate asynchronously to the query side (MariaDB)
– measurement target is the command side API
– send 150 TPS requests from Gatling
Measurement condition
- 9. © 2021 TIS Inc. 9
• measure the time when even one service user is unavailable at each point of failure,
and set it as "MTTR in single failure"
• assuming one failure per server per year, "number of servers x failure impact range
(Percentage of Service Unavailable)" is MTTR
• total MTTR for all failure points for one year
Lerna's definition of MTTR
Failed layer
MTTR in
single failure
number of
servers
number of
failures per year
failure
impact range
MTTR
Load Balancer(Keepalived) 2.78 sec 1 1 1
MTTR in single failure
x
number of servers
x
number of failures
per year
x
failure impact range
Load Balancer(HAProxy) 3.32 sec 3 1 1/3
Application(Akka Cluster) 5.92 sec 9 1 1/9
Command Side DB(Cassandra) 0.00 sec 6 1 1/6
Query Side DB(MariaDB) 1.14 sec 6 1 1/6
DC Failure(network partition) 8.02 sec 1 1 1
downtime per year
total MTTR for all
faults
- 10. © 2021 TIS Inc. 10
• to Minimize impact, the important thing is to isolate points of failure instantaneously rather
than to recover them
• all layers are scalable and can be healed to their original state
• the application layer is implemented with its original library akka-entity-replication with Raft
Measurement result
failed layer
MTTR in
single failure
number of
servers
number of
failures per year
failure impact
range
MTTR
Load Balancer(Keepalived) 2.78 sec 1 1 1 2.78 sec
Load Balancer(HAProxy) 3.32 sec 3 1 1/3 3.32 sec
Application(Akka Cluster) 5.92 sec 9 1 1/9 5.92 sec
Command Side DB(Cassandra) 0.00 sec 6 1 1/6 0.00 sec
Query Side DB(MariaDB) 1.14 sec 6 1 1/6 1.14 sec
DC Failure(network partition) 8.02 sec 1 1 1 8.02 sec
downtime per year
total MTTR for all
faults
all layers recovered within 10 seconds
- 11. © 2021 TIS Inc. 11
akka-entity-replication
https://github.com/lerna-stack/akka-entity-replication#akka-entity-replication
Requests recover (become green) immediately even if failure (kill a node) occurred
- 12. © 2021 TIS Inc. 12
Availability : 99.9999%
failed Layer
MTTR in
single failure
number of
servers
number of
failures per year
failure impact
range
MTTR
Load Balancer(Keepalived) 2.78 sec 1 1 1 2.78 sec
Load Balancer(HAProxy) 3.32 sec 3 1 1/3 3.32 sec
Application(Akka Cluster) 5.92 sec 9 1 1/9 5.92 sec
Command Side DB(Cassandra) 0.00 sec 6 1 1/6 0.00 sec
Query Side DB(MariaDB) 1.14 sec 6 1 1/6 1.14 sec
DC Failure(network partition) 8.02 sec 1 1 1 8.02 sec
downtime per year 21.18 sec
https://www.eventhelix.com/fault-handling/reliability-availability-basics/
- 13. © 2021 TIS Inc. 13
Lerna is Elastic
Lerna's architecture is Elastic, so adding nodes can achieve 1,000 TPS
(This is not an upper bound because it is Elastic)
- 14. © 2021 TIS Inc. 14
Lerna is Responsive
Lerna‘s architecture is responsive, so that it can respond under high load (1,000 TPS)
within 100 ms (tested with payment transactions persisting events to Cassandra)
- 15. © 2021 TIS Inc. 15
• availability and performance of Lerna
https://fintan.jp/?p=7256
• getting started with Lerna
https://fintan.jp/?p=5946
• our technical site
https://fintan.jp/?lang=en
More information
- 16. © 2021 TIS Inc. 16
• Lerna High Availability Software Stack
– achieve non-functional requirements for mission-critical systems
– not only libraries, but also necessary items for system development are available as OSS
– reduce barriers to complex and difficult distributed systems
• the numeric value of ”Design for Failure”
– logical availability calculated from measured MTTR: 99.9999%
– all layers, including the application layer, recover from failure in less than 10 seconds
Summary
- 17. THANK YOU
If you have any questions, please mention or DM on Twitter.
Twitter ID : @yugolf