Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Break Me If You Can
Practical Guide to Building Fault-tolerant Systems
Devoxx Belgium, November 15, 2018
Alex Borysov, Sof...
Who are we?
Alex Borysov
Software Engineer @Google
Mykyta Protsenko
Software Engineer @Netflix
@aiborisov
@mykyta_p
Fault-Tolerance?
@aiborisov
@mykyta_p
Fault vs Error vs Failure
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Fault
@aiborisov
@mykyta_p
incorrect
internal
state
Picture by Bob McMillan. Public domain. See slide...
@aiborisov
@mykyta_p
Error
@aiborisov
@mykyta_p
visibly
incorrect
behaviour
Picture by David Goehring. CC BY 2.0. See slid...
@aiborisov
@mykyta_p
Failure
@aiborisov
@mykyta_p
main
functionality
is broken
Picture by Camerafiend. CC BY-SA 3.0. See s...
@aiborisov
@mykyta_p
RMS Titanic vs Miracle on the Hudson
@aiborisov
@mykyta_p
Willy Stöwer. Public domain. See slide #180...
@aiborisov
@mykyta_p
RMS Titanic
@aiborisov
@mykyta_p
Fault: Hitting an iceberg
Error: Water in the hull
Failure: Sinking
...
@aiborisov
@mykyta_p
Miracle on the Hudson
@aiborisov
@mykyta_p
Fault: Hitting geese at 859 m
Error: Engines shut down
No ...
Fault Error Failure
@aiborisov
@mykyta_p
→ →
Fault Error Failure
@aiborisov
@mykyta_p
→ →
@aiborisov
@mykyta_p
Fault Tolerance
@aiborisov
@mykyta_p
Code and Design Patterns
Product-Driven Decisions
Communication
...
Dodging Geese
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Clouds Service
Leaderboard Service
API
Gateway
@aibori...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Clouds Service
Leaderboard Service
API
Gateway
@aibori...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Leaderboard Service
API
Gateway
@aiborisov
@mykyta_p
C...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Leaderboard Service
API
Gateway
@aiborisov
@mykyta_p
Clouds Service
...
@aiborisov
@mykyta_p
Dodging Geese Architecture
Geese Service
Clouds ServiceAPI
Gateway
@aiborisov
@mykyta_p
TOP-5
Leaderb...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Clouds Service
Leaderboard Service
API
Gateway
@aibori...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Clouds Service
Leaderboard Service
API
Gateway
@aibori...
@aiborisov
@mykyta_p
Dodging Geese Architecture
TOP-5
Geese Service
Clouds Service
Leaderboard Service
API
Gateway
@aibori...
@aiborisov
@mykyta_p
Leaderboard API (REST)
/players/<username>/score
{"name": "Jane", "score": 100}
/leaderboard/top/<n>
...
@aiborisov
@mykyta_p
gRPC Service Definitions
@aiborisov
@mykyta_p
service GeeseService {
// Return next line of geese.
rp...
@aiborisov
@mykyta_p
gRPC Service Definitions
@aiborisov
@mykyta_p
service GeeseService {
// Return next line of geese.
rp...
@aiborisov
@mykyta_p
service FixtureService {
// Return next line of geese and clouds.
rpc GetFixture (GetFixtureRequest) ...
@aiborisov
@mykyta_p
service FixtureService {
// Return next line of geese and clouds.
rpc GetFixture (GetFixtureRequest) ...
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
Gateway Fixture Service
@aiborisov
@myky...
@aiborisov
@mykyta_p
Gateway Fixture Service
Geese Service
Clouds ServiceAPI
Gateway
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Gateway Fixture Service
Clouds ServiceAPI
Gateway
@aiborisov
@mykyta_p
Geese Service
@aiborisov
@mykyta_p
Gateway Fixture Service
Clouds ServiceAPI
Gateway
@aiborisov
@mykyta_p
Geese Service
@aiborisov
@mykyta_p
Gateway Fixture Service
API
Gateway
@aiborisov
@mykyta_p
Geese Service
Clouds Service
@aiborisov
@mykyta_p
Gateway Fixture Service
API
Gateway
@aiborisov
@mykyta_p
Geese Service
Clouds Service
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Fixture Latency =
Geese Latency
+
Clouds Latency
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Non-Blocking Calls
Don’t block
Send requests in parallel
Combine results when re...
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
Gateway Service Implementation
@aiboriso...
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
Gateway Service Implementation
@aiboriso...
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
Gateway Service Implementation
@aiboriso...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Slow dependencies
Slow upstream services
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Timeouts
Guaranteed latency
for integration points
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
...
Gateway Service Implementation
@aibo...
@aiborisov
@mykyta_p
public class FixtureService extends FixtureServiceImplBase {
...
Gateway Service Implementation
@aibo...
@aiborisov
@mykyta_p
@Override
public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response)...
@aiborisov
@mykyta_p
REST: Non-Blocking Calls
CompletableFuture<List<LeaderboardEntry>> leaderboard =
httpClient
.get().ur...
@aiborisov
@mykyta_p
REST: Non-Blocking Calls with Timeout
CompletableFuture<List<LeaderboardEntry>> leaderboard =
httpCli...
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
No Geese
No Clouds
Blinking Leaderboard
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Observability
Monitoring:
QPS, latency, errors, ...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Observability: gRPC
Monitoring:
QPS, latency, errors, ...
// OpenCensus
RpcViews...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Tracing: gRPC
GrpcTracing grpcTracing =
GrpcTracing.create(...);
ManagedChannelB...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Tracing: gRPC
GrpcTracing grpcTracing =
GrpcTracing.create(...);
ManagedChannelB...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Tracing: REST
build.gradle:
dependencies {
compile '...:spring-cloud-sleuth-zipk...
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Clouds are slow
Geese are fast
Entire call fails
@aiborisov
@mykyta_p
ListenableFuture<GeeseResponse> geese =
geeseClient..getGeese(toGeese(request));
ListenableFuture<Clo...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Partial Degradation
ListenableFuture<GeeseResponse> geese =
geeseClient..getGees...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Some L-board calls fail
L-board latency is low
Scores disappear
@aiborisov
@mykyta_p
CompletableFuture<List<Leaderboard>> request() {
return httpClient
.get().uri("/top/5").exchange()
.t...
@aiborisov
@mykyta_p
CompletableFuture<List<Leaderboard>> request() {
return httpClient
.get().uri("/top/5").exchange()
.t...
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Retry slow calls?
Retry failed calls?
Retry network faults?
@aiborisov
@mykyta_p
Retry Storm
Clouds ServiceAPI
Gateway
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
new RetryPolicy()
.withBackoff(
MIN_DELAY,
MAX_DELAY,
TimeUnit.MILLISECONDS, 100.0)
...
...
@aiboriso...
@aiborisov
@mykyta_p
Failsafe
.with(RETRY_POLICY)
.withFallback(
() -> emptyLeaderboard())
...
@aiborisov
@mykyta_p
Fallba...
@aiborisov
@mykyta_p
Failsafe
.with(RETRY_POLICY)
.withFallback(
() -> cachedLeaderboard())
...
@aiborisov
@mykyta_p
Fallb...
@aiborisov
@mykyta_p
Retry
Fallback
Fail Fast
@aiborisov
@mykyta_p
On Error
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
High 99%ile latency
100 requests
Error probability?
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
High 99%ile latency
100 requests
Error probability:
1 – 0.99^100 = 63%
@aiborisov
@mykyta_p
Tail-Tolerance
@aiborisov
@mykyta_p
Request
200 ms deadline
@aiborisov
@mykyta_p
Tail-Tolerance
@aiborisov
@mykyta_p
Request
200 ms deadline
↓ 100 ms
@aiborisov
@mykyta_p
Tail-Tolerance
@aiborisov
@mykyta_p
Request
200 ms deadline
↓ 100 ms
Request
@aiborisov
@mykyta_p
Tail-Tolerance
@aiborisov
@mykyta_p
Request
200 ms deadline
↓ 100 ms
Request
Fastest Response
@aiborisov
@mykyta_p
High 99%ile latency
100 requests
@aiborisov
@mykyta_p
Request Hedging
@aiborisov
@mykyta_p
High 99%ile latency
100 requests
Error probability:
63% x 0.01 < 1%
@aiborisov
@mykyta_p
Request Hedg...
@aiborisov
@mykyta_p
Channel geeseChannel = ManagedChannelBuilder
.forAddress(geeseHost, geesePort)
.enableRetry()
.maxHed...
@aiborisov
@mykyta_p
Channel geeseChannel = ManagedChannelBuilder
.forAddress(geeseHost, geesePort)
.enableRetry()
.maxHed...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
High mean latency
100 requests
Error probability?
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
High mean latency
100 requests
Error probability:
1 – 0.50^100 = 99.99...%
@aiborisov
@mykyta_p
CircuitBreaker CIRCUIT_BREAKER =
new CircuitBreaker()
.withFailureThreshold(3, 5);
CompletableFuture<...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Error Handling
100% Error Fail Fast
Intermittent Slow Hedging
Intermittent Fast ...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Error Handling
100% Error Fail Fast
Intermittent Slow Hedging
Intermittent Fast ...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Client-driven deadline
Don’t process failed calls
@aiborisov
@mykyta_p
Deadlines
API
Gateway
@aiborisov
@mykyta_p
See slides ##180, 181 for licensing details.
@aiborisov
@mykyta_p
Deadlines
API
Gateway
@aiborisov
@mykyta_p
Deadline 200 ms
→
@aiborisov
@mykyta_p
Deadlines
API
Gateway
@aiborisov
@mykyta_p
Deadline 200 ms
→ Spent 120 ms
→
@aiborisov
@mykyta_p
Deadlines
API
Gateway
@aiborisov
@mykyta_p
Spent 120 ms
→ Spent 90 ms
Deadline 200 ms
→
X
@aiborisov
@mykyta_p
Deadlines
API
Gateway
@aiborisov
@mykyta_p
Spent 120 ms
→ Spent 90 ms
Deadline 200 ms
→
X
→
@aiborisov
@mykyta_p
Deadlines Propagation
API
Gateway
@aiborisov
@mykyta_p
Deadline 200 ms
→
@aiborisov
@mykyta_p
Deadline 80 ms
Deadlines Propagation
API
Gateway
@aiborisov
@mykyta_p
Deadline 200 ms
→ Spent 120 ms
→
@aiborisov
@mykyta_p
Deadline 80 ms
Deadlines Propagation
API
Gateway
@aiborisov
@mykyta_p
Spent 120 ms
→ Spent 90 ms
Dead...
@aiborisov
@mykyta_p
Deadline 80 ms
Deadlines Propagation
API
Gateway
@aiborisov
@mykyta_p
Spent 120 ms
→ Spent 90 ms
Dead...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Throughput has limits
Exceeding limits?
@aiborisov
@mykyta_p
new ConcurrencyLimitServletFilter(
new ServletLimiterBuilder()
.partitionByHeader("GEESE_TYPE",
c -> ...
@aiborisov
@mykyta_p
new ConcurrencyLimitServletFilter(
new ServletLimiterBuilder()
.partitionByHeader("GEESE_TYPE",
c -> ...
@aiborisov
@mykyta_p
var limiter =
new GrpcServerLimiterBuilder()
.partitionByHeader(GEESE_TYPE)
.partition("premium", 0.9...
@aiborisov
@mykyta_p
var limiter =
new GrpcServerLimiterBuilder()
.partitionByHeader(GEESE_TYPE)
.partition("premium", 0.9...
@aiborisov
@mykyta_p
new GrpcClientLimiterBuilder()
.limit(
newBuilder()
.initialLimit(1000).build())
.blockOnLimit(false)...
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Monitoring
@aiborisov
@mykyta_p
APM
Service
metrics
Distributed
tracing
Business
metrics
Picture by A...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Code and Design
Timeouts / Deadline Propagation
Retries / Hedging
Proper Fallbac...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Request for each response
Requests don’t change
@aiborisov
@mykyta_p
Redundant Requests
@aiborisov
@mykyta_p
GeeseRequest
GeeseResponse
GeeseRequest
GeeseResponse
GeeseRe...
@aiborisov
@mykyta_p
Redundant Requests
@aiborisov
@mykyta_p
GeeseRequest
GeeseResponse
GeeseRequest
GeeseResponse
GeeseRe...
@aiborisov
@mykyta_p
Streaming
@aiborisov
@mykyta_p
GeeseRequest
GeeseResponse
GeeseResponse
GeeseResponse
@aiborisov
@mykyta_p
service GeeseService {
rpc GetGeese (GetGeeseRequest)
returns (GeeseResponse);
}
service CloudsServic...
@aiborisov
@mykyta_p
service GeeseService {
rpc GetGeese (GetGeeseRequest)
returns (stream GeeseResponse);
}
service Cloud...
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Server faster than client
Client cannot keep up
@aiborisov
@mykyta_p
Too Many Streaming Responses
@aiborisov
@mykyta_p
GeeseRequest
@aiborisov
@mykyta_p
Too Many Streaming Responses
@aiborisov
@mykyta_p
GeeseRequest
X
@aiborisov
@mykyta_p
Flow Control
@aiborisov
@mykyta_p
GeeseRequest
@aiborisov
@mykyta_p
Flow Control
@aiborisov
@mykyta_p
GeeseRequest
5
@aiborisov
@mykyta_p
Flow Control
@aiborisov
@mykyta_p
GeeseRequest
5
@aiborisov
@mykyta_p
Flow Control
@aiborisov
@mykyta_p
GeeseRequest
5
3
@aiborisov
@mykyta_p
Flow Control
@aiborisov
@mykyta_p
GeeseRequest
5
3
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Decople producer and
consumer
Decople failures
@aiborisov
@mykyta_p
Message-driven
Elastic
Responsive
Resilient
@aiborisov
@mykyta_p
Reactive Systems
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Per
instance
limits
@aiborisov
@mykyta_p
Door Capacity
@aiborisov
@mykyta_p
Why didn’t Rose make room for
Jack on the door?
Willy Stöwer. Publ...
@aiborisov
@mykyta_p
Door Capacity
@aiborisov
@mykyta_p
Why didn’t Rose make room for
Jack on the door?
“ The answer is ve...
@aiborisov
@mykyta_p
Capacity
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Capacity
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Autoscaling
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Prescaling
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Prescaling
@aiborisov
@mykyta_p
See slides ##180, 182 for licensing details.
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Services
break
each other
@aiborisov
@mykyta_p
$
Free and Premium?
Free
Premium
$
@aiborisov
@mykyta_p
Free and Premium Outage
Free
Premium
$
$
@aiborisov
@mykyta_p
$
$
Bulkheads
Free
Premium
$
@aiborisov
@mykyta_p
Bulkheads
Free
Premium $
$
$
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Bulkheads
By Request Type
By Client Priority
By Region
By Availability Zone
etc
@aiborisov
@mykyta_p
Demo
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Bad user experience
Metrics are not enough
@aiborisov
@mykyta_p
Prober
TOP-5
API
Gateway
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Prober
TOP-5
API
Gateway
@aiborisov
@mykyta_p
See slides ##180, 182 for licensing details.
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Prober
Availability
Latency SLO
Response verification
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Prober
Availability
Latency SLO
Response verification
CloudProber.org
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Technical
solutions
are not enough
@aiborisov
@mykyta_p
Communication
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Communication
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Communication Channels
@aiborisov
@mykyta_p
GEESE
at 270
@aiborisov
@mykyta_p
Communication Channels
@aiborisov
@mykyta_p
GEESE
at 270
@aiborisov
@mykyta_p
GEESE
at 270
Communication Channels
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
GEESE
at 270
Communication Channels
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Postmortems
@aiborisov
@mykyta_p
Blameless
Constructive
@aiborisov
@mykyta_p
Postmortems
@aiborisov
@mykyta_p
Blameless
Constructive
Social
See slides ##189, 182, 183 for licensi...
@aiborisov
@mykyta_p
Postmortems
@aiborisov
@mykyta_p
Timeline
Causes
Remedies
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Learn from Failure
Blameless postmortems
Alert playbooks
Incident knowledge base
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Libraries and Tools
@aiborisov
@mykyta_p
Demo: github.com/break-me-if-you-can
Failsafe: github.com/jh...
@aiborisov
@mykyta_p
Demo UI
@HalloGene_
Yevgen Golubenko
Twitter: @HalloGene_
github.com/HalloGene
Picture by Yevgen Golu...
@aiborisov
@mykyta_p
Books
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Fault-Tolerance
Code & Design Patterns
Product decisions
Communication culture
@aiborisov
@mykyta_p
Please Break Me!
If you can
@aiborisov
@mykyta_p
Please Break Me!
Rate
If you can
@aiborisov
@mykyta_p
Please Break Me!
Rate Us
If you can
@aiborisov
@mykyta_p
Please Break Me!
Rate Us
If you enjoyed the talk
Or give feedback
If you can
@aiborisov
@mykyta_p
Please Break Me!
Rate Us
If you enjoyed the talk
Or give feedback
If you can
5 STARS!
@aiborisov
@mykyta_p
@aiborisov
@mykyta_p
Images and Licensing
Images of geese, clouds, pilots, plane, arrows, cup, airport traffic control tow...
@aiborisov
@mykyta_p
Images and Licensing
Slides ##8, 10, 13: www.flickr.com/photos/22608787@N00/3200086900
- Picture y Gr...
@aiborisov
@mykyta_p
Images and Licensing
Slide #140: piq.codeus.net/picture/254492/CVsantahat
- Santa hat for CommanderVi...
@aiborisov
@mykyta_p
Images and Licensing
Slides #166, 167: piq.codeus.net/picture/444498/Beer-Bottle
- Beer Bottle by jac...
Prochain SlideShare
Chargement dans…5
×

Break me if you can: practical guide to building fault-tolerant systems (with examples from REST and gRPC polyglot stacks)

Devoxx 2018 "Break me if you can: practical guide to building fault-tolerant systems" slides .

  • Soyez le premier à commenter

Break me if you can: practical guide to building fault-tolerant systems (with examples from REST and gRPC polyglot stacks)

  1. 1. Break Me If You Can Practical Guide to Building Fault-tolerant Systems Devoxx Belgium, November 15, 2018 Alex Borysov, Software Engineer @ Google Mykyta Protsenko, Software Engineer @ Netflix
  2. 2. Who are we? Alex Borysov Software Engineer @Google Mykyta Protsenko Software Engineer @Netflix @aiborisov @mykyta_p
  3. 3. Fault-Tolerance? @aiborisov @mykyta_p
  4. 4. Fault vs Error vs Failure @aiborisov @mykyta_p
  5. 5. @aiborisov @mykyta_p Fault @aiborisov @mykyta_p incorrect internal state Picture by Bob McMillan. Public domain. See slide #180 for details.
  6. 6. @aiborisov @mykyta_p Error @aiborisov @mykyta_p visibly incorrect behaviour Picture by David Goehring. CC BY 2.0. See slide #180 for details.
  7. 7. @aiborisov @mykyta_p Failure @aiborisov @mykyta_p main functionality is broken Picture by Camerafiend. CC BY-SA 3.0. See slide #180 for details.
  8. 8. @aiborisov @mykyta_p RMS Titanic vs Miracle on the Hudson @aiborisov @mykyta_p Willy Stöwer. Public domain. See slide #180 for details. By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
  9. 9. @aiborisov @mykyta_p RMS Titanic @aiborisov @mykyta_p Fault: Hitting an iceberg Error: Water in the hull Failure: Sinking Willy Stöwer. Public domain. See slide #180 for details.
  10. 10. @aiborisov @mykyta_p Miracle on the Hudson @aiborisov @mykyta_p Fault: Hitting geese at 859 m Error: Engines shut down No Failure! By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
  11. 11. Fault Error Failure @aiborisov @mykyta_p → →
  12. 12. Fault Error Failure @aiborisov @mykyta_p → →
  13. 13. @aiborisov @mykyta_p Fault Tolerance @aiborisov @mykyta_p Code and Design Patterns Product-Driven Decisions Communication By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
  14. 14. Dodging Geese @aiborisov @mykyta_p
  15. 15. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p See slides ##180, 181 for licensing details.
  16. 16. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  17. 17. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service
  18. 18. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service Geese Service
  19. 19. @aiborisov @mykyta_p Dodging Geese Architecture Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p TOP-5 Leaderboard Service
  20. 20. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  21. 21. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  22. 22. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  23. 23. @aiborisov @mykyta_p Leaderboard API (REST) /players/<username>/score {"name": "Jane", "score": 100} /leaderboard/top/<n> [{"name": "Jane", "score": 100}, {"name": "John", "score": 50}, ...] @aiborisov @mykyta_p
  24. 24. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); }
  25. 25. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { // Return next line of clouds. rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); }
  26. 26. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } gRPC Gateway Service @aiborisov @mykyta_p
  27. 27. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } + = Fixture gRPC Gateway Service @aiborisov @mykyta_p
  28. 28. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Fixture Service @aiborisov @mykyta_p
  29. 29. @aiborisov @mykyta_p Gateway Fixture Service Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p
  30. 30. @aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
  31. 31. @aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
  32. 32. @aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
  33. 33. @aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
  34. 34. @aiborisov @mykyta_p @aiborisov @mykyta_p Fixture Latency = Geese Latency + Clouds Latency
  35. 35. @aiborisov @mykyta_p @aiborisov @mykyta_p Non-Blocking Calls Don’t block Send requests in parallel Combine results when ready
  36. 36. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;
  37. 37. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  38. 38. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  39. 39. @aiborisov @mykyta_p
  40. 40. @aiborisov @mykyta_p @aiborisov @mykyta_p Slow dependencies Slow upstream services
  41. 41. @aiborisov @mykyta_p @aiborisov @mykyta_p Timeouts Guaranteed latency for integration points
  42. 42. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  43. 43. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  44. 44. @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p
  45. 45. @aiborisov @mykyta_p REST: Non-Blocking Calls CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  46. 46. @aiborisov @mykyta_p REST: Non-Blocking Calls with Timeout CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  47. 47. @aiborisov @mykyta_p
  48. 48. Demo @aiborisov @mykyta_p
  49. 49. @aiborisov @mykyta_p @aiborisov @mykyta_p No Geese No Clouds Blinking Leaderboard
  50. 50. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability Monitoring: QPS, latency, errors, ...
  51. 51. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability: gRPC Monitoring: QPS, latency, errors, ... // OpenCensus RpcViews.registerAllViews();
  52. 52. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build() ; ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  53. 53. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build(); ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  54. 54. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: REST build.gradle: dependencies { compile '...:spring-cloud-sleuth-zipkin' compile '...:spring-cloud-starter-sleuth' ... } application.properties: spring.zipkin.baseUrl=http://zipkin:9411/ spring.sleuth.sampler.probability=1.0 spring.sleuth.web.enabled=true
  55. 55. @aiborisov @mykyta_p
  56. 56. Demo @aiborisov @mykyta_p
  57. 57. @aiborisov @mykyta_p @aiborisov @mykyta_p Clouds are slow Geese are fast Entire call fails
  58. 58. @aiborisov @mykyta_p ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... @aiborisov @mykyta_p Partial Degradation
  59. 59. @aiborisov @mykyta_p @aiborisov @mykyta_p Partial Degradation ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.successfulAsList(geese, clouds); ...
  60. 60. @aiborisov @mykyta_p
  61. 61. @aiborisov @mykyta_p @aiborisov @mykyta_p Some L-board calls fail L-board latency is low Scores disappear
  62. 62. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } @aiborisov @mykyta_p Retries: REST
  63. 63. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } RetryPolicy RETRY_POLICY = new RetryPolicy() .retryOn(IOException.class) .withMaxRetries(MAX_RETRIES); CompletableFuture<List<Leaderboard>> top5 = Failsafe.with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Retries: REST
  64. 64. @aiborisov @mykyta_p
  65. 65. Demo @aiborisov @mykyta_p
  66. 66. @aiborisov @mykyta_p @aiborisov @mykyta_p Retry slow calls? Retry failed calls? Retry network faults?
  67. 67. @aiborisov @mykyta_p Retry Storm Clouds ServiceAPI Gateway @aiborisov @mykyta_p
  68. 68. @aiborisov @mykyta_p new RetryPolicy() .withBackoff( MIN_DELAY, MAX_DELAY, TimeUnit.MILLISECONDS, 100.0) ... ... @aiborisov @mykyta_p Exponential Backoffs
  69. 69. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> emptyLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  70. 70. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> cachedLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  71. 71. @aiborisov @mykyta_p Retry Fallback Fail Fast @aiborisov @mykyta_p On Error
  72. 72. @aiborisov @mykyta_p
  73. 73. @aiborisov @mykyta_p @aiborisov @mykyta_p
  74. 74. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability?
  75. 75. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 1 – 0.99^100 = 63%
  76. 76. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline
  77. 77. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms
  78. 78. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request
  79. 79. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request Fastest Response
  80. 80. @aiborisov @mykyta_p High 99%ile latency 100 requests @aiborisov @mykyta_p Request Hedging
  81. 81. @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 63% x 0.01 < 1% @aiborisov @mykyta_p Request Hedging
  82. 82. @aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
  83. 83. @aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
  84. 84. @aiborisov @mykyta_p
  85. 85. @aiborisov @mykyta_p @aiborisov @mykyta_p
  86. 86. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability?
  87. 87. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability: 1 – 0.50^100 = 99.99...%
  88. 88. @aiborisov @mykyta_p CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(3, 5); CompletableFuture<...> top5 = Failsafe .with(CIRCUIT_BREAKER) .with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Circuit Breaker
  89. 89. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  90. 90. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  91. 91. @aiborisov @mykyta_p
  92. 92. @aiborisov @mykyta_p @aiborisov @mykyta_p Client-driven deadline Don’t process failed calls
  93. 93. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p See slides ##180, 181 for licensing details.
  94. 94. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms →
  95. 95. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
  96. 96. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
  97. 97. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X →
  98. 98. @aiborisov @mykyta_p Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms →
  99. 99. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
  100. 100. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
  101. 101. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline -10 ms Deadline 200 ms → X
  102. 102. @aiborisov @mykyta_p
  103. 103. @aiborisov @mykyta_p @aiborisov @mykyta_p Throughput has limits Exceeding limits?
  104. 104. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  105. 105. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  106. 106. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  107. 107. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  108. 108. @aiborisov @mykyta_p new GrpcClientLimiterBuilder() .limit( newBuilder() .initialLimit(1000).build()) .blockOnLimit(false) // fail-fast .build(); @aiborisov @mykyta_p gRPC: Client
  109. 109. @aiborisov @mykyta_p
  110. 110. Demo @aiborisov @mykyta_p
  111. 111. Demo @aiborisov @mykyta_p
  112. 112. @aiborisov @mykyta_p Monitoring @aiborisov @mykyta_p APM Service metrics Distributed tracing Business metrics Picture by Alex Borysov. CC BY 2.0. See slide #180 for details.
  113. 113. @aiborisov @mykyta_p @aiborisov @mykyta_p Code and Design Timeouts / Deadline Propagation Retries / Hedging Proper Fallbacks Concurrency Limits Load Shedding Observability
  114. 114. @aiborisov @mykyta_p @aiborisov @mykyta_p Request for each response Requests don’t change
  115. 115. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  116. 116. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  117. 117. @aiborisov @mykyta_p Streaming @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseResponse GeeseResponse
  118. 118. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  119. 119. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (stream GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (stream CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  120. 120. @aiborisov @mykyta_p
  121. 121. @aiborisov @mykyta_p @aiborisov @mykyta_p Server faster than client Client cannot keep up
  122. 122. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest
  123. 123. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest X
  124. 124. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest
  125. 125. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  126. 126. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  127. 127. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  128. 128. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  129. 129. @aiborisov @mykyta_p
  130. 130. @aiborisov @mykyta_p @aiborisov @mykyta_p Decople producer and consumer Decople failures
  131. 131. @aiborisov @mykyta_p Message-driven Elastic Responsive Resilient @aiborisov @mykyta_p Reactive Systems
  132. 132. @aiborisov @mykyta_p
  133. 133. @aiborisov @mykyta_p @aiborisov @mykyta_p Per instance limits
  134. 134. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? Willy Stöwer. Public domain. See slide #180 for details.
  135. 135. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? “ The answer is very simple because it says on page 147 that Jack dies “ James Cameron Willy Stöwer. Public domain. See slide #180 for details.
  136. 136. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  137. 137. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  138. 138. @aiborisov @mykyta_p Autoscaling @aiborisov @mykyta_p
  139. 139. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p
  140. 140. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p See slides ##180, 182 for licensing details.
  141. 141. @aiborisov @mykyta_p
  142. 142. @aiborisov @mykyta_p @aiborisov @mykyta_p Services break each other
  143. 143. @aiborisov @mykyta_p $ Free and Premium? Free Premium $
  144. 144. @aiborisov @mykyta_p Free and Premium Outage Free Premium $ $
  145. 145. @aiborisov @mykyta_p $ $ Bulkheads Free Premium $
  146. 146. @aiborisov @mykyta_p Bulkheads Free Premium $ $ $
  147. 147. @aiborisov @mykyta_p @aiborisov @mykyta_p Bulkheads By Request Type By Client Priority By Region By Availability Zone etc
  148. 148. @aiborisov @mykyta_p
  149. 149. Demo @aiborisov @mykyta_p
  150. 150. @aiborisov @mykyta_p @aiborisov @mykyta_p Bad user experience Metrics are not enough
  151. 151. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p
  152. 152. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p See slides ##180, 182 for licensing details.
  153. 153. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification
  154. 154. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification CloudProber.org
  155. 155. @aiborisov @mykyta_p
  156. 156. @aiborisov @mykyta_p
  157. 157. @aiborisov @mykyta_p
  158. 158. @aiborisov @mykyta_p @aiborisov @mykyta_p Technical solutions are not enough
  159. 159. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  160. 160. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  161. 161. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  162. 162. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  163. 163. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  164. 164. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  165. 165. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive
  166. 166. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive Social See slides ##189, 182, 183 for licensing details.
  167. 167. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Timeline Causes Remedies
  168. 168. @aiborisov @mykyta_p @aiborisov @mykyta_p Learn from Failure Blameless postmortems Alert playbooks Incident knowledge base
  169. 169. @aiborisov @mykyta_p
  170. 170. @aiborisov @mykyta_p Libraries and Tools @aiborisov @mykyta_p Demo: github.com/break-me-if-you-can Failsafe: github.com/jhalterman/failsafe Observability: opencensus.io, opentracing.io Prober: cloudprober.org Concurrency Limits: github.com/Netflix/concurrency-limits
  171. 171. @aiborisov @mykyta_p Demo UI @HalloGene_ Yevgen Golubenko Twitter: @HalloGene_ github.com/HalloGene Picture by Yevgen Golubenko. Also see slide #183 for licensing details.
  172. 172. @aiborisov @mykyta_p Books @aiborisov @mykyta_p
  173. 173. @aiborisov @mykyta_p @aiborisov @mykyta_p Fault-Tolerance Code & Design Patterns Product decisions Communication culture
  174. 174. @aiborisov @mykyta_p Please Break Me! If you can
  175. 175. @aiborisov @mykyta_p Please Break Me! Rate If you can
  176. 176. @aiborisov @mykyta_p Please Break Me! Rate Us If you can
  177. 177. @aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can
  178. 178. @aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can 5 STARS!
  179. 179. @aiborisov @mykyta_p
  180. 180. @aiborisov @mykyta_p Images and Licensing Images of geese, clouds, pilots, plane, arrows, cup, airport traffic control tower are property of Mykyta Protsenko and Alex Borysov, if not stated otherwise (see below). All Rights Reserved. Other images used: Slide #5: commons.wikimedia.org/wiki/File:FEMA_-_16381_-_Photograph_by_Bob_McMillan_taken_on_09-28-2005_in_Texas.jpg - Picture by Bob McMillan, the US federal government work, public domain Slide #6: www.flickr.com/photos/carbonnyc/3290528875 - Picture by David Goehring. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - changes were made Slide #7: www.flickr.com/photos/carbonnyc/3290528875 - Picture by Camerafiend. Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0): creativecommons.org/licenses/by-sa/3.0/deed.en - no changes were made Slides ##8, 9, 134, 135: commons.wikimedia.org/wiki/File:Titanic_sinking,_painting_by_Willy_St%C3%B6wer.jpg - Willy Stöwer. Public domain work of art
  181. 181. @aiborisov @mykyta_p Images and Licensing Slides ##8, 10, 13: www.flickr.com/photos/22608787@N00/3200086900 - Picture y Greg Lam Pak Ng. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - no changes were made Slides ##15-22, 29-33, 67, 76-79, 93-101, 115-117, 122-128, 136-140, 143-146, 151-152: - Blue Game Boy Color by kure: piq.codeus.net/picture/31994/Blue-Game-Boy-Color - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slides ##93-101: - The Sun by Vinicius615: piq.codeus.net/picture/191706/The-Sun - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #112: - Picture by Alex Borysov. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0
  182. 182. @aiborisov @mykyta_p Images and Licensing Slide #140: piq.codeus.net/picture/254492/CVsantahat - Santa hat for CommanderVideo, CVsantahat by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slide #152: piq.codeus.net/picture/423109/UFO - UFO by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slides #166, 167: piq.codeus.net/picture/334023/beer - beer by Investa - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made
  183. 183. @aiborisov @mykyta_p Images and Licensing Slides #166, 167: piq.codeus.net/picture/444498/Beer-Bottle - Beer Bottle by jacklrj - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #171: https://piq.codeus.net/picture/330338/Deal-With-It - Deal With It by Shiro - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

×