Beyond the Buzzwords - Duncan Winn, Keith Strini, Sean Keery
Originally delivered at Cloud Foundry Summit Europe 2017 Basel Switzerland October 11, 2017
1. Beyond The Buzzwords
Duncan Winn | Platform Engineering | @duncwinn
Sean Keery | Minister of Chaos | @zgrinch
Keith Strini | Federal Practice Lead | @pivotal
9. verb
Estimate the monetary worth of (something): Hard ROI
• Removing Spend
• Hardware / Middleware / OS Reduction
• Automation
What is Value
noun
The importance, worth, or usefulness of something
• Faster Time to Market
• Innovation
• Delighted Customers
10. Effective Use of CAPEX
Eliminating technical debt earlier,
validated features,
continuous product evolution
reflecting changing user base
● Data driven decisions
● Higher customer spend ratio
per investment dollar
● Lower overall subscription
churn
● Less “restarts” more evolution
Continuous Experimentation
Reducing the risk of building the
wrong thing while nimbly
changing direction
● Distributed Tracing/Shared
Context (Fast Feedback)
● Identify & test assumptions
● Direct feedback to
Design/CFO/CEO
● Lower CAPEX per hypothesis
Cloud Native
Enablement
Cloud Native Org
PRACTICES PRACTICES
Waste Reduction
Leveraging a Platform with
cloud-ready workloads to
remove delivery constraints
● Paired Programming
● CI/CD and better QA/TDD
● Rel-Eng Intelligence
● Automated Resilient Ops
Operations + App Transformation
PRACTICES
Cloud Native ROI Continuum
12. Code Deploy Prod Support
Work Flow
Value Stream
Mapping
Request Delivery
13. Muda Type I
Non-value added activity, necessary for end customer
Muda Type II
Non-value added activity, unnecessary for end customer
What is muda - 無駄
Any process that consumes more resources than needed
28. SLIs and SLOs are crucial elements in the control loops used to identify
systemic value:
Monitor and measure the system’s SLIs.
Compare the SLIs to the SLOs, and decide whether or not action is needed.
If action is needed, figure out what needs to happen in order to meet the
target.
Take that action.
Review SLO’s.
Continuous Experimentation
31. Our Service Level Agreement will be “Real Time Readiness of
the Platform.”
Our Service Level Indicators and Objectives:
Cell Rep Time Synch < 5m
BBS Time to Run LRP Convergence > 10m
Auctioneer App Instances Placement Failures > 0.5
Auctioneer Task Placement Failures > 0.5
Waiting or Delays
Latency
32. Our SLA will be “Proactive
Security Mitigation.”
Our Service Level Indicators and Objectives:
Number of Authn Errors > 10 attempts
Number of Failed Logons > 4 attempts
Number of Forbidden SSH Sessions > 2
Defects
Errors
33. Our SLA will be “Proactive Scaling of the Platform.”
Our SLI and SLO’s:
Unhealthy Cells = 0
Remaining Memory - Cell Memory Chunks Available
> 4
Remaining Memory - Overall Memory Available > 4
Over-production or Extra Features
35
Saturatio
n
34. Transportation or Handoffs
Traffic
Our SLA will be “Proactive Scaling of our Apps.”
Our SLI and SLO’s:
Router Throughput > 10000 rps
# of Request per Application Instance > 1000 rps
# of Request per Application Function > 100 rps
36. Effective Use of CAPEX
Eliminating technical debt earlier,
validated features,
continuous product evolution
reflecting changing user base
● Data driven decisions
● Higher customer spend ratio
per investment dollar
● Lower overall subscription
churn
● Less “restarts” more evolution
Continuous Experimentation
Reducing the risk of building the
wrong thing while nimbly
changing direction
● Distributed Tracing/Shared
Context (Fast Feedback)
● Identify & test assumptions
● Direct feedback to
Design/CFO/CEO
● Lower CAPEX per hypothesis
Cloud Native
Enablement
Cloud Native Org
PRACTICES PRACTICES
Waste Reduction
Leveraging a Platform with
cloud-ready workloads to
remove delivery constraints
● Paired Programming
● CI/CD and better QA/TDD
● Rel-Eng Intelligence
● Automated Resilient Ops
Operations + App Transformation
PRACTICES
Cloud Native ROI Continuum
39. Company Objective 1 - Release Money: Acquire + Retain Customers
- Key Result 1 - 40% Redux in OPEX
- Key Result 2 - 20% more efficient use of CAPEX for new customer acquisition
Company Objective 2 - Capture and Retain New Market Share
- Key Result 1 - 3 new revenue generating products / quarter
- Key Result 2 - 10% lower churn in new user base vs existing product churn
What If?
51. Effective Use of CAPEX
Eliminating technical debt earlier,
validated features,
continuous product evolution
reflecting changing user base
● Data driven decisions
● Higher customer spend ratio
per investment dollar
● Lower overall subscription
churn
● Less “restarts” more evolution
Continuous Experimentation
Reducing the risk of building the
wrong thing while nimbly
changing direction
● Distributed Tracing/Shared
Context (Fast Feedback)
● Identify & test assumptions
● Direct feedback to
Design/CFO/CEO
● Lower CAPEX per hypothesis
Cloud Native
Enablement
Cloud Native Org
PRACTICES PRACTICES
Waste Reduction
Leveraging a Platform with
cloud-ready workloads to
remove delivery constraints
● Paired Programming
● CI/CD and better QA/TDD
● Rel-Eng Intelligence
● Automated Resilient Ops
Operations + App Transformation
PRACTICES
Cloud Native ROI Continuum
WHAT IF {CLICK}
For everything you build….. {CLICK}
You can measure the VALUE. And by value we don’t man indicative value (such as simply believing this feature is important) {CLICK}
we mean quantitive value - the feature has produced this much revenue.
{CLICK}
This talk will unpack how,
through using Cloud Foundry and
BUILDING a continuous feedback loop to measure the correct set of Service Level Indicators,
You can start to model your software delivery practises around MEASURABLE value.
{CLICK}
To backup a little:
This title of this talk “buzzword bingo” came about because my colleagues and I have spent a considerable portion of our career explaining buzzwords to people
By buzzwords, for anyone who’s not heard my previous buzzword talk - I don’t just mean the literal definition - I mean”
Actually taking the time to understand the value intended behind these various trends in IT.
For example Agile being more than sprints and stand-ups it’s about getting features into the hands of end users quickly.
DevOps being an actual culture and not just a tool chain.
And to that point, as some of you know - I wrote a book on Cloud Foundry and how CF supports these various trends
{CLICK}
Once you have all these buzzwords down: You’ve
chosen your IaaS layer and built a hardened CF environment
Your’ve defined your container strategy and leveraged your app architecture - probably involving some level of micro services
You’ve structured you teams to adopt a dev ops culture with Agile delivery - underpinned by a centralised platform operations team
you use pipeline to deploy EVERYTHING
at this point your are probably feeling pretty good about yourself as you’ve obtained the Cloud Native Jedi Status
{CLICK}
But to get here comes at a cost - both in terms of time and resource.
And at this point folks often start to question the value.
{CLICK}
Now first question to ask here is…
{CLICK}
WHY do we we care - why quantify the value.
This is a question you should ask yourselves daily in whatever you do. Why should people care about that value you deliver - and if you don’t know the answer - or at least if you can’t find that answer out - maybe you should stop doing what you are doing.
{CLICK}
You’ve build our your Cloud Native Ecosystem
{CLICK}
As a platform operator and platform champion you are really proud of this environment and so you start to tell other LOB’s about it and then many many developers like it and start to use it.
{CLICK}
At this point - the senior exec tends to take notice and they start to question is this is strategy that is good for select LOB or should they start to roll this out to all the other LOBs -
and to make that decision they often want to know “the value”
{CLICK}
So how to you start to measure and quantify the value behind this decision?
Before we
Jump into measuring and quantifying the value behind your Cloud Native decision?
We need to spend a bit of time really unpacking what we mean by value.
{CLICK}
When we talk about Value it means different things to different people so let’s break it down:
{CLICK}
There is indicative value - the noun - which is harder to measure and quantify - and because it can be intangible - it may be harder to realise or quantify monetary gain as a result.
{CLICK}
The noun is where most Cloud Foundry talks spend their time - these amazing bold statements of what could be done in 10 months can now be done in 10 weeks.
{CLICK}
There there is the verb - the doing - the realisation of value.
{CLICK}
The verb is the Hard ROI - for example the removal of fixed costs
For this talk we are going to look at the progression of realising value, which is really the Cloud Native ROI Continuum
SEAN :
It starts with Waste Reduction
The first is waste reduction (both Opex/capex reduction)
So we start our value journey with waste reduction.
This is the most fundamental task to tackle
{CLICK}
The best way to quantify waste reduction is through an activity called value stream mapping.
Value stream mapping is a lean-management method for analyzing:
the current state and then designing a future state.
{CLICK}
A value stream is defined as the flow of work from a request to the DELIVERY.
{CLICK}
There is an ethos in LEAN - around waste reduction.
VSM is primarily concerned with reducing as much waste as possible. It stems from a word Moo-da.
Example - platform upgrades (for a CVE) / packaging a release.
LEAN Theory started off in manufacturing and transitioned to becoming popular within DevOps communities. And it’s easy to see why:
Because each value stream typically cross multiple functions within an organisation.
{CLICK}
It’s easy to see that by aligning to a DevOps culture you can begin to reduce handoffs and eliminate waste.
So the adoption of DevOps is one way of reducing waste.
But couple DevOps with using CF in the correct way you achieve a significant amount of waste reduction.
And that’s because CF offers a Unified platform strategy
A single point of convergence that everyone rallies behind to obtain a distinct set of benefits
{CLICK}
You get increased speed through reducing waste in setting up new environments.
I have a question for the audience: without using CF how long does it take to provision a VM?
You ask an operator and typically they’ll say something in the order of minutes to hours.
You ask a developer and they will typically tell you a matter of weeks.
The worst time lane for provisioning a VM - I kid you not is 18 months.
With Cloud Foundry your env setup is already set up - no need to raise a ticket and get a VM and middleware set up. This dynamic provisioning of environments reduces operations and saves considerable time.
{CLICK}
For stability - BOSH and CF are self healing - and further - you can easily leverage deployment patterns such a blue/green and feature flags. This saves operational resource.
{CLICK}
You can auto scale the platform and apps based on set criteria, and you remove the need for manual tasks like route configuration for every new app.
{CLICK}
And finally but arguably most importantly - the security posture gets significantly strengthened.
So what that means for reducing waste is through a mix of increased automation and self services. Cloud Foundry Provides
{CLICK}
Better resource consolidation resulting in HW and SW reduction and more efficient use of operators time and skills.
It assesses the events required to take a feature from inception to production.
So what else can be done.
Remove all non value add activities
make lead time = process time
The best way to understand where problems start is by performing an activity called value stream mapping.
Every organization has many value streams, defined as the flow of work from a customer request to the fulfillment of that request.
Each value stream will cross multiple functions within an organization
Value stream mapping is a lean-management method for analyzing the current state and designing a future state for the series of events that take a product or service from its beginning through to the customer.
Every organisation has many value streams,
Each value stream will cross multiple functions within an organisation
And it’s important to map out the entire flow from request to delivery for every Value stream that feeds into software delivery.
The best way to understand where problems start is by performing an activity called value stream mapping.
Every organization has many value streams, defined as the flow of work from a customer request to the fulfillment of that request.
Each value stream will cross multiple functions within an organization
Value stream mapping is a lean-management method for analyzing the current state and designing a future state for the series of events that take a product or service from its beginning through to the customer.
We’ve done this for some of our customers.
When you actually sit down and map our the effort that goes into something simple
like creating a VM - the results can be staggering.
The best way to understand where problems start is by performing an activity called value stream mapping.
Every organization has many value streams, defined as the flow of work from a customer request to the fulfillment of that request.
Each value stream will cross multiple functions within an organization
Value stream mapping is a lean-management method for analyzing the current state and designing a future state for the series of events that take a product or service from its beginning through to the customer.
Introduce yourself
Now you’ve become fully cloud native - what about realizing the value?
Going Cloud Foundry is not for free - it takes time effort change - all things people often associate with pain.
So what is the value
Buzzwords
Google SRE
Golden Signals?
Resilient operations
Bottom three roll up to test-driven operations
Add scaling events to muda
Example : Number of people in the room
Most services consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile.
Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable. It is often defined in terms of the fraction of well-formed requests that succeed, sometimes called yield. (Durability—the likelihood that data will be retained over a long period of time—is equally important for data storage systems.) Although 100% availability is impossible, near-100% availability is often readily achievable, and the industry commonly expresses high-availability values in terms of the number of “nines” in the availability percentage. For example, availabilities of 99% and 99.999% can be referred to as “2 nines” and “5 nines” availability, respectively, and the current published target for Google Compute Engine availability is “three and a half nines”—99.95% availability.
Example: Duncan thought we could get 40. Stretch goal was 50
A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds.
Alert on these.
Choosing an appropriate SLO is complex. To begin with, you don’t always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is essentially determined by the desires of your users, and you can’t really set an SLO for that.
On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment. (100 milliseconds is obviously an arbitrary value, but in general lower latency numbers are good. There are excellent reasons to believe that fast is better than slow, and that user-experienced latency above certain values actually drives people away— see “Speed Matters” [Bru09] for more details.)
Again, this is more subtle than it might at first appear, in that those two SLIs—QPS and latency—might be connected behind the scenes: higher QPS often leads to larger latencies, and it’s common for services to have a performance cliff beyond some load threshold.
Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is (as happened with Chubby: see “The Global Chubby Planned Outage”), and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.
Example: if we didn’t get 50 people signed up, Sean will trim his beard
The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.
So we start our value journey with waste reduction.
This is the most fundamental task to tackle
{CLICK}
Find your biggest waste, eliminate it.
Hypothesize on your next one. Validate through metrics.
Fix it.
Re-evaluate your objectives periodically.
Inspired by balance scorecard
We’ll map the five measurable value components to our SLA/I/O’s
“VC1.1.1.1: Functionality”, “VC1.1.1.2: Reliability”, “VC1.1.1.3: Usability”, “VC1.1.1.4: Maintainability” and
“VC1.1.1.5: Portability”
Value perspective (VP) Value aspect (VA) Subvalue aspect (SVA) Value component (VC)
An excerpt from Software Value Map (Khurum, et al..2012)
Now you’ve become fully cloud native - what about realizing the value?
Going Cloud Foundry is not for free - it takes time effort change - all things people often associate with pain.
So what is the value
Example value lost through waste: Can’t push my bug fix for app
People waiting for information: Waiting for getting complete requirements
from
• the customer
• Waiting for months for project approval
• Waiting for resources to be assigned
• Waiting for the whole system to be done before one can get the key
features that he/she really needs
• Information waiting for people: detailed requirements specification created
upfront
Latency in your systems - try some distributed tracing to find your bottlenecks
Cycle time
Whenever goods are not in transport or being processed, they are waiting. In traditional processes, a large part of an individual product's life is spent waiting to be worked on.
Metric: Time to deliver platform updates; downtime;Lag between updates and upgrades of platforms (stem cells, build packs, versions.);
o Cell Rep Time Synch < 5m
o BBS Time to Run LRP Convergence > 10m
o Auctioneer App Instances Placement Failures > 0.5
o Auctioneer Task Placement Failures > 0.5
o Cloud Controller and Diego in Synch
o Router Server Error
o Router Error: 502 Bad Gateway
Example: $300k AWS bill
Leaving testing towards the end
• Not finding defects as early as possible
Lack of disciplined reviews, tests, verification
Whenever defects occur, extra costs are incurred reworking the part, rescheduling production, etc. This results in labor costs, more time in the "Work-in-progress". Defects in practice can sometimes double the cost of one single product. This should not be passed on to the consumer and should be taken as a loss
Metric: Failed BOSH deployments; Failed application starts; Failed stagings; Application errors; Routes not found; Failing smoke & acceptance tests.
- If SLO is Proactive Security Mitigation
- SLIs are
o Number for Authn errors > _attempts
o Number of Failed Logons > _attempts
o Number forbidden ssh sessions > _attempts
Example: Black friday crash - Target
Overproduction occurs when more product is produced than is required at that time by your customers.
Tie this to capacity planning in Dickerson hierarchy
Functionality that is not required by the customer
Features for which markets are not ready
Do you need 18 foundations or five availability zones
Metric: Cell capacity used
Creation of unnecessary data and informatio
One common practice that leads to this muda is the production of large batches, as often consumer needs change over the long times large batches require. Overproduction is considered the worst muda [8] because it hides and/or generates all the others. Overproduction leads to excess inventory, which then requires the expenditure of resources on storage space and preservation, activities that do not benefit the customer.
Router Latency > 50_ms
o VM Health
o VM Memory Used > 80%
o VM CPU utilization > 85%
Example - Self DDOS through retry storm without circuit breaker or exponential backoff
Difficulty to transfer tacit knowledge (for example, design decisions and
rationale)
• Incompatible information types (drawings vs. digital descriptions)
Incompatible software systems or tools
• Lack of availability, knowledge, or training in conversion and linking
systems
Lack of access
Information silos
Each time a product is moved it stands the risk of being damaged, lost, delayed, etc. as well as being a cost for no added value. Transportation does not make any transformation to the product that the consumer is willing to pay for.
I’ve bounded my discussion of value stream mapping to platform operations. Keith is going to talk about how this fits into your Product development value stream
For this talk we are going to look at the progression of realising value, which is really the Cloud Native ROI Continuum
SEAN :
It starts with Waste Reduction
The first is waste reduction (both Opex/capex reduction)
Now you’ve become fully cloud native - what about realizing the value?
Going Cloud Foundry is not for free - it takes time effort change - all things people often associate with pain.
So what is the value
OPEX reduction is an excellent opening line, but it isn’t the true promise of the conversation.
We need to move the conversation forward from the reduction in software capitalization to the effective use of CAPEX.
This is achieved by making hypothesis market testing a first class citizen in feedback loops to CFO/CEO decision making.
Observing users dealing with a problem, creating assumptions, continuously validating those assumptions and estimating what value a feature you can provide that they recognize will solve it and are willing to pay/subscribe for it.
This means features that don't resonate get sunset quicker, reducing technical debt sooner simultaneously increasing stickiness by continued budget for development and evolution.
The savings from pruning technical debt early is quickly reallocated to new hypotheses.
Product development can remain objective and more focused on what the user is telling them rather than guessing product directions based unvalidated assumptions. This allows them to articulate what the instrumentation has discovered into actionable insights.
It liberates the product/company enabling them to evolve with customer’s changes in value and preferences.
Even after product features have been delivered, the granularity of microservices design promises the clean modularity to continuously collaborate and learn with your evolving user base
Circumstances change, therefore products can remain relevant long after the initial benefits
A/B and B/G testing tell us yes/no but not the why. These in conjunction with microservices and distributed tracing can be used as primitives to build up complex hypotheses that are objectively instrumented that allow us to verify what we believed going into each release, what behavior was validated and what behavior was invalidated.
Now you’ve become fully cloud native - what about realizing the value?
Going Cloud Foundry is not for free - it takes time effort change - all things people often associate with pain.
So what is the value
Marketing shifts from telling users how to use the product to responding to how the users want to use the products creating a stickier user base and attracting new subscribers.
Job’s Theory alludes to this aspect in its postulation that users don’t simple use products, they hire products to complete a job for them. The job is the progress that a person is trying to make in a particular circumstance.
Sean’s live demo
New World of Customer Understanding
Instrumenting highest bounce rate services with gamification to gather motivation and intent
Clearly understand abandonment points in your funnel
Incentivize exit survey behaviors
Clearly understand what is most valuable to your product
B/G deployments help isolate the attraction and understand how to expand the attraction to other areas in the product
UI/UX Heat Maps combined with A/B with Navigation and Design Layout Variations
Is it a UI/UX problem or usability problem
A/B with entire workflows
Understand which threshold signifies the conversion tipping point
Measure impact through the entire funnel not just the application as a whole
All of this now measurable with the granularity of cloud native/microservices architecture and the fact that platform's like PCF offload the technical complexity of managing them.
This type of agility to respond to a market is the panacea of all of this buzz.
So we start our value journey with waste reduction.
This is the most fundamental task to tackle
{CLICK}
For this talk we are going to look at the progression of realising value, which is really the Cloud Native ROI Continuum
SEAN :
It starts with Waste Reduction
The first is waste reduction (both Opex/capex reduction)