SlideShare une entreprise Scribd logo
1  sur  47
NICTA Copyright 2012 From imagination to impact
Supporting Operations
Personnel: A Software
Engineering
Perspective
Len Bass
NICTA Copyright 2012 From imagination to impact
2
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry
NICTA technology is
in over 1 billion mobile
phones
NICTA Copyright 2012 From imagination to impact
Traditional View from Software Engineers
3
Application
Cloud
Environment
Traditionally, the software engineering community
has viewed systems as being developed for users
and existing in an environment. The motivating
questions have been: With this world view: how can
development costs be reduced and run time quality
improved?
End users
Developers
NICTA Copyright 2012 From imagination to impact
A Broader View
4
Application
Cloud
Environment
Applications are not only affected by the behavior of the
end users but also by actions of operators who control
the environment for a consumer’s application.
Consumer
Operator
End users
Developers
NICTA Copyright 2012 From imagination to impact
My Message: Consider the Operator in this
Picture
5
Application
Cloud
Environment
Consumer
Operator
End users
Developers
Computer operations is a domain that impacts every
application that operates in an enterprise environment. As
such, Software Engineers need to be aware of how
actions of operators can affect their application and how
actions of their application can simplify life for operators..
NICTA Copyright 2012 From imagination to impact
Business Context
“Through 2015, 80% of outages impacting mission-critical
services will be caused by people and process issues, and
more than 50% of those outages will be caused by
change/configuration/release integration and hand-off
issues.”
Change/configuration/release integration and hand off are
all operations issues.
Gartner - http://www.rbiassets.com/getfile.ashx/42112626510
"I&O [Infrastructure and operations] represents
approximately 60 percent of total IT spending worldwide, "
http://www.gartner.com/it/page.jsp?id=1807615
6
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain
– What do operators do?
– What can go wrong with what they do?
• Some results NICTA has achieved or activities
we have ongoing
7
NICTA Copyright 2012 From imagination to impact
What Do Operators Do?
8
Akamai’s NOC in Cambridge, Massachusetts
• Monitor and control data center/network/system
activity
– Install new/upgraded
applications/middleware/configurations/hardware
• Support business continuity through back ups
and disaster recovery
NICTA Copyright 2012 From imagination to impact
Monitor and Control
• Data Center
– Total number and type of resources (may be virtual)
• Processors
• Storage
• Network
• Network
– Intrusion detection
– Routing
– Loading
• System
– Allocation to resources
– Install/uninstall
– Configure 9
NICTA Copyright 2012 From imagination to impact
What can go Wrong with Monitor and
Control?
Everything that was on previous slide.
• Failure
• Installations can fail
• Resources fail and must be replaced
• Overload
– Resources are over/under loaded and must be
supplemented/removed
– Networks get overloaded and routing must be changed
• Error
– Routing may be incorrectly specified
– Allocation of systems to resources may be incorrect
– Configurations can be incorrectly specified
10
NICTA Copyright 2012 From imagination to impact
Install New/Upgraded Applications
• Specifying configuration for applications
• Synchronizing state for upgraded applications
• Testing new/upgraded applications in target
environment
• Allocating resources for new version
11
NICTA Copyright 2012 From imagination to impact
What Can go Wrong with Installation?
• Again its everything.
– Configuration can be misspecified
– Cut over to new version may leave inconsistent state
– Upgrade to level N of the stack may break software in
level >N of the stack
– Testing environment may not appropriately mirror real
environment
– Configuration of one level of the stack may be
inconsistent with requirements of another level.
12
NICTA Copyright 2012 From imagination to impact
Supporting Business Continuity
• Disasters happen – natural or human causes
• Backing up data provides recovery possibility
– Lag between last version backed up and when
disaster happens
– In the Cloud, backing up large amounts of data to
different geographic regions takes time.
13
NICTA Copyright 2012 From imagination to impact
Hand Offs
• Problems can arise when a shift changes
– What problems did old shift deal with?
– What problems were totally solved?
– What problems were partially solved?
– What operations activities are currently ongoing?
14
NICTA Copyright 2012 From imagination to impact
Operations is a Target Rich Environment
• There are many existing tools. Operation of data
centers would not work without tools
• Much room for improvement (see Gartner quote)
• Some general approaches for improvement
– Make software systems operations and tools process and
incident aware. E.g. make them aware of upgrade or shift
change
– Model operations processes and systems using a single model.
• Model analysis will provide opportunities for detecting trade offs between
human and automated activities.
• Model might also enable smoother error detection
15
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain
• Some results we have achieved or activities
we have ongoing
– Disaster Recovery product
– Upgrade
– Operator undo
– Installation process.
16
NICTA Copyright 2012 From imagination to impact
Disaster Recovery
• Clouds fail – Amazon had three outages in 2011
that affected whole availability zones or regions.
• NICTA has a subsidiary (Yuruware) with a non-
intrusive disaster recovery product (Bolt).
• Bolt copies data periodically to a back up region.
• Bolt utilizes sophisticated data movement
techniques to reduce time required to back up
• This is an insurance policy.
17
NICTA Copyright 2012 From imagination to impact
Next Problem – Upgrade
• Upgrades are a very common occurrence
• Upgrade frequency of some common systems
• Some systems have multiple releases per day,
driven by developers – continuous deployment
18
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 days
Joomla 30 days
NICTA Copyright 2012 From imagination to impact
Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries. This
is using production environment for testing)
– All at once (big flip)
• How long are new versions tested to determine
correctness?
– Period based – for some period of time
– Load based – under some utilization assumptions
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
19
NICTA Copyright 2012 From imagination to impact
Having Multiple Versions Simulaneously
Active May Lead to Mixed Version Race
Condition
20
Server 2 (new
version
3
4
X ERROR
Initial request
Client (browser)
Server 1 (old
version
1
2
5
Start rolling
upgrade
HTTP reply with
embedded JavaScript
AJAX callback
NICTA Copyright 2012 From imagination to impact
One Method for Preventing Mixed Version Race Condition
is to Make Load Balancers Version Aware
Client may
request
particular version
of a service
External facing
Router (wrt to
cloud)
Internal Router
Server for
Version A
Server for
Version A
Server for
Version B
Internal Router
Server for
Version A
Server for
Version B
21
At each level of the routing hierarchy
there are two possibilities for each
request
• Request is neutral with respect to
version
• Request specifies version
Routing must
• Be fast to ensure rapid response
• Satisfy “goodness” criteria for
scheduling
• Conform to client request wrt
version.
In addition:
• Servers are being
upgraded to a later
version while servicing
client requests
• Load variation may
trigger elasticity rules
NICTA Copyright 2012 From imagination to impact
What is Criterion for Measuring Load
Balancer Scheduling?
• What is “goodness” with respect to routing
decisions within the constraints of scheduling
strategy and version awareness?
– Uniform distribution of requests?
– Keeping utilization within bounds?
– Utilizing wide variety of clients?
– Other?
• Main result so far. Version awareness is
incompatible with any of the above “goodness”
criteria for the staged upgrade strategy.
22
NICTA Copyright 2012 From imagination to impact
Canary or Staged Strategy
• Upgrade one or several servers to new version
and leave them for some time.
• Formulation:
– Staged upgrade
• M version A servers (constant number)
• N version B servers (constant number)
• Fixed number of clients
– Version aware
• Once a client has had a request serviced by a version B
server it cannot subsequently have any requests serviced by
any version A server.
23
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients
• Clients are bifurcated into version A clients and
version B clients after some time
– Intuitively, for each client, either it is serviced by a
server with version B and consequently never served
by any server with version A or never served by a
server with version B. So each client ends in up the
Server A class or the Server B class but not both.
• We call clients that end up being serviced by
services with version A, class A clients.
Similarly, for class B clients.
• Allowing additional clients does not
fundamentally change result.
24
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients Implies
• Cannot control for utilization unless create new instances
of version B in response to demand
– There are a fixed number of clients sending requests to a fixed
number of servers with version B. Cannot vary the number of
servers to reflect the load generated by the fixed set of clients.
Consequently cannot control the utilization by servers with
version B.
• Cannot control for uniform distribution.
– Uniform distribution means that every request has an equal
change of being sent to any server. If a client is in class A, then it
has 0% chance of being sent to a server with Version B.
• Difficult to control for wide variety of clients.
– Variations among the clients must be mirrored within class A and
class B clients since the classes are fixed after the bifurcation.
This is difficult to accomplish since types of variations that are
important are usually not known.
25
NICTA Copyright 2012 From imagination to impact
Questions to Answer.
– How long does it take to reach bifurcated state under
what assumptions?
– How can the goals of staged upgrade be achieved
within the constraints of version awareness?
26
NICTA Copyright 2012 From imagination to impact
Next Problem
• Operators use scripts to perform actions such as
update
• Scripts may fail
– May be result of API failure (more on this later)
– May be desire to set up testing environment
– May be result of failure of underlying virtual machine.
• When a script fails, the operator may wish to
return to a known state (undo several
operations)
27
NICTA Copyright 2012 From imagination to impact
Operator Undo
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
28
NICTA Copyright 2012 From imagination to impact
Undo for System Operators
29
+ commit
+ pseudo-delete
begin-
transaction
rollback
do
do
do
Administrator
NICTA Copyright 2012 From imagination to impact
Approach
30
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
NICTA Copyright 2012 From imagination to impact
Approach
31
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
NICTA Copyright 2012 From imagination to impact
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
Set of
actions
Set of
actions
Approach
32
NICTA Copyright 2012 From imagination to impact
What about API Failures?
• Operator scripts make heavy use of checking or
controlling state of resources
– Start/stop VM
– Is VM active?
• These scripts becomes calls to the cloud
provider’s API.
• Calls may fail
– Underlying VM has failed
– Eventual consistency.
33
NICTA Copyright 2012 From imagination to impact
We Have Performed an Empirical Study of
API Failures in EC2
• 922 cases out of 1109 reported API-related
cases in the EC2 forum from 2010 to 2012 are
API failures (rather than feature requests or
general inquiries).
• We classified the extracted API failures into four
types of failures:
– content failures,
– late timing failures,
– halt failures, and
– erratic failures.
34
NICTA Copyright 2012 From imagination to impact
Results
• A majority (60%) of the cases of API failure are related to
stuck API calls or unresponsive API calls.
• A large portion (12%) of the cases are about slow
responsive API calls.
• 19% of the cases are related to the output issues of API
calls, including failed calls with unclear error
messages, as well as missing output, wrong output, and
unexpected output of API calls.
• 9% of the cases reported that their calls were pending
for a certain time and then returned to the original state
without informing the caller properly or the calls were
reported to be successful first but failed later.
35
NICTA Copyright 2012 From imagination to impact
Next Problem - Operations Processes
• We are looking at the process of installing new
software
– Error Prone
– Potential process improvements.
36
NICTA Copyright 2012 From imagination to impact
Motivating Scenario
• You change the operating environment for an
application
– Configuration change
– Version change
– Hardware change
• Result is degraded performance
• When the software stack is deep with portions
from different suppliers, the result is frequently:
37
NICTA Copyright 2012 From imagination to impact
Why is Installation Error Prone?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• HBase has 64
– Many dependencies are not visible until execution
38
NICTA Copyright 2012 From imagination to impact
Installation Processes
• Processes may be
– Undocumented
– Out of date
– Insufficiently detailed
• Our goal is to build process model including
error recovery mechanisms
39
NICTA Copyright 2012 From imagination to impact
Our Activities
40
• Create up to date process models for installation
processes. Information sources are
– Process discovery from logs
– Process formalization from existing written
descriptions.
• Process descriptions can be used to
– Make trade offs
– Make recommendations in real time to operations
staff
– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered
– Assist in the detection of errors
NICTA Copyright 2012 From imagination to impact
Hard Problems
41
• Creating accurate process models
– Exception handling mechanisms are not well
documented
– Labor intensive.
– Our approach
• Top down modeling using process modeling formalism
• Bottom up process mining from error logs
• Diagnosing errors
NICTA Copyright 2012 From imagination to impact
Why is Error Diagnosis Hard?
In a distributed computing
environment, when an error
occurs during operations, it is
difficult and time consuming to
diagnosis it.
Diagnosis involves correlating
messages from
• different distributed servers
• different portions of the
software stack
and determining the root
cause of the error.
The root cause, in turn, may
be within a portion of the stack
that is different from where the
error is observed.
NICTA Copyright 2012 From imagination to impact
Test Bed
43
Our current test bed is the Hbase stack
NICTA Copyright 2012 From imagination to impact
Currently Performing Analysis of
Configuration Errors
44
• Cross stack errors may take hours to diagnose
– Log files are inconsistent
– Error message may not give context necessary to
determine root cause.
NICTA Copyright 2012 From imagination to impact
Where to Find Information about Operations
Domain?
• Every open source program requires a variety of
configuration parameters.
• Every modern application depends on a variety
of middleware so cross domain examples should
be readily available.
• Most organizations have extensive processes for
their operations personnel. Use these processes
as a framework for investigating process/product
interactions.
45
NICTA Copyright 2012 From imagination to impact
Summary
• Operations problems will account for the majority
of outages and IT costs in the next several
years.
• The operations space is a rich source of
research problems that has been insufficiently
mined.
• Best way to determine what problems to attack
is to monitor or interview operators
46
NICTA Copyright 2012 From imagination to impact
NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Jim Zhanwen Li
• Qinghua Lu
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
47

Contenu connexe

Tendances

Software/System Development Life Cycle
Software/System Development Life CycleSoftware/System Development Life Cycle
Software/System Development Life CycleHem Pokhrel
 
Engineering Systems For The Cloud
Engineering Systems For The CloudEngineering Systems For The Cloud
Engineering Systems For The CloudTrevor Warren
 
IT Ops Mgmt in the New Virtualized, Software-defined World
IT Ops Mgmt in the New Virtualized, Software-defined WorldIT Ops Mgmt in the New Virtualized, Software-defined World
IT Ops Mgmt in the New Virtualized, Software-defined WorldEMC
 
Virtual Private Data Center Solution Overview
Virtual Private Data Center Solution OverviewVirtual Private Data Center Solution Overview
Virtual Private Data Center Solution OverviewAngela Chavez
 
DevOps for Enterprise Systems : Innovate like a Startup
DevOps for Enterprise Systems : Innovate like a StartupDevOps for Enterprise Systems : Innovate like a Startup
DevOps for Enterprise Systems : Innovate like a StartupDevOps for Enterprise Systems
 
Chuck_Roden_Resume
Chuck_Roden_ResumeChuck_Roden_Resume
Chuck_Roden_ResumeChuck Roden
 
Beit 381 se lec 2 - 27 - 12 feb08
Beit 381 se lec 2 - 27 - 12 feb08Beit 381 se lec 2 - 27 - 12 feb08
Beit 381 se lec 2 - 27 - 12 feb08babak danyal
 
The Changing Role of IT: From Service Managers to Advisors
The Changing Role of IT:From Service Managers to AdvisorsThe Changing Role of IT:From Service Managers to Advisors
The Changing Role of IT: From Service Managers to AdvisorsJesse Stockall
 
Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteGeoff Halprin
 
Federating Subversion and Git
Federating Subversion and GitFederating Subversion and Git
Federating Subversion and GitCollabNet
 
Netpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMNetpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMBoni Bruno
 
Real Cost of Software Remediation
Real Cost of Software RemediationReal Cost of Software Remediation
Real Cost of Software RemediationDenim Group
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Gil Irizarry
 
Chuck_Roden_Resume
Chuck_Roden_ResumeChuck_Roden_Resume
Chuck_Roden_ResumeChuck Roden
 
Four Essential Steps for Removing Risk and Downtime from POWER9 Migration
Four Essential Steps for Removing Risk and Downtime from POWER9 MigrationFour Essential Steps for Removing Risk and Downtime from POWER9 Migration
Four Essential Steps for Removing Risk and Downtime from POWER9 MigrationPrecisely
 
From monolith to resilient microservices
From monolith to resilient microservicesFrom monolith to resilient microservices
From monolith to resilient microservicesRehanvanderMerwe1
 

Tendances (20)

All About Jazz Team Server Technology
All About Jazz Team Server TechnologyAll About Jazz Team Server Technology
All About Jazz Team Server Technology
 
RRC RUP
RRC RUPRRC RUP
RRC RUP
 
Software/System Development Life Cycle
Software/System Development Life CycleSoftware/System Development Life Cycle
Software/System Development Life Cycle
 
Engineering Systems For The Cloud
Engineering Systems For The CloudEngineering Systems For The Cloud
Engineering Systems For The Cloud
 
IT Ops Mgmt in the New Virtualized, Software-defined World
IT Ops Mgmt in the New Virtualized, Software-defined WorldIT Ops Mgmt in the New Virtualized, Software-defined World
IT Ops Mgmt in the New Virtualized, Software-defined World
 
Virtual Private Data Center Solution Overview
Virtual Private Data Center Solution OverviewVirtual Private Data Center Solution Overview
Virtual Private Data Center Solution Overview
 
Software Lifecycle
Software LifecycleSoftware Lifecycle
Software Lifecycle
 
DevOps for Enterprise Systems : Innovate like a Startup
DevOps for Enterprise Systems : Innovate like a StartupDevOps for Enterprise Systems : Innovate like a Startup
DevOps for Enterprise Systems : Innovate like a Startup
 
Chuck_Roden_Resume
Chuck_Roden_ResumeChuck_Roden_Resume
Chuck_Roden_Resume
 
Beit 381 se lec 2 - 27 - 12 feb08
Beit 381 se lec 2 - 27 - 12 feb08Beit 381 se lec 2 - 27 - 12 feb08
Beit 381 se lec 2 - 27 - 12 feb08
 
The Changing Role of IT: From Service Managers to Advisors
The Changing Role of IT:From Service Managers to AdvisorsThe Changing Role of IT:From Service Managers to Advisors
The Changing Role of IT: From Service Managers to Advisors
 
Fifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynoteFifteen Years of DevOps -- LISA 2012 keynote
Fifteen Years of DevOps -- LISA 2012 keynote
 
Federating Subversion and Git
Federating Subversion and GitFederating Subversion and Git
Federating Subversion and Git
 
SIG-NOC Tools Survey 2015
SIG-NOC Tools Survey 2015SIG-NOC Tools Survey 2015
SIG-NOC Tools Survey 2015
 
Netpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMNetpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APM
 
Real Cost of Software Remediation
Real Cost of Software RemediationReal Cost of Software Remediation
Real Cost of Software Remediation
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
 
Chuck_Roden_Resume
Chuck_Roden_ResumeChuck_Roden_Resume
Chuck_Roden_Resume
 
Four Essential Steps for Removing Risk and Downtime from POWER9 Migration
Four Essential Steps for Removing Risk and Downtime from POWER9 MigrationFour Essential Steps for Removing Risk and Downtime from POWER9 Migration
Four Essential Steps for Removing Risk and Downtime from POWER9 Migration
 
From monolith to resilient microservices
From monolith to resilient microservicesFrom monolith to resilient microservices
From monolith to resilient microservices
 

En vedette

Architecting for the cloud scability-availability
Architecting for the cloud scability-availabilityArchitecting for the cloud scability-availability
Architecting for the cloud scability-availabilityLen Bass
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providersLen Bass
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoopLen Bass
 
Architecture patterns for continuous deployment
Architecture patterns for continuous deploymentArchitecture patterns for continuous deployment
Architecture patterns for continuous deploymentLen Bass
 
WICSA 2012 tutorial
WICSA 2012 tutorialWICSA 2012 tutorial
WICSA 2012 tutorialLen Bass
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureLen Bass
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipelineLen Bass
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev opsLen Bass
 
Deployability
DeployabilityDeployability
DeployabilityLen Bass
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity securityLen Bass
 
Architecting for the cloud intro, virtualization, iaa s
Architecting for the cloud   intro, virtualization, iaa sArchitecting for the cloud   intro, virtualization, iaa s
Architecting for the cloud intro, virtualization, iaa sLen Bass
 
Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build testLen Bass
 
Architectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsArchitectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsLen Bass
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool optionsLen Bass
 
Architecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsArchitecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsLen Bass
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture designLen Bass
 

En vedette (16)

Architecting for the cloud scability-availability
Architecting for the cloud scability-availabilityArchitecting for the cloud scability-availability
Architecting for the cloud scability-availability
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providers
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 
Architecture patterns for continuous deployment
Architecture patterns for continuous deploymentArchitecture patterns for continuous deployment
Architecture patterns for continuous deployment
 
WICSA 2012 tutorial
WICSA 2012 tutorialWICSA 2012 tutorial
WICSA 2012 tutorial
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study future
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
 
Introduction to dev ops
Introduction to dev opsIntroduction to dev ops
Introduction to dev ops
 
Deployability
DeployabilityDeployability
Deployability
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
Architecting for the cloud intro, virtualization, iaa s
Architecting for the cloud   intro, virtualization, iaa sArchitecting for the cloud   intro, virtualization, iaa s
Architecting for the cloud intro, virtualization, iaa s
 
Architecting for the cloud storage build test
Architecting for the cloud storage build testArchitecting for the cloud storage build test
Architecting for the cloud storage build test
 
Architectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale SystemsArchitectural Tactics for Large Scale Systems
Architectural Tactics for Large Scale Systems
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool options
 
Architecting for the cloud storage misc topics
Architecting for the cloud storage misc topicsArchitecting for the cloud storage misc topics
Architecting for the cloud storage misc topics
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture design
 

Similaire à Supporting operations personnel a software engineers perspective

Are your cloud applications performing? How Application Performance Managemen...
Are your cloud applications performing? How Application Performance Managemen...Are your cloud applications performing? How Application Performance Managemen...
Are your cloud applications performing? How Application Performance Managemen...DevOps.com
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science PlatformDecision Science Community
 
Novelty in Non-Greenfield
Novelty in Non-GreenfieldNovelty in Non-Greenfield
Novelty in Non-GreenfieldJustin Lovell
 
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-Native
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-NativeApp Modernization with .NET Core: How Travelers Insurance is Going Cloud-Native
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-NativeVMware Tanzu
 
Aitp presentation ed holub - october 23 2010
Aitp presentation   ed holub - october 23 2010Aitp presentation   ed holub - october 23 2010
Aitp presentation ed holub - october 23 2010AITPHouston
 
A DevOps adoption playbook- achieving business value at scale
A DevOps adoption playbook- achieving business value at scaleA DevOps adoption playbook- achieving business value at scale
A DevOps adoption playbook- achieving business value at scaleSanjeev Sharma
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkDellNMS
 
Getting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsGetting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsThousandEyes
 
SolarWinds Online Federal User Group
SolarWinds Online Federal User GroupSolarWinds Online Federal User Group
SolarWinds Online Federal User GroupSolarWinds
 
Making Your Apps Cloudy - Migrating to Microservices
Making Your Apps Cloudy - Migrating to MicroservicesMaking Your Apps Cloudy - Migrating to Microservices
Making Your Apps Cloudy - Migrating to MicroservicesCloudify Community
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsVMware Tanzu
 
Reactive applications tools of the trade huff po
Reactive applications   tools of the trade huff poReactive applications   tools of the trade huff po
Reactive applications tools of the trade huff poshinolajla
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupYashrajNayak4
 
Enable business continuity and high availability through active active techno...
Enable business continuity and high availability through active active techno...Enable business continuity and high availability through active active techno...
Enable business continuity and high availability through active active techno...Qian Li Jin
 
Operating a Highly Available Cloud Service
Operating a Highly Available Cloud ServiceOperating a Highly Available Cloud Service
Operating a Highly Available Cloud ServiceDepankar Neogi
 
Il paradigma DevOps e Continuous Delivery Automation
Il paradigma DevOps e Continuous Delivery Automation Il paradigma DevOps e Continuous Delivery Automation
Il paradigma DevOps e Continuous Delivery Automation HP Enterprise Italia
 

Similaire à Supporting operations personnel a software engineers perspective (20)

Are your cloud applications performing? How Application Performance Managemen...
Are your cloud applications performing? How Application Performance Managemen...Are your cloud applications performing? How Application Performance Managemen...
Are your cloud applications performing? How Application Performance Managemen...
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Novelty in Non-Greenfield
Novelty in Non-GreenfieldNovelty in Non-Greenfield
Novelty in Non-Greenfield
 
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-Native
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-NativeApp Modernization with .NET Core: How Travelers Insurance is Going Cloud-Native
App Modernization with .NET Core: How Travelers Insurance is Going Cloud-Native
 
Aitp presentation ed holub - october 23 2010
Aitp presentation   ed holub - october 23 2010Aitp presentation   ed holub - october 23 2010
Aitp presentation ed holub - october 23 2010
 
A DevOps adoption playbook- achieving business value at scale
A DevOps adoption playbook- achieving business value at scaleA DevOps adoption playbook- achieving business value at scale
A DevOps adoption playbook- achieving business value at scale
 
Developer want change Ops want control - devops
Developer want change Ops want control - devopsDeveloper want change Ops want control - devops
Developer want change Ops want control - devops
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
 
Adopting DevOps for 2-Speed IT
Adopting DevOps for 2-Speed ITAdopting DevOps for 2-Speed IT
Adopting DevOps for 2-Speed IT
 
Getting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of ConceptsGetting Started with ThousandEyes Proof of Concepts
Getting Started with ThousandEyes Proof of Concepts
 
SolarWinds Online Federal User Group
SolarWinds Online Federal User GroupSolarWinds Online Federal User Group
SolarWinds Online Federal User Group
 
Making Your Apps Cloudy - Migrating to Microservices
Making Your Apps Cloudy - Migrating to MicroservicesMaking Your Apps Cloudy - Migrating to Microservices
Making Your Apps Cloudy - Migrating to Microservices
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 
Reactive applications tools of the trade huff po
Reactive applications   tools of the trade huff poReactive applications   tools of the trade huff po
Reactive applications tools of the trade huff po
 
Cloud Accounting
Cloud AccountingCloud Accounting
Cloud Accounting
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
 
Enable business continuity and high availability through active active techno...
Enable business continuity and high availability through active active techno...Enable business continuity and high availability through active active techno...
Enable business continuity and high availability through active active techno...
 
Operating a Highly Available Cloud Service
Operating a Highly Available Cloud ServiceOperating a Highly Available Cloud Service
Operating a Highly Available Cloud Service
 
Il paradigma DevOps e Continuous Delivery Automation
Il paradigma DevOps e Continuous Delivery Automation Il paradigma DevOps e Continuous Delivery Automation
Il paradigma DevOps e Continuous Delivery Automation
 

Plus de Len Bass

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabusLen Bass
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020Len Bass
 
11 secure development
11  secure development 11  secure development
11 secure development Len Bass
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery Len Bass
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction Len Bass
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline Len Bass
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management Len Bass
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architectureLen Bass
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure securityLen Bass
 
4 container management
4  container management4  container management
4 container managementLen Bass
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud Len Bass
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machinesLen Bass
 
2 networking
2 networking2 networking
2 networkingLen Bass
 
Quantum talk
Quantum talkQuantum talk
Quantum talkLen Bass
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorialLen Bass
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devopsLen Bass
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchainsLen Bass
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchainLen Bass
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
Securing deployment pipeline
Securing deployment pipelineSecuring deployment pipeline
Securing deployment pipelineLen Bass
 

Plus de Len Bass (20)

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
 
11 secure development
11  secure development 11  secure development
11 secure development
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architecture
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
 
4 container management
4  container management4  container management
4 container management
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machines
 
2 networking
2 networking2 networking
2 networking
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
Securing deployment pipeline
Securing deployment pipelineSecuring deployment pipeline
Securing deployment pipeline
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Supporting operations personnel a software engineers perspective

  • 1. NICTA Copyright 2012 From imagination to impact Supporting Operations Personnel: A Software Engineering Perspective Len Bass
  • 2. NICTA Copyright 2012 From imagination to impact 2 About NICTA National ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners • Providing R&D services, knowledge transfer to Australian (and global) ICT industry NICTA technology is in over 1 billion mobile phones
  • 3. NICTA Copyright 2012 From imagination to impact Traditional View from Software Engineers 3 Application Cloud Environment Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved? End users Developers
  • 4. NICTA Copyright 2012 From imagination to impact A Broader View 4 Application Cloud Environment Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application. Consumer Operator End users Developers
  • 5. NICTA Copyright 2012 From imagination to impact My Message: Consider the Operator in this Picture 5 Application Cloud Environment Consumer Operator End users Developers Computer operations is a domain that impacts every application that operates in an enterprise environment. As such, Software Engineers need to be aware of how actions of operators can affect their application and how actions of their application can simplify life for operators..
  • 6. NICTA Copyright 2012 From imagination to impact Business Context “Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.” Change/configuration/release integration and hand off are all operations issues. Gartner - http://www.rbiassets.com/getfile.ashx/42112626510 "I&O [Infrastructure and operations] represents approximately 60 percent of total IT spending worldwide, " http://www.gartner.com/it/page.jsp?id=1807615 6
  • 7. NICTA Copyright 2012 From imagination to impact Outline • Overview of operations domain – What do operators do? – What can go wrong with what they do? • Some results NICTA has achieved or activities we have ongoing 7
  • 8. NICTA Copyright 2012 From imagination to impact What Do Operators Do? 8 Akamai’s NOC in Cambridge, Massachusetts • Monitor and control data center/network/system activity – Install new/upgraded applications/middleware/configurations/hardware • Support business continuity through back ups and disaster recovery
  • 9. NICTA Copyright 2012 From imagination to impact Monitor and Control • Data Center – Total number and type of resources (may be virtual) • Processors • Storage • Network • Network – Intrusion detection – Routing – Loading • System – Allocation to resources – Install/uninstall – Configure 9
  • 10. NICTA Copyright 2012 From imagination to impact What can go Wrong with Monitor and Control? Everything that was on previous slide. • Failure • Installations can fail • Resources fail and must be replaced • Overload – Resources are over/under loaded and must be supplemented/removed – Networks get overloaded and routing must be changed • Error – Routing may be incorrectly specified – Allocation of systems to resources may be incorrect – Configurations can be incorrectly specified 10
  • 11. NICTA Copyright 2012 From imagination to impact Install New/Upgraded Applications • Specifying configuration for applications • Synchronizing state for upgraded applications • Testing new/upgraded applications in target environment • Allocating resources for new version 11
  • 12. NICTA Copyright 2012 From imagination to impact What Can go Wrong with Installation? • Again its everything. – Configuration can be misspecified – Cut over to new version may leave inconsistent state – Upgrade to level N of the stack may break software in level >N of the stack – Testing environment may not appropriately mirror real environment – Configuration of one level of the stack may be inconsistent with requirements of another level. 12
  • 13. NICTA Copyright 2012 From imagination to impact Supporting Business Continuity • Disasters happen – natural or human causes • Backing up data provides recovery possibility – Lag between last version backed up and when disaster happens – In the Cloud, backing up large amounts of data to different geographic regions takes time. 13
  • 14. NICTA Copyright 2012 From imagination to impact Hand Offs • Problems can arise when a shift changes – What problems did old shift deal with? – What problems were totally solved? – What problems were partially solved? – What operations activities are currently ongoing? 14
  • 15. NICTA Copyright 2012 From imagination to impact Operations is a Target Rich Environment • There are many existing tools. Operation of data centers would not work without tools • Much room for improvement (see Gartner quote) • Some general approaches for improvement – Make software systems operations and tools process and incident aware. E.g. make them aware of upgrade or shift change – Model operations processes and systems using a single model. • Model analysis will provide opportunities for detecting trade offs between human and automated activities. • Model might also enable smoother error detection 15
  • 16. NICTA Copyright 2012 From imagination to impact Outline • Overview of operations domain • Some results we have achieved or activities we have ongoing – Disaster Recovery product – Upgrade – Operator undo – Installation process. 16
  • 17. NICTA Copyright 2012 From imagination to impact Disaster Recovery • Clouds fail – Amazon had three outages in 2011 that affected whole availability zones or regions. • NICTA has a subsidiary (Yuruware) with a non- intrusive disaster recovery product (Bolt). • Bolt copies data periodically to a back up region. • Bolt utilizes sophisticated data movement techniques to reduce time required to back up • This is an insurance policy. 17
  • 18. NICTA Copyright 2012 From imagination to impact Next Problem – Upgrade • Upgrades are a very common occurrence • Upgrade frequency of some common systems • Some systems have multiple releases per day, driven by developers – continuous deployment 18 Application Average release interval Facebook (platform) < 7 days Google Docs <50 days Media Wiki 21 days Joomla 30 days
  • 19. NICTA Copyright 2012 From imagination to impact Various Upgrade Strategies • How many at once? – One at a time (rolling upgrade) – Groups at a time (staged upgrade, e.g. canaries. This is using production environment for testing) – All at once (big flip) • How long are new versions tested to determine correctness? – Period based – for some period of time – Load based – under some utilization assumptions • What happens to old versions? – Replaced en masse – Maintained for some period for compatibility purposes 19
  • 20. NICTA Copyright 2012 From imagination to impact Having Multiple Versions Simulaneously Active May Lead to Mixed Version Race Condition 20 Server 2 (new version 3 4 X ERROR Initial request Client (browser) Server 1 (old version 1 2 5 Start rolling upgrade HTTP reply with embedded JavaScript AJAX callback
  • 21. NICTA Copyright 2012 From imagination to impact One Method for Preventing Mixed Version Race Condition is to Make Load Balancers Version Aware Client may request particular version of a service External facing Router (wrt to cloud) Internal Router Server for Version A Server for Version A Server for Version B Internal Router Server for Version A Server for Version B 21 At each level of the routing hierarchy there are two possibilities for each request • Request is neutral with respect to version • Request specifies version Routing must • Be fast to ensure rapid response • Satisfy “goodness” criteria for scheduling • Conform to client request wrt version. In addition: • Servers are being upgraded to a later version while servicing client requests • Load variation may trigger elasticity rules
  • 22. NICTA Copyright 2012 From imagination to impact What is Criterion for Measuring Load Balancer Scheduling? • What is “goodness” with respect to routing decisions within the constraints of scheduling strategy and version awareness? – Uniform distribution of requests? – Keeping utilization within bounds? – Utilizing wide variety of clients? – Other? • Main result so far. Version awareness is incompatible with any of the above “goodness” criteria for the staged upgrade strategy. 22
  • 23. NICTA Copyright 2012 From imagination to impact Canary or Staged Strategy • Upgrade one or several servers to new version and leave them for some time. • Formulation: – Staged upgrade • M version A servers (constant number) • N version B servers (constant number) • Fixed number of clients – Version aware • Once a client has had a request serviced by a version B server it cannot subsequently have any requests serviced by any version A server. 23
  • 24. NICTA Copyright 2012 From imagination to impact Bifurcation of Clients • Clients are bifurcated into version A clients and version B clients after some time – Intuitively, for each client, either it is serviced by a server with version B and consequently never served by any server with version A or never served by a server with version B. So each client ends in up the Server A class or the Server B class but not both. • We call clients that end up being serviced by services with version A, class A clients. Similarly, for class B clients. • Allowing additional clients does not fundamentally change result. 24
  • 25. NICTA Copyright 2012 From imagination to impact Bifurcation of Clients Implies • Cannot control for utilization unless create new instances of version B in response to demand – There are a fixed number of clients sending requests to a fixed number of servers with version B. Cannot vary the number of servers to reflect the load generated by the fixed set of clients. Consequently cannot control the utilization by servers with version B. • Cannot control for uniform distribution. – Uniform distribution means that every request has an equal change of being sent to any server. If a client is in class A, then it has 0% chance of being sent to a server with Version B. • Difficult to control for wide variety of clients. – Variations among the clients must be mirrored within class A and class B clients since the classes are fixed after the bifurcation. This is difficult to accomplish since types of variations that are important are usually not known. 25
  • 26. NICTA Copyright 2012 From imagination to impact Questions to Answer. – How long does it take to reach bifurcated state under what assumptions? – How can the goals of staged upgrade be achieved within the constraints of version awareness? 26
  • 27. NICTA Copyright 2012 From imagination to impact Next Problem • Operators use scripts to perform actions such as update • Scripts may fail – May be result of API failure (more on this later) – May be desire to set up testing environment – May be result of failure of underlying virtual machine. • When a script fails, the operator may wish to return to a known state (undo several operations) 27
  • 28. NICTA Copyright 2012 From imagination to impact Operator Undo • Not always that straight-forward: – Attaching volume is no problem while the instance is running, detaching might be problematic – Creating / changing auto-scaling rules has effect on number of running instances • Cannot terminate additional instances, as the rule would create new ones! – Deleted / terminated / released resources are gone! 28
  • 29. NICTA Copyright 2012 From imagination to impact Undo for System Operators 29 + commit + pseudo-delete begin- transaction rollback do do do Administrator
  • 30. NICTA Copyright 2012 From imagination to impact Approach 30 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System
  • 31. NICTA Copyright 2012 From imagination to impact Approach 31 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System Goal state Goal state Initial state Initial state
  • 32. NICTA Copyright 2012 From imagination to impact begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states PlanGenerate codeExecute Administrator Undo System Goal state Goal state Initial state Initial state Set of actions Set of actions Approach 32
  • 33. NICTA Copyright 2012 From imagination to impact What about API Failures? • Operator scripts make heavy use of checking or controlling state of resources – Start/stop VM – Is VM active? • These scripts becomes calls to the cloud provider’s API. • Calls may fail – Underlying VM has failed – Eventual consistency. 33
  • 34. NICTA Copyright 2012 From imagination to impact We Have Performed an Empirical Study of API Failures in EC2 • 922 cases out of 1109 reported API-related cases in the EC2 forum from 2010 to 2012 are API failures (rather than feature requests or general inquiries). • We classified the extracted API failures into four types of failures: – content failures, – late timing failures, – halt failures, and – erratic failures. 34
  • 35. NICTA Copyright 2012 From imagination to impact Results • A majority (60%) of the cases of API failure are related to stuck API calls or unresponsive API calls. • A large portion (12%) of the cases are about slow responsive API calls. • 19% of the cases are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls. • 9% of the cases reported that their calls were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later. 35
  • 36. NICTA Copyright 2012 From imagination to impact Next Problem - Operations Processes • We are looking at the process of installing new software – Error Prone – Potential process improvements. 36
  • 37. NICTA Copyright 2012 From imagination to impact Motivating Scenario • You change the operating environment for an application – Configuration change – Version change – Hardware change • Result is degraded performance • When the software stack is deep with portions from different suppliers, the result is frequently: 37
  • 38. NICTA Copyright 2012 From imagination to impact Why is Installation Error Prone? • Installation is complicated. – Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for Linux are ~250 pages each – Apache description of addresses and ports (one out of 16 descriptions) has following elements: • Choosing and specifying ports for the server to listen to • IPv4 and IPv6 • Protocols • Virtual Hosts – The number of configuration options that must be set can be large • Hadoop has 206 options • HBase has 64 – Many dependencies are not visible until execution 38
  • 39. NICTA Copyright 2012 From imagination to impact Installation Processes • Processes may be – Undocumented – Out of date – Insufficiently detailed • Our goal is to build process model including error recovery mechanisms 39
  • 40. NICTA Copyright 2012 From imagination to impact Our Activities 40 • Create up to date process models for installation processes. Information sources are – Process discovery from logs – Process formalization from existing written descriptions. • Process descriptions can be used to – Make trade offs – Make recommendations in real time to operations staff – Recommend setting checkpoints for potential later undo, before a risky part of a process is entered – Assist in the detection of errors
  • 41. NICTA Copyright 2012 From imagination to impact Hard Problems 41 • Creating accurate process models – Exception handling mechanisms are not well documented – Labor intensive. – Our approach • Top down modeling using process modeling formalism • Bottom up process mining from error logs • Diagnosing errors
  • 42. NICTA Copyright 2012 From imagination to impact Why is Error Diagnosis Hard? In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it. Diagnosis involves correlating messages from • different distributed servers • different portions of the software stack and determining the root cause of the error. The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.
  • 43. NICTA Copyright 2012 From imagination to impact Test Bed 43 Our current test bed is the Hbase stack
  • 44. NICTA Copyright 2012 From imagination to impact Currently Performing Analysis of Configuration Errors 44 • Cross stack errors may take hours to diagnose – Log files are inconsistent – Error message may not give context necessary to determine root cause.
  • 45. NICTA Copyright 2012 From imagination to impact Where to Find Information about Operations Domain? • Every open source program requires a variety of configuration parameters. • Every modern application depends on a variety of middleware so cross domain examples should be readily available. • Most organizations have extensive processes for their operations personnel. Use these processes as a framework for investigating process/product interactions. 45
  • 46. NICTA Copyright 2012 From imagination to impact Summary • Operations problems will account for the majority of outages and IT costs in the next several years. • The operations space is a rich source of research problems that has been insufficiently mined. • Best way to determine what problems to attack is to monitor or interview operators 46
  • 47. NICTA Copyright 2012 From imagination to impact NICTA Team • Anna Liu • Alan Fekete • Min Fu • Jim Zhanwen Li • Qinghua Lu • Hiroshi Wada • Ingo Weber • Xiwei Xu • Liming Zhu 47