This document discusses building trust within an organization through a DevOps approach. It introduces the role of a DevOps person to deliver features, mediate between devs and ops, and address non-functional requirements. It outlines steps taken such as listening to stakeholders, gathering requirements, and prioritizing non-functional needs. Tools are proposed for logging, metrics, and testing to provide transparency and shared understanding across teams. Results seen include improved support, proactive issue fixing, and better product performance through data and testing collaboration.
2. What’s the role of a DevOp(s)?
• Deliver
• Be bridge of trust between DEVs and SysOPs
• Stop the “throw the ball over the fence” game
• Mediate
• Drive non-functional requirements
… DevOp or DevOps, talking of one?
3. Introduce a DevOp(s)
• In ‘txtr, starting as a QA Manager, specialised
on backend systems, seems to have worked
• Other organizations tends to call it Site
Reliability Engineer / Site Reliability Operation
• But… QA != Testing, not strictly at least
– Testing should be only a subset of QA, but that is
not how it is normally perceived
– Non-functional requirements did not seem to fit in
4. Non-functional requirements?
• Functional requirements == features
• Non-functional requirements == everything that
OPS would need to run the service, or even
things that Product Owners would want but has
not thought of at the design time
– Logging
• Which kind of informations?
• How?
– Health checks / Load Balancer required URL
– Live sales report / Dashboard / Charting
5. Steps that worked so far
• Listen …to OPS, to PMs, to QA, to R&D
• See how the people have solved their specific
needs trying to gather informations
• Match all the tools that have been built
• Try to gather the essence of those tools, and
come up with non-functional requirements
• Discuss those with the R&D organization and
push them at Product level to be prioritized
over features
6. TRUST
Means…
• Not having to duplicate work
– wrongly testing the backend to see if it is
answering
– or testing to measure the response times
– or creating tests again, when there are plenty of
them that are simply not shared and/or broadly
understood
7. The answer is 42?
…no, the answer is DATA!
• Creating a single point of data collection and
graphing, people are gaining trust in the
backend
• Logs need to be shared too
• Tests needs to be commonly understood
9. Tools
• Logging
– Slf4j > Log4j / JUL > GELF > GrayLog2
• Logging to syslog from a Java based backend, is pretty
bad. The stacktrace become very hard to be fetched
and reported in a ticket. Instead, one link and a
screenshot, or a cut&paste of a complete stacktrace
from a web interface is much more easy to be digested
• GELF is a notification format, encapsulating the full
stacktrace as a message
• GrayLog2 is a ruby/MongoDB FIFO queue with a nice
web interface, and an alerting email system
10. Why?
• Slf4j
– It is an abstraction layer on logging facilities
• I’ll not explain why an “abstraction layer” is good
• Log4j or JUL, at your choice
– They are the most commonly used
• Means: their code is maintained
• GELF
– It keeps a full stacktrace in a single message. There is no need of
reconstructing it from syslog, spread on multiple lines and with
additional garbage/timestamps
• GrayLog2
– We have an in-house developer, and it is working pretty well
– Has threshold based alerting per streams of events (regexp)
11.
12.
13. Results seen so far
• 1st level support team is gaining trust in the
application.
– Logs are getting more and more readable
– Events can be correlated much more easily
• 2nd level support (OPS) can set thresholds of
alerts and react promptly, having alerts tight
to real traffic data and not “one time probes”
• I have a better feeling of the trend of issues in
production, and I don’t have to dig for logs
15. Tools
• Instrumented metrics
– JMX > Jolokia > JSON > Graphite
• MIN / MAX / AVG response time of each API
• Worst response times with related API parameters
• Success / failure counters
• All the above aggregated over the last 5 / 15 minutes, 1
hour, 24 hours
• Plus all the standard exposed JConsole / JMX infos
16. Why?
• JMX
– It is built in in Java, and it is non-invasive
• R&D loves it, cause it does not need an invasive agent as
many profiling agents that are normally used in such cases.
Standard profiling agents tend to interfere with the
application and decrease the overall performance.
– It is a standard, so there are many tools that plug into
it natively
• Jolokia
– It is a standard tool that plugs into JMX and expose it
as JSON encoded format
17. Why?
• Graphite
– It can correlate data from many sources
– Gives me the freedom of structuring graphs as I
want, directly from the web interface
• This is a definitive WIN over Munin or Cacti
– It lets me select specific timeframes
• In case of outage investigation. Thing which is not
possible with Munin
– Can create dashboards
18. Data are in transactions “per 5 minutes” in this graph
…you can see this specific service is currently being used
19. 100 transactions per second
uhmm… at 7a.m., ok 11a.m. in India
someone is testing…
20. Results seen so far
• No need of load and performance testing
– Apart of specific cases, to try to reproduce the
issue to let DEVs work on it.
– Producing a proper load test is problematic, and
can bring to false assumptions about the product.
Having the possibility to watch what the business
logic is doing in production is the best load test.
• DEVs are proactively watching and fixing
performance issues on their own. The overall
product gets better and better.
22. Tools
• Testing
– BDD / Cucumber-Nagios executed by Jenkins
• Cover all the fast HTTP action via Watir
• API calls via JsonRPC or Soap4Rr
• Javascript based UI via Selenium / Capybara
• These tests are actually very valuable at
deployment time, since there is no need of
manual testing. All is in the hand of whom
follows the deployment.
23. Why?
• BDD
– Not everyone wants to read your code
– Not everyone is a coder
– You don’t want to have to explain your test again and
again and again, and you hate documenting
• Cucumber-Nagios / Ruby
– It is off-the-shelf, it works.
– It generates standard JUnit XML report
• Means: it directly integrates with Jenkins ( ex Hudson )
– It generates an awesome HTML report
– It can be extended pretty easily
24. Why?
• Watir
– It is the default HTTP client in Cucumber-Nagios
• BUT: it has tons of bugs… I have a long backlog to fix
– It is fast
• Soap4r
– Pretty easy SOAP ruby gem/library
• JsonRPC
– Very simple and basic JSON RPC gem/library
• BUT: it does not support proxy settings
25. Why?
• Selenium
– Cause it is the only one?
– It supports Javascript
– It supports clustering of testing nodes
– It is supposed to be easy to integrate with
Cucumber (it is NOT …I’m working on it)
26.
27. Upcoming…
• Health checks (normally used for load
balancing purposes) are based on business
logic historical data from within the
instrumented metrics
• Continuous integration
– Configuration management
• Data mining