Splunk's Matt Davies and Vertu's Rob Charlton Presentation at Computing's DevOps Summit in London.
Digital Transformation: The role of machine data in DevOps: increase velocity, improve quality and drive impact
Find out how UK luxury mobile device manufacturer Vertu use machine data for smarter DevOps
Hear how to improve software quality by measuring the metrics that matter
Understand how effective DevOps help Vertu improve their customers’ experience
15. VERTU TIMELINE
1998 2002
Signature –
First phone
Founded
by Nokia
2010
First
Smartphone
2012
Private
Equity
2013
Vertu Ti
(Android)
2014
Vertu Aster
2015
Signature
Touch
Private
Owner
2016
16. YOUR PRESENTERS
• Rob Charlton
• A Cloud DevOps Architect currently managing an Amazon Cloud based
consumer services platform for Vertu using leading edge technology. Prior to
this he founded and worked for a series of startups as CTO and Chief Architect.
18. TRANSFORMATION: CLOUD
Back in 2011 we worked with
multiple managed service
providers in multiple data
centres. We started the
process of automation early
though, adopting Puppet
even at this stage.
2011 2012
In 2012 we consolidated
and migrated everything to
a single VMware private
cloud. We used
automation and built tools
to ensure customers didn’t
even notice.
2015
import boto
ec2 = boto.connect_ec2()
reservation = ec2.run_instances(
image_id='ami-bb709dd2',
key_name='ec2-sample-key')
In 2015 we completed a
full migration from
VMware to Amazon Web
Services – using EC2,
VPC, RDS, ELB and
allowing us access to all
their features from
Python!
19. TRANSFORMATION: CULTURE & DEVOPS
We’re climbing up the
pyramid:
IaaS
Amazon!
Repeatability
Ansible!
Tooling
Jenkins, Packer, Consul.io
The Pinnacle? http://wp.me/p6k7pa-l
20. TRANSFORMATION: DATA & IOT -
CHALLENGE
How to become more data led when productizing a phone?
Hundreds of mobile devices under test with individuals
Who has tested what, for how long?
How many phone / modem / system crashes?
Can we launch?
21. HOW TO MAKE AN ANDROID MOBILE
PHONE
Drivers
Qualcomm provide a layer of
drivers to work with their hardware.
Power Management, Camera,
Modem, Security, Sensors etc.Linux
The Linux layer consists of the
kernel as well as boot code“Android” - AOSP
Google’s Android actually only
makes up this part – the “Android
Open Source Project” and “Google
Mobile Services”
Apps
Vertu adds its own Apps, to support
the services which come with our
phones. Other Apps from the play
store form this layer too.
System
Integration
Partner
As with most Android phone
manufacturers, we use a System
Integration Partner to help us make
all these layers of software work
optimally together. This involves
tuning settings, making custom
changes and applying thousands
of patches.
Tackling quality issues in this layer
is well supported and understood.
Splunk’s MINT can help here!
App providers will fix their own
apps.
This is where the big challenge
lies! There is a lot of software
here, with many parties working on
it. If it goes wrong it means your
phone resets, runs out of battery,
misses calls, takes fuzzy photos
etc.
The phone can reset silently too!
22. TRANSFORMATION: DATA & IOT – OUR
SOLUTION
Metrics
collation
agent
During the productization
phase, we run an agent on
our phones to collect
metrics: uptime, crashes,
battery stats and other
product health information
The phones regularly
upload metrics to a web
service running in our
Amazon cloud
A Splunk
Forwarder feeds
the data into our
Splunk Enterprise
cluster
Splunk will send out email
alerts to our crash analysis
team whenever a phone
reset is detected
The project management
team will use Splunk
dashboards to assess the
state of the software
23. CRASH ALERT!
Cause of the
crash
Which tester
has the phone
When the crash
happened
What phone
software
version
24. SUMMARY
• Vertu has undergone a Digital Transformation on 3 fronts
• Cloud – from physical to virtual to cloud, without any customer noticing
• DevOps – from zero to DevOps focussing on automation with Ansible
• Data – taking a data-driven approach to product quality with Splunk
• The future
• Serverless & NoOps -> AWS Lambda and API Gateway
• Splunk latest features, Splunk Cloud?
• If you are interested in finding out more, please get in touch!
However, DevOps is complex and consists ‘loosely connected’ tools, especially new solutions for Continuous Integration (CI) and Continuous Delivery (CD), that automate the various aspects of the application lifecycle, from application planning through project management, code management, build automation, test automation, provisioning, configuration, release, and monitoring
Similarly in Ops, the environment is becoming much more complex and disjointed, demanding Site Reliability Engineers understand what is happening in a massively complex ecosystem. From on-site and legacy data center systems to cloud and SaaS services, network and storage infrastructures including SDN, SDS, and SDDC, security and compliance posture, and an increasing number of third party and internal services accessed solely through APIs.
All the while, Devs, Ops, and the rest of the delivery team are being told they must ‘align with the business’, without having any real visibility, let alone understanding, of how a DevOps-oriented delivery lifecycle directly impacts business goals like user signups, cart fulfillment, customer satisfaction, social sentiment, or revenue.
This complexity of the DevOps build pipeline (tool chain) impacts IT and business. Gartner currently lists DevOps (based on application services 2015 Gartner report) at peak of inflated expectations
Slower rate of releases and updates
Long Troubleshooting times
Applications are released with defects, resulting in efficiency, stability, revenue, satisfaction, and security/audit risk
Limited insights into the business impact of new applications/code, slow reaction times
Lack of reporting on application security and compliance implications
Because of the above issues, currently, Gartner lists DevOps at the peak of inflated expectations.
Gartner currently rates the
I promised you FIVE. Check it out – this is SIX.
The answer to complexity is to Splunk to mine machine data. You can collect, index and correlate data in real-time from across entire delivery lifecycle . The data you need to deliver new products and services to you customers might live in many places – in log files, behind API endpoints, relational databases, containers, – even wire data! Once your data is in Splunk, you can quickly and easily search, explore and visualize the data.
Vertu is a British manufacturer and retailer of handmade luxury mobile phones.
The phones are made on site in Hampshire using luxury materials like titanium, hand-stitched leather and sapphire crystal.
Each one is built and signed by a single craftsman.
We sell phones globally in 600 stores and 70 of our own boutiques.
The phones come with a range of exclusive services such as our Concierge service that will put you in touch with a 24x7 lifestyle manager that can arrange whatever you need: flights, opera tickets, a table at an exclusive restaurant that’s fully booked.
Over the last 5 years Vertu has undergone a Digital Transformation on three main fronts:
In infrastructure we’ve migrated from a collection of on premises and managed IT data centres to running all of our services in Amazon AWS.
In ops we’ve gone from having no ops function at all, to climbing the path towards DevOps – Culture, Automation, Lean, Metrics, Sharing
In data we’ve gone from using gut feel, instinct and experience alone to combining that with a data led approach using Splunk Enterprise.
Our cloud transformation started from a fairly common position – a heterogeneous collection of different data centres and managed services which was complex, expensive and slow to change. We looked to the future though and started introducing automation early using a tool called Puppet, as we knew where we wanted to go.
In 2012, triggered by the need to separate our IT systems from Nokia, we consolidated all of the data centres together and migrated them to a VMware private cloud. We build custom monitoring and migration tools and employed automation via Puppet to ensure the process was as seamless as possible. Our customers and most of Vertu didn’t even notice we had migrated – which is exactly what you want from a migration!
Early in 2015, we migrated again. Our private cloud hardware was reaching end of life and needed replacing. We didn’t have a way to scale up for experimentation or scale down to reduce costs. So we looked to the future once again and moved everything over to Amazon AWS. We evaluated cloud providers and chose Amazon because of the breadth of solutions in their catalog and their pace of innovation. We can now stop and start machines, create networks and loadbalancers all from Python, as this code snippet shows.
Our second transformation axis involved our approach to IT operations.
You’re probably familiar with Maslow’s Hierarchy of needs – not the 1943 original but this updated one. I was reading an excellent blog post last year by Space Ape games which described their journey to DevOps as climbing a ‘Hierarchy of DevOps needs’. We’re not as far as they are, but our journey has been similar. It all starts with IaaS, which we’ve now adopted with Amazon.
The next stage is to be able to deploy your infrastructure in a repeatable manner. We first used Puppet for this but last year moved to a newer tool called Ansible. All our infrastructure is deployed using Ansible, and all the Ansible code is versioned in Git. Our infrastructure is code.
We’re just starting to ascend to the next level which Space Ape suggests is using tools that deploy and manage your infrastructure for you. We’re using Jenkins right now, and are evaluating Packer and Consul.io which we intend to combine to lead to a world class Continuous Delivery solution.
The 3rd transformation we’ve undergone was in data. We make mobile phones and we have very discerning customers. When we are productizing a phone we typically have hundreds of phones under test by individuals in the UK and around the world. The programme manager responsible for the phone has some very basic questions:
Who has tested what, and for how long?
How many crashes or errors have there been?
These two questions give an industry standard figure called Mean Time Between Failures (MTBF) which you expect to reach several hundred hours for a reliable phone. Ultimately the big question is:
Can we launch?
So why is this a challenge?
To answer that, I need to make a small digression to explain what is involved in making an Android mobile phone.
It all starts with the silicon vendor – as well as supplying the chipset, Qualcomm provide use with a whole package of drivers to support all the different devices in the phone: camera, sensors, security subsystem, modem, power management etc.
Next we have the linux layer – this is the same kernel that’s in a linux server – as well as the boot system.
Then we have Google’s contribution – this is the bit of the Android phone that is actually Android. The Android Open Source Project (AOSP) contains most of the Android framework and some of the applications. The rest you may be familiar with: Play Store, Chrome, Gmail etc. makes up “Google Mobile Services” (GMS) and this is the closed-source portion of Android.
Finally at the top we have Apps. Vertu adds some of our own apps in here: client applications for our suite of services like Concierge, but this layer also includes the apps you might download from the Play store, like Facebook, Twitter, LinkedIn and Angry Birds.
There’s a lot of work to bring all these layers together into a high quality phone. Like many phone manufacturers we use a System Integration Partner to help us here. They make changes at all the levels – tuning, configuring and applying thousands of patches.
So where do the quality issues occur?
Well, at the Apps layer it is very well understood. There are lots of tools to help and Splunk have an offering called MINT specifically to tackle this. App vendors will of course be fixing problems in their own applications, so if Facebook crashes then Facebook will address that.
But these lower layers are the big challenge for us and other manufacturers. There is a _lot_ of software here and issues at this level will do more than just make an App exit. The phone can reboot (sometimes silently, in your bag), it could drop calls or lose signal, take fuzzy photos or run out of battery too quickly.
So, how did we tackle this?
This is our solution. During the phone productization phase, when we are developing and testing and fixing the phone, we run a metrics collection agent on all the handsets. This will collect an array of different product health information: how long the phone has been on, battery level, crash details etc.
Periodically, the agent will upload these metrics to a web service running in our Amazon cloud.
A Splunk Forwarder feeds the metrics into our Splunk Enterprise Cluster where they can be analysed.
We use Splunk Alerts to send out emails when crashes are detected. These go to our crash analysis team who can find the problematic phone and ask the tester what they were doing, take logs etc. The project management team responsible for the phone will use Splunk Dashboards to assess the state of the software.
I’ve got some examples to show you…
This is a sample crash alert email. This was sent because the phone I was testing crashed earlier this year. It went to the crash analysis team who contacted me asking for more information.
The mail gives the cause of the crash, as a coded set of numbers. This is important because a reboot just looks like a reboot but could be for any number of reasons. It is very easy for human testers to all agree they’ve suffered from the same crash (“Yeah I had that too!”) which can send us chasing our tails. Being able to cluster errors using data prevents this.
You can also see whose phone crashed, when and what software version they were using.
When using Splunk to mine machine data our customers and prospects can
1) INCREASE APP DELIVERY VELOCITY
2) IMPROVE CODE QUALITY
3) INCREASE BUSINESS IMPACT OF APPLICATION DELIVERY
The best part is that Splunk is really easy to try and deploy.
We have multiple options for getting started:
- Try out Splunk Enterprise, Splunk Cloud, or light with our free downloads or online trials.
- Or try our free software download. The free Splunk Enterprise download is the same product that scales to ingest petabytes of data per day.
- Already running with Amazon Cloud deployments? AMIs for Splunk Enterprise and Hunk make it easy to get up and running.