Slides from my talk at the Edge Engineering Meetup on June 9, 2016 at Netflix HQ, Los Gatos, CA. The talk covers why developer productivity is important for the Netflix experience based API system and takes a look at the kinds of problems we attempt to solve for Netflix developers.
This talk was part of a series of talks about the Netflix Edge.
http://www.slideshare.net/danieljacobson/netflix-edge-engineering-open-house-presentations-june-9-2016
12. $ newt auto-deploy -d
nodeJS
project
Docker Machine
node-inspector
Debugger
File watcher / live reload trigger
File watcher agent
NeWT: Local Container Development
Local
Container
docker build / run
13. $ newt auto-deploy -d
Docker Machine
NeWT: Local Container Development
Local
Container
Cloud
Microservices
Cloud
Proxy
Terminate
security
DiscoveryAgent
Service
Discovery
Local
System
Cloud
15. • Low Latency, High throughput, Highly Efficient
• Handle bursty or large scale loads
• Extensible programming model
600 jobs in production, 8M messages/sec at peak, 100Gbps network throughput
Mantis - Stream Processing Platform
28. • Scaling developer productivity with business growth
•Provide fully managed PaaS experience to client developers
• Shift Left Insights to power smart development
• Curated, blended visualizations that simplify devops
In conclusion...
At the Netflix Edge Developer Experience team, we are all about translating developer productivity into Netflix customer delight.
Wait, What developer experience?
Let’s get a show of hands -- how many of you are developers who write code and ship applications and in your daily life?
Good, so you know how important developer experience is to being productive at your work.
But who are these developers that we are talking about?
The Netflix Edge is all about an experience based API -- Netflix client application developers write an API that creates the best experience possible for their device.
Check out http://techblog.netflix.com/2013/01/optimizing-netflix-api.html, http://techblog.netflix.com/2014/03/the-netflix-dynamic-scripting-platform.html
We are talking about these internal Netflix client application developers
The innovation velocity for these client applications is very high - there are nearly 700 client adaptor applications deployed today, deploying dozens of times a day,
They are authored by nearly 15+ client teams, totalling ~200 developers
For a service tier that is funnelling billions of requests a day, there is a large appetite for high velocity changes
We would like those developers to be able to develop rapidly, deploy reliably and operate their application effectively.
And given Netflix’s experimentation driven culture, those applications are constantly evolving based on AB tests, from which they learn and do further development.
While a client developer does get to have a lot of fun creating cool new UI experiences for customers, they are also the final feature integration point, for both client and server code.
The slightest friction causes a lot of pain, missed deadlines and suboptimal features
So developer productivity at Edge leads to faster, more reliable innovation of product, which in turns helps keep our 81m subscriber base happy and growing.
Our strategy to achieve developer productivity is to invest in tools, insights and automation and grow their value as our service grows.
Let’s dive a little deeper.
Let’s take a look at innovations we are making in the areas of app development and management
This is today’s awesome dynamic scripting API server, where apps run on the JVM
At the demo stations you will get to learn about Primer, our dynamic app delivery and deployment system. With primer a developer can push one of these apps to production globally effecting change for customers within five minutes
However, given our future scale, there is a developer ergonomics challenge with this architecture, that we would like to solve.
First, there is a tech stack mismatch -- most Netflix UIs are JS, and the groovy stack at API makes for an unnatural fit
The API JVM is large and complex which means that devs cannot run debug their apps by running the server locally
Also a complex, overloaded and changing application profile makes it hard to provide guarantees about performance of any individual script in production
With edge rearchitecture, we are separating client app scripts into own process isolated services implemented as docker containers
The ergonomics story improves tremendously.
But remember, UI developers typically do not operate services. Just want to write JS, not operate a service tier.
Stakes are very high - the criticality of the component means that developers have to manage a lot of concerns.
What starts out as developmental concerns, quickly grows into various aspects of managing a server application at scale.
What developers need is a Platform as a Service Solution
We are excited about something we are working on towards this, called Newt or the Netflix workflow toolkit.
Newt brings docker container based, managed application development concepts to a developer’s hands
NeWT itself is a command line tool, but it represents wrapping all of the platform facilities underneath to simplify app development and operations.
A newt project gets all of these subsystems initialized, wrapped. Its also about the backend systems and maintaining them on behalf of the application developer
Our goal is that developers have to just bring their javascript code!
Here’s a different view of the various systems that NeWT abstracts
You might see some familiar open source systems there, and a few others that are Netflix specific telemetry and container cloud systems.
The PaaS experience is not just about a CLI but also about the corresponding UI experience
This is a preview of our Edge PaaS UI which provides user / team personalized access to apps with integrations into other platform systems much like its CLI counter part.
It also had deep integration into operational insights systems which we will talk about shortly.
Newt is also our main container tooling wrapper
Recall that today’s API platform prevents effective debug of client application code
With NeWT and containers we are looking to change that around
How many of you love live reload debugging?
Lots - oh cool, so you will love this
None - well, I hope I can entice you towards using live reload debugging by the end of the day
<walk through>
newt auto-deploy takes your nodeJS project (or pull an image from production)
Provisions a docker machine, and builds and runs your app inside a container on the docker machine
Installs a file watcher agent that monitors your code
And as you make edits, pushes the changes to your container, respins the node process
A debugger connection is also established seamlessly via the docker machine allowing you to debug as you edit.
Our client apps typically expect to terminate security at our proxy layer, so when doing local development, it would be cumbersome to run the proxy locally too.
Instead, newt will launch a network agent that creates a cloud like service discovery and registration setup
And traffic can flow seamlessly from the proxy in the cloud, to the local container, and then onto downstream cloud systems
Let’s switch gears a little bit. Now that client application developers run services, we need to extend devops workflows to them
They need to be able to operate their deployed code effectively and/or understand client application behavior quickly.
We have numerous curated insights tools, we don’t have time to cover them all, but let’s look at a few of them
Before we look at our actual solutions, I would like to tip my hat towards Mantis which my colleagues in the Edge Realtime Events team work on
So, what is Mantis? Low latency, high throughput, stream processing platform
Because it is sharded and auto-scaleable
Because queries are evaluated at source, and you stream only wha ct matches a query
Can chain jobs, a variety of sources and sinks
And the numbers speak for themselves…
Mantis powers a lot of the insights tools you will be seeing next
Once an app is deployed, “How does one know the aggregate health of the application, say, globally?”
For this purpose, we created application specific dashboards, with critical health metrics laid out together so that an engineer can draw correlations.
Blends historical metric data with real-time visualizations
Also blends contextual information such as API pushes or server pushes to point to source of problems
Let’s say you have found a latency problem
How do you surgically analyze those slow requests? You will want to collect samples
Enter our queriable real-time data explorer
A user can pick a streaming data source, enter a set of conditions, in this specific case, say requests that take > 5 seconds
Hit Submit
And they get aggregated results from all the cloud instances that matched their query
Here you see all the slow requests being listed
We are talking about JS developers here, they can use our real-time javascript mapper to turn that stream and filter, map, reduce on it to further get to an actionable dataset. In this example, they choose to further ignore slow requests from mexico...
They can also turn this data stream into a numeric metric, and then plot graphs or create alerts on deviations of that metric’s value
This creates a nimble system for transient, on demand metrics. You no longer have to code up all metrics ahead of time into your source
Maybe you have identified a few devices exhibiting those slow requests
At this point a developer can use our session tracing tools to get a view of their device session
Here is an example
Here you see the the client’s view of requests over time, plotted on a timeline that you can zoom in and zoom out of
Maybe you spotted one or two specific slow requests in that session
You can drill down and get a server side call graph / trace for that specific request
We try to highlight hotspots in the call graph, as well as annotate each node with rich node specific insights data when available
And let’s say you identify that the hotspot is within your service
You can then get a method level execution profile within your request
Surgical insights and alerting is possible, when you know the specific dimension that has issues and are trying to debug an issue after it has happened.
But in reality, our applications have numerous dimensions and long tail characteristics -- a device could be having issues only in a certain country, and only for a given title.
But the cardinality of the data can be really high for each of those dimensions
What if we could automatically analyze all combinations of a set of known dimensions - say country / title, or device / title, or uiversion / title, and alert about anomalies in real-time. Not only that, but provide relevant debug data right next to the alert
We have just the system based on Mantis
Here you see a specific title starting to show an increase in errors relative to historical values
Future work here is looking towards auto-triaging and enriching the alert signal with a set of correlated data
Thus you can send a more targeted alert
In conclusion you have gotten a flavor for ideas around
Providing a managed PaaS experience
Shift left insights for powering smart development
Curated, blended insights for simplifying devops
For the tech fans out here, here is a short but by no means comprehensive set of technologies we employ in these solutions
Come talk to us at the demo stations to learn more or if you have a great idea, come tell us what we could be doing differently!