VMware’s Common SaaS Platform (CSP) is a brand new offering designed to enhance the productivity of developers and cloud providers by equipping them with a set of common and configurable capabilities (such as Identity, Telemetry, Account Management, Billing etc.), thus enabling them to focus on their core businesses.
But enough with the product pitch.
CSP is distributed to numerous cloud providers around the globe, used by developers and IT alike to empower their services and better answer the business need of their customers.
Please join us and witness how we take continuous delivery to the next step where sometimes the target environment is not on our control and still seamlessly manage and deliver our unique collection of capabilities, packaged as platform for ease of use, using the best and shiniest tools the frogs can provide.
2. Agenda
• What is the Common SaaS Platform (CSP)
• CI/CD processes for CSP
• Upgrading CSP
• Xenon - Distributed Control Plane (If we have the time)
2
3. Who are we ?
3
Kiril Nesenko
DevOps Lead
knesenko@vmware.com
Gilad Garon
Architect
ggaron@vmware.com , Twitter @giladgaron
4. VMware’s SaaS Transition
• VMware is developing many SaaS offerings
• Many services have the same common requirements (Billing, Identity, etc.)
• Like other good engineers, we like to reuse code wherever possible
• VMware’s Common SaaS Platform (CSP) is platform that internal SaaS
offerings are using to leverage existing internal components
4
5. Designing a SaaS platform
Design Principles
5
Cloud Agnostic
Highly Available
Scalable
Great Public APIs
Modular
In Practice
Infrastructure needs to support containers
Dynamic, Stateful and Distributed cluster
Tunable consistency helps to achieve availability & scalability
No internal APIs
Capabilities as libraries, Coupling is done with APIs
Ease of operability /
development
Single JAR, limited classpath dependencies set
6. Deployment Architecture. yep that’s it.
6
Xenon Host
Jar
Container
Xenon Host
Jar
Container
Xenon Host
Jar
Container
Xenon Host
Jar
Container
Some Cloud Provider Inc.
14. Jenkins Job Builder
• Developed by OpenStack folks
• Configuration as code (yaml format)
• Easy to review changes
• Configuration de-duplication
• Include shell/groovy/python… scripts
• Test before deploying
• Easier to organize (per directory, per file)
• Serves as backup (easy to replicate to another jenkins)
14
21. Jenkins Jobs Types
• Gating – listens for patch-set-created events
• Build – for building purposes (gradle, docker etc)
• Listeners – listens for change-merged events on gerrit (orchestrators for the
pipelines)
21
22. Gating Jobs
• For each patch we run a gating job
• Each git project has its own gating job
• Build + test + post results to gerrit
22
23. Gating Jobs
23
Developer sends a patch Run build and tests(gating)
Post results to gerritMerge ?
Start build pipeline(listener)
27. Gerrit Failure
Gerrit hooks
• Executed on the server side
• Execute per event type
• Various checks: commit message style, trailing white
spaces, etc.
• Integrations with external systems: bugzilla, jira, etc.
27
30. Listener Jobs
• Executed on patch-merged event
• Orchestrating the build and delivery pipeline dynamically
• Orchestration done via the BuildFlow plugin (groovy)
• All listeners run the same code base
• On failure, user is notified on slack channel
30
37. Upgrading a Stateful platform
Goals:
• Minimal service interruptions
• Support schema changes
Challenges:
• Symmetrical cluster: Can’t refactor / add API paths
• State & Business Logic in the same tier: can’t separate schema upgrade from
BL changes
37
38. Upgrading a Stateful platform
Design:
• Work in cycles, get meaningful metrics per cycle
• Each cycle migrates and transforms state
• Use a Threshold to determine progress and cutoff point
• Smartly queue external traffic
• Reroute traffic to new cluster
38
40. Xenon – Distributed Control Plane
• A design pattern and runtime for scalable orchestration and management logic
• A runtime powering tiny REST services
• IO Pipeline integrates key building blocks within each service operation
• Production ready code with continuous integration tests, design documents
40
https://github.com/vmware/xenon
41. The Popular Way
Stand up N nodes for each of:
• Orchestration code & container (Spring Boot)
• Your HA persistency layer (Cassandra, Mongo)
• Your translation layer (ORM)
• Your arbitration/leader election (ZK, etcd, consul)
• Your UI server (node.js, tomcat, apache)
• Your cache layer (Redis, memcached)
• Your message bus, event broker
42. The Xenon Way
Stand up N nodes running Xenon services:
• Orchestration as stateless or stateful REST endpoints
• Persist, replicate state independently
• Manage concurrency with a single JVM and one thread per core across ALL
services
• Provide per operation owner selection (leader)
• Pub / Sub
• Stats
• UI
• Tracing
45. Decentralized Model
• Scalable to lots of nodes
– SWIM node discovery and maintenance
– Replication with Eventual OR Strong Consistency (choose!)
• Every node in a node group has the same core services
– Operational simplicity
46. Indexing/Queries
• Multi version, fully indexed, replicated document store
– Lucene!
• Query services with rich document query support modeled as tasks
– Real time or historical
• Collections are just queries
47. Programming Model
• Isolated, asynchronous components listening on URIs
• Each service instance represents a “living” document
– All side effects happen through REST actions on document
– Replication, consensus, notifications all leveraging symmetric model
• Stateless handlers are offered latest state and request body
• Developer declares requirements through Service options
– Replication with Strong (Eager) or Eventual consistency
– Scale out (Owner selection)
– Instrumentation
– Persistence (with deep indexing)
– And more …
Notes de l'éditeur
Hi,
My name is Gilad and along here with is Kiril and we are a part of Vmware’s CPSBU or Cloud provider software business unit which a fancy way of saying the we build software for cloud providers.
Vmware is transitioning from a product based company to a services based company.
More and more teams are developing services, and need to interact with internal backoffice system such as identity and billing.
As development moved forward, we’ve noticed two things:
No one like to write integrations with billing or identity developers prefer to write services! Not integrations
Every service implements its integrations in its own way, and if different services wants to share this integration, most of the time it’s too domain specific
Like all good engineers we want to share code and not waste time on reinventing the wheel.
So, our main goal with CSP is to create a platform that will enable acceleration of internal services development and standardize the way a service interacts withthe various intergations
How do you design such a platform? When designing CSP we’ve decided on a set of design principles:1. Run on any infrastructure
2. High availability – self explanatory
3. Scalable – support N nodes
4. Public APIs dogfooding – we believe that a good API experience is only achievable when you consume your own APIs
5. Modular – add capabilities to the platform easily and be able to not use certain capabilities
6. Ease of operability / development – try to limit the tech zoo, and be able to run the platform with a single “click”
How does it looks in practice?
Our lowest common denominator is container support. If a provider can support containers, we can run on it.
Our platform is distributed and Stateful. we use tunable consistency in which most of our data is eventually consistent
In order to be scalable, we use gossip or to me more precise, SWIM protocol to be highly available
No internal APIs, if you don’t have them, you need to consume the public ones
Our capabilities or modules are just jars in the class path. Coupling between modules is done at the public API level
Our executable is a JAR, not a web / application server which is easy on development and operations. We limited our tech zoo to technologies that are aligned with our design principles.
Most of these principles are provided by Vmware’s own Xenon framework, a distributed control plane. More on xenon in a few seconds.
When we sticked to our guns with the design princples (and it wasn’t easy) we had a big win:
When deployed in production, CSP looks like this. (also in Dev) the number of nodes can scale. A lot.How did we achieve this? Vmware’s xenon framework
So how do we upgrade our customer envs?
Upgrading services to a new version is not a new concept, All of us are familiar with the popular strategies
Rolling upgrade inside an existing cluster
Blue/Green
Even hybrid solutions exists
We had two main goals when designing the upgrade mechanism, other than the oblivious one of actually upgrading the code base:
We must support schema transformation (renaming of fields) since adding or subtracting fields is free in Xenon.
The other goal is that the customer should not feel service interruptions
CSP has some challenges that needed to be addressed when we designed our upgrade mechanism:
CSP is stateful and the state and the business logic reside together in the same tier. This causes a challenge when considering a rolling upgrade.
You can’t seprate the schema changes and the business logic changes since they both reside in the same jar.And you you can’t modify API paths and or logic since our cluster is symmetrical.
So what did we do?
Since rolling upgrades are not easily achievable for now, we went with a green / blue strategy.
Our goal here is to migrate most of the data while the platform is live. Once the migration is almost done, we queue the incoming traffic,copy the remaining data, and then reroute the traffic to the new cluster.
In order to achieve that, we run in cycles. When a cycle is finished, we examine its telemetry and pass it to a threshold mechanism.
The threshold’s mechanism purpose it to determine whether it is safe to queue the external traffic and migrate the remaining data.If the last cycle took too long, we start a new cycle picking up from where the last cycle finished in terms of state. (the platform is still live so data is modified in runtime and we need to address these changes)
So, we migrate, check and repeat until we’ve crossed a certain threshold. Once the threshold is crossed we queue the traffic, perform a final
Cycle and reroute the traffic.Let’s see an example.
What is Xenon?
Xenon is a framework for writing small REST-based services. (Some people call them microservices.) The runtime is implemented in Java and acts as the host for the lightweight, asynchronous services. The programming model is language agnostic (does not rely on Java specific constructs) so implementations in other languages are encouraged. The services can run on a set of distributed nodes. Xenon provides replication, synchronization, ordering, and consistency for the state of the services. Because of the distributed nature of Xenon, the services scale well and highly available.
Xenon is a "batteries included" framework. Unlike some frameworks that provides just consistent data replication or just a microservice framework, Xenon provides both. Xenon services have REST-based APIs and are backed by a consistent, replicated document store.
When you build a modern service today you’ll probably need the following checklist:
Orchestration code and container – you’ll probably go with Spring Boot
HA Distributed DB – Cassandra / Mongo
And an ORM layer to go with it
A way to keep your cluster in sync – Zookeeper / ETCD
UI serving – Node.js / Apache
You’ll want to go Stateful at one point for performance / throughput issues – Redis
And some message bus / pipeline – Kafka?
In my opinion, this checklist looks good . All of the techs listed here work. Industry standard. But, you have to admit, it is a bit complex to manage and deploy.You have to deploy and bootstrap in a certain order, wait for things to get settled in…. You get it.
But, there’s another way:
Each Xenon runtime provides the following abilities:
An Orchestration and a restful layer
Persistency and replication layer
Total asyc processing with a single thread per core
Tunable consistency per service with leader election
Publish / Subscribe mechansims
And UI sevices, telemetry data, tracing and more….