Dotcloud were a paas provider who built Docker to automate the deployment containers
Docker containers use an execution environment called libcontainer, which is an interface to various Linux kernel isolation features, like namespaces and cgroups. Docker gives you this level of abstraction.
Namespaces and cgroups are two of the main kernel technologies most of the new trend on software containerization Docker rides on. To put it simple, cgroups are a metering and limiting mechanism, they control how much of a system resource (CPU, memory) you can use. On the other hand, namespaces limit what you can see. Thanks to namespaces processes have their own view of the system’s resources.
This architecture allows for multiple containers to be run in complete isolation from one another while sharing the same Linux kernel. Because a Docker container instance doesn’t require a dedicated OS, it is much more portable and lightweight than a virtual machine.
I would like to spend a few minutes discussing what docker is, most of you would have at least heard of it, and I’d like to talk about why it is important.
An image is the build component of a container. It is a read-only template from which one or more container instances can be launched. Conceptually, it’s similar to an AMI.
Registries are used to store images. Registries can be local or remote. When we launch a container, Docker first searches the local registry for the image. If it’s not found locally, then it searches a public remote registry, called DockerHub.
Finally, a container is a running instance of an image. Docker uses containers to execute and run the software contained in the image
Here is an example docker Dockerfile, which includes all of the instructions for building the Docker images. Take the time to get this right from the beginning.
Developers can add new application features more quickly by taking advantage of automated building, testing, integration, and packaging - at the speed of containers.
Idle containers don’t take up computing, memory, and I/O resources.
You can move workload between private and public clouds more quickly. Instead of moving gigabytes between clouds, you can move megabytes.
Containerized applications can boot and restart in seconds, compared to minutes for virtual machines
Instead of building one application (monolithic architecture), developers build a suite of components, called microservices, which come together over the network. Each component is written in the best programming language for the task, and each component can be deployed and scaled independently of one another.
At the core of the application is the business logic, which is implemented by modules that define services, domain objects, and events. Surrounding the core are adapters that interface with the external world. Examples of adapters include database access components, messaging components that produce and consume messages, and web components that either expose APIs or implement a UI.
Despite having a logically modular architecture, the application is packaged and deployed as a monolith.
Many organizations, such as eBay, and Netflix, have adopted a Microservices archtecture pattern. Instead of building a single, monolithic application, the idea is to split your application into set of smaller, interconnected services.
Each microservice is a mini-application that has its own architecture consisting of business logic along with various adapters. Some microservices would expose an API that’s consumed by other microservices or by the application’s clients. Other microservices might implement a web UI. At runtime, each instance is often a cloud VM or a Docker container.
Looking at the evolutions of deployment and application. 1 day to 15 minutes to 10 seconds. Only one host OS to manage. Smalll learining curcve.
Rise of the container between 2013 – 2015; spearheaded by docker.
A typical DSE node runs the following processes on a single instance within the cluster:
A single core DSE JVM – including Apache Cassandra, integrated DSE Search, and Spark Master (for HA)
One or more Spark executor processes
A single Spark Worker process
Multiple processes for the integrated Hadoop stack
Multiple processes which may be started in an adhoc manner (e.g. Spark Job server, SparkSQL CLI, etc.)
A single OpsCenter agent responsible for monitoring all processes on that DSE instance
Container 2 - All the JVMs running on a single DSE node (uniformly deployed across the each machine
within the cluster)
The OpsCenter daemon is (logically) separate from the cluster and there is usually one7 instance for the entire deployment8.
To provide cluster specific configuration, the following environment variables should be provided via the Docker run command:
a. CLUSTER_NAME: the name of the cluster to create/connect to b. SEEDS:thecomma-separatedlistofseedIPaddresses,
e.g. SEEDS=127.0.0.2,127.0.0.3
mlockall to prevent swapping and page faults. The simplest workaround is to add -XX:+AlwaysPreTouch to the JVM arguments and disable swap on the host OS.
All containers by default inherit ulimits from the Docker daemon. DSE containers should have them set to unlimited or reasonably high values (for e.g. for mem_locked_memory and max_memory_size). *Check*
Docker’s default networking (via Linux bridge) is not recommended for the production use as it slows down networking considerably, up to 50% Development and testing benefit from running DSE clusters on a single Docker host and for such scenarios the default networking is just fine
Instead, use the host networking (docker run --net=host) or a plugin that can manage IP ranges across clusters of hosts. The host networking limits the number of DSE nodes per a Docker host to one, but this is the recommended configuration to use in production. Using Docker doesn’t mean have it all on a host – think about the disks!
. Use pipework or Weave if consistent IP address allocation is needed.
Data volumes are required for the commitlog, saved_caches, and data directories (everything in /var/lib/cassandra). The data volume must use a supported file system (usually xfs or ext4).
A data volume is a specially-designated directory within one or more containers that bypasses the Union filesystem.
Volumes are initialized when a container is created. If the container’s base image contains data at the specified mount point, that existing data is copied into the new volume upon volume initialization.
Data volumes can be shared and reused among containers.
Changes to a data volume are made directly.
Changes to a data volume will not be included when you update an image.
Data volumes persist even if the container itself is deleted.
All of this works great for test/dev/prop environments.
Deploying DSE within Docker isn’t trivial, but with adequate guidance and pre-production validation, it’s not that difficult. As the container ecosystem evolves, it is expected that future DSE releases will have additional guidelines to make the most of DSE installations under Docker. Some future areas that DataStax is investigating are:
Further splitting up of DSE processes into separate containers (e.g. running Spark executors and DSE core JVM within a single container, and all other DSE processes within a separate containers)
Integration of container based deployment with workload management infrastructure components such as Kubernetes, Mesos, etc.
Enabling the deployment model on a variety of public and private clouds
using volumes for the data storage is a must for durability and performance
avoiding the bridge/NAT networking and run containers with --net=host. This provides the simplest way to connect to the outside world and guarantees a stable IP address to the guest. Host networking also has the lowest overhead performance-wise so your cluster should perform nearly as well as it does on bare metal.
DataStax acknowledges that containers have rapidly become one of the building blocks, guidelines and examples to reduce the amount of time required to run DSE in Docker.