Learn the concepts of Thermodynamics on Magic Marks
Apache Samza - New features in the upcoming Samza release 0.10.0
1. Apache Samza 0.10.0
What’s coming up in the next Samza release
LinkedIn
Navina R
Committer @ Apache
Samza
2. New Features in Samza
0.10.0
Dynamic Configuration & Control
◦ Coordinator Stream
◦ Broadcast Stream
Host affinity in Samza
New Consumer: Kinesis
New Producers: Kinesis, HDFS,
ElasticSearch
Upgraded RocksDB
4. How does Config work today?
Job
Config RM
AM
C0 C1 C2
Submit Job
Cfg via cmd line
Cfg via cmd line
Job deployment in Yarn:
Job is localized to the
Resource Manager (RM)
RM allocates a container for
the Application Master (AM)
and passes the config
parameters as command-
line arguments to the run-
am script
Similarly, AM passes config
to the containers on
allocation
Checkpoint Stream
5. Problems
Job
Config RM
AM
C0 C1 C2
Submit Job
Cfg via cmd line
Cfg via cmd line
Escaping / Unescaping
quotes is cumbersome
(SAMZA-700)
Limits the number of
arguments that can be set
through shell command line
(SAMZA-337, SAMZA-333)
Dynamic config change not
possible. Every config
change requires a job re-
submission (restart)
(SAMZA-348)
Handle system config like
checkpoints differently than
user-defined config (SAMZA-
348)Checkpoint Stream
6. Solution: Coordinator Stream
RM
AM
C0
C1 C2
Submit Job
JC
Coordinator Stream
Config requested via HTTP
Coordinator Stream (CS)
Single partition
Log-compacted
Each job has its own
CS
Job Coordinator (JC)
Exposes HTTP end-
point for containers to
query for Job Model
Bootstraps from CS
and then, continues
consumption from CS
Samza job deployment using Job
Coordinator & Coorindator Stream
Bootstraps config from stream
7. Data in Coordinator Stream
Coordinator Stream (CS) contains:
◦ Checkpoints for the input streams
Containers periodically write to checkpoints to CS,
instead of a separate checkpoint topic
◦ Task-to-changelog partition mapping
◦ Container Locality Info (required for Host
Affinity)
Containers write their location (machine-name) to
CS
◦ User-defined configuration
Entire configuration is written to the CS when the
job is started
◦ Migration related messages
8. Coordinator Stream: Benefits
RM
AM
C0
C1 C2
Submit Job
JC
Coordinator Stream
Config requested via HTTP
Config can be easily
serialized /
deserialized
Checkpoints & user-
defined configs are
stored similarly
Config change can be
made by writing to the
CS*
JC can be used to
coordinate job
execution*
* Work In Progress
Samza job deployment using Job
Coordinator & Coorindator Stream
Bootstraps config from stream
9. Coordinator Stream: Tools /
Migration
Tools:
Command-line tool to write config
changes to coordinator stream
Migration:
JobRunner in 0.10.0 automatically
migrates checkpoints and changelog
mappings in 0.9.1 to Coordinator
Stream in 0.10.0
17. Fault Tolerance in a Stateful
Job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Task-0 & Task-1
running on the
container in Host-A
fail
18. Fault Tolerance in a Stateful
Job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Yarn allocates the
tasks to a
container on a
different host!
19. Fault Tolerance in a Stateful
Job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
:
:
0
1
159
:
:
0
1
82
Local state restored by
consuming the
changelog from the
earliest offset!
20. Fault Tolerance in a Stateful
Job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
After restored, job
continues with input
processing – Back to
Stable State!
21. Problems
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
State stores are not
persisted if the container
fails
◦ Tasks need to restore the state
stores from the change-log
before continuing with input
processing
Samza AppMaster is not
aware of host locality for a
container
◦ Container gets relocated to a
new host
Excessive start-up times
when a job is restarted
22. Motivation
During upgrades and job failures,
◦ Local state built in the task is lost
◦ Samza is not aware of the container
locality
◦ Job start-up time is large (hours)
Job is no longer “near-realtime”
Multiple stateful jobs starting up at the
same time will DDoS kafka –
saturating the Kafka clusters
23. Solution: Host Affinity in
Samza
Host Affinity – ability of Samza to
allocate a container to the same
machine across job
restarts/deployments
Host affinity is best-effort
◦ Cluster load may vary
◦ Machine may be non-responsive
◦ Container should shutdown cleanly
28. Host Affinity in Samza
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
:
:
0
1
159
:
:
0
1
82
Coordinator Stream
Container-0 -> Host-E
Container-1 -> Host-B
Container-2 -> Host-C
Container-0 -> Host-E
State store does not have to
be restored from the earliest
offset!
29. Host Affinity in Samza
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-E Host-B Host-C
Job back to Stable state
pretty quickly!
30. Host Affinity in Samza
Enable host-affinity
◦ yarn.samza.host-affinity.enabled=true
Enable continuous scheduling in Yarn
Useful for stateful jobs
Does not affect stateless jobs
32. Upgraded RocksDB
New RocksDb JNI 3.13.1+ version
supports TTL
Impact:
◦ Removes the need to write customized
code to delete expired records
33. New Features in Samza
0.10.0
Dynamic Configuration & Control
◦ Coordinator Stream
◦ Broadcast Stream
Host affinity in Samza
New Consumer: Kinesis
New Producers: Kinesis, HDFS,
ElasticSearch
Upgraded RocksDB
34. Thanks!
Expected release date – Nov 2015
Thanks to all the contributors!
Contact Us:
◦ Mailing List – dev@samza.apache.org
◦ Twitter - #samza, @samzastream
We have come up with a way to dynamically configure and control your Samza jobs. The 2 features relevant to this are coordinator stream and broadcast stream. I will discuss the motivation and design for these in the next few slides.
We will also illustrate what we mean by host-affinity in Samza and why it is important.
There have been many new contributors since the last release and have added significant value to our codebase such as new system producers and consumers.
We have verified and merged producers for 2 systems – HDFS (Eli Reisman) and ElasticSearch (Dan Harvey).
Kinesis producer/consumer is a very popular ask among Samza users that are predominantly based on AWS. This was recently prototyped as a part of a Google Summer of Code. We are eagerly looking forward for the patch and are planning to release it as a beta version in 0.10.0
0.10.0 will use can upgraded version of RocksDb.
I will be discussing the design details of Coordinator stream, broadcast stream and host affinity. Rest of feature-adds should be straight-forward from the website docs!
Let’s look at how a Samza job configuration works today.
When a Samza job is deployed on Yarn, we submit a application request to the RM.
Job tar, which includes the config is localized on the RM. RM passes the config to the AM when executing “run-am” on AM container start-up
Similarly, AM starts each container using “run-container.sh” command with the config included in the command line
Passing the config as apart of the command line has certain drawbacks.
* Escaping / unescaping quotes becomes tedious, preventing us from using any complicated config values
* Size limit on varargs when Yarn exports configuration – When yarn launches a container using launch_container.sh, it exports all envt variables, including the samza config, as a variables on the command line. There is limit to the size to the varargs length on the machine, usually ~128KB. This is problematic for jobs with large configuration.
No support for dynamic configuration changes
– Config is immutable once the job starts; Features such as auto-scaling require dynamic reconfiguration of the job.
User defined and programmatic configuration are handled differently – checkpoint configuration is in a stream and can be over-written, where as user-defined config is actionable only during job start-up
Lack of persistent configuration between job executions
– Cannot validate a configuration for a job without persistent configuration. Certain changes to job configuration may be equivalent to resetting the job itself.
Coordinator Stream is basically a single partitioned , log-compacted stream that acts as a “config log”. Each job has its own CS.
Job Coordinator – A component that reads the entire config from the bootstrap stream and exposes the config to the container through a HTTP end-point.
The term “Coordinator Stream” is kind of overloaded, in the sense that it carries a lot of job and system related configuration and can be potentially used by the JC to make more smart decisions about container execution/ placement.
For example, it will now contain checkpoint information. Containers write checkpoints directly to the coordinator stream. When the container comes up , it queries the JC for the “JobModel”, which basically defines the hierarchy of the job execution. JobModel is composed of one or more ContainerModel, and each ContainerModel is composed of a bunch of TaskModel. Each TaskModel contains the checkpoint information related to the input streams being processed by that task instance. In this way, JC exposes the job topology using a uniform data model.
So now, when you deploy a job, the RM brings up the AM container which contains the JC embedded within it. The JC bootstrap from the CS and builds a Job model.
Callout -> Container Locality Information -> will be explained as a part of host-affinity.
Serialized/Deserialized – can support more complex config definitions ; Currently, use Json serde. Gives more flexibility in terms of parsing configs.
No distinction between system and job related configuration.
Make config change by writing directly to the CS. We already have a command line tool to that. -> Dynamic config change for the job.
In future, we want to enable to JC to control the container life-cycle and the job execution instead of the AM.
CALL OUT:
Migration works for Kafka based systems ONLY! If there are Samza jobs that use different stream system for checkpoints, then the checkpoint / changelog migration has to be performed manually and also, remove the “task.checkpoint.factory” configuration before restarting the job successfully with 0.10.0
Trident in Storm allows you to perform a broadcast function, where every tuple is replicated to all target partitions. Broadcast streams in Samza is analogous to the broadcast function in Storm. Here, we allow a stream partition to be consumed by all task instances in the job.
Use Cases
- Change the algorithm or tests that are run in the Samza job such as PMML
- Acts as a custom control channel for an application - Trigger global behavior change in a job
Today, in Samza, we have modeled Tasks such that each stream partition is consumed by only one task instance in the job. This ensures that you don’t process the same message partition multiple times.
Explain the diagram & config statement
Explain the diagram and config
Call Out -> Can broadcast more than 1 partition in the stream
Deployment Modes
Local – Single container process which handles all input partitions, running on a machine
Standalone* - Developer/deployment tool starts the containers in a set of machines
Yarn – SamzaAppMaster interacts with the Resource Manager (RM) and the Node Manager (NM) in order to manage resource-allocation and provide fault tolerance
Before getting into the details of host-affinity, I want to provide an overview of how stateful jobs behave and fault tolerance is handled using Yarn.
Explain the diagram
Call Out ->
* Container allocated on a different host
* State needs to be restored
This impact is amplified not only when a container is lost of a host or is pre-empted from a host by Yarn. But also, during job upgrades.
The job is no more “near-realtime” -> since it needs to catch-up with a large backlog of input messages that was accumulated while the state was being restored
Multiple stateful jobs starting up at the same time (let’s say, you just upgraded Yarn and that bounced all the running jobs).. It will DDoS kafka – saturating the Kafka clusters. ATC job at LinkedIn recently faced this issue
If the container does not shutdown cleanly, the OFFSET file with checksum is not generated and hence, local store state will not get re-used.
Containers write task checkpoints to Coordinator Stream directly
Additionally, when the container starts-up , it writes the machine name to the CS.
Call out ->
* Local state remains in the machine (for a period of time)
* JC has a global view of the locality of the container
AM knows that it has to try placing the container on Host-E before defaulting to some other available host.
Now , the AM knows to specifically ask for Host-E from the RM. As long as the RM can allocate a container on the same host, host-affinity is successful.
In scenarios where the machine cannot allocate the requested number of resources, the AM brings it up on any other free host returned by the RM.
Note:
New container again writes the locality to the coordinator stream
An “OFFSET” file is stored on the failed disk with a checksum to ensure that the data is not corrupted.
In Yarn, we leverage the continuous scheduling feature in the Fair Scheduler to make this work. This requires some configuration on the Yarn cluster.