2. 2
ABOUT ME
• Apache Storm Committer and PMC member
• Member of the Yahoo’s low latency Team
Data processing solutions with low latency
• Graduate student @ University of Illinois, Urbana-Champaign
Research emphasis in distributed systems and stream processing
• Contact:
jerrypeng@yahoo-inc.com
4. 4
OVERVIEW
• Apache Storm is an open source distributed real-time data stream processing
platform
Real-time analytics
Online machine learning
Continuous computation
Distributed RPC
ETL
5. 5
STORM TOPOLOGY
• Processing can be represented as a directed graph
• Spouts are sources of information
• Bolts are operators that process data
6. 6
DEFINITIONS OF STORM TERMS
• Stream
an unbounded sequence of tuples.
• Component
A processing operator in a Storm
topology that is either a Bolt or Spout
• Executors
Threads that are spawned in worker
processes that execute the logic of
components
• Worker Process
A process spawned by Storm that may
run one or more executors.
9. 9
OVERVIEW OF SCHEDULING IN STORM
• Default Scheduling Strategy
Naïve round robin scheduler
Naïve load limiter (Worker Slots)
• Multitenant Scheduler
Default Scheduler with multitenant capabilities (supported by
security)
Can allocate a set of isolated nodes for topology (Soft
Partitioning)
Resource Aware
10. 10
RUNNING STORM AT YAHOO - CHALLENGES
• Increasing heterogeneous clusters
Isolation Scheduler – handing out dedicated machines
• Low cluster overall resource utilization
Users not utilizing their isolated allocation very well
• Unbalanced resource usage
Some machines not used, others over used
• Per topology scheduling strategy
Different topologies have different scheduling needs (e.g. constraint based
scheduling)
11. 11
RUNNING STORM AT YAHOO – SCALE
600
2300
3500
120
300
680
0
100
200
300
400
500
600
700
800
0
500
1000
1500
2000
2500
3000
3500
4000
2012 2013 2014 2015 2016
Nodes
Year
Total Nodes Running Storm at Yahoo
Total Nodes Largest Cluster Size
12. 12
RESOURCE AWARE SCHEDULING IN STORM
• Scheduling in Storm that takes into account resource availability on
machines and resource requirement of workloads when scheduling
the topology
Fine grain resource control
Resource Aware Scheduler (RAS) implements this function
- Includes many nice multi-tenant features
• Built on top of:
Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar,
and Roy Campbell. "R-storm: Resource-aware scheduling in
storm." In Proceedings of the 16th Annual Middleware Conference,
pp. 149-161. ACM, 2015
13. 13
RAS API
• Fine grain resource control
Allows users to specify resources requirement for each component (Spout or Bolt) in a Storm Topology:
API to set component memory requirement:
API to set component CPU requirement:
Example of Usage:
public T setMemoryLoad(Number onHeap, Number offHeap)
public T setCPULoad(Number amount)
SpoutDeclarer s1 = builder.setSpout("word", new TestWordSpout(), 10);
s1.setMemoryLoad(1024.0, 512.0);
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
.shuffleGrouping("word").setCPULoad(100.0);
15. 15
RAS FEATURES – PLUGGABLE PER TOPOLOGY
SCHEDULING STRATEGIES
• Allows users to specify which scheduling strategy to use
• Default Strategy
- Based on:
• Peng, Boyang, Mohammad Hosseini, Zhihao Hong, Reza Farivar, and Roy Campbell. "R-storm: Resource-
aware scheduling in storm." In Proceedings of the 16th Annual Middleware Conference, pp. 149-161. ACM,
2015.
- Enhancements have been made (e.g. limiting max heap size per worker, better rack selection algorithm, etc)
- Aims to pack topology as tightly as possible on machines to reduce communication latency and increase
utilization
- Collocating components that communication with each other (operator chaining)
• Constraint Based Scheduling Strategy
CSP problem solver
conf.setTopologyStrategy(DefaultResourceAwareStrategy.class);
16. 16
RAS FEATURES – RESOURCE ISOLATION VIA
CGROUPS (LINUX PLATFORMS ONLY*)
• Replaces resource isolation via isolated nodes
• Resource quotas enforced on a per worker basis
• Each worker should not go over its allocated resource quota
• Guarantee QOS and topology isolation
• Documentation:
https://storm.apache.org/releases/2.0.0-
SNAPSHOT/cgroups_in_storm.html
*RHEL 7 or higher. Potential critical bugs in older RHEL versions.
17. 17
RAS FEATURES – PER USER RESOURCE
GUARANTEES
• Configurable per user resource guarantees
18. 18
RAS FEATURE – TOPOLOGY PRIORITY
• Users can set the priority of a topology to indicate its importance
• The range of topology priorities can range form 0-29. The topologies priorities will
be partitioned into several priority levels that may contain a range of priorities
conf.setTopologyPriority(int priority)
PRODUCTION => 0 – 9
STAGING => 10 – 19
DEV => 20 – 29
19. 19
RAS FEATURES – PLUGGABLE TOPOLOGY
PRIORITY
• Topology Priority Strategy
Which topology should be scheduled first?
Cluster wide configuration set in storm.yaml
Default Topology Priority Strategy
- Takes into account resource guarantees and topology priority
- Schedules topologies from users who is the most under his or her resource
guarantee.
- Topologies of each user is sorted by priority
- More details:
https://storm.apache.org/releases/2.0.0-
SNAPSHOT/Resource_Aware_Scheduler_overview.html
20. 20
RAS FEATURES – PLUGGABLE TOPOLOGY
EVICTION STRATEGIES
• Topology Eviction Strategy
When there is not enough resource which topology from which user to evict?
Cluster wide configuration set in storm.yaml
Default Eviction Strategy
- Based on how much a user’s guarantee has been satisfied
- Priority of the topology
FIFO Eviction Strategy
- Used on our staging clusters.
- Ad hoc use
More details:
https://storm.apache.org/releases/2.0.0-
SNAPSHOT/Resource_Aware_Scheduler_overview.html
25. 25
CONCLUDING REMARKS AND FUTURE WORK
• In Summary
Built resource aware scheduler
• Migration Process
In the Progress from migrating from MultitenantScheduler to RAS
Working through bugs with Cgroups, Java, and Linux kernel
• Future Work
Improved Scheduling Strategies
Real-time resource monitoring
Elasticity
29. 29
PROBLEM FORMULATION
• Targeting 3 types of resources
CPU, Memory, and Network
• Limited resource budget for each node
• Specific resource needs for each task
Goal:
Improve throughput by maximizing
utilization and minimizing network
latency
30. 30
PROBLEM FORMULATION
• Set of all tasks Ƭ = {τ1 , τ2, τ3, …}, each task τi has resource demands
CPU requirement of cτi
Network bandwidth requirement of bτi
Memory requirement of mτi
• Set of all nodes N = {θ1 , θ2, θ3, …}
Total available CPU budget of W1
Total available Bandwidth budget of W2
Total available Memory budget of W3
30
31. 31
PROBLEM FORMULATION
• Qi : Throughput contribution of each node
• Assign tasks to a subset of nodes N’ ∈ N that minimizes the total resource waste:
31
32. 32
PROBLEM FORMULATION
Quadratic Multiple 3D Knapsack Problem
We call it QM3DKP!
NP-Hard!
• Compute optimal solutions or approximate solutions may be hard and time consuming
• Real time systems need fast scheduling
Re-compute scheduling when failures occur
32
33. 33
SOFT CONSTRAINTS VS HARD CONSTRAINTS
• Soft Constraints
CPU and Network Resources
Graceful performance degradation with over subscription
• Hard Constraints
Memory
Oversubscribe -> Game over
Your date comes hereYour footer comes here33
34. 34
OBSERVATIONS ON NETWORK LATENCY
1. Inter-rack communication is the slowest
2. Inter-node communication is slow
3. Inter-process communication is faster
4. Intra-process communication is the fastest
Your date comes hereYour footer comes here34
35. 35
HEURISTIC ALGORITHM
35
• Greedy approach
• Designing a 3D resource space
Each resource maps to an axis
Can be generalized to nD resource space
Trivial overhead!
• Based on:
min (Euclidean distance)
Satisfy hard constraints
38. 38
HEURISTIC ALGORITHM
38
• Our proposed heuristic algorithm has the following properties:
1) Tasks of components that communicate will each other will have the highest priority to be scheduled in close network proximity
to each other.
2) No hard resource constraint is violated.
3) Resource waste on nodes are minimized.
Editor's Notes
Good afternoon, My name is Boyang Jerry Peng and I am here to present Resource Aware Scheduling in Apache.
A little about me, apache storm committer and pmc member
I am currently apart of the low latency team at Yahoo.
Our team primarily works on projects that provide data processing solutions with low latency to yahoo and Apache storm is one of the platforms we work on.
Prior to me joining Yahoo, I was a graduate student at the University of Iilinois, urbana champaign with a research emphasis in distributed systems.
First, going to provide a brief overview of Apache Storm
Then, I will discuss the problems and challenges of running apache storm at yahoo.
Next, I will get to the core of this presentation and talk about resource aware scheduling in Storm. Define what it is and how to use it and how it helps us overcome the problems and challenges I have mentioned
Lastly, I will present some results.
Apache Storm is a popular open source distributed data stream processing platform used by many companies in industry
There are many use cases for Apache Storm such as:
Real-time analytics , Online machine learning , Continuous computation , Distributed RPC , and ETL operations
In apache storm, an application or workload is called a Storm topology. A storm topology, like applications in other stream processing systems, can be represented as a directed graph
In which each edge represents a flow of data and each vertex a location where processing data occurs.
In Storm, there are two types of operators or component.
First type is called a spout. Spouts are sources of information and are responsible for injecting data into the storm topology
Second type is called a bolt. Bolts consume streams of data, conduct any user defined processing, and potentially emit new streams of data downstream to be processed by other
bolts
Briefly go over some definitions in Storm
Two types of nodes in a Storm cluster
A master node that runs a daemon called Nimbus. The master node and the Nimbus daemon is responsible (with the help of Apache Zookeeper) for maintaining the active membership of the storm cluster. The nimbus Node is also responsible for computing schedulings of topologies in the Storm cluster.
A worker node in Storm is a node that runs a daemon called supervisor that is responsible for retrieving schedulings from nimbus via zookeeper and launching the necessary processes according to the scheduling to realize the computation of the topology
Let me also talk about the difference between logical and physical connections in Storm.
The diagram on the left is an example of a storm topology where executors are organized by component.
And each line connecting two executors represents a logical connection.
In The diagram on your right, executors are organized by the physical machines they are scheduled on and each line represents a physical connection.
As you can see logical connections can vary quite a bit from the physical connections that need to be made in a topology
This is where the scheduler can play an important part. How the topology is scheduled can have major impacts on performance of the topology.
Let me talk about how scheduling is done in storm
Default scheduler schedules executors in a round robin fashion
Uses the concept of worker slots to limit the computation load on a single machine. Can only Launch as many worker processes as worker slots.
Each worker can run any number of executors that requires any amount of resources to run.
Because not resource aware customers want isolated nodes
Not very effective
Not resource aware.
Executors use any arbitrary amount of resources.
See some loads overloaded and some nodes empty
Let me talk about some challenges of running storm at yahoo
Our clusters have become increasingly heterogeneous. Made up of older nodes and new nodes that have different hardware specs
Handing out dedicated nodes heterogeneous cluster, some times nodes on size some time another
Not utilizing resources well. Customers used more nodes then they need. Because they don’t think about resource requirements as well. Nothing else can run on those isolated nodes
Fine grain resource control
Deprecates the notion of using worker slots to limit load and removes the need to use isolated nodes. Resource isolation via cgroups
Let me go over the some of the core API for scheduling with resource aware scheduler
Allows users to specify the resource requirements for each component…
Cluster admins can specify how much of each resource is available for user on each worker machine
Let me talk about some features Resource Aware Scheduler provides
One of them is have pluggable per topology scheduling strategies.
We have identified that different topologies might have different scheduling needs
Constraint based scheduling strategy:
An internal user has some scheduling requirements in which
Users can can describe these constraints and the strategy will attempt to find a scheduling that satisfies these constraints
Only neat features we developed to support RAS is resource isolation via cgroups
Get rid of delagating isolated nodes that was killing out utilization
Rhel 7 cgroup and java memory do play well. Bugs in kernel
Taken into account in the scheduling priority and eviction strategies I will mention latter
Taken into account in scheduling priority and eviction strategies
pluggable
In what order should the topologies be scheduled
Pluggable
Different clusters should have different eviction policies (Production vs Staging)
How much over his or her resource guarantee a user is
Not enough resources or sudden failure
Still in the process of migration.
The average amount of assigned memory has decreased. Which implies that topologies are becoming more resource efficient to run
Using less memory to run
Run more topologies
Working out the kinks. Cgroup and memory. Complete migration, beta quality
For each task with a certain resource vector that represents its resource requirement we attempt to find the node with the resource vector that represents its resource availability that is closest Based on min (Euclidean distance) while not violating hard constraints
Based on min (Euclidean distance) while not violating hard constraints