3. Load balancing
Responding to
node entry and
exit
Building distributed systems is hard.
Alerting based
on metrics
Managing data
replicas
Supporting
event listeners
4. Load balancing
Responding to
node entry and
exit
Helix abstracts away problems distributed
systems need to solve.
Alerting based
on metrics
Managing data
replicas
Supporting
event listeners
5. System Lifecycle
Cluster Expansion
Fault Tolerance
Multi-Node
Partitioning
Discovery
Co-Location
Single Node
Replication
Fault Detection
Recovery
Throttle movement
Redistribute data
10. Resource Assignment Problem
Making it Work: Take 1 (ZooKeeper)
Application
Application
Helix
ZooKeeper
File system
Lock
Ephemeral
ZooKeeper provides low-level primitives
Node
Partition
Replica
State
Transition
Consensus
System
We need high-level primitives
11. Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)
S
config changes
node changes
node updates
S
Consensus
System
S
S
service running on a node
S
12. Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)
S
config changes
node changes
node updates
S
Consensus
System
multiple brains
S
app-specific logic
unscalable traffic
S
13. Resource Assignment Problem
Making it Work: Take 2 (Decisions by Nodes)
S
config changes
node changes
node updates
S
Consensus
System
multiple brains
S
app-specific logic
unscalable traffic
S
14. Resource Assignment Problem
Making it Work: Take 3 (Single Brain)
S
node updates
node updates
Controller
config changes
node changes
Consensus
System
S
Node logic is drastically simplified!
S
16. Resource Assignment Problem
Helix View
RESOURCES
Controller
Controller
Controller
Manage
NODES (Participants)
Spectators
Question: How do we make this controller generic
enough to work for different resources?
20. Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants
21. Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants
Transition Constraints
Scope: Cluster
OFFLINE-SLAVE: 3 concurrent
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4
OFFLINE-SLAVE: 2 concurrent
22. Helix Concepts
Constraints: Augmenting the State Model
State Constraints
MASTER: [1, 1]
SLAVE: [0, R]
Special Constraint Values
R: Replica count per partition
N: Number of participants
Transition Constraints
Scope: Cluster
OFFLINE-SLAVE: 3 concurrent
Scope: Resource R1
SLAVE-MASTER: 1 concurrent
Scope: Participant P4
OFFLINE-SLAVE: 2 concurrent
States and transitions are ordered by priority in computing replica states.
Transition constraints can be restricted to cluster, resource, and
participant scopes. The most restrictive constraint is used.
23. Helix Concepts
Resources and the Augmented State Model
master
Resource
slave
offline
Partition
Partition
Partition
All partitions can be replicated.
Each replica is in a state governed by the augmented state model.
24. Helix Concepts
Objectives
Partition Placement
Distribution policy for partitions and replicas
Making effective use of the cluster and the resource
Failure and Expansion Semantics
Create new replicas and assign states
Changing existing replica states
26. Rebalancing Strategies
Meeting Objectives within Constraints
Full-Auto
Replica
Placement
Replica
State
Helix
Helix
Semi-Auto
App
Helix
Customized
User-Defined
App
App code
plugged into
the Helix
controller
App
App code
plugged into
the Helix
controller
27. Rebalancing Strategies
Full-Auto
Node 1
Node 2
Node 3
P1: M
P2: M
P3: M
P2: S
P3: S
P1: S
By default, Helix optimizes for minimal movement and even
distribution of partitions and states
28. Rebalancing Strategies
Full-Auto
Node 1
Node 2
Node 3
P1: M
P2: M
P3: M
P2: S
P3: S
P1: S
By default, Helix optimizes for minimal movement and even
distribution of partitions and states
29. Rebalancing Strategies
Full-Auto
Node 1
Node 2
P1: M
P2: M
P2: S
P3: S
P3: M
Node 3
P1: S
By default, Helix optimizes for minimal movement and even
distribution of partitions and states
30. Rebalancing Strategies
Semi-Auto
Node 1
Node 2
Node 3
P1: M
P2: M
P3: M
P2: S
P3: S
P1: S
Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
31. Rebalancing Strategies
Semi-Auto
Node 1
Node 2
Node 3
P1: M
P2: M
P3: M
P2: S
P3: S
P1: S
Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
32. Rebalancing Strategies
Semi-Auto
Node 1
Node 2
Node 3
P1: M
P2: M
P3: M
P2: S
P3: M
P3: S
P1: S
Semi-Auto mode maintains the location of the replicas, but
allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
34. Rebalancing Strategies
Customized
The app specifies the location and state of
each replica. Helix still ensures that transitions
are fired according to constraints.
Need to respond to node changes? Use the
Helix custom code invoker to run on one
participant, or...
35. Rebalancing Strategies
User-Defined
Node joins or
leaves the cluster
Helix
controller
invokes code
plugged in by the
app
Rebalancer
implemented by
app computes
replica placement
and state
Helix fires
transitions without
violating
constraints
The rebalancer receives a full snapshot of the current cluster
state, as well as access to the backing data store. Helix
rebalancers implement the same interface.
43. ZooKeeper View
Ideal State
P1
P2
N1: M
N2: M
N2: S
N1: S
Replica
Placement
and State
{
"id"
:
"SampleResource",
"simpleFields"
:
{
"REBALANCE_MODE"
:
"USER_DEFINED",
"NUM_PARTITIONS"
:
"2",
"REPLICAS"
:
"2",
"STATE_MODEL_DEF_REF"
:
"MasterSlave",
"STATE_MODEL_FACTORY_NAME"
:
"DEFAULT"
},
"mapFields"
:
{
"SampleResource_0"
:
{
"node1_12918"
:
"MASTER",
"node2_12918"
:
"SLAVE"
}
...
},
"listFields"
:
{}
}
44. ZooKeeper View
Current State and External View
External View
Current State
P1
P2
N1
P1: MASTER
P2: MASTER
N1: M
N1: M
N2
P1: OFFLINE
P2: OFFLINE
N2: O
N2: O
Helix’s responsibility is to make the external view
match the ideal state as closely as possible
47. Example: Distributed Data Store
Master
P.1
P.2
P.3
P.5
P.6
P.7
P.9
P.10
P.11
Slave
P.4
P.5
P.6
P.8
P.1
P.2
P.12
P.3
P.4
P.9
P.10
P.11
P.12
P.7
P.8
Node 1
Partition
Management
• multiple replicas
• 1 master
• even distribution
Node 2
Fault Tolerance
• fault detection
• promote master to
slave
• even distribution
• no SPOF
Node 3
Elasticity
• minimize downtime
• minimize data
movement
• throttle movement
48. Example: Distributed Data Store
Helix-Based Solution
Define
state model
state transitions
Configure
create cluster
add nodes
add resource
config rebalancer
Run
start controller
start participants
49. Example: Distributed Data Store
State Model Definition: Master-Slave
States
all possible states
priority
Transitions
legal transitions
priority
Applicable to each partition
of a resource
Slave
Offline
Master
50. Example: Distributed Data Store
State Model Definition: Master-Slave
builder
=
new
StateModelDefinition.Builder(“MasterSlave”);
//
add
states
and
their
ranks
to
indicate
priority
builder.addState(MASTER,
1);
builder.addState(SLAVE,
2);
builder.addState(OFFLINE);
//
set
the
initial
state
when
participant
starts
builder.initialState(OFFLINE);
//
add
transitions
builder.addTransition(OFFLINE,
SLAVE);
builder.addTransition(SLAVE,
OFFLINE);
builder.addTransition(SLAVE,
MASTER);
builder.addTransition(MASTER,
SLAVE);
51. Example: Distributed Data Store
Defining Constraints
StateCount=2
State
Transition
Y
Y
Resource -
Y
Node
Y
Y
Cluster
-
Y
Partition
Slave
Offline
Master
StateCount=1
53. Example: Distributed Data Store
Participant Plug-In Code
@StateModelInfo(initialState=“OFFLINE”,
states={“OFFLINE”,
“SLAVE”,
“MASTER”})
class
DistributedDataStoreModel
extends
StateModel
{
@Transition(from=“OFFLINE”,
to=“SLAVE”)
public
void
fromOfflineToSlave(Message
m,
NotificationContext
ctx)
{
//
bootstrap
data,
setup
replication,
etc.
}
@Transition(from=“SLAVE”,
to=“MASTER”)
public
void
fromSlaveToMaster(Message
m,
NotificationContext
ctx)
{
//
catch
up
previous
master,
enable
writes,
etc.
}
...
}
54. Example: Distributed Data Store
Configure and Run
HelixAdmin
-‐zkSvr
<zk-‐address>
Create Cluster
-‐-‐
addCluster
MyCluster
Add Participants
-‐-‐
addNode
MyCluster
localhost_12000
...
Add Resource
-‐-‐
addResource
MyDB
16
MasterSlave
SEMI_AUTO
Configure Rebalancer
-‐-‐
rebalance
MyDB
3
55. Example: Distributed Data Store
Spectator Plug-In Code
class
RoutingLogic
{
public
void
write(Request
request)
{
partition
=
getPartition(request.key);
List<Node>
nodes
=
routingTableProvider.getInstance(partition,
“MASTER”);
nodes.get(0).write(request);
}
public
void
read(Request
request)
{
partition
=
getPartition(request.key);
List<Node>
nodes
=
routingTableProvider.getInstance(partition);
random(nodes).read(request);
}
56. Example: Distributed Data Store
Where is the Code?
Participant
Participant
Plug-In
Code
node updates
node updates
Controller
config changes
node changes
Consensus
System
Spectator
Participant
Plug-In
Code
Participant
Spectator
Plug-In
Code
57. Example: Distributed Search
Index
shard
P.1
P.2
P.3
P.4
P.3
P.4
P.5
P.6
Node 1
Partition
Management
• multiple replicas
• rack-aware
placement
• even distribution
P.5
P.6
P.1
P.2
Node 2
Fault Tolerance
• fault detection
• auto create replicas
• controlled creation
of replicas
Node 3
Elasticity
• redistribute
partitions
• minimize data
movement
• throttle movement
58. Example: Distributed Search
State Model Definition: Bootstrap
Idle
setup node
cleanup
recover
Offline
stop consume
data
StateCount=3
stop indexing
and serving
Online
consume data
to build index
can serve requests
Bootstrap
Error
StateCount=5
65. Plugins
Data-Driven Testing and Debugging
Instrument ZK,
controller, and
participant logs
Simulate execution
with Chaos Monkey
Analyze
invariants like state
and transition
constraints
The exact sequence of events can be
replayed: debugging made easy!
67. Plugins
Data-Driven Testing and Debugging: Count Aggregation
Time
State
Slave Count
Participant
42632
OFFLINE
0
10.117.58.247_12918
42796
SLAVE
1
10.117.58.247_12918
43124
OFFLINE
1
10.202.187.155_12918
43131
OFFLINE
1
10.220.225.153_12918
43275
SLAVE
2
10.220.225.153_12918
43323
SLAVE
3
10.202.187.155_12918
85795
MASTER
2
10.220.225.153_12918
Error! The state constraint for SLAVE
has an upper bound of 2.
68. Plugins
Data-Driven Testing and Debugging: Time Aggregation
Slave Count
Time
Percentage
0
1082319
0.5
1
35578388
16.46
2
179417802
82.99
3
118863
0.05
Master Count
Time
Percentage
0
1082319
0.5
1
35578388
16.46
83% of the time, there
were 2 slaves to a
partition
93% of the time, there
was 1 master to a
partition
We can see for exactly how long the cluster was out of whack.
71. Coming Up Next
New APIs
Automatic scaling
with YARN
Non-JVM
participants
72. Summary
• Helix: A generic framework for building
distributed systems
• Abstraction and modularity allow for
modifying and enhancing system behavior
• Simple programming model: declarative
state machine