The Codex of Business Writing Software for Real-World Solutions 2.pptx
October 2014 HUG : Oozie HA
1. Oozie High Availability
Hadoop User Group meetup 10/15/14
Robert Kanter (Cloudera)
Ryota Egashira (Yahoo!)
2. Agenda
● What is High Availability?
● Architectural Overview
● Security
● Authentication
● HCatalog Integration in HA
● Other Challenges in HA
● Future Work
3. What is High Availability?
● A system without non-planned downtime when partial
failures occur
o Typically achieved by having redundancies and
removing single-points of failure
● Our Goals
o Don’t change the API or usage patterns
o User doesn’t even have to know it’s HA
4. Architectural Overview:
Database
● Oozie stores most of its state in a database
o (submitted jobs, workflow definitions, etc)
● Instead of a failover model, we want to run many Oozie
servers against the same database
o Active-Active HA
o Also provides horizontal scalability
● Zookeeper for coordination
5.
6. Architectural Overview:
Access
● Users and client programs need a single address to
connect to
o Web UI, REST/Java API,
JobTracker/ResourceManager callbacks, etc
● Load balancer, Virtual IP, or DNS round-robin can be
used to provide a single entry point to the Oozie servers
o Technically also needs to be HA
7.
8. Architectural Overview:
Log Streaming
● Oozie’s log files are not in the database
o Each Oozie server only has access to its own logs
● Jobs are not assigned to a specific Oozie server
● What if the user asks Oozie Server 1 for logs for a job
processed by Oozie Server 2?
o Oozie Server 1 can ask Oozie Server 2 for its logs
● Caveat: If an Oozie Server goes down, any logs from it
will be unavailable
9.
10. Security
● Existing security features continue to work
● authentication tokens
o Signed cookies for authenticating users to Oozie server
o Each Oozie server uses it’s own randomly generated secret
Problem: Won’t accept cookies signed by other Oozie
servers
● Hadoop-auth in Hadoop 2.6.0 will add support for pluggable secret
providers
o Includes a Zookeeper-backed implementation that
synchronizes a rolling randomly generated secret across
multiple servers
No locking required!
11.
12.
13. Authentication
Load
Balancer
Oozie
Server 1
Oozie
Server 2
Oracle DB
Zookeeper
Hadoop Cluster
user submit request
Load balanced request
redirection
Inter server communication for
log streaming, sharelib.etc
Zookeeper for lock and
management
Apache
Curator
14. Authentication
Load
Balancer
Oozie
Server 1
Oozie
Server 2
Oracle DB
Zookeeper
Hadoop Cluster
user submit request
Security: https + kerberos /
cookie-based auth
Load balanced request
redirection
Security: https +
kerberos / cookie-based-
auth
Inter server communication for
log streaming, sharelib.etc
Security: https+kerberos
Zookeeper for lock and
management
Security: Kerberos
Security: kerberos
Apache
Curator
15. HCatalog Integration in HA
• HCatalog : metadata management service for HDFS datasets
– Oozie receive notification from Hcatalog through JMS
(e.g., ActiveMQ)
– Start a job immediately after data becomes ready
Oozie 1
JMS
1. Query/Poll Partition
(e.g, ActiveMQ)
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Oozie 2
Job
16. HCatalog Integration in HA
• To support HA
– Keep consistency in in-memory data structure
• store list of jobs waiting for a data partition
– Create and cleanup topic listener for JMS
Oozie 1 HCatalog
3. Push notification
<New Partition>
1. Query/Poll Partition
2. Register Topic
4. Notify New Partition
Job
Oozie 2
JMS
(e.g, ActiveMQ)
17. Other Challenges in HA
• SLA support
– Oozie has in-memory data structure to track sla status for
each job (start/duration/end met/miss and notifications)
– add check of sla status against Database
– use ZK lock to synchronize update on the same job from
multiple servers.
• Distributed Locks
– Reentrant distributed lock using Apache Curator +
Zookeeper
18. Other Challenges in HA
● Distributed Job ID
o Maintain distributed sequence number for Job ID
using Apache Curator + Zookeeper
● Zookeeper Failure Handling
o Oozie servers automatically shutdown when
Zookeeper is down
19. Future work
• Learn from experience for stability
– At Y!, HA running on non-prod grids >1 month, and
prod deployment in Q4
• Faster job fail-over
– currently wait for a thread (Recovery Service) to pick
non-progressing jobs every few minutes
– Oozie server should immediately notice when other
server is down and fail-over job (e.g, using ZK
watcher)
• Improve log streaming