9. 9
Why Scheduling?
Multiple “tasks” to schedule
The processes on a single-core OS
The tasks of a Hadoop job
The tasks of multiple Hadoop jobs
Limited resources that these tasks require
Processor(s)
Memory
(Less contentious) disk, network
Scheduling goals
1. Good throughput or response time for tasks (or jobs)
2. High utilization of resources
10. 10
Single Processor Scheduling
Task 1
10
Task 2
5
Task 3
3
Arrival Times 0 6 8
Processor
Task Length Arrival
1 10 0
2 5 6
3 3 8
Which tasks run when?
10
11. 11
FIFO Scheduling (First-In First-Out)/FCFS
Task 1 Task 2 Task 3
Time 0 6 8 10 15 18
Processor
Task Length Arrival
1 10 0
2 5 6
3 3 8
• Maintain tasks in a queue in order of arrival
• When processor free, dequeue head and schedule it
11
12. 12
FIFO/FCFS Performance
Average completion time may be high
For our example on previous slides,
Average completion time of FIFO/FCFS =
(Task 1 + Task 2 + Task 3)/3
= (10+15+18)/3
= 43/3
= 14.33
12
13. 13
STF Scheduling (Shortest Task First)
Task 1Task 2Task 3
Time 0 3 8 18
Processor
Task Length Arrival
1 10 0
2 5 0
3 3 0
• Maintain all tasks in a queue, in increasing order of running time
• When processor free, dequeue head and schedule
13
14. 14
STF Is Optimal!
Average completion of STF is the shortest among all
scheduling approaches!
For our example on previous slides,
Average completion time of STF =
(Task 1 + Task 2 + Task 3)/3
= (18+8+3)/3
= 29/3
= 9.66
(versus 14.33 for FIFO/FCFS)
In general, STF is a special case of priority scheduling
Instead of using time as priority, scheduler could use user-provided
priority
14
15. 15
Round-Robin Scheduling
Time 0 6 8
Processor
Task Length Arriv
al
1 10 0
2 5 6
3 3 8
• Use a quantum (say 1 time unit) to run portion of task at queue head
• Pre-empts processes by saving their state, and resuming later
• After pre-empting, add to end of queue
Task 1
15 (Task 3 done)
…
15
16. 16
Round-Robin vs. STF/FIFO
Round-Robin preferable for
Interactive applications
User needs quick responses from system
FIFO/STF preferable for Batch applications
User submits jobs, goes away, comes back to get result
16
17. 17
Summary
Single processor scheduling algorithms
FIFO/FCFS
Shortest task first (optimal!)
Priority
Round-robin
Many other scheduling algorithms out there!
What about cloud scheduling?
Next!
17
18. 18
Hadoop Scheduling
A Hadoop job consists of Map tasks and Reduce tasks
Only one job in entire cluster => it occupies cluster
Multiple customers with multiple jobs
Users/jobs = “tenants”
Multi-tenant system
=> Need a way to schedule all these jobs (and their
constituent tasks)
=> Need to be fair across the different tenants
Hadoop YARN has two popular schedulers
Hadoop Capacity Scheduler
Hadoop Fair Scheduler
18
19. 19
Hadoop Capacity Scheduler
Contains multiple queues
Each queue contains multiple jobs
Each queue guaranteed some portion of the cluster capacity
E.g.,
Queue 1 is given 80% of cluster
Queue 2 is given 20% of cluster
Higher-priority jobs go to Queue 1
For jobs within same queue, FIFO typically used
Administrators can configure queues
Source: http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
19
20. 20
Elasticity in HCS
Administrators can configure each queue with limits
Soft limit: how much % of cluster is the queue guaranteed to occupy
(Optional) Hard limit: max % of cluster given to the queue
Elasticity
A queue allowed to occupy more of cluster if resources free
But if other queues below their capacity limit, now get full, need to give
these other queues resources
Pre-emption not allowed!
Cannot stop a task part-way through
When reducing % cluster to a queue, wait until some tasks of that queue
have finished
20
21. 21
Other HCS Features
Queues can be hierarchical
May contain child sub-queues, which may contain child sub-queues, and so
on
Child sub-queues can share resources equally
Scheduling can take memory requirements into account
(memory specified by user)
21
22. 22
Hadoop Fair Scheduler
Goal: all jobs get equal share of resources
When only one job present, occupies entire cluster
As other jobs arrive, each job given equal % of cluster
E.g., Each job might be given equal number of cluster-wide YARN
containers
Each container == 1 task of job
Source: http://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html 22
23. 23
Hadoop Fair Scheduler (2)
Divides cluster into pools
Typically one pool per user
Resources divided equally among pools
Gives each user fair share of cluster
Within each pool, can use either
Fair share scheduling, or
FIFO/FCFS
(Configurable)
23
24. 24
Pre-emption in HFS
Some pools may have minimum shares
Minimum % of cluster that pool is guaranteed
When minimum share not met in a pool, for a while
Take resources away from other pools
By pre-empting jobs in those other pools
By killing the currently-running tasks of those jobs
Tasks can be re-started later
Ok since tasks are idempotent!
To kill, scheduler picks most-recently-started tasks
Minimizes wasted work
24
25. 25
HFS Features
Can also set limits on
Number of concurrent jobs per user
Number of concurrent jobs per pool
Number of concurrent tasks per pool
Prevents cluster from being hogged by one user/job
25
26. 26
Estimating Task Lengths
HCS/HFS use FIFO
May not be optimal (as we know!)
Why not use shortest-task-first instead? It‟s optimal (as we know!)
Challenge: Hard to know expected running time of task (before it‟s
completed)
Solution: Estimate length of task
Some approaches
Within a job: Calculate running time of task as proportional to size of its input
Across tasks: Calculate running time of task in a given job as average of other
tasks in that given job (weighted by input size)
Lots of recent research results in this area!
26
27. 27
Summary
Hadoop Scheduling in YARN
Hadoop Capacity Scheduler
Hadoop Fair Scheduler
Yet, so far we‟ve talked of only one kind of resource
Either processor, or memory
How about multi-resource requirements?
Next!
27
28. 28
Challenge
What about scheduling VMs in a cloud (cluster)?
Jobs may have multi-resource requirements
Job 1‟s tasks: 2 CPUs, 8 GB
Job 2‟s tasks: 6 CPUs, 2 GB
How do you schedule these jobs in a “fair” manner?
That is, how many tasks of each job do you allow the
system to run concurrently?
What does fairness even mean?
28
29. 29
Dominant Resource Fairness (DRF)
Proposed by researchers from U. California Berkeley
Proposes notion of fairness across jobs with multi-
resource requirements
They showed that DRF is
Fair for multi-tenant systems
Strategy-proof: tenant can‟t benefit by lying
Envy-free: tenant can‟t envy another tenant‟s allocations
29
30. 30
Where is DRF Useful?
DRF is
Usable in scheduling VMs in a cluster
Usable in scheduling Hadoop in a cluster
DRF used in Mesos, an OS intended for cloud environments
DRF-like strategies also used some cloud computing
company‟s distributed OS‟s
30
32. 32
DRF Works (2)
Our example
Job 1‟s tasks: 2 CPUs, 8 GB
=> Job 1‟s resource vector = <2 CPUs, 8 GB>
Job 2‟s tasks: 6 CPUs, 2 GB
=> Job 2‟s resource vector = <6 CPUs, 2 GB>
Consider a cloud with <18 CPUs, 36 GB RAM>
Each Job 1‟s task consumes % of total CPUs = 2/18 = 1/9
Each Job 1‟s task consumes % of total RAM = 8/36 = 2/9
1/9 < 2/9
=> Job 1’s dominant resource is RAM, i.e., Job 1 is more memory-
intensive than it is CPU-intensive
32
33. 33
How DRF Works (3)
Our example
Job 1‟s tasks: 2 CPUs, 8 GB
=> Job 1‟s resource vector = <2 CPUs, 8 GB>
Job 2‟s tasks: 6 CPUs, 2 GB
=> Job 2‟s resource vector = <6 CPUs, 2 GB>
Consider a cloud with <18 CPUs, 36 GB RAM>
Each Job 2‟s task consumes % of total CPUs = 6/18 =
6/18
Each Job 2‟s task consumes % of total RAM = 2/36 =
1/18
6/18 > 1/18
=> Job 2’s dominant resource is CPU, i.e., Job 1 is more CPU-
intensive than it is memory-intensive
33
34. 34
DRF Fairness
For a given job, the % of its dominant resource type
that it gets cluster-wide, is the same for all jobs
Job 1‟s % of RAM = Job 2‟s % of CPU
Can be written as linear equations, and solved
34
35. 35
DRF Solution, For our Example
DRF Ensures
Job 1‟s % of RAM = Job 2‟s % of CPU
Solution for our example:
Job 1 gets 3 tasks each with <2 CPUs, 8 GB>
Job 2 gets 2 tasks each with <6 CPUs, 2 GB>
• Job 1‟s % of RAM
= Number of tasks * RAM per task / Total cluster RAM
= 3*8/36 = 2/3
• Job 2‟s % of CPU
= Number of tasks * CPU per task / Total cluster CPUs
= 2*6/18 = 2/3
35
36. 36
Other DRF Details
DRF generalizes to multiple jobs
DRF also generalizes to more than 2 resource types
CPU, RAM, Network, Disk, etc.
DRF ensures that each job gets a fair share of that
type of resource which the job desires the most
Hence fairness
36
37. 37
Summary: Scheduling
Scheduling very important problem in cloud
computing
Limited resources, lots of jobs requiring access to these jobs
Single-processor scheduling
FIFO/FCFS, STF, Priority, Round-Robin
Hadoop scheduling
Capacity scheduler, Fair scheduler
Dominant-Resources Fairness
37
40. 40
History of the World, Part 1
40
Relational Databases – mainstay of business
Web-based applications caused spikes
Especially true for public-facing e-Commerce sites
Developers begin to front RDBMS with memcache or
integrate other caching mechanisms within the
application (ie. Ehcache)
41. 41
Scaling Up
41
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as „scaling out‟ or „horizontal scaling‟
Different approaches include:
Master-slave
Sharding
42. 42
Scaling RDBMS – Master/Slave
42
Master-Slave
All writes are written to the master. All
reads performed against the replicated
slave databases
Critical reads may be incorrect as writes
may not have been propagated down
Large data sets can pose problems as
master needs to duplicate data to slaves
43. 43
Scaling RDBMS - Sharding
43
Partition or sharding
Scales well for both reads and writes
Not transparent, application needs to be
partition-aware
Can no longer have relationships/joins
across partitions
Loss of referential integrity across shards
44. 44
Other ways to scale RDBMS
44
Multi-Master replication
INSERT only, not UPDATES/DELETES
No JOINs, thereby reducing query time
This involves de-normalizing data
In-memory databases
45. 45
What is NoSQL?
45
Stands for Not Only SQL
Class of non-relational data storage
systems
Usually do not require a fixed table
schema nor do they use the concept of
joins
All NoSQL offerings relax one or more of
the ACID properties (will talk about the
CAP theorem)
46. 46
Why NoSQL?
46
For data storage, an RDBMS cannot be the
be-all/end-all
Just as there are different programming
languages, need to have other data
storage tools in the toolbox
A NoSQL solution is more acceptable to a
client now than even a year ago
Think about proposing a Ruby/Rails or
Groovy/Grails solution now versus a couple
of years ago
47. 47
How did we get here?
47
Explosion of social media sites (Facebook, Twitter)
with large data needs
Rise of cloud-based solutions such as Amazon S3
(simple storage solution)
Just as moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data
with frequent schema changes
Open-source community
48. 48
Dynamo and BigTable
48
Three major papers were the seeds of the
NoSQL movement
BigTable (Google)
Dynamo (Amazon)
Gossip protocol (discovery and error
detection)
Distributed key-value data store
Eventual consistency
CAP Theorem (discuss in a sec ..)
49. 49
The Perfect Storm
49
Large datasets, acceptance of alternatives, and
dynamically-typed data has come together in a
perfect storm
Not a backlash/rebellion against RDBMS
SQL is a rich query language that cannot be
rivaled by the current list of NoSQL offerings
50. 50
CAP Theorem
50
Three properties of a system: consistency,
availability and partitions
You can have at most two of these three
properties for any shared-data system
To scale out, you have to partition. That
leaves either consistency or availability to
choose from
In almost all cases, you would choose
availability over consistency
52. 52
CAP Theorem
52
Once a writer has written,
all readers will see that
write
Consistency
Partition
tolerance
Availability
53. 53
Consistency
53
Two kinds of consistency:
strong consistency – ACID(Atomicity Consistency
Isolation Durability)
weak consistency – BASE(Basically Available Soft-
state Eventual consistency )
54. 54
ACID Transactions
54
54
A DBMS is expected to support “ACID transactions,”
processes that are:
Atomic : Either the whole process is done or none is.
Consistent : Database constraints are preserved.
Isolated : It appears to the user as if only one process executes at
a time.
Durable : Effects of a process do not get lost if the system
crashes.
55. 55
Atomicity
55
55
A real-world event either happens or does
not happen
Student either registers or does not register
Similarly, the system must ensure that either
the corresponding transaction runs to
completion or, if not, it has no effect at all
Not true of ordinary programs. A crash could
leave files partially updated on recovery
56. 56
Commit and Abort
56
56
If the transaction successfully completes it
is said to commit
The system is responsible for ensuring that all
changes to the database have been saved
If the transaction does not successfully
complete, it is said to abort
The system is responsible for undoing, or rolling
back, all changes the transaction has made
57. 57
Database Consistency
57
57
Enterprise (Business) Rules limit the
occurrence of certain real-world events
Student cannot register for a course if the current
number of registrants equals the maximum allowed
Correspondingly, allowable database states
are restricted
cur_reg <= max_reg
These limitations are called (static) integrity
constraints: assertions that must be satisfied
by all database states (state invariants).
58. 58
Database Consistency (state invariants)
58
58
Other static consistency requirements are
related to the fact that the database might
store the same information in different ways
cur_reg = |list_of_registered_students|
Such limitations are also expressed as integrity
constraints
Database is consistent if all static integrity
constraints are satisfied
59. 59
Transaction Consistency
59
59
A consistent database state does not necessarily
model the actual state of the enterprise
A deposit transaction that increments the balance by
the wrong amount maintains the integrity constraint
balance 0, but does not maintain the relation between
the enterprise and database states
A consistent transaction maintains database
consistency and the correspondence between the
database state and the enterprise state (implements
its specification)
Specification of deposit transaction includes
balance = balance + amt_deposit ,
(balance is the next value of balance)
60. 60
Dynamic Integrity Constraints (transition invariants)
60
60
Some constraints restrict allowable state
transitions
A transaction might transform the database
from one consistent state to another, but the
transition might not be permissible
Example: A letter grade in a course (A, B, C, D,
F) cannot be changed to an incomplete (I)
Dynamic constraints cannot be checked
by examining the database state
61. 61
Transaction Consistency
61
61
Consistent transaction: if DB is in consistent
state initially, when the transaction completes:
All static integrity constraints are satisfied (but
constraints might be violated in intermediate states)
Can be checked by examining snapshot of database
New state satisfies specifications of transaction
Cannot be checked from database snapshot
No dynamic constraints have been violated
Cannot be checked from database snapshot
62. 62
Isolation
62
62
Serial Execution: transactions execute in sequence
Each one starts after the previous one completes.
Execution of one transaction is not affected by the
operations of another since they do not overlap in time
The execution of each transaction is isolated from
all others.
If the initial database state and all transactions are
consistent, then the final database state will be
consistent and will accurately reflect the real-world
state, but
Serial execution is inadequate from a performance
perspective
63. 63
Isolation
63
63
Concurrent execution offers performance benefits:
A computer system has multiple resources capable of
executing independently (e.g., cpu’s, I/O devices), but
A transaction typically uses only one resource at a time
Hence, only concurrently executing transactions can
make effective use of the system
Concurrently executing transactions yield interleaved
schedules
64. 64
Concurrent Execution
64
T1
T2
DBMS
local computation
local variables
sequence of db
operations output by T1op1,1 op1.2
op2,1 op2.2
op1,1 op2,1 op2.2 op1.2
interleaved sequence of db
operations input to DBMS
begin trans
..
op1,1
..
op1,2
..
commit
65. 65
Durability
65
65
The system must ensure that once a transaction
commits, its effect on the database state is not
lost in spite of subsequent failures
Not true of ordinary programs. A media failure after a
program successfully terminates could cause the file
system to be restored to a state that preceded the
program’s execution
66. 66
Implementing Durability
66
Database stored redundantly on mass storage
devices to protect against media failure
Architecture of mass storage devices affects
type of media failures that can be tolerated
Related to Availability: extent to which a
(possibly distributed) system can provide
service despite failure
Non-stop DBMS (mirrored disks)
Recovery based DBMS (log)
67. 67
Consistency Model
67
A consistency model determines rules for visibility and
apparent order of updates.
For example:
Row X is replicated on nodes M and N
Client A writes row X to node N
Some period of time t elapses.
Client B reads row X from node M
Does client B see the write from client A?
Consistency is a continuum with tradeoffs
For NoSQL, the answer would be: maybe
CAP Theorem states: Strict Consistency can't be achieved at
the same time as availability and partition-tolerance.
68. 68
Eventual Consistency
68
When no updates occur for a long
period of time, eventually all updates
will propagate through the system and
all the nodes will be consistent
For a given accepted update and a
given node, eventually either the
update reaches the node or the node
is removed from service
Known as BASE (Basically Available,
Soft state, Eventual consistency), as
opposed to ACID
69. 69
The CAP Theorem
69
System is available during
software and hardware
upgrades and node
failures.
Consistency
Partition
toleranc
e
Availabilit
y
70. 70
Availability
70
Traditionally, thought of as the server/process
available five 9‟s (99.999 %).
However, for large node system, at almost any
point in time there‟s a good chance that a node is
either down or there is a network disruption among
the nodes.
Want a system that is resilient in the face of network
disruption
71. 71
The CAP Theorem
71
A system can continue to
operate in the presence
of a network partitions.
Consistency
Partition
toleranc
e
Availabilit
y
72. 72
The CAP Theorem
72
Theorem: You can have at most two of
these properties for any shared-data
system
Consistency
Partition
toleranc
e
Availabilit
y
73. 73
What kinds of NoSQL
73
NoSQL solutions fall into two major areas:
Key/Value or „the big hash table‟.
Amazon S3 (Dynamo)
Voldemort
Scalaris
Memcached (in-memory key/value store)
Redis
Schema-less which comes in multiple flavors, column-based,
document-based or graph-based.
Cassandra (column-based)
CouchDB (document-based)
MongoDB(document-based)
Neo4J (graph-based)
HBase (column-based)
74. 74
Key/Value
74
Pros:
very fast
very scalable
simple model
able to distribute horizontally
Cons:
- many data structures (objects) can't be
easily modeled as key value pairs
75. 75
Schema-Less
75
Pros:
- Schema-less data model is richer than key/value
pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability
Cons:
- typically no ACID transactions or joins
76. 76
Common Advantages
76
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical and fault-tolerant)
and can be partitioned
Down nodes easily replaced
No single point of failure
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
77. 77
What am I giving up?
77
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still
powerful query language
easy integration with other applications
that support SQL
78.
79. 79
Types of DBMS
79
Hierarchical database
model resembles a tree structure, similar to a folder architecture in your computer system. The relationships between records are
pre-defined in a one to one manner, between 'parent and child' nodes. They require the user to pass a hierarchy in order to access
needed data. Due to limitations, such databases may be confined to specific uses. Discover more about hierarchical
Network database
models also have a hierarchical structure. However, instead of using a single-parent tree hierarchy, this model supports many
to many relationships, as child tables can have more than one parent. See more on network databases
NoSQL or non-relational databases
A popular alternative to relational databases, NoSQL databases take a variety of forms and allow you to store and manipulate
large amounts of unstructured and semi-structured data. Examples include key-value stores, document stores and graph
databases.
A flat file database
A flat file database stores data in a plain text file, with each line of text typically holding one record.
Delimiters such as commas or tabs separate fields. A flat file database uses a simple structure and, unlike a
relational database, cannot contain multiple tables and relations.
object-oriented database systems
in object-oriented databases, the information is represented as objects, with different types of relationships possible between
two or more objects. Such databases use an object-oriented programming language for development. Find out more about
object-
80. 80
80
Relational Databases – mainstay of business
Web-based applications caused spikes
Especially true for public-facing e-Commerce sites
Developers begin to front RDBMS with memcache or
integrate other caching mechanisms within the
application (ie. Ehcache)
Types of DBMS
81. 81
Scaling Up
81
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as „scaling out‟ or „horizontal scaling‟
Different approaches include:
Master-slave
Sharding
82. 82
Scaling RDBMS – Master/Slave
82
Master-Slave
All writes are written to the master. All
reads performed against the replicated
slave databases
Critical reads may be incorrect as writes
may not have been propagated down
Large data sets can pose problems as
master needs to duplicate data to slaves
83. 83
Scaling RDBMS - Sharding
83
Partition or sharding
Scales well for both reads and writes
Not transparent, application needs to be
partition-aware
Can no longer have relationships/joins
across partitions
Loss of referential integrity across shards
84. 84
Other ways to scale RDBMS
84
Multi-Master replication
INSERT only, not UPDATES/DELETES
No JOINs, thereby reducing query time
This involves de-normalizing data
In-memory databases
85. 85
What is NoSQL?
85
Stands for Not Only SQL
RDBMS search mechanism is tuple wise
When to use SQL: Schema, Consistency and transaction
When to use NoSQL: Speed, Scalabilty,Flexibility
Types of NoSql: Colum oriented, Document, Key value stored,Graph oriented,
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the concept of joins
All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP
theorem)
Example of the user uses Samsung and iPhone ( if you want individual specific data
then SQL is preferred but if you want a data in bulk then you are preferring NoSQL
HBASE: Colum oriented
86. 86
Why NoSQL?
86
For data storage, an RDBMS cannot be the
be-all/end-all
Just as there are different programming
languages, need to have other data
storage tools in the toolbox
A NoSQL solution is more acceptable to a
client now than even a year ago
Think about proposing a Ruby/Rails or
Groovy/Grails solution now versus a couple
of years ago
87. 87
How did we get here?
87
Explosion of social media sites (Facebook, Twitter)
with large data needs
Rise of cloud-based solutions such as Amazon S3
(simple storage solution)
Just as moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data
with frequent schema changes
Open-source community
88. 88
Dynamo and BigTable
88
Three major papers were the seeds of the
NoSQL movement
BigTable (Google)
Dynamo (Amazon)
Gossip protocol (discovery and error
detection)
Distributed key-value data store
Eventual consistency
CAP Theorem (discuss in a sec ..)
89. 89
The Perfect Storm
89
Large datasets, acceptance of alternatives, and
dynamically-typed data has come together in a
perfect storm
Not a backlash/rebellion against RDBMS
SQL is a rich query language that cannot be
rivaled by the current list of NoSQL offerings
90. 90
CAP Theorem
90
Three properties of a system: consistency,
availability and partitions
You can have at most two of these three
properties for any shared-data system
To scale out, you have to partition. That
leaves either consistency or availability to
choose from
In almost all cases, you would choose
availability over consistency
94. 94
HBASE
94
As hadoop also can prosess the dataset then why do we require
HBASE?
Hadoop uses batch processing and sequential data access. Then for search of small data with specific
information we cannot go for one trillion tuples to be searched at once.
IS NoSQL Column oriented/ Column family oriented
But HABSE uses random data access permission. No need to
search dataset in a batch processing that Hadoop do.
HA, Replication, Fault tolerance as it is installed on Hadoop it
provides all hadoop features
When to use Hbase?
When small database size that is in MB of Gb then RDBMS with SQL. But if database is in TB/Peta Bytes the HBASE.
When you do not require Transaction, Rigid schema, Big queries, Complex joins, Speed, Scalability, Flexibility
Who are using HBASE?
Pinterest, Facebook, Adobe, Yahoo
95. 95
HBASE
95
A distributed data store that can scale horizontally to
1,000s of commodity servers and petabytes of indexed
storage.
Designed to operate on top of the Hadoop distributed file
system (HDFS) or Kosmos File System (KFS, aka Cloudstore)
for scalability, fault tolerance, and high availability.
Distributed storage
Table-like in data structure
multi-dimensional map
High scalability
High availability
High performance
96. 96
HBASE
96
Started toward by Chad Walters and Jim
2006.11
Google releases paper on BigTable
2007.2
Initial HBase prototype created as Hadoop contrib.
2007.10
First useable HBase
2008.1
Hadoop become Apache top-level project and HBase becomes
subproject
2008.10~
HBase 0.18, 0.19 released
97. 97
HBASE is not a…
97
Tables have one primary index, the row key.
No join operators.
Limited atomicity and transaction support.
HBase supports multiple batched mutations of single rows only.
Data is unstructured and untyped.
No accessed or manipulated via SQL.
Programmatic access via Java, REST, or Thrift APIs.
Scripting via JRuby.
Scans and queries can select a subset of available columns, perhaps by using a
wildcard.
There are three types of lookups:
Fast lookup using row key and optional timestamp.
Full table scan
Range scan from region start to end.
98. 98
HBASE Advantages
98
No real indexes
Automatic partitioning
Scale linearly and automatically with new
nodes
Commodity hardware
Fault tolerance
Batch processing
99. 99
HBASE Data model
99
Tables are sorted by Row
Table schema only define it‟s column families .
Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free.
Columns within a family are sorted and stored together
Everything except table names are byte[]
(Row, Family: Column, Timestamp) Value
Row key
Column Family
valueTimeStamp
100. 100
Members
Master
Responsible for monitoring region servers
Load balancing for regions
Redirect client to correct region servers
The current SPOF
Region server slaves
Serving requests(Write/Read/Scan) of Client
Send HeartBeat to Master
Throughput and Region numbers are scalable by region
servers
100
102. 102
102
Region Default size: 256 MB, once region full new region is created. Why not to have one region to store all data. degrade performance
Region has write memory read memory
HBASE Architecture
103. 103
103
HBASE Architecture
Region Server handles multiple region , Each region has column family, Each reagion can have different table like
employee, students, prodcts.
Region Default size: 256 MB, once region full new region is created. Why not to have one region to store all data.
degrade performance
Data is written in Memstore/ Write ahead log: It‟s a file very region server maintain. For recovery purpose if data
loss.
Memstore is write buffer. Default size is 100 MB. Once it is full it flush the data. Its segmented into very small
hfiles in KB and stored on disk.
All these hfiles are zipped together by admin that is called as major compaction. It is generally done in non peak
hours.
Few files zipped together by admin that is called as Minor compaction
Region has write memory read memory
write memory
Read memory
104. 104
104
HBASE Architecture
Hmaster Functions
Create, delete, update operations
Region Assignment in region server
Reassessing regions after load balancing
Manage region server failure (region server failure then recovery is also done by Hmaster)
105. 105
HBASE
105
Zookeeper Functions
Active/ Inactive Hmaster, Region server ping/ Heart beat signal to zookeeper.
If active Hmaster crashes it won‟t send the heartbeat signal then zookeeper activate inactive Hmaster
Root table and meta tables are chandelled by zoo keeper
Complete cluster management task is under zoo keeper.
Root is only one and meta tables can be more ( Which data where, which region. Mestore, block cache .
112. 112
HBASE
112
Testing (4)
$ hbase shell
> create 'test', 'data'
0 row(s) in 4.3066 seconds
> list
test
1 row(s) in 0.1485 seconds
> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
> put 'test', 'row3', 'data:3', 'value3'
0 row(s) in 0.0090 seconds
> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198, value=value1
row2 column=data:2, timestamp=1240148040035, value=value2
row3 column=data:3, timestamp=1240148047497, value=value3
3 row(s) in 0.0825 seconds
> disable 'test'
09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test
0 row(s) in 6.0426 seconds
> drop 'test'
09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test
0 row(s) in 0.0210 seconds
> list
0 row(s) in 2.0645 seconds
113. 113
HBASE
113
Connecting to HBase
Java client
get(byte [] row, byte [] column, long timestamp, int versions);
Non-Java clients
Thrift server hosting HBase client instance
Sample ruby, c++, & java (via thrift) clients
REST server hosts HBase client
TableInput/OutputFormat for MapReduce
HBase as MR source or sink
HBase Shell
JRuby IRB with “DSL” to add get, scan, and admin
./bin/hbase shell YOUR_SCRIPT
114. 114
HBASE
114
Thrift
a software framework for scalable cross-language
services development.
By facebook
seamlessly between C++, Java, Python, PHP, and Ruby.
This will start the server instance, by default on port
9090
The other similar project “rest”
$ hbase-daemon.sh start thrift
$ hbase-daemon.sh stop thrift
115. 115
Hive
115
What is hive ?
Its is Data warehouse package built on the top of Hadoop. Used for data visualization and analysis.
User with SQL background uses hive
No need of java Familiarizations
History?
FB Daily generate 78 TB/day, 1.5L queries per day, 300 M images /day
Facebook was using Backup strategy and import was done using schedule job(Cron Job)
ETL(Extract transform and load) using python
Oracle DBMS , MS SQL server was being used which caused lo of problems
So oracle was having SQL programmer so they developed Hive compatible with SQL which is called as HQL
Features
Tables can be created
JDBC/ODBC drivers are available
Data is only stored on Hadoop.
Uses Hadoop for the fault tolerance . as Hadoop provide the fault tolerance for all like pIG, HIVE, HBASE
Features
Data Mining
Document indexing ( Face book images indexing)
Video indexing
Predictive modeling
116. 116
Hive
116
Need for High-Level Languages
Hadoop is great for large-data processing!
But writing Java programs for everything is verbose and slow
Not everyone wants to (or can) write Java code
Solution: develop higher-level data processing languages
Hive: HQL is like SQL
Pig: Pig Latin is a bit like Perl
117. 117
Hive
117
Hive: data warehousing application in Hadoop
Query language is HQL, variant of SQL
Tables stored on HDFS as flat files
Developed by Facebook, now open source
Pig: large-scale data processing system
Scripts are written in Pig Latin, a dataflow language
Developed by Yahoo!, now open source
Roughly 1/3 of all Yahoo! internal jobs
Common idea:
Provide higher-level language to facilitate large-data
processing
Higher-level language “compiles down” to Hadoop jobs
118. 118
Hive
118
Hive: Example
Hive looks similar to an SQL database
Relational join on two tables:
Table of word counts from Shakespeare collection
Table of word counts from the bible
Source: Material drawn from Cloudera training VM
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
the 25848 62394
I 23031 8854
and 19671 38985
to 18038 13526
of 16700 34654
a 14170 8057
you 12702 2720
my 11297 4135
in 10797 12445
is 8882 6884
119. 119
Hive
119
Hive: Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s) word) (.
(TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL k)
freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k) freq) 1))) (TOK_ORDERBY
(TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))
(one or more of MapReduce jobs)
(Abstract Syntax Tree)
120. 120
Hive: Behind the Scenes
120
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s
TableScan
alias: s
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 0
value expressions:
expr: freq
type: int
expr: word
type: string
k
TableScan
alias: k
Filter Operator
predicate:
expr: (freq >= 1)
type: boolean
Reduce Output Operator
key expressions:
expr: word
type: string
sort order: +
Map-reduce partition columns:
expr: word
type: string
tag: 1
value expressions:
expr: freq
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
outputColumnNames: _col0, _col1, _col2
Filter Operator
predicate:
expr: ((_col0 >= 1) and (_col2 >= 1))
type: boolean
Select Operator
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col2
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:8022/tmp/hive-training/364214370/10002
Reduce Output Operator
key expressions:
expr: _col1
type: int
sort order: -
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: int
expr: _col2
type: int
Reduce Operator Tree:
Extract
Limit
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: 10
121. 121
Hive
121
Example Data Analysis Task
user url time
Amy www.cnn.com 8:00
Amy www.crap.com 8:05
Amy www.myblog.com 10:00
Amy www.flickr.com 10:05
Fred cnn.com/index.htm 12:00
url pagerank
www.cnn.com 0.9
www.flickr.com 0.9
www.myblog.com 0.7
www.crap.com 0.2
Find users who tend to visit “good” pages.
PagesVisits
...
...
Pig Slides adapted from Olston et al.
122. 122
Hive
122
System-Level Dataflow
. . . . . .
Visits
Pages
...
... join by url
the answer
loadload
canonicalize
compute average pagerank
filter
group by user
Pig Slides adapted from Olston et al.
123. 123
Hive:MapReduce Code
123
i m p o r t j a v a . i o . I O E x c e p t i o n ;
i m p o r t j a v a . u t i l . A r r a y L i s t ;
i m p o r t j a v a . u t i l . I t e r a t o r ;
i m p o r t j a v a . u t i l . L i s t ;
i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;
im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;
i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;
i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;
p u b l i c c l a s s M R E x a m p l e {
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u t
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;
T e x t o u t K e y = n e w T e x t ( k e y ) ;
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e
/ / i t c a m e f r o m .
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;
o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}
}
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {
p u b l i c v o i d r e d u c e ( T e x t k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d
s t o r e i t
/ / a c c o r d i n g l y .
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) {
T e x t t = i t e r . n e x t ( ) ;
S t r i n g v a l u e = t . t oS t r i n g ( ) ;
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
/ / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e s
f o r ( S t r i n g s 1 : f i r s t ) {
f o r ( S t r i n g s 2 : s e c o n d ) {
S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;
o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
}
}
}
p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {
p u b l i c v o i d m a p (
T e x t k ,
T e x t v a l ,
O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F i n d t h e u r l
S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;
i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;
S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,
/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .
T e x t o u t K e y = n e w T e x t ( k e y ) ;
o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;
}
}
p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,
W r i t a b l e > {
p u b l i c v o i d r e d u c e (
T e x t k ey ,
I t e r a t o r < L o n g W r i t a b l e > i t e r ,
O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / A d d u p a l l t h e v a l u e s w e s e e
l o n g s u m = 0 ;
w hi l e ( i t e r . h a s N e x t ( ) ) {
s u m + = i t e r . n e x t ( ) . g e t ( ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;
}
}
p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e
im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,
T e x t > {
p u b l i c v o i d m a p (
W r i t a b l e C o m p a r a b l e k e y ,
W r i t a b l e v a l ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n {
o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;
}
}
p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {
i n t c o u n t = 0 ;
p u b l i cv o i d r e d u c e (
L o n g W r i t a b l e k e y ,
I t e r a t o r < T e x t > i t e r ,
O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
w h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) ) {
o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;
c o u n t + + ;
}
}
}
p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {
J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
l p . s et J o b N a m e ( " L o a d P a g e s " ) ;
l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n e w
P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,
n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
l p . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d P a g e s = n e w J o b ( l p ) ;
J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;
l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w
P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u ,
n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;
l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
J o b l o a d U s e r s = n e w J o b ( l f u ) ;
J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;
j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " ) ;
j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;
j o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;
j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;
F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b j o i n J o b = n e w J o b ( j o i n ) ;
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;
j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ;
J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;
g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;
g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;
g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e . c l a s s ) ;
g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s s ) ;
g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ;
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s s ) ;
g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;
J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;
t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;
t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;
t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t Fo r m a t . c l a s s ) ;
t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s ) ;
t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;
t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;
F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e w
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e w
P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ;
t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;
l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;
J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i n d t o p1 0 0 s i t e s f o r u s e r s
1 8 t o 2 5 " ) ;
j c . a d d J o b ( l o a d P a g e s ) ;
j c . a d d J o b ( l o a d U s e r s ) ;
j c . a d d J o b ( j o i n J o b ) ;
j c . a d d J o b ( g r o u p J o b ) ;
j c . a d d J o b ( l i m i t ) ;
j c . r u n ( ) ;
}
}
124. 124
Hive
124
Data Flows
Moving HBase data
HBase Prod
Imported in parallel into
HBase MRCopyTable MR job
Read in parallel
* HBase replication currently only works for a single slave cluster, in our case HBase
replicates to a backup cluster.
125. 125
Hive Architecture
125
Command line interface/ Hive web interface/ Thrift server are used to access hive or firing query
If you want to access hive on other machine you can access using thrift sever using C, C++, Java that is
cross language interface.
Metadata of tables and hive meta data is stored in metastore .
Meta store types: Embeded metastore (Driver- MS(Meta store)-Derby), Local Metastore (Driver- My SQL),
Remote meatstore (Driver- My SQL),
Its is Data warehouse package built on the top of Hadoop. Used for data visualization and analysis.
User with SQL background uses hive
No need of java Familiarizations
Limitations: Donot use for row level updates, Latency of Hive queries are high, Not designed for OLTP(insert update delete)
128. 128
Hive Data Model
128
Database: path: user/warehouse/hive its folder is created :
Table: Table created employee then folder is created in database folder:
Partition: Date wise portioning is created under table folder , So the searching becomes faster.
Buckets or clusters:Similar data allocated together depending on the has value.
Types of Tables: Internal and external
129. 129
PIG
129
Developed by:
Abstraction of for large datasets.
Why pig?
No need of Java.
Code reducibility
Multi query approach
Provides nested data types
130. 130
PIG
130
Pig Latin Script
Visits = load ‘/data/visits’ as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;
store GoodUsers into '/data/good_users';
131. 131
PIG
131
Java vs. Pig Latin
0
20
40
60
80
100
120
140
160
180
Hadoop Pig
1/20 the lines of code
0
50
100
150
200
250
300
Hadoop PigMinutes
1/16 the development time
Performance on par with raw Hadoop!
132. 132
PIG
132
Pig takes care of…
Schema and type checking
Translating into efficient physical dataflow
(i.e., sequence of one or more MapReduce jobs)
Exploiting data reduction opportunities
(e.g., early partial aggregation via a combiner)
Executing the system-level dataflow
(i.e., running the MapReduce jobs)
Tracking progress, errors, etc.
133. 133
PIG
133
Integration
Reasons to use Hive on HBase:
A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
Give access to data in HBase usually only queried through
MapReduce to people that don‟t code (business analysts)
When needing a more flexible storage solution, so that rows
can be updated live by either a Hive job or an application and
can be seen immediately to the other
Reasons not to do it:
Run SQL queries on HBase to answer live user requests (it‟s
still a MR job)
Hoping to see interoperability with other SQL analytics systems
134. 134
PIG
134
Integration
How it works:
Hive can use tables that already exist in HBase or manage
its own ones, but they still all reside in the same HBase
instance
HBase
Hive table definitions
Points to an existing table
Manages this table from Hive
135. 135
PIG
135
Integration
How it works:
When using an already existing table, defined as EXTERNAL, you can create multiple
Hive tables that point to it
HBaseHive table definitions
Points to some column
Points to other
columns,
different names
136. 136
PIG
136
Integration
How it works:
Columns are mapped however you want, changing names and giving types
HBase tableHive table definition
name STRING
age INT
siblings MAP<string, string>
d:fullname
d:age
d:address
f:
persons people
137. 137
PIG
137
Integration
Drawbacks (that can be fixed with brain juice):
Binary keys and values (like integers represented on 4 bytes) aren‟t
supported since Hive prefers string representations, HIVE-1634
Compound row keys aren‟t supported, there‟s no way of using multiple
parts of a key as different “fields”
This means that concatenated binary row keys are completely
unusable, which is what people often use for HBase
Filters are done at Hive level instead of being pushed to the region
servers
Partitions aren‟t supported
138. 138
PIG
138
Data Flows
Data is being generated all over the place:
Apache logs
Application logs
MySQL clusters
HBase clusters
139. 139
PIG
139
Data Flows
Moving application log files
Wild log file
Read nightly
Transforms format Dumped into HDFS
Tail’ed
continuou
sly
Inserted into
HBaseParses into HBase format
140. 140
PIG
140
Data Flows
Moving MySQL data
MySQL
Dumped
nightly with
CSV import
HDFS
Tungsten
replicator
Inserted into
HBaseParses into HBase format
141. 141
PIG
141
Use Cases
Front-end engineers
They need some statistics regarding their latest product
Research engineers
Ad-hoc queries on user data to validate some assumptions
Generating statistics about recommendation quality
Business analysts
Statistics on growth and activity
Effectiveness of advertiser campaigns
Users‟ behavior VS past activities to determine, for example,
why certain groups react better to email communications
Ad-hoc queries on stumbling behaviors of slices of the user
base
142. 142
PIG
142
Use Cases
Using a simple table in HBase:
CREATE EXTERNAL TABLE blocked_users(
userid INT,
blockee INT,
blocker INT,
created BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,f:blockee,f:blocker,f:created")
TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users");
HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped
143. 143
PIG
143
Use Cases
Using a complicated table in HBase:
CREATE EXTERNAL TABLE ratings_hbase(
userid INT,
created BIGINT,
urlid INT,
rating INT,
topic INT,
modified BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key#b@0,:key#b@1,:key#b@2,default:rating#b,default:topic#b,default:modified#b")
TBLPROPERTIES("hbase.table.name" = "ratings_by_userid");
#b means binary, @ means position in composite key (SU-specific hack)
144. 144
PIG Architecture
144
Grunt Shell / pig server : If you want access PIG using any
program then you will be using pig server.
Then the code goes to parser for syntax change. If it is error free
then creates a logical plan that is DAG( Directed acyclic graph).
DAG which has logical operators. This logical plan is forwarded to
optimizer to optimize.
After this optimized code is sent to compiler.
Compiler output is a series of map reduce job which is given to
execution engine.
Execution engine takes care to execute the job on map reduce
145. 145
Data Model
145
A table in Bigtable is a sparse, distributed, persistent
multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
(row:string, column:string, time:int64) uninterpreted
byte array
Supports lookups, inserts, deletes
Single row transactions only
Image Source: Chang et al., OSDI 2006
146. 146
Rows and Columns
146
Rows maintained in sorted lexicographic order
Applications can exploit this property for efficient row
scans
Row ranges dynamically partitioned into tablets
Columns grouped into column families
Column key = family:qualifier
Column families provide locality hints
Unbounded number of columns
148. 148
SSTable
148
Basic building block of Bigtable
Persistent, ordered immutable map from keys to values
Stored in GFS
Sequence of blocks on disk plus an index for block lookup
Can be completely mapped into memory
Supported operations:
Look up value associated with key
Iterate key/value pairs within a key range
Index
64K
block
64K
block
64K
block
SSTable
Source: Graphic from slides by Erik Paulson
149. 149
Tablet
149
Dynamically partitioned range of rows
Built from multiple SSTables
Index
64K
block
64K
block
64K
block
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Tablet Start:aardvark End:apple
Source: Graphic from slides by Erik Paulson
150. 150
Table
150
Multiple tablets make up the table
SSTables can be shared
SSTable SSTable SSTable SSTable
Tablet
aardvark apple
Tablet
apple_two_E boat
Source: Graphic from slides by Erik Paulson
152. 152
Bigtable Master
152
Assigns tablets to tablet servers
Detects addition and expiration of tablet servers
Balances tablet server load
Handles garbage collection
Handles schema changes
153. 153
Bigtable Tablet Servers
153
Each tablet server manages a set of tablets
Typically between ten to a thousand tablets
Each 100-200 MB by default
Handles read and write requests to the tablets
Splits tablets that have grown too large
155. 155
Tablet Assignment
155
Master keeps track of:
Set of live tablet servers
Assignment of tablets to tablet servers
Unassigned tablets
Each tablet is assigned to one tablet server at a time
Tablet server maintains an exclusive lock on a file in Chubby
Master monitors tablet servers and handles assignment
Changes to tablet structure
Table creation/deletion (master initiated)
Tablet merging (master initiated)
Tablet splitting (tablet server initiated)
157. 157
Compactions
157
Minor compaction
Converts the memtable into an SSTable
Reduces memory usage and log traffic on restart
Merging compaction
Reads the contents of a few SSTables and the memtable,
and writes out a new SSTable
Reduces number of SSTables
Major compaction
Merging compaction that results in only one SSTable
No deletion records, only live data
159. 159
Cassandra
159
Why Cassandra?
Lots of data
Copies of messages, reverse indices of messages, per user
data.
Many incoming requests resulting in a lot of random
reads and random writes.
No existing production ready solutions in the market
meet these requirements.
160. 160
Cassandra
160
Design Goals
High availability
Eventual consistency
trade-off strong consistency in favor of high
availability
Incremental scalability
Optimistic Replication
“Knobs” to tune tradeoffs between consistency,
durability and latency
Low total cost of ownership
Minimal administration
161. 161
Cassandra
161
innovation at scale
google bigtable (2006)
consistency model: strong
data model: sparse map
clones: hbase, hypertable
amazon dynamo (2007)
O(1) dht
consistency model: client tune-able
clones: riak, voldemort
cassandra ~= bigtable + dynamo
162. 162
Cassandra
162
proven
The Facebook stores 150TB of data on 150 nodes
web 2.0
used at Twitter, Rackspace, Mahalo, Reddit, Cloudkick, Cisco, Digg, SimpleGeo,
Ooyala, OpenX, others
163. 163
Cassandra
163
Data Model
KEY
ColumnFamily1 Name : MailList Type : Simple Sort : Name
Name : tid1
Value : <Binary>
TimeStamp : t1
Name : tid2
Value : <Binary>
TimeStamp : t2
Name : tid3
Value : <Binary>
TimeStamp : t3
Name : tid4
Value : <Binary>
TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Name : aloha
ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1
<Column List>
Name : hint2
<Column List>
Name : hint3
<Column List>
Name : hint4
<Column List>
C1
V1
T1
C2
V2
T2
C3
V3
T3
C4
V4
T4
Name : dude
C2
V2
T2
C6
V6
T6
Column Families
are declared
upfront
Columns are
added and
modified
dynamically
SuperColumns
are added and
modified
dynamically
Columns are
added and
modified
dynamically
164. 164
Cassandra
164
Write Operations
A client issues a write request to a random node in the
Cassandra cluster.
The “Partitioner” determines the nodes responsible for
the data.
Locally, write operations are logged and then applied to
an in-memory version.
Commit log is stored on a dedicated disk local to the
machine.
166. 166
Cassandra
166
Write cont‟d
Key (CF1 , CF2 , CF3)
Commit Log
Binary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF1)
Memtable ( CF2)
Memtable ( CF2)
• Data size
• Number of Objects
• Lifetime
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns><
Serialized column family>
---
---
---
---
<Key name><Size of key Data><Index of columns/supercolumns><
Serialized column family>
BLOCK Index <Key Name> Offset, <Key Name> Offset
K128 Offset
K256 Offset
K384 Offset
Bloom Filter
(Index in memory)
Data file on disk
167. 167
Cassandra
167
Write Properties
No locks in the critical path
Sequential disk access
Behaves like a write back Cache
Append support without read ahead
Atomicity guarantee for a key
“Always Writable”
accept writes during failure
scenarios
170. 170
Cassandra
170
Cluster Membership and Failure
Detection
Gossip protocol is used for cluster membership.
Super lightweight with mathematically provable properties.
State disseminated in O(logN) rounds where N is the number
of nodes in the cluster.
Every T seconds each member increments its heartbeat
counter and selects one other member to send its list to.
A member merges the list with its own list .
175. 175
Cassandra
175
Accrual Failure Detector
Valuable for system management, replication, load
balancing etc.
Defined as a failure detector that outputs a value,
PHI, associated with each process.
Also known as Adaptive Failure detectors - designed
to adapt to changing network conditions.
The value output, PHI, represents a suspicion level.
Applications set an appropriate threshold, trigger
suspicions and perform appropriate actions.
In Cassandra the average time taken to detect a
failure is 10-15 seconds with the PHI threshold set at
5.
177. 177
Cassandra
177
Performance Benchmark
Loading of data - limited by network bandwidth.
Read performance for Inbox Search in production:
Search Interactions Term Search
Min 7.69 ms 7.78 ms
Median 15.69 ms 18.27 ms
Average 26.13 ms 44.41 ms
178. 178
Cassandra
178
MySQL Comparison
MySQL > 50 GB Data
Writes Average : ~300 ms
Reads Average : ~350 ms
Cassandra > 50 GB Data
Writes Average : 0.12 ms
Reads Average : 15 ms
179. 179
Cassandra
179
Lessons Learnt
Add fancy features only when absolutely required.
Many types of failures are possible.
Big systems need proper systems-level monitoring.
Value simple designs
180. 180
Graph Databases
180
NEO4J (Graphbase)
• A graph is a collection nodes (things) and edges (relationships) that connect
pairs of nodes.
• Attach properties (key-value pairs) on nodes and relationships
•Relationships connect two nodes and both nodes and relationships can hold an
arbitrary amount of key-value pairs.
• A graph database can be thought of as a key-value store, with full support for
relationships.
• http://neo4j.org/
187. 187
History of the World, Part 1
187
NEO4J Features
• Dual license: open source and commercial
•Well suited for many web use cases such as tagging, metadata annotations,
social networks, wikis and other network-shaped or hierarchical data sets
• Intuitive graph-oriented model for data representation. Instead of static and
rigid tables, rows and columns, you work with a flexible graph network
consisting of nodes, relationships and properties.
• Neo4j offers performance improvements on the order of 1000x
or more compared to relational DBs.
• A disk-based, native storage manager completely optimized for storing
graph structures for maximum performance and scalability
• Massive scalability. Neo4j can handle graphs of several billion
nodes/relationships/properties on a single machine and can be sharded to
scale out across multiple machines
•Fully transactional like a real database
•Neo4j traverses depths of 1000 levels and beyond at millisecond speed.
(many orders of magnitude faster than relational systems)