Relational databases are useful for their transactional and query capabilities but do not scale linearly. Distributed "shared nothing" architectures like Hadoop MapReduce are more scalable by making work independent and parallelizable. However, real-world problems often require sharing data between processes. A common pattern is to use MapReduce for bulk processing and loading data, then use databases or other technologies for interactive queries and iterative jobs. Choosing the right data storage technology depends on an application's specific needs.
3. Atomic, transactional updates
Guaranteed consistency
Relational Databases are Awesome
Declarative queries
Easy to reason about
Long track record of success
18. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
19. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
1000 Windows ME machines running
independent Excel macros
20. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
1000 Windows ME machines running
independent Excel macros
25. “Shared Nothing” architectures are the
most scalable…
…but most real-world problems require
us to share something…
…so our designs usually have a parallel
part and a serial part
26. The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.
27. Amdahl’s Law
1 S : speed improvement
S(N ) = P : ratio of the problem that
(1- P) + P can be parallelized
N N: number of processors
28. MapReduce Primer
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1
Split 2 Mapper 2
Reducer 1
Split 3 Mapper 3
Reducer 2
. . .
. . .
. .
Reducer N
Split N Mapper N
29. MapReduce Example: Word Count
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book Sum words
. D-E
.
. .
.
Sum words
W-Z
Count words
per book
30. Notice there is still a serial part of the
problem: the of the reducers must be
combined
31. Notice there is still a serial part of the
problem: the of the reducers must be
combined
…but this is much smaller, and can be
handled by a single process
32. Also notice that the network is a shared
resource when processing big data
33. Also notice that the network is a shared
resource when processing big data
So rather than moving data to computation,
we move computation to data.
34. MapReduce Data Locality
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1
Split 2 Mapper 2 Reducer 1
Split 3 Mapper 3 Reducer 2
.
.
. .
. .
. .
Reducer N
Split N Mapper N
= a physical machine
36. Data locality is only guaranteed the Map
phase
So the most data-intensive work should be
done in the map, with smaller sets set to
the reducer
37. Data locality is only guaranteed the Map
phase
So the most data-intensive work should be
done in the map, with smaller sets set to the
reducer
Some Map/Reduce jobs have no reducer at
all!
38. MapReduce Gone Wrong
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book
Sum words Word
. D-E
. Addition
.
.
. Service
Sum words
W-Z
Count words
per book
39. Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it
40. So for data processing, prefer embedded
libraries over remote services
41. So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!
42. Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.
So how do we “join” with MapReduce?
43. Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
Data Set 1 Map Phase Shuffle Reduce
Mapper 1 Phase
Split 1
Data set 2
Reducer 1
Mapper 2
Split 2
Data set 2 Reducer 2
.
Mapper 3 .
Split 3
Data set 2
44. Merge in Reducer
Route common items to the same reducer
Data Set 1 Map Phase Shuffle Reduce
Split 1 Phase
Group by key
Split 2 Group by key
Reducer 1
Split 3 Group by key
Reducer 2
.
.
Data Set 2
Split 1 Group by key
Reducer N
Split 2 Group by key
Split 3 Group by key
45. Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
Use one!
Crunch!
47. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
48. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
49. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
50. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
51. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data
52. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data
Poor interactive query support Good interactive query support
54. MapReduce MPP Databases
…are complementary!
Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis
57. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
58. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
59. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
All blocks written in triplicate
60. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
All blocks written in triplicate
Write and append only –
no random updates!
61. HDFS Writes
Lookup Data Node
Name Node
Client
Write
Data Node 1 Data Node 2 Data Node N
Block Replicate Block Replicate . . . Block
Block Block
62. HDFS Reads
Lookup Block
locations Name Node
Client
Read
Data Node 1 Data Node 2 Data Node N
Block Block ... Block
Block Block
63. HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files
64. HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files
Enter HBase
“Random Access To Your Planet-Size Data”
65. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
66. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
Files accessible as tables, split across
many regions, hosted by region servers
67. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
Files accessible as tables, split across
many regions, hosted by region servers
Preserves scalability, data locality, and
Map/Reduce features of Hadoop
69. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
70. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
To handle huge write loads
71. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
To handle huge write loads
As a scalable key/value store
72. But there are drawbacks:
Limited schema support
Limited atomicity guarantees
No built-in secondary indexes
HBase is a great tool for many jobs,
but not every job
73. The data store should align
with the needs of the application
74. So a pattern is emerging:
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
75. But we have a potential bottleneck
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
76. Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import
Oracle Loader for Hadoop
HBase HFile Import Bulk Loads for MPP
77. And we’re missing an important piece:
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
78. And we’re missing an important piece:
Collection Aggregation Processing Storage
Millennium MPP
Realtime
Processing
CCDs Relational
Hadoop
with
Claims HBase Document
Map/Red Store
HL7 uce Jobs
(batch)
HBase
79. How do we make it fast?
Speed Layer
Batch Layer
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
80. How do we make it fast?
Move data to computation
Hours of data
Speed Layer
Incremental
Low Latency (seconds to process) updates
Move computation to data
Years of data
Batch Layer
Bulk loads
High Latency (minutes or hours to process)
81. How do we make it fast?
Complex Event Processing
Speed Layer
Storm
Batch Layer Hadoop
MapReduce
84. Quickly create new data models
Fast iteration cycles means fast innovation
Process all data overnight
Simple correction of any bugs
Much easier to understand and work with