The document discusses using MapReduce and NoSQL databases like MongoDB and Accumulo to solve challenges of analyzing large datasets by allowing distributed processing and incremental updates compared to traditional analytical systems. It provides examples of using MapReduce on MongoDB and Accumulo to perform analytics and maintain running aggregates or results. The document also discusses tradeoffs between different approaches and best practices for optimizing performance when using MapReduce and NoSQL databases together.
13. Analysis Challenges
Analytical Latency
Data is always old
Answers can take a long time
Serving up analytical results
Higher cost, complexity
Incremental Updates
14. Analysis Challenges
Word Counts
a: 5342 New Document:
aardvark: 13
an: 4553 “The red aardvarks
anteater: 27 live in holes.”
...
yellow: 302
zebra:19
15. Analysis Challenges
HDFS Log files:
sources/routers MapReduce over data
sources/apps from all sources for the
sources/webservers week of Jan 13th
20. Performance Profiles
MapReduce NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
21. Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
22. Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
23. Performance Profiles
MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
24. Best Practices
Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query
25. THE IN T ER L L E C T I V E
Concept-Based Search
36. Config
Replicas App servers
P
P
Single job
Shards
P
P
mongos
37. MongoDB
Mappers read directly
from a single mongod mongod
process, not through
mongos - tends to be
local Map HDFS
Balancer can be turned
off to avoid potential for mongos
reading data twice
38. MongoReduce
Only MongoDb mongod
primaries do writes.
Schedule mappers on
secondaries Map HDFS
Intermediate output
goes to HDFS Reduce mongos
39. MongoReduce
mongod
Final output can go to
HDFS
HDFS or MongoDb
Reduce mongos
40. MongoReduce
mongod
Mappers can just write
to global MongoDb Map HDFS
through mongos
mongos
41. What’s Going On?
Map Map Map Map
r1 r2 r3 r1 r2 r3 mongos mongos
r1 r2 r3 P P P
Identity Reducer
42. MongoReduce
Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:
q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}
s = {authors:true}
Queries use indexes!
43. MongoReduce
If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()
51. Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
52. Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
53. MapReduce and Accumulo
Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’
60. Iterators
row : column family : column qualifier : ts -> value
can specify which key elements are unique, e.g.
row : column family
can specify a function to execute on values of identical
key-portions, e.g.
sum(), max(), min()
61. Key to performance
When the functions are run
Rather than atomic increment:
lock, read, +1, write SLOW
Write all values, sum at
read time
minor compaction time
major compaction time
64. Reduce’ (prime)
Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function
65. ‘Continuous’ MapReduce
Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities
69. Google Percolator
A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.
70. Google Percolator
A novel, proprietary system of Distributed Transactions
and Notifications built on top of BigTable
71. Solution Space
Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS