2. /usr/bin/whoami
• Russell Smith
• Consultant for UKD1 Limited
• I Specialise in helping companies going through rapid growth;
• Code, architecture, infrastructure, devops, sysops, capacity planning, etc
• <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
3. What is MongoDB
• A scalable, high-performance, open source, document-oriented
database.
• Stores JSON like documents
• Indexible on any attributes (like MySQL)
• Built in MapReduce
4. Requirements
• A running MongoDB server
http://www.mongodb.org/downloads
• Basic knowledge of MongoDB
• Basic Javascript
5. What is Map Reduce
• Allows aggregating data in parallel
• Some built in aggregation functions exist;
distinct, count
• If you need to do something more, either query or MapReduce
6. How does it work?
• You write two functions
• You write them in Javascript (currently)
• Map function:
Called once per document - returns a key + a value
• Reduce function:
Called once per key emitted, with an array of values
• Optional finalize function allowing rounding up of the reduce data
7. Some example data
• I downloaded the H1B (US temporary work VISA data)
http://www.flcdatacenter.com/CaseH1B.aspx
• Imported the CSV data using mongoimport command
• Total imported documents ~335k
9. What we can do with the data?
• Work out the;
• Applications per state
• Applications by status per state
• Average time from submission to decision, by status
10. Applications by State
• Key will be LCA_CASE_EMPLOYER_STATE
• Assume (wrongly) one person per document
11. Map
• this is equal to the current document m = function () {
emit(this.LCA_CASE_EMPLOYER_STATE, 1);
• emit a value of 1; as we are assuming a
single H1B app per document }
12. Reduce
• Return a value; the length of the array r = function (k, v_arr) {
return v_arr.length
• This works as each value in the array is 1 }
13. Executing
• This will execute the map/reduce
db.text2010.mapReduce(m,r,
{out: 'workers_by_state',
• Output goes to a collection named
keeptemp:true, verbose:true})
workers_by_state
15. A more complex Map!
m = function () {
• The last example assumed one worker
per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE,
this.TOTAL_WORKERS);
• We now emit a numeric value per state
}
16. Reduce
r = function (k, v_arr) {
var total = 0;
var len = v_arr.length;
• As the array now contains values other
for (var i=0, i<len, i++)
than 1, we have to iterate over it
{
total = total + v_arr[i];
• This is standard Javascript
}
return total;
}
17. VISA Class by Application Status by
Average wage m = function () {
var k = this.VISA_CLASS + ' ' + this.STATUS;
switch (this.LCA_CASE_WAGE_RATE_UNIT)
{
•
case 'Year':
Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM);
break;
case 'Month':
• People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12);
break;
case 'Bi-Weekly':
•
emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26);
Weekly wages are paid every week break;
rather than only the weeks worked case 'Week':
emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52);
break;
• 'Select Pay Range' seems to the the case 'Hour':
emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52);
default option... break;
default:
emit(k, 0);
}
}
18. Reduce
r = function (k, v_arr) {
var tot = 0;
var len = v_arr.length;
• Work out the average for each key
for (var i = 0; i < len; i++)
{
• Add each of the elements up
tot += v_arr[i];
}
• Average them
return tot / len;
}
19. Finalize
• A finalize function may be run after reduction.
• Called a single time per object
• The finalize function takes a key and a value, and returns a finalized
value.
20. Options
• Persist the output
• Filtering input documents
• Sorting input documents
• Javascript scope - allows you to pass in extra variables (cannot be
changed at runtime?)
21. Current limitations / Watch for
• Single threaded per node (which sucks)
https://jira.mongodb.org/browse/SERVER-463
• Language is restricted to Javascript (which sucks)
https://jira.mongodb.org/browse/SERVER-699)
• Does not use secondaries in replica sets
• From 1.7.3 on, you can reduce into existing collection
22. ...
• Doesn't allow creation of full documents (which can be a pain for
perm MR collections if using libraries)
https://jira.mongodb.org/browse/SERVER-2517
• Slow; ~x20-30 slower than Hadoop with 1.8
https://jira.mongodb.org/browse/SERVER-3055
23. Using MongoDB with Hadoop
• https://github.com/mongodb/mongo-hadoop
• Open source
• Requires knowledge of Java
• Working Input and Output adapters for MongoDB are provided
• Alpha quality from what I can tell
26. > 2.0
• Multi-threaded
• Alternative languages
https://jira.mongodb.org/browse/SERVER-699
• ~2.2 native aggregation framework
• Js only mode is faster for lighter jobs
https://jira.mongodb.org/browse/SERVER-2976
27. Further reading
• I’ve only brushed on the details, but this should be enough to get you
interested / started with MongoDB Map Reduce. Some of the missing
stuff;
• Finalize functions - http://bit.ly/gEfKOr
• Some more examples - http://bit.ly/ig1Yfj