SlideShare une entreprise Scribd logo
1  sur  27
An Introduction to
MapReduce with MongoDB
        Russell Smith
/usr/bin/whoami

•   Russell Smith

•   Consultant for UKD1 Limited

•   I Specialise in helping companies going through rapid growth;

•   Code, architecture, infrastructure, devops, sysops, capacity planning, etc

•   <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
What is MongoDB

•   A scalable, high-performance, open source, document-oriented
    database.

•   Stores JSON like documents

•   Indexible on any attributes (like MySQL)

•   Built in MapReduce
Requirements

•   A running MongoDB server
    http://www.mongodb.org/downloads


•   Basic knowledge of MongoDB

•   Basic Javascript
What is Map Reduce

•   Allows aggregating data in parallel

•   Some built in aggregation functions exist;
    distinct, count

•   If you need to do something more, either query or MapReduce
How does it work?
•   You write two functions

•   You write them in Javascript (currently)
•   Map function:
    Called once per document - returns a key + a value

•   Reduce function:
    Called once per key emitted, with an array of values

•   Optional finalize function allowing rounding up of the reduce data
Some example data

•   I downloaded the H1B (US temporary work VISA data)
    http://www.flcdatacenter.com/CaseH1B.aspx


•   Imported the CSV data using mongoimport command

•   Total imported documents ~335k
What do the documents look like?
                                  {
                                  
   "_id" : ObjectId("4db7c981e243a6e23725570f"),
                                  
   "LCA_CASE_NUMBER" : "I-200-09132-243675",
                                  
   "STATUS" : "CERTIFIED",
                                  
   "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",



•
                                  
   "VISA_CLASS" : "H-1B",

    LCA_CASE_EMPLOYER_STATE       
                                  
                                  
                                      "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00",
                                      "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00",
                                      "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC",
                                  
   "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.",
                                  
   "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",



•
                                  
   "LCA_CASE_EMPLOYER_STATE" : "TX",

    STATUS                        
                                  
                                  
                                      "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092,
                                      "LCA_CASE_SOC_CODE" : "25-2022.00",
                                      "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio",
                                  
   "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR",
                                  
   "LCA_CASE_WAGE_RATE_FROM" : 51577.63,



•
                                  
   "LCA_CASE_WAGE_RATE_UNIT" : "Year",

    LCA_CASE_SUMBIT / Decision_Date
                                  
                                  
                                  
                                      "FULL_TIME_POS" : "Y",
                                      "TOTAL_WORKERS" : 1,
                                      "LCA_CASE_WORKLOC1_CITY" : "HOUSTON",
                                  
   "LCA_CASE_WORKLOC1_STATE" : "TX",




•
                                  
   "PW_1" : 47827,


    LCA_CASE_WAGE_RATE_FROM
                                  
   "PW_UNIT_1" : "Year",
                                  
   "PW_SOURCE_1" : "OES",
                                  
   "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER",
                                  
   "YR_SOURCE_PUB_1" : 2010,
                                  
   "LCA_CASE_NAICS_CODE" : 611110,
                                  
   "Decision_Date" : "7/20/2010 0:00:00r"
                                  }
What we can do with the data?

•   Work out the;

•   Applications per state

•   Applications by status per state

•   Average time from submission to decision, by status
Applications by State


•   Key will be LCA_CASE_EMPLOYER_STATE

•   Assume (wrongly) one person per document
Map


•   this is equal to the current document     m = function () {

                                              
   emit(this.LCA_CASE_EMPLOYER_STATE, 1);
•   emit a value of 1; as we are assuming a
    single H1B app per document               }
Reduce


•   Return a value; the length of the array      r = function (k, v_arr) {
                                                    return v_arr.length
•   This works as each value in the array is 1   }
Executing


•   This will execute the map/reduce
                                        db.text2010.mapReduce(m,r,
                                        {out: 'workers_by_state',
•   Output goes to a collection named
                                        keeptemp:true, verbose:true})
    workers_by_state
Result

{
"_id"
:
"NEW
YORK",
"value"
:
512
}
{
"_id"
:
"IOWA",
"value"
:
15
}
{
"_id"
:
"KANSAS",
"value"
:
54
}
...
A more complex Map!

                                            m = function () {
•   The last example assumed one worker
    per state...which is wrong.                   emit(this.LCA_CASE_EMPLOYER_STATE,
                                            this.TOTAL_WORKERS);

•   We now emit a numeric value per state
                                            }
Reduce
                                             r = function (k, v_arr) {
                                                   var total = 0;
                                                   var len = v_arr.length;

•   As the array now contains values other
                                                  for (var i=0, i<len, i++)
    than 1, we have to iterate over it
                                                  {
                                                        total = total + v_arr[i];
•   This is standard Javascript
                                                  }
                                                  return total;
                                             }
VISA Class by Application Status by
          Average wage                    m = function () {
                                               var k = this.VISA_CLASS + ' ' + this.STATUS;

                                              switch (this.LCA_CASE_WAGE_RATE_UNIT)
                                              {


•
                                                   case 'Year':
    Assumptions:                                         emit(k, this.LCA_CASE_WAGE_RATE_FROM);
                                                         break;

                                                   case 'Month':

•   People work ~40 hour weeks                         emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12);
                                                       break;

                                                   case 'Bi-Weekly':


•
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26);
    Weekly wages are paid every week                   break;

    rather than only the weeks worked              case 'Week':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52);
                                                       break;



•   'Select Pay Range' seems to the the            case 'Hour':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52);

    default option...                                  break;

                                                   default:
                                                        emit(k, 0);
                                              }

                                          }
Reduce
                                        r = function (k, v_arr) {
                                              var tot = 0;
                                              var len = v_arr.length;
•   Work out the average for each key
                                             for (var i = 0; i < len; i++)
                                             {
•   Add each of the elements up
                                                   tot += v_arr[i];
                                             }
•   Average them

                                             return tot / len;
                                        }
Finalize

•   A finalize function may be run after reduction.

•   Called a single time per object

•   The finalize function takes a key and a value, and returns a finalized
    value.
Options

•   Persist the output

•   Filtering input documents

•   Sorting input documents

•   Javascript scope - allows you to pass in extra variables (cannot be
    changed at runtime?)
Current limitations / Watch for

•   Single threaded per node (which sucks)
    https://jira.mongodb.org/browse/SERVER-463


•   Language is restricted to Javascript (which sucks)
    https://jira.mongodb.org/browse/SERVER-699)


•   Does not use secondaries in replica sets

•   From 1.7.3 on, you can reduce into existing collection
...


•   Doesn't allow creation of full documents (which can be a pain for
    perm MR collections if using libraries)
    https://jira.mongodb.org/browse/SERVER-2517


•   Slow; ~x20-30 slower than Hadoop with 1.8
    https://jira.mongodb.org/browse/SERVER-3055
Using MongoDB with Hadoop

•   https://github.com/mongodb/mongo-hadoop

•   Open source

•   Requires knowledge of Java

•   Working Input and Output adapters for MongoDB are provided

•   Alpha quality from what I can tell
The future
1.9 / 2.0

•   V8 is replacing SpiderMonkey

•   Recent Hadoop provider

•   Sharded output collections

•   Improved yielding (concurrency)
> 2.0

•   Multi-threaded

•   Alternative languages
    https://jira.mongodb.org/browse/SERVER-699


•   ~2.2 native aggregation framework

•   Js only mode is faster for lighter jobs
    https://jira.mongodb.org/browse/SERVER-2976
Further reading
•   I’ve only brushed on the details, but this should be enough to get you
    interested / started with MongoDB Map Reduce. Some of the missing
    stuff;

•   Finalize functions - http://bit.ly/gEfKOr

•   Some more examples - http://bit.ly/ig1Yfj

Contenu connexe

Tendances

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure DatabricksSascha Dittmann
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 

Tendances (20)

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 

En vedette

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceTakahiro Inoue
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation FrameworkMongoDB
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 
Introduction to MongoDB with PHP
Introduction to MongoDB with PHPIntroduction to MongoDB with PHP
Introduction to MongoDB with PHPfwso
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne ResumeJustin Dunne
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINALChristoph Sinn
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)David Ritchie
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 

En vedette (20)

An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduce
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
MongoDB - Ekino PHP
MongoDB - Ekino PHPMongoDB - Ekino PHP
MongoDB - Ekino PHP
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB
MongoDBMongoDB
MongoDB
 
Introduction to MongoDB with PHP
Introduction to MongoDB with PHPIntroduction to MongoDB with PHP
Introduction to MongoDB with PHP
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne Resume
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 

Similaire à An Introduction to Map/Reduce with MongoDB

GraphQL, Redux, and React
GraphQL, Redux, and ReactGraphQL, Redux, and React
GraphQL, Redux, and ReactKeon Kim
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...Lucidworks
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009Jason Davies
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Alex Sharp
 
"An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done..."An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done...Fwdays
 
JavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJamshid Hashimi
 
Practical AngularJS
Practical AngularJSPractical AngularJS
Practical AngularJSWei Ru
 
kissy-past-now-future
kissy-past-now-futurekissy-past-now-future
kissy-past-now-futureyiming he
 
KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天tblanlan
 
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Fastly
 
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Lucas Jellema
 
前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 sessionRANK LIU
 
Converting a Rails application to Node.js
Converting a Rails application to Node.jsConverting a Rails application to Node.js
Converting a Rails application to Node.jsMatt Sergeant
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
 
Building your first Java Application with MongoDB
Building your first Java Application with MongoDBBuilding your first Java Application with MongoDB
Building your first Java Application with MongoDBMongoDB
 
JavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxJavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxMegha V
 
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...Ortus Solutions, Corp
 
Programming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineProgramming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineAmazon Web Services
 

Similaire à An Introduction to Map/Reduce with MongoDB (20)

GraphQL, Redux, and React
GraphQL, Redux, and ReactGraphQL, Redux, and React
GraphQL, Redux, and React
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
 
"An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done..."An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done...
 
JavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQuery
 
Practical AngularJS
Practical AngularJSPractical AngularJS
Practical AngularJS
 
kissy-past-now-future
kissy-past-now-futurekissy-past-now-future
kissy-past-now-future
 
KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天
 
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
 
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
 
前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session
 
Converting a Rails application to Node.js
Converting a Rails application to Node.jsConverting a Rails application to Node.js
Converting a Rails application to Node.js
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
 
Building your first Java Application with MongoDB
Building your first Java Application with MongoDBBuilding your first Java Application with MongoDB
Building your first Java Application with MongoDB
 
JavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxJavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptx
 
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
 
Programming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineProgramming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules Engine
 

Plus de Rainforest QA

Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Rainforest QA
 
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyCTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyRainforest QA
 
Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Rainforest QA
 
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Rainforest QA
 
Bitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerBitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerRainforest QA
 
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Rainforest QA
 
MongoDB Command Line Tools
MongoDB Command Line ToolsMongoDB Command Line Tools
MongoDB Command Line ToolsRainforest QA
 
Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011Rainforest QA
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]Rainforest QA
 
London MongoDB User Group April 2011
London MongoDB User Group April 2011London MongoDB User Group April 2011
London MongoDB User Group April 2011Rainforest QA
 
Geo & capped collections with MongoDB
Geo & capped collections  with MongoDBGeo & capped collections  with MongoDB
Geo & capped collections with MongoDBRainforest QA
 

Plus de Rainforest QA (11)

Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019
 
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyCTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
 
Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]
 
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
 
Bitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerBitcoin Ops & Security Primer
Bitcoin Ops & Security Primer
 
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
 
MongoDB Command Line Tools
MongoDB Command Line ToolsMongoDB Command Line Tools
MongoDB Command Line Tools
 
Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
London MongoDB User Group April 2011
London MongoDB User Group April 2011London MongoDB User Group April 2011
London MongoDB User Group April 2011
 
Geo & capped collections with MongoDB
Geo & capped collections  with MongoDBGeo & capped collections  with MongoDB
Geo & capped collections with MongoDB
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

An Introduction to Map/Reduce with MongoDB

  • 1. An Introduction to MapReduce with MongoDB Russell Smith
  • 2. /usr/bin/whoami • Russell Smith • Consultant for UKD1 Limited • I Specialise in helping companies going through rapid growth; • Code, architecture, infrastructure, devops, sysops, capacity planning, etc • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
  • 3. What is MongoDB • A scalable, high-performance, open source, document-oriented database. • Stores JSON like documents • Indexible on any attributes (like MySQL) • Built in MapReduce
  • 4. Requirements • A running MongoDB server http://www.mongodb.org/downloads • Basic knowledge of MongoDB • Basic Javascript
  • 5. What is Map Reduce • Allows aggregating data in parallel • Some built in aggregation functions exist; distinct, count • If you need to do something more, either query or MapReduce
  • 6. How does it work? • You write two functions • You write them in Javascript (currently) • Map function: Called once per document - returns a key + a value • Reduce function: Called once per key emitted, with an array of values • Optional finalize function allowing rounding up of the reduce data
  • 7. Some example data • I downloaded the H1B (US temporary work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx • Imported the CSV data using mongoimport command • Total imported documents ~335k
  • 8. What do the documents look like? { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36", • "VISA_CLASS" : "H-1B", LCA_CASE_EMPLOYER_STATE "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON", • "LCA_CASE_EMPLOYER_STATE" : "TX", STATUS "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63, • "LCA_CASE_WAGE_RATE_UNIT" : "Year", LCA_CASE_SUMBIT / Decision_Date "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX", • "PW_1" : 47827, LCA_CASE_WAGE_RATE_FROM "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00r" }
  • 9. What we can do with the data? • Work out the; • Applications per state • Applications by status per state • Average time from submission to decision, by status
  • 10. Applications by State • Key will be LCA_CASE_EMPLOYER_STATE • Assume (wrongly) one person per document
  • 11. Map • this is equal to the current document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1); • emit a value of 1; as we are assuming a single H1B app per document }
  • 12. Reduce • Return a value; the length of the array r = function (k, v_arr) { return v_arr.length • This works as each value in the array is 1 }
  • 13. Executing • This will execute the map/reduce db.text2010.mapReduce(m,r, {out: 'workers_by_state', • Output goes to a collection named keeptemp:true, verbose:true}) workers_by_state
  • 15. A more complex Map! m = function () { • The last example assumed one worker per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS); • We now emit a numeric value per state }
  • 16. Reduce r = function (k, v_arr) { var total = 0; var len = v_arr.length; • As the array now contains values other for (var i=0, i<len, i++) than 1, we have to iterate over it { total = total + v_arr[i]; • This is standard Javascript } return total; }
  • 17. VISA Class by Application Status by Average wage m = function () { var k = this.VISA_CLASS + ' ' + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) { • case 'Year': Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case 'Month': • People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case 'Bi-Weekly': • emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); Weekly wages are paid every week break; rather than only the weeks worked case 'Week': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break; • 'Select Pay Range' seems to the the case 'Hour': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); default option... break; default: emit(k, 0); } }
  • 18. Reduce r = function (k, v_arr) { var tot = 0; var len = v_arr.length; • Work out the average for each key for (var i = 0; i < len; i++) { • Add each of the elements up tot += v_arr[i]; } • Average them return tot / len; }
  • 19. Finalize • A finalize function may be run after reduction. • Called a single time per object • The finalize function takes a key and a value, and returns a finalized value.
  • 20. Options • Persist the output • Filtering input documents • Sorting input documents • Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?)
  • 21. Current limitations / Watch for • Single threaded per node (which sucks) https://jira.mongodb.org/browse/SERVER-463 • Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699) • Does not use secondaries in replica sets • From 1.7.3 on, you can reduce into existing collection
  • 22. ... • Doesn't allow creation of full documents (which can be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517 • Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055
  • 23. Using MongoDB with Hadoop • https://github.com/mongodb/mongo-hadoop • Open source • Requires knowledge of Java • Working Input and Output adapters for MongoDB are provided • Alpha quality from what I can tell
  • 25. 1.9 / 2.0 • V8 is replacing SpiderMonkey • Recent Hadoop provider • Sharded output collections • Improved yielding (concurrency)
  • 26. > 2.0 • Multi-threaded • Alternative languages https://jira.mongodb.org/browse/SERVER-699 • ~2.2 native aggregation framework • Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976
  • 27. Further reading • I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff; • Finalize functions - http://bit.ly/gEfKOr • Some more examples - http://bit.ly/ig1Yfj

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n