SlideShare une entreprise Scribd logo
1  sur  35
Paris 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall
MongoDB & Hadoop 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall
Agenda 
Evolving Data Landscape 
MongoDB & Hadoop Use Cases 
MongoDB Connector Features 
Demo
Evolving Data Landscape
Hadoop 
“The Apache Hadoop software library is a framework 
that allows for the distributed processing of large 
data sets across clusters of computers using simple 
programming models.” 
• Terabyte and Petabyte datasets 
• Data warehousing 
• Advanced analytics 
http://hadoop.apache.org
‹#› 
Enterprise IT Stack
‹#› 
Operational vs. Analytical: Enrichment 
Applications, Interactions Warehouse, Analytics
Operational: MongoDB 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
Analytical: Hadoop 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
Operational & Analytical: Lifecycle 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
MongoDB & Hadoop Use Cases
Commerce 
Applications 
powered by 
Analysis 
powered by 
Products & Inventory 
Recommended products 
Customer profile 
Session management 
Elastic pricing 
Recommendation models 
Predictive analytics 
Clickstream history 
MongoDB Connector 
for Hadoop
Insurance 
Applications 
powered by 
Analysis 
powered by 
Customer profiles 
Insurance policies 
Session data 
Call center data 
Customer action analysis 
Churn analysis 
Churn prediction 
Policy rates 
MongoDB Connector 
for Hadoop
Fraud Detection 
Payments Nightly Analysis 
MongoDB Connector 
for Hadoop 
3rd Party 
Data Sources 
Results Cache 
Fraud 
Detection 
Query Only 
Query Only
MongoDB Connector for Hadoop
‹#› 
Connector Overview 
DATA 
• Read/Write MongoDB 
• Read/Write BSON 
TOOLS 
• MapReduce 
• Pig 
• Hive 
• Spark 
PLATFORMS 
• Apache Hadoop 
• Cloudera CDH 
• Hortonworks HDP 
• MapR 
• Amazon EMR
‹#› 
Connector Features and Functionality 
• Computes splits to read data 
• Single Node, Replica Sets, Sharded Clusters 
• Mappings for Pig and Hive 
• MongoDB as a standard data source/destination 
• Support for 
• Filtering data with MongoDB queries 
• Authentication 
• Reading from Replica Set tags 
• Appending to existing collections
‹#› 
MapReduce Configuration 
• MongoDB input/output 
mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat 
mongo.input.uri = mongodb://mydb:27017/db1.collection1 
mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat 
mongo.output.uri = mongodb://mydb:27017/db1.collection2 
• BSON input/output 
mongo.job.input.format = com.hadoop.BSONFileInputFormat 
mapred.input.dir = hdfs:///tmp/database.bson 
mongo.job.output.format = com.hadoop.BSONFileOutputFormat 
mapred.output.dir = hdfs:///tmp/output.bson
‹#› 
Pig Mappings 
• Input: BSONLoader and MongoLoader 
data = LOAD ‘mongodb://mydb:27017/db.collection’ 
using com.mongodb.hadoop.pig.MongoLoader 
• Output: BSONStorage and MongoInsertStorage 
STORE records INTO ‘hdfs:///output.bson’ 
using com.mongodb.hadoop.pig.BSONStorage
‹#› 
Hive Support 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONStorageHandler 
CREATE TABLE mongo_users (id int, name string, age int) 
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) 
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
‹#› 
Spark 
• Use with MapReduce input/output 
formats 
• Create Configuration objects with 
input/output formats and data URI 
• Load/save data using SparkContext 
Hadoop file API
‹#› 
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in HDFS 
Dynamic queries with most 
recent data 
Puts load on operational 
database 
Snapshots move load to 
Hadoop 
Snapshots add predictable 
load to MongoDB
Demo : Recommendation Platform
‹#› 
Movie Web
‹#› 
MovieWeb Web Application 
• Browse 
- Top movies by ratings count 
- Top genres by movie count 
• Log in to 
- See My Ratings 
- Rate movies 
• Recommendations 
- Movies You May Like 
- Recommendations
‹#› 
MovieWeb Components 
• MovieLens dataset 
– 10M ratings, 10K movies, 70K users 
– http://grouplens.org/datasets/movielens/ 
• Python web app to browse movies, recommendations 
– Flask, PyMongo 
• Spark app computes recommendations 
– MLLib collaborative filter 
• Predicted ratings are exposed in web app 
– New predictions collection
‹#› 
Spark Recommender 
• Apache Hadoop (2.3) 
- HDFS & YARN 
- Top genres by movie count 
• Spark (1.0) 
- Execute within YARN 
- Assign executor resources 
• Data 
- From HDFS, MongoDB 
- To MongoDB
‹#› 
MovieWeb Workflow 
Snapshot db 
as BSON 
Predict ratings for 
all pairings 
Write Prediction to 
MongoDB 
collection 
Store BSON 
in HDFS 
Read BSON 
into Spark App 
Create user movie 
pairing 
Web Application 
exposes 
recommendations 
Train Model from 
existing ratings 
Repeat Process
‹#› 
Execution 
$ spark-submit --master local  
--driver-memory 2G --executor-memory 2G  
--jars mongo-hadoop-core.jar,mongo-java-driver.jar  
--class com.mongodb.workshop.SparkExercise  
./target/spark-1.0-SNAPSHOT.jar  
hdfs://localhost:9000  
mongodb://127.0.0.1:27017/movielens  
predictions
Should I use MongoDB or Hadoop?
‹#› 
Business First! 
First-Level 
Analytics 
Internet of 
Things 
Mobile Apps 
Social 
What/Why How 
Product/Asse 
t Catalog 
Security & 
Fraud 
Customer 
Data 
Management 
Single View 
Churn 
Analysis 
Risk 
Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommend 
er 
Warehouse 
& ETL 
Predictive 
Analytics 
Ad Targeting
‹#› 
The good tool for the task 
• Dataset size 
• Data processing complexity 
• Continuous improvement 
V1.0
‹#› 
The good tool for the task 
• Dataset size 
• Data processing complexity 
• Continuous improvement 
V2.0
‹#› 
Resources / Questions 
• MongoDB Connector for Hadoop 
- http://github.com/mongodb/mongo-hadoop 
• Getting Started with MongoDB and Hadoop 
- http://docs.mongodb.org/ecosystem/tutorial/getting-started- 
with-hadoop/ 
• MongoDB-Spark Demo 
- https://github.com/crcsmnky/mongodb-hadoop-workshop
MongoDB & Hadoop 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall

Contenu connexe

Tendances

MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at WishReal Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at WishMongoDB
 
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)Jing Chen (Jerry) He
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and KubernetesMongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and KubernetesMongoDB
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategyIBM Sverige
 
Data persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbData persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbDimgba Kalu
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB
 
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingWebinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingMongoDB
 
MongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB
 
MongoDB Launchpad 2016: MongoDB 3.4: Your Database Evolved
MongoDB Launchpad 2016: MongoDB 3.4: Your Database EvolvedMongoDB Launchpad 2016: MongoDB 3.4: Your Database Evolved
MongoDB Launchpad 2016: MongoDB 3.4: Your Database EvolvedMongoDB
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB
 
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...MongoDB
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 

Tendances (20)

MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at WishReal Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at Wish
 
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)
HBaseCon 2017: Community-Driven Graph with JanusGraph (updated)
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and KubernetesMongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategy
 
MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
Data persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbData persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdb
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingWebinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
 
MongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB and Azure Databricks
MongoDB and Azure Databricks
 
MongoDB Launchpad 2016: MongoDB 3.4: Your Database Evolved
MongoDB Launchpad 2016: MongoDB 3.4: Your Database EvolvedMongoDB Launchpad 2016: MongoDB 3.4: Your Database Evolved
MongoDB Launchpad 2016: MongoDB 3.4: Your Database Evolved
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
 
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...
MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product ...
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 

Similaire à MongoDB and Hadoop

Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsMongoDB
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB MongoDB
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBMongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBTobias Trelle
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 
MongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
MongoDB World 2016: Get MEAN and Lean with MongoDB and KubernetesMongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
MongoDB World 2016: Get MEAN and Lean with MongoDB and KubernetesMongoDB
 
MongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
 
MongoDB Introduction, Installation & Execution
MongoDB Introduction, Installation & ExecutionMongoDB Introduction, Installation & Execution
MongoDB Introduction, Installation & ExecutioniFour Technolab Pvt. Ltd.
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeMongoDB
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_SparkMat Keep
 
Machine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2OMachine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2OSri Ambati
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesAndrás Fehér
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 

Similaire à MongoDB and Hadoop (20)

Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
MongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
MongoDB World 2016: Get MEAN and Lean with MongoDB and KubernetesMongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
MongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
 
MongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a TreeMongoDB & Hadoop, Sittin' in a Tree
MongoDB & Hadoop, Sittin' in a Tree
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
MongoDB Introduction, Installation & Execution
MongoDB Introduction, Installation & ExecutionMongoDB Introduction, Installation & Execution
MongoDB Introduction, Installation & Execution
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
 
MongoDB 3.4 webinar
MongoDB 3.4 webinarMongoDB 3.4 webinar
MongoDB 3.4 webinar
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
Machine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2OMachine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2O
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 

Plus de Tugdual Grall

Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkTugdual Grall
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkTugdual Grall
 
Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Tugdual Grall
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Introduction to NoSQL with MongoDB - SQLi Workshop
Introduction to NoSQL with MongoDB - SQLi WorkshopIntroduction to NoSQL with MongoDB - SQLi Workshop
Introduction to NoSQL with MongoDB - SQLi WorkshopTugdual Grall
 
Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Tugdual Grall
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglotTugdual Grall
 
Drop your table ! MongoDB Schema Design
Drop your table ! MongoDB Schema DesignDrop your table ! MongoDB Schema Design
Drop your table ! MongoDB Schema DesignTugdual Grall
 
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6Tugdual Grall
 
Some cool features of MongoDB
Some cool features of MongoDBSome cool features of MongoDB
Some cool features of MongoDBTugdual Grall
 
Building Your First MongoDB Application
Building Your First MongoDB ApplicationBuilding Your First MongoDB Application
Building Your First MongoDB ApplicationTugdual Grall
 
Opensourceday 2014-iot
Opensourceday 2014-iotOpensourceday 2014-iot
Opensourceday 2014-iotTugdual Grall
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseTugdual Grall
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0Tugdual Grall
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 

Plus de Tugdual Grall (20)

Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Introduction to NoSQL with MongoDB - SQLi Workshop
Introduction to NoSQL with MongoDB - SQLi WorkshopIntroduction to NoSQL with MongoDB - SQLi Workshop
Introduction to NoSQL with MongoDB - SQLi Workshop
 
Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
Drop your table ! MongoDB Schema Design
Drop your table ! MongoDB Schema DesignDrop your table ! MongoDB Schema Design
Drop your table ! MongoDB Schema Design
 
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
 
Some cool features of MongoDB
Some cool features of MongoDBSome cool features of MongoDB
Some cool features of MongoDB
 
Building Your First MongoDB Application
Building Your First MongoDB ApplicationBuilding Your First MongoDB Application
Building Your First MongoDB Application
 
Opensourceday 2014-iot
Opensourceday 2014-iotOpensourceday 2014-iot
Opensourceday 2014-iot
 
Neotys conference
Neotys conferenceNeotys conference
Neotys conference
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
Introduction to NoSQL with Couchbase
Introduction to NoSQL with CouchbaseIntroduction to NoSQL with Couchbase
Introduction to NoSQL with Couchbase
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

MongoDB and Hadoop

  • 1. Paris Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  • 2. MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  • 3. Agenda Evolving Data Landscape MongoDB & Hadoop Use Cases MongoDB Connector Features Demo
  • 5. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” • Terabyte and Petabyte datasets • Data warehousing • Advanced analytics http://hadoop.apache.org
  • 7. ‹#› Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  • 8. Operational: MongoDB First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 9. Analytical: Hadoop First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 10. Operational & Analytical: Lifecycle First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 11. MongoDB & Hadoop Use Cases
  • 12. Commerce Applications powered by Analysis powered by Products & Inventory Recommended products Customer profile Session management Elastic pricing Recommendation models Predictive analytics Clickstream history MongoDB Connector for Hadoop
  • 13. Insurance Applications powered by Analysis powered by Customer profiles Insurance policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector for Hadoop
  • 14. Fraud Detection Payments Nightly Analysis MongoDB Connector for Hadoop 3rd Party Data Sources Results Cache Fraud Detection Query Only Query Only
  • 16. ‹#› Connector Overview DATA • Read/Write MongoDB • Read/Write BSON TOOLS • MapReduce • Pig • Hive • Spark PLATFORMS • Apache Hadoop • Cloudera CDH • Hortonworks HDP • MapR • Amazon EMR
  • 17. ‹#› Connector Features and Functionality • Computes splits to read data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections
  • 18. ‹#› MapReduce Configuration • MongoDB input/output mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat mongo.input.uri = mongodb://mydb:27017/db1.collection1 mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output mongo.job.input.format = com.hadoop.BSONFileInputFormat mapred.input.dir = hdfs:///tmp/database.bson mongo.job.output.format = com.hadoop.BSONFileOutputFormat mapred.output.dir = hdfs:///tmp/output.bson
  • 19. ‹#› Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 20. ‹#› Hive Support • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
  • 21. ‹#› Spark • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  • 22. ‹#› Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 25. ‹#› MovieWeb Web Application • Browse - Top movies by ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations
  • 26. ‹#› MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users – http://grouplens.org/datasets/movielens/ • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  • 27. ‹#› Spark Recommender • Apache Hadoop (2.3) - HDFS & YARN - Top genres by movie count • Spark (1.0) - Execute within YARN - Assign executor resources • Data - From HDFS, MongoDB - To MongoDB
  • 28. ‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for all pairings Write Prediction to MongoDB collection Store BSON in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Train Model from existing ratings Repeat Process
  • 29. ‹#› Execution $ spark-submit --master local --driver-memory 2G --executor-memory 2G --jars mongo-hadoop-core.jar,mongo-java-driver.jar --class com.mongodb.workshop.SparkExercise ./target/spark-1.0-SNAPSHOT.jar hdfs://localhost:9000 mongodb://127.0.0.1:27017/movielens predictions
  • 30. Should I use MongoDB or Hadoop?
  • 31. ‹#› Business First! First-Level Analytics Internet of Things Mobile Apps Social What/Why How Product/Asse t Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommend er Warehouse & ETL Predictive Analytics Ad Targeting
  • 32. ‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V1.0
  • 33. ‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V2.0
  • 34. ‹#› Resources / Questions • MongoDB Connector for Hadoop - http://github.com/mongodb/mongo-hadoop • Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-started- with-hadoop/ • MongoDB-Spark Demo - https://github.com/crcsmnky/mongodb-hadoop-workshop
  • 35. MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall

Notes de l'éditeur

  1. Apache def, a framework to enable many things Distributed File system one of the core component is MapReduce Now it is more YARN, that is resource manager, and MR is just one type of jobs you can manage Mongo DB : GB and Terabytes Hadoop : Tb and Pb
  2. You have 2 places where you deal with data
  3. You have to think about “enrichment” MongoDB is here to enrich data that are in Hadoop Hadoop is here to enrich data that are in Mongodb Let’s look at the different uses cases between Operational and Analytics
  4. First level could be done in MongoDB “what is your application is talking to?”
  5. Hadoop will be there to analyze a bigger problem and do some treatment We are talking of hadoop when it is Pb of data
  6. We are trying to solve the bigger problem, by connecting the 2 technologies when it makes sense
  7. Split the data when reading data (Mapper) But also filtering queries, for example to take data from a specific timestamp To reduce the load on your cluster read from replicaset tags a new feature that people asked for, is adding result to the existing collection
  8. Spark in a new data processing that happens most in memory Take all the power of the connector Open a new “hadoop file” that is loaded in RDD ( Resilient Distributed Dataset )
  9. Load Data Read Users from MongoDB (user collection) Read Movies from BSON (HDFS) Read Ratings from MongoDB (ratings collection) Data Processing Generate (user/movies) pairs users.cartesian(movies) Train : Collaborative Filter ALS.train(ratings.rdd(), 10, 10, 0.01); Predict/Recommend Save data into MongoDB prediction collection