SlideShare a Scribd company logo
1 of 29
Download to read offline
#MongoDB DC 
MongoDB and Hadoop 
Driving Business Insights 
Justin Lee 
Software Engineer, MongoDB
Agenda 
• Evolving Data Landscape 
• MongoDB & Hadoop Use Cases 
• MongoDB Connector Features 
• Demo
Evolving Data Landscape
Hadoop 
The Apache Hadoop software library is a framework that 
allows for the distributed processing of large data sets 
across clusters of computers using simple programming 
models. 
• Terabyte and Petabtye datasets 
• Data warehousing 
• Advanced analytics
Enterprise IT Stack 
Operational Analytical 
EDW 
Management & Monitoring 
Security & Auditing 
Applications 
CRM, ERP, Collaboration, Mobile, BI 
Data Management 
RDBMS 
RDBMS 
Infrastructure 
OS & Virtualization, Compute, Storage, Network
Operational vs. Analytical: Enrichment 
Applications, Interactions Warehouse, Analytics
Operational: MongoDB 
First-­‐level 
Analy/cs 
Product/Asset 
Catalogs 
Security 
& 
Fraud 
Internet 
of 
Things 
Mobile 
Apps 
Customer 
Data 
Mgmt 
Single 
View 
Social 
Churn 
Analysis 
Recommender 
Warehouse 
& 
ETL 
Risk 
Modeling 
Trade 
Surveillance 
Predic/ve 
Analy/cs 
Ad 
Targe/ng 
Sen/ment 
Analysis
Analytical: Hadoop 
First-­‐level 
Analy/cs 
Product/Asset 
Catalogs 
Security 
& 
Fraud 
Internet 
of 
Things 
Mobile 
Apps 
Customer 
Data 
Mgmt 
Single 
View 
Social 
Churn 
Analysis 
Recommender 
Warehouse 
& 
ETL 
Risk 
Modeling 
Trade 
Surveillance 
Predic/ve 
Analy/cs 
Ad 
Targe/ng 
Sen/ment 
Analysis
Operational vs. Analytical: Lifecycle 
First-­‐level 
Analy/cs 
Product/Asset 
Catalogs 
Security 
& 
Fraud 
Internet 
of 
Things 
Mobile 
Apps 
Customer 
Data 
Mgmt 
Single 
View 
Social 
Churn 
Analysis 
Recommender 
Warehouse 
& 
ETL 
Risk 
Modeling 
Trade 
Surveillance 
Predic/ve 
Analy/cs 
Ad 
Targe/ng 
Sen/ment 
Analysis
MongoDB & Hadoop Use Cases
Commerce 
Applications 
powered by 
Analysis 
powered by 
• Products & Inventory 
• Recommended products 
• Customer profile 
• Session management 
• Elastic pricing 
• Recommendation models 
• Predictive analytics 
• Clickstream history 
MongoDB 
Connector for 
Hadoop
Insurance 
Applications 
powered by 
Analysis 
powered by 
• Customer profiles 
• Insurance policies 
• Session data 
• Call center data 
• Customer action analysis 
• Churn analysis 
• Churn prediction 
• Policy rates 
MongoDB 
Connector for 
Hadoop
Fraud Detection 
Payments 
Nightly 
Analysis 
Fraud modeling 
MongoDB Connector 
for Hadoop 
Results 
Cache 
Online payments 
processing 
3rd Party Data 
Sources 
Fraud 
Detection 
query 
only 
query 
only
MongoDB Connector for Hadoop
Data 
Read/Write 
MongoDB 
Read/Write 
BSON 
Tools 
MapReduce 
Pig 
Hive 
Spark 
PlaNorms 
Apache 
Hadoop 
Cloudera 
CDH 
Hortonworks 
HDP 
Amazon 
EMR 
Connector Overview
Connector Features and Functionality 
• Computes splits to read data 
– Single Node, Replica Sets, Sharded Clusters 
• Mappings for Pig and Hive 
– MongoDB as a standard data source/destination 
• Support for 
– Filtering data with MongoDB queries 
– Authentication 
– Reading from Replica Set tags 
– Appending to existing collections
MapReduce Configuration 
• MongoDB input 
– mongo.job.input.format 
= 
com.mongodb.hadoop.MongoInputFormat 
– mongo.input.uri 
= 
mongodb://mydb:27017/db1.collection1 
• MongoDB output 
– mongo.job.output.format 
= 
com.mongodb.hadoop.MongoOutputFormat 
– mongo.output.uri 
= 
mongodb://mydb:27017/db1.collection2 
• BSON input/output 
– mongo.job.input.format 
= 
com.hadoop.BSONFileInputFormat 
– mapred.input.dir 
= 
hdfs:///tmp/database.bson 
– mongo.job.output.format 
= 
com.hadoop.BSONFileOutputFormat 
– mapred.output.dir 
= 
hdfs:///tmp/output.bson
Pig Mappings 
• Input: BSONLoader and MongoLoader 
data 
= 
LOAD 
‘mongodb://mydb:27017/db.collection’ 
using 
com.mongodb.hadoop.pig.MongoLoader 
• Output: BSONStorage and MongoInsertStorage 
STORE 
records 
INTO 
‘hdfs:///output.bson’ 
using 
com.mongodb.hadoop.pig.BSONStorage
Hive Support 
CREATE 
TABLE 
mongo_users 
(id 
int, 
name 
string, 
age 
int) 
STORED 
BY 
"com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH 
SERDEPROPERTIES("mongo.columns.mapping” 
= 
"_id,name,age”) 
TBLPROPERTIES("mongo.uri" 
= 
"mongodb://host:27017/test.users”) 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONStorageHandler
Spark Usage 
• Use with MapReduce input/ 
output formats 
• Create Configuration objects 
with input/output formats and 
data URI 
• Load/save data using 
SparkContext Hadoop file API
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in HDFS 
Dynamic queries with 
most recent data 
Puts load on operational 
database 
Snapshots move load to 
Hadoop 
Snapshots add predictable 
load to MongoDB
Demo
MovieWeb
MovieWeb Components 
• MovieLens dataset 
– 10M ratings, 10K movies, 70K users 
• Python web app to browse movies, recommendations 
– Flask, PyMongo 
• Spark app computes recommendations 
– MLLib collaborative filter 
• Predicted ratings are exposed in web app 
– New predictions collection
MovieWeb Web Application 
• Browse 
– Top movies by ratings count 
– Top genres by movie count 
• Log in to 
– See My Ratings 
– Rate movies 
• What’s missing? 
– Movies You May Like 
– Recommendations
Spark Recommender 
• Apache Hadoop 2.3.0 
– HDFS and YARN 
• Spark 1.0 
– Execute within YARN 
– Assign executor 
resources 
• Data 
– From HDFS, MongoDB 
– To MongoDB
Snapshot database 
as BSON 
Store BSON in 
HDFS 
Read BSON into 
Spark app 
Train model from 
existing ratings 
Create user-movie 
pairings 
Predict ratings for 
all pairings 
Write predictions 
to MongoDB 
collection 
Web application 
exposes 
recommendations 
Repeat the process 
weekly 
MovieWeb Workflow
Execution 
$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar 
$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop 
$ bin/spark-submit 
--master yarn-cluster 
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar 
--jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar 
--driver-memory 1G 
--executor-memory 2G 
--num-executors 4
Questions? 
• MongoDB Connector for Hadoop 
– http://github.com/mongodb/mongo-hadoop 
• Getting Started with MongoDB and Hadoop 
– http://docs.mongodb.org/ecosystem/tutorial/getting-started- 
with-hadoop/ 
• MongoDB-Spark Demo 
– http://github.com/crcsmnky/mongodb-spark-demo

More Related Content

What's hot

Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
confluent
 
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
The SAS Search Journey:  Using AI to Move from Google to Lucidworks - Alex Fl...The SAS Search Journey:  Using AI to Move from Google to Lucidworks - Alex Fl...
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
Lucidworks
 

What's hot (20)

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksBuilding an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
 
Big data cloud architecture
Big data cloud architectureBig data cloud architecture
Big data cloud architecture
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
 
Webinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB AtlasWebinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB Atlas
 
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and KubernetesMongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
 
Mindtalk Tech - Behind the scenes
Mindtalk Tech - Behind the scenesMindtalk Tech - Behind the scenes
Mindtalk Tech - Behind the scenes
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
The SAS Search Journey:  Using AI to Move from Google to Lucidworks - Alex Fl...The SAS Search Journey:  Using AI to Move from Google to Lucidworks - Alex Fl...
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDB
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 

Similar to MongoDB and Hadoop: Driving Business Insights

MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
Mat Keep
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB
 

Similar to MongoDB and Hadoop: Driving Business Insights (20)

MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
MongoDB and Hadoop
MongoDB and HadoopMongoDB and Hadoop
MongoDB and Hadoop
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe World
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
 

More from MongoDB

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

MongoDB and Hadoop: Driving Business Insights

  • 1. #MongoDB DC MongoDB and Hadoop Driving Business Insights Justin Lee Software Engineer, MongoDB
  • 2. Agenda • Evolving Data Landscape • MongoDB & Hadoop Use Cases • MongoDB Connector Features • Demo
  • 4. Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabtye datasets • Data warehousing • Advanced analytics
  • 5. Enterprise IT Stack Operational Analytical EDW Management & Monitoring Security & Auditing Applications CRM, ERP, Collaboration, Mobile, BI Data Management RDBMS RDBMS Infrastructure OS & Virtualization, Compute, Storage, Network
  • 6. Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  • 7. Operational: MongoDB First-­‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis
  • 8. Analytical: Hadoop First-­‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis
  • 9. Operational vs. Analytical: Lifecycle First-­‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis
  • 10. MongoDB & Hadoop Use Cases
  • 11. Commerce Applications powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 12. Insurance Applications powered by Analysis powered by • Customer profiles • Insurance policies • Session data • Call center data • Customer action analysis • Churn analysis • Churn prediction • Policy rates MongoDB Connector for Hadoop
  • 13. Fraud Detection Payments Nightly Analysis Fraud modeling MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  • 15. Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark PlaNorms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR Connector Overview
  • 16. Connector Features and Functionality • Computes splits to read data – Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive – MongoDB as a standard data source/destination • Support for – Filtering data with MongoDB queries – Authentication – Reading from Replica Set tags – Appending to existing collections
  • 17. MapReduce Configuration • MongoDB input – mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 18. Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 19. Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler
  • 20. Spark Usage • Use with MapReduce input/ output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  • 21. Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 22. Demo
  • 24. MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  • 25. MovieWeb Web Application • Browse – Top movies by ratings count – Top genres by movie count • Log in to – See My Ratings – Rate movies • What’s missing? – Movies You May Like – Recommendations
  • 26. Spark Recommender • Apache Hadoop 2.3.0 – HDFS and YARN • Spark 1.0 – Execute within YARN – Assign executor resources • Data – From HDFS, MongoDB – To MongoDB
  • 27. Snapshot database as BSON Store BSON in HDFS Read BSON into Spark app Train model from existing ratings Create user-movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workflow
  • 28. Execution $ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar $ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop $ bin/spark-submit --master yarn-cluster --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar --jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar --driver-memory 1G --executor-memory 2G --num-executors 4
  • 29. Questions? • MongoDB Connector for Hadoop – http://github.com/mongodb/mongo-hadoop • Getting Started with MongoDB and Hadoop – http://docs.mongodb.org/ecosystem/tutorial/getting-started- with-hadoop/ • MongoDB-Spark Demo – http://github.com/crcsmnky/mongodb-spark-demo