MongoDB and Hadoop: Driving Business Insights

#MongoDB DC
MongoDB and Hadoop
Driving Business Insights
Justin Lee
Software Engineer, MongoDB

Agenda
• Evolving Data Landscape
• MongoDB & Hadoop Use Cases
• MongoDB Connector Features
• Demo

Hadoop
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics

Enterprise IT Stack
Operational Analytical
EDW
Management & Monitoring
Security & Auditing
Applications
CRM, ERP, Collaboration, Mobile, BI
Data Management
RDBMS
RDBMS
Infrastructure
OS & Virtualization, Compute, Storage, Network

Operational vs. Analytical: Enrichment
Applications, Interactions Warehouse, Analytics

Operational: MongoDB
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis

Analytical: Hadoop
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis

Operational vs. Analytical: Lifecycle
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis

Commerce
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop

Insurance
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call center data
• Customer action analysis
• Churn analysis
• Churn prediction
• Policy rates
MongoDB
Connector for
Hadoop

Fraud Detection
Payments
Nightly
Analysis
Fraud modeling
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only

Data
Read/Write
MongoDB
Read/Write
BSON
Tools
MapReduce
Pig
Hive
Spark
PlaNorms
Apache
Hadoop
Cloudera
CDH
Hortonworks
HDP
Amazon
EMR
Connector Overview

Connector Features and Functionality
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
– MongoDB as a standard data source/destination
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading from Replica Set tags
– Appending to existing collections

MapReduce Configuration
• MongoDB input
– mongo.job.input.format
=
com.mongodb.hadoop.MongoInputFormat
– mongo.input.uri
=
mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format
=
com.mongodb.hadoop.MongoOutputFormat
– mongo.output.uri
=
mongodb://mydb:27017/db1.collection2
• BSON input/output
– mongo.job.input.format
=
com.hadoop.BSONFileInputFormat
– mapred.input.dir
=
hdfs:///tmp/database.bson
– mongo.job.output.format
=
com.hadoop.BSONFileOutputFormat
– mapred.output.dir
=
hdfs:///tmp/output.bson

Pig Mappings
• Input: BSONLoader and MongoLoader
data
=
LOAD
‘mongodb://mydb:27017/db.collection’
using
com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE
records
INTO
‘hdfs:///output.bson’
using
com.mongodb.hadoop.pig.BSONStorage

Hive Support
CREATE
TABLE
mongo_users
(id
int,
name
string,
age
int)
STORED
BY
"com.mongodb.hadoop.hive.MongoStorageHandler"
WITH
SERDEPROPERTIES("mongo.columns.mapping”
=
"_id,name,age”)
TBLPROPERTIES("mongo.uri"
=
"mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler

Spark Usage
• Use with MapReduce input/
output formats
• Create Configuration objects
with input/output formats and
data URI
• Load/save data using
SparkContext Hadoop file API

Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB

MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies, recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection

MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
– Rate movies
• What’s missing?
– Movies You May Like
– Recommendations

Spark Recommender
• Apache Hadoop 2.3.0
– HDFS and YARN
• Spark 1.0
– Execute within YARN
– Assign executor
resources
• Data
– From HDFS, MongoDB
– To MongoDB

Snapshot database
as BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-movie
pairings
Predict ratings for
all pairings
Write predictions
to MongoDB
collection
Web application
exposes
recommendations
Repeat the process
weekly
MovieWeb Workflow

Execution
$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.4.0.jar
$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
$ bin/spark-submit
--master yarn-cluster
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.3.jar,mongo-hadoop-1.3.0.jar
--driver-memory 1G
--executor-memory 2G
--num-executors 4

Questions?
• MongoDB Connector for Hadoop
– http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop
– http://docs.mongodb.org/ecosystem/tutorial/getting-started-
with-hadoop/
• MongoDB-Spark Demo
– http://github.com/crcsmnky/mongodb-spark-demo

MongoDB and Hadoop: Driving Business Insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB and Hadoop: Driving Business Insights

Similar to MongoDB and Hadoop: Driving Business Insights (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB and Hadoop: Driving Business Insights