MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
4. Hadoop
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics
5. Enterprise IT Stack
Operational Analytical
EDW
Management & Monitoring
Security & Auditing
Applications
CRM, ERP, Collaboration, Mobile, BI
Data Management
RDBMS
RDBMS
Infrastructure
OS & Virtualization, Compute, Storage, Network
7. Operational: MongoDB
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis
8. Analytical: Hadoop
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis
9. Operational vs. Analytical: Lifecycle
First-‐level
Analy/cs
Product/Asset
Catalogs
Security
&
Fraud
Internet
of
Things
Mobile
Apps
Customer
Data
Mgmt
Single
View
Social
Churn
Analysis
Recommender
Warehouse
&
ETL
Risk
Modeling
Trade
Surveillance
Predic/ve
Analy/cs
Ad
Targe/ng
Sen/ment
Analysis
16. Connector Features and Functionality
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
– MongoDB as a standard data source/destination
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading from Replica Set tags
– Appending to existing collections
18. Pig Mappings
• Input: BSONLoader and MongoLoader
data
=
LOAD
‘mongodb://mydb:27017/db.collection’
using
com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE
records
INTO
‘hdfs:///output.bson’
using
com.mongodb.hadoop.pig.BSONStorage
19. Hive Support
CREATE
TABLE
mongo_users
(id
int,
name
string,
age
int)
STORED
BY
"com.mongodb.hadoop.hive.MongoStorageHandler"
WITH
SERDEPROPERTIES("mongo.columns.mapping”
=
"_id,name,age”)
TBLPROPERTIES("mongo.uri"
=
"mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler
20. Spark Usage
• Use with MapReduce input/
output formats
• Create Configuration objects
with input/output formats and
data URI
• Load/save data using
SparkContext Hadoop file API
21. Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB
24. MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies, recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection
25. MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
– Rate movies
• What’s missing?
– Movies You May Like
– Recommendations
26. Spark Recommender
• Apache Hadoop 2.3.0
– HDFS and YARN
• Spark 1.0
– Execute within YARN
– Assign executor
resources
• Data
– From HDFS, MongoDB
– To MongoDB
27. Snapshot database
as BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-movie
pairings
Predict ratings for
all pairings
Write predictions
to MongoDB
collection
Web application
exposes
recommendations
Repeat the process
weekly
MovieWeb Workflow