MongoDB and Hadoop: Driving Business Insights

Partner Technical Solutions, MongoDB
Sandeep Parikh
#MongoDBWorld
MongoDB and Hadoop
Driving Business Insights

Agenda
• Evolving Data Landscape
• MongoDB & Hadoop Use Cases
• MongoDB Connector Features
• Demo

Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of computers using simple
programming models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics

Enterprise IT Stack
EDW
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Operational Analytical

Operational vs. Analytical:
Enrichment
Applications, Interactions Warehouse, Analytics

Operational: MongoDB
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Analytical: Hadoop
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Operational vs. Analytical: Lifecycle
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Commerce
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop

Insurance
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call center data
• Customer action analysis
• Churn analysis
• Churn prediction
• Policy rates
MongoDB
Connector for
Hadoop

Fraud Detection
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only

Connector Overview
Data
Read/Write
MongoDB
Read/Write
BSON
Tools
MapReduce
Pig
Hive
Spark
Platforms
Apache Hadoop
Cloudera CDH
Hortonworks HDP
Amazon EMR

Connector Features and
Functionality
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
– MongoDB as a standard data source/destination
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading from Replica Set tags
– Appending to existing collections

MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson

Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage

Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler

Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects
with input/output formats and
data URI
• Load/save data using
SparkContext Hadoop fileAPI

Data Movement
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add
predictable load to
MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS

MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies,
recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection

MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
– Rate movies
• What’s missing?
– Movies You May Like
– Recommendations

Spark Recommender
• Apache Hadoop 2.3.0
– HDFS and YARN
• Spark 1.0
– Execute within YARN
– Assign executor
resources
• Data
– From HDFS,
MongoDB
– To MongoDB

Snapshot
database as
BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-
movie pairings
Predict ratings for
all pairings
Write predictions
to MongoDB
collection
Web application
exposes
recommendations
Repeat the
process weekly
MovieWeb Workflow

$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.3.0.jar
$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
$ bin/spark-submit
--master yarn-cluster
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.2.jar,mongo-hadoop-1.2.1.jar
--driver-memory 1G
--executor-memory 2G
--num-executors 4
Execution

MongoDB and Hadoop: Driving Business Insights

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MongoDB and Hadoop: Driving Business Insights

Similaire à MongoDB and Hadoop: Driving Business Insights (20)

Plus de MongoDB

Plus de MongoDB (20)

Dernier

Dernier (20)

MongoDB and Hadoop: Driving Business Insights