4. Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of computers using simple
programming models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics
9. Operational vs. Analytical: Lifecycle
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis
16. Connector Features and
Functionality
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
– MongoDB as a standard data source/destination
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading from Replica Set tags
– Appending to existing collections
18. Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
19. Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler
20. Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects
with input/output formats and
data URI
• Load/save data using
SparkContext Hadoop fileAPI
21. Data Movement
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add
predictable load to
MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
24. MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies,
recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection
25. MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
– Rate movies
• What’s missing?
– Movies You May Like
– Recommendations
26. Spark Recommender
• Apache Hadoop 2.3.0
– HDFS and YARN
• Spark 1.0
– Execute within YARN
– Assign executor
resources
• Data
– From HDFS,
MongoDB
– To MongoDB
27. Snapshot
database as
BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-
movie pairings
Predict ratings for
all pairings
Write predictions
to MongoDB
collection
Web application
exposes
recommendations
Repeat the
process weekly
MovieWeb Workflow