What's the Scoop on MongoDB & Hadoop
Jake Angerman, Sr. Solutions Architect, MongoDB
MongoDB Evenings Dallas
March 30, 2016 at the Addison Treehouse, Dallas, TX
7. 7
Hadoop
A framework for distributed processing of large data sets
• Terabyte and petabyte datasets
• Data warehousing
• Advanced analytics
• Not a database
• No indexes
• Batch processing
11. 11
Commerce Use Case
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop
12. 12
Insurance Use Case
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call center data
• Customer action analysis
• Churn analysis
• Churn prediction
• Policy rates
MongoDB
Connector for
Hadoop
13. 13
Fraud Detection Use Case
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only
17. 17
MongoDB Connector for Hadoop
• Low latency
• Rich fast querying
• Flexible indexing
• Aggregations in database
• Known data relationships
• Great for any subset of data
• Longer jobs
• Batch analytics
• Highly parallel processing
• Unknown data relationships
• Great for looking at all data or
large subsets
Applications Distributed Analytics
MongoDB
Connector for
Hadoop
19. 19
MongoDB Data Operations Spectrum
• Document Retrieval – 1ms if in cache, ~10ms from
spinning disk
• .find() – per-document cost similar to single document
– _id range
– any secondary index range, can be composite key
– intersect two indexes
– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered
• .aggregate() – retrieval cost like find, plus pipeline
operations
– $match, $group
– $project, $redact
• .mapReduce() – in-database Javascript
• Hadoop Connector
– mongo.input.query for indexed partial scan
– full scan
Faster…………….....Slower
23. MetLife – Single View
…
Single CSR Applica3on Unified Customer
Portal
Opera3onal Repor3ng
Cards …Cards Silo 1
…
Opera'onal Data Layer
• Insurance policies
• Demographic data
• Customer web data
• Call center data DW/Data Lake
• Churn predic3on algorithms
MongoDB
Connector for Hadoop
Cards Cards Silo 2
Cards Cards Silo N
Pub-sub/ETL
Customer
Clustering
Churn Analysis
Predic3ve
analy3cs
…
24. 25
Foursquare
• k-nearest neighbor problems
– similarity of venues, people, or brands
• MongoDB data has advantages when used with MapReduce
– log files can be stale
– log files may not contain as much information
– you can scan much less data
BSON dump
MongoDB
Connector for Hadoop
32. 36
• High-level platform for creating MapReduce
• Pig Latin abstracts Java into easier-to-use notation
• Executed as a series of MapReduce applications
• Supports user-defined functions (UDFs)
Pig
33. 37
samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs'
USING
com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');
grouped = GROUP samples by deviceId;
sample_stats = FOREACH grouped {
mean = AVG(samples.value);
GENERATE group as deviceId, mean as mean;
}
STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats'
USING com.mongodb.hadoop.pig.MongoStorage;
34. 38
• Data warehouse infrastructure built on top of Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)
38. 42
• Powerful built-in transformations and actions
– map, reduceByKey, union, distinct, sample, intersection, and more
– foreach, count, collect, take, and many more
An engine for processing Hadoop data. Can perform
MapReduce in addition to streaming, interactive queries,
and machine learning.
47. Optimal location
for providing
operational
response times
& slices
Governance to
choose where to
load and process
data
More Complete EDMArchitecture & Data Lake
…Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Data processing pipeline
Pub-sub,ETL,fileimports
Stream Processing
Downstream
Systems
… …
Single CSR
Applica3on
Unified
Digital
Apps
Opera3ona
l Repor3ng
…
… …
Analy3c
Repor3ng
Drivers & Stacks
Customer
Clusterin
g
Churn
Analysis
Predic3v
e
Analy3cs
…
Distributed
Processing
Operational Applications & Reporting
Can run
processing on
all data or
slices
Data Lake
48. Code “JakeAngerman” gets 25% off
Super Early Bird Registration Ends March 25, 2016
June 28 - 29, 2016
New York, NY
www.mongodbworld.com