Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

MongoDB et Hadoop

2 156 vues

Publié le

Publié dans : Technologie
  • Soyez le premier à commenter

MongoDB et Hadoop

  1. 1. Paris Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  2. 2. MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  3. 3. Agenda Evolving Data Landscape MongoDB & Hadoop Use Cases MongoDB Connector Features Demo
  4. 4. Evolving Data Landscape
  5. 5. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” • Terabyte and Petabyte datasets • Data warehousing • Advanced analytics http://hadoop.apache.org
  6. 6. ‹#› Enterprise IT Stack
  7. 7. ‹#› Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  8. 8. Operational: MongoDB First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  9. 9. Analytical: Hadoop First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  10. 10. Operational & Analytical: Lifecycle First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  11. 11. MongoDB & Hadoop Use Cases
  12. 12. Commerce Applications powered by Analysis powered by Products & Inventory Recommended products Customer profile Session management Elastic pricing Recommendation models Predictive analytics Clickstream history MongoDB Connector for Hadoop
  13. 13. Insurance Applications powered by Analysis powered by Customer profiles Insurance policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector for Hadoop
  14. 14. Fraud Detection Payments Nightly Analysis MongoDB Connector for Hadoop 3rd Party Data Sources Results Cache Fraud Detection Query Only Query Only
  15. 15. MongoDB Connector for Hadoop
  16. 16. ‹#› Connector Overview DATA • Read/Write MongoDB • Read/Write BSON TOOLS • MapReduce • Pig • Hive • Spark PLATFORMS • Apache Hadoop • Cloudera CDH • Hortonworks HDP • MapR • Amazon EMR
  17. 17. ‹#› Connector Features and Functionality • Computes splits to read data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections
  18. 18. ‹#› MapReduce Configuration • MongoDB input/output mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat mongo.input.uri = mongodb://mydb:27017/db1.collection1 mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output mongo.job.input.format = com.hadoop.BSONFileInputFormat mapred.input.dir = hdfs:///tmp/database.bson mongo.job.output.format = com.hadoop.BSONFileOutputFormat mapred.output.dir = hdfs:///tmp/output.bson
  19. 19. ‹#› Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  20. 20. ‹#› Hive Support • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
  21. 21. ‹#› Spark • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  22. 22. ‹#› Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  23. 23. Demo : Recommendation Platform
  24. 24. ‹#› Movie Web
  25. 25. ‹#› MovieWeb Web Application • Browse - Top movies by ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations
  26. 26. ‹#› MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users – http://grouplens.org/datasets/movielens/ • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  27. 27. ‹#› Spark Recommender • Apache Hadoop (2.3) - HDFS & YARN - Top genres by movie count • Spark (1.0) - Execute within YARN - Assign executor resources • Data - From HDFS, MongoDB - To MongoDB
  28. 28. ‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for all pairings Write Prediction to MongoDB collection Store BSON in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Train Model from existing ratings Repeat Process
  29. 29. ‹#› Execution $ spark-submit --master local --driver-memory 2G --executor-memory 2G --jars mongo-hadoop-core.jar,mongo-java-driver.jar --class com.mongodb.workshop.SparkExercise ./target/spark-1.0-SNAPSHOT.jar hdfs://localhost:9000 mongodb://127.0.0.1:27017/movielens predictions
  30. 30. Should I use MongoDB or Hadoop?
  31. 31. ‹#› Business First! First-Level Analytics Internet of Things Mobile Apps Social What/Why How Product/Asse t Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommend er Warehouse & ETL Predictive Analytics Ad Targeting
  32. 32. ‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V1.0
  33. 33. ‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V2.0
  34. 34. ‹#› Resources / Questions • MongoDB Connector for Hadoop - http://github.com/mongodb/mongo-hadoop • Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-started- with-hadoop/ • MongoDB-Spark Demo - https://github.com/crcsmnky/mongodb-hadoop-workshop
  35. 35. MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall

×