Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Blazing Fast Analytics with MongoDB & Spark

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Spark and MongoDB
Spark and MongoDB
Chargement dans…3
×

Consultez-les par la suite

1 sur 40 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (14)

Publicité

Similaire à Blazing Fast Analytics with MongoDB & Spark (20)

Plus par MongoDB (20)

Publicité

Blazing Fast Analytics with MongoDB & Spark

  1. 1. Blazing Fast Analytics with MongoDB & Spark
  2. 2. 3 Muthu Chinnasamy Senior Solutions Architect muthu@mongodb.com Twitter: @MuthuMongo
  3. 3. 4 Agenda The data challenge Spark Use Cases Connectors Demo
  4. 4. 2010 Eric Schmidt Every two days now we create as much information as we did from the dawn of civilization up until 2003 “
  5. 5. Apache Spark is the Taylor Swift of big data software. “ Derrick Harris, Fortune
  6. 6. 8 What is Spark? Fast and general computing engine for clusters • Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, Graph • It’s fundamentally different to what’s come before
  7. 7. 9 Why not just use Hadoop? • Spark is FAST –Faster to write. –Faster to run. • Up to 100x faster than Hadoop in memory • 10x faster on disk.
  8. 8. A visual comparison Hadoop Spark
  9. 9. 11 RDD Operations Transformations Actions map reduce filter collect flatMap count mapPartitions save sample lookupKey union take join foreach groupByKey reduceByKey
  10. 10. 12 Spark higher level libraries Spark Spark SQL Spark Streaming MLIB GraphX
  11. 11. Spark + MongoDB
  12. 12. 14 Data Management OLTP Applications Fine grained operations Low Latency Offline Processing Analytics Data Warehousing High Throughput
  13. 13. 15 Spark + MongoDB top use cases: – Business Intelligence – Data Warehousing – Recommendation – Log processing – User Facing Services – Fraud detection
  14. 14. 16 MongoDB and Spark
  15. 15. 17 Spark reading directly from MongoDB
  16. 16. 18 Aggregation pipeline to Pre-filter Aggregation pipeline filter: $match
  17. 17. 19 Spark writing directly to MongoDB
  18. 18. Fraud Detection I'm so in love! Me, too<3 Now send me your CC number ? Ok, XXXX-123-zzz $$$
  19. 19. Fraud Detection
  20. 20. Sharing Workloads Chat App HDFS HDFS HDFS Archiving Data Crunching Login User Profile Contacts Messages … Fraud Detection Segmentation Recommendations Spark
  21. 21. MongoDB + Spark Connector
  22. 22. 24 MongoDB Spark Connector https://spark-packages.org/?q=official+mongodb
  23. 23. MongoDB Spark Connector MongoDB Shard Spark MongoDB Spark Connector https://github.com/mongodb/mongo-spark
  24. 24. Spark Streaming
  25. 25. 27 Spark Streaming Twitter Feed Spark
  26. 26. 28 Spark Streaming Twitter Feed { "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [ ], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } } }
  27. 27. 29 Spark Streaming { "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [ ], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } } } { "time": "Mon Sep 24 03:35", "freebandnames": 1 } { "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [ ], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } } } { "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [ ], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } } } { "statuses": [ { "coordinates": null, "favorited": false, "truncated": false, "created_at": "Mon Sep 24 03:35:21 +0000 2012", "id_str": "250075927172759552", "entities": { "urls": [ ], "hashtags": [ { "text": "freebandnames", "indices": [ 20, 34 ] } ], "user_mentions": [] } } } { "time": "Mon Sep 24 03:35", "freebandnames": 4 } Spark
  28. 28. 30 Capped Collection MongoDB and Spark Streaming feature { "time": "Mon Sep 24 03:35", "freebandnames": 4 } { "time": "Mon Nov 5 09:40", “mongoDBLondon": 400 } { "time": "Mon Nov 5 11:50", “spark": 7556 } { "time": "Mon Nov 24 12:50", "itshappening": 100 } Tailable Cursor
  29. 29. MongoDB + Spark MLib Demo
  30. 30. 32 Collaborative Filtering • Two parts • Collaborative: Using Rating preference from several Users • Filtering: Recommend preferences UserId / MovieId Star Wars Toy Story Frozen Buzz 4 4 5 Woody 5 4 Jessie 5 ? Movie Ratings as a matrix
  31. 31. 33 MLib ALS • Approximate into User & Movie latent factor matrices UserId / MovieId Frozen Toy Story Star Wars Buzz 4 4 5 Woody 5 4 Jessie 5 Buzz x y Woody x y Jessie x y Star Wars Toy Story Frozen x x x y y y f(i) f(j) rij
  32. 32. 34 Prediction Process • Load movie ratings data from MongoDB • Reflect and Infer the input formats for the ALS algorithm • Split the data –80% for training and 20% for validating the model • Calculate the best model using ALS algorithm –Build/train a User Movie matrix model • Combine the data with user preferences and retrain the model
  33. 33. 35 Explore as a Databricks Notebook http://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html
  34. 34. MongoDB + Spark Case Study
  35. 35. 37 China Eastern Airlines – Fare Engine 130K seats,180 million fares & 1.6 billion daily searches
  36. 36. 38 Spark and MongoDB • An extremely powerful combination • Many possible use cases • Some operations are actually faster if performed using Aggregation Framework • Evolving all the time
  37. 37. Questions? Muthu Chinnasamy muthu@mongodb.com @muthumongo

×