Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Big Data: Examples and
Guidelines for the Enterprise
Decision Maker
Solutions Architect, MongoDB
Buzz Moschetti
buzz.mosch...
Who is your Presenter?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase ...
Agenda
• (Occasionally) Brutal Truths about Big Data
• Review of Directed Content Business Architecture
• A Simple Technic...
Truths
• Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial
• Developing, ...
It’s About The Functions, not the
Terms
DON’T ASK:
• Is this an operations or an analytics problem?
• Is this online or of...
What We’re Going to “Build” today
Realtime Directed Content System
• Based on what users click, “recommended”
content is r...
The Participants and Their Roles
Directed
Content
System
Customer
s
Content
Creators
Management/
Strategy
Analysts/
Data S...
Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experience
Provide managem...
The Architecture
mongoDB HadoopApp(s) MapReduce
Complementary Strengths
mongoDB HadoopApp(s) MapReduce
• Standard design paradigm
(objects, tools, 3rd party products,
IDE...
“Legacy” Approach: Somewhat
unidirectional
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other
sources ni...
Somewhat better approach
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other
sources nightly (or weekly)
...
…but the overall problem remains:
• How to realtime integrate and operate upon both
periodically generated data and realti...
The legacy problem in pseudocode
onContentClick() {
String[] tags = content.getTags();
Resource[] r = f1(database, tags);
...
The Right Approach
• Users have a specific Profile entity
• The Profile captures trend analytics as baselining
information...
24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & Power...
Monday, 1:30AM EST
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if using the mongoDB-Hadoop
connecto...
mongoDB-Hadoop MapReduce Example
public class ProfileMapper
extends Mapper<Object, BSONObject, IntWritable, IntWritable>
{...
Monday, 1:45AM EST
• Grind through all content data and user Profile data to
produce:
• Tags based on feature extraction (...
Monday, 8AM EST
• User Bob logs in and Profile retrieved from mongoDB
• Bob clicks on Content X which is already tagged as...
Monday, 8:02AM EST
• Bob clicks on Content Y which is already tagged as “Spices”
• Spice is a new tag type for Bob
• Adjus...
Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10024”,
gender: “M”
},
tags: {
PETS: { algo: “A4”,
baseline: [0,0,1...
Tag-based algorithm detail
getRecommendedContent(profile, [“PETS”, other]) {
if algo for a tag available {
filter = algo(p...
Tuesday, 1AM EST
mongoDB HadoopApp(s) MapReduce
• Fetch all user Profiles from mongoDB; load into Hadoop
• Or skip if usin...
Tuesday, 1:30AM EST
• Grind through all content data and user profile data to
produce:
• Tags based on feature extraction ...
New Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10024”,
gender: “M”
},
tags: {
PETS: { algo: “A4”,
baseline: [0...
Tuesday, 1:35AM EST
• Perform maintenance on user Profiles
• Click history trimming (variety of algorithms)
• “Dead tag” r...
New Profile in Detail
{
user: “Bob”,
personalData: {
zip: “10022”,
gender: “M”
},
tags: {
PETS: { algo: “A4”,
baseline: [ ...
Feel free to run the baselining more
frequently
… but avoid “Are We There
Yet?”
mongoDB HadoopApp(s) MapReduce
Nearterm / Realtime Questions & Actions
With respect to the Customer:
• What has Bob done over the past 24 hours?
• Given ...
Longterm/ Not Realtime Questions &
Actions
With respect to the Customer:
• Any way to explain historic performance / actio...
The Key To Success: It is One System
mongoDB
Hadoop
App(s)
MapReduce
Webex Q&A
Thank You
Buzz Moschetti
buzz.moschetti@mongodb.com
#MongoDB
Prochain SlideShare
Chargement dans…5
×

Big Data: Guidelines and Examples for the Enterprise Decision Maker

1 671 vues

Publié le

This presentation covers how to use MongoDB with Hadoop to leverage big data within your company.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Big Data: Guidelines and Examples for the Enterprise Decision Maker

  1. 1. Big Data: Examples and Guidelines for the Enterprise Decision Maker Solutions Architect, MongoDB Buzz Moschetti buzz.moschetti@mongodb.com #MongoDB
  2. 2. Who is your Presenter? • Yes, I use “Buzz” on my business cards • Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that • Over 25 years of designing and building systems • Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring • Still programming – using emacs, of course
  3. 3. Agenda • (Occasionally) Brutal Truths about Big Data • Review of Directed Content Business Architecture • A Simple Technical Implementation
  4. 4. Truths • Clear definition of Big Data still maturing • Efficiently operationalizing Big Data is non-trivial • Developing, debugging, understanding MapReduce • Cluster monitoring & management, job scheduling/recovery • If you thought regular ETL Hell was bad…. • Big Data is not about math/set accuracy • The last 25000 items in a 25,497,612 set “don’t matter” • Big Data questions are best asked periodically • “Are we there yet?” • Realtime means … realtime
  5. 5. It’s About The Functions, not the Terms DON’T ASK: • Is this an operations or an analytics problem? • Is this online or offline? • What query language should we use? • What is my integration strategy across tools? ASK INSTEAD: • Am I incrementally addressing data (esp. writes)? • Am I computing a precise answer or a trend? • Do I need to operate on this data in realtime? • What is my holistic architecture?
  6. 6. What We’re Going to “Build” today Realtime Directed Content System • Based on what users click, “recommended” content is returned in addition to the target • The example is sector (manufacturing, financial services, retail) neutral • System dynamically updates behavior in response to user activity
  7. 7. The Participants and Their Roles Directed Content System Customer s Content Creators Management/ Strategy Analysts/ Data Scientists Generate and tag content from a known domain of tags Make decisions based on trends and other summarized data Operate on data to identify trends and develop tag domains Developers/ ProdOps Bring it all together: apps, SDLC, integration, etc.
  8. 8. Priority #1: Maximizing User value Considerations/Requirements Maximize realtime user value and experience Provide management reporting and trend analysis Engineer for Day 2 agility on recommendation engine Provide scrubbed click history for customer Permit low-cost horizontal scaling Minimize technical integration Minimize technical footprint Use conventional and/or approved tools Provide a RESTful service layer …..
  9. 9. The Architecture mongoDB HadoopApp(s) MapReduce
  10. 10. Complementary Strengths mongoDB HadoopApp(s) MapReduce • Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.) • Language flexibility (Java, C#, C++ python, Scala, …) • Webscale deployment model • appservers, DMZ, monitoring • High performance rich shape CRUD • MapReduce design paradigm • Node deployment model • Very large set operations • Computationally intensive, longer duration • Read-dominated workload
  11. 11. “Legacy” Approach: Somewhat unidirectional mongoDB HadoopApp(s) MapReduce • Extract data from mongoDB and other sources nightly (or weekly) • Run analytics • Generate reports for people to read • Where’s the feedback?
  12. 12. Somewhat better approach mongoDB HadoopApp(s) MapReduce • Extract data from mongoDB and other sources nightly (or weekly) • Run analytics • Generate reports for people to read • Move important summary data back to mongoDB for consumption by apps.
  13. 13. …but the overall problem remains: • How to realtime integrate and operate upon both periodically generated data and realtime current data? • Lackluster integration between OLTP and Hadoop • It’s not just about the database: you need a realtime profile and profile update function
  14. 14. The legacy problem in pseudocode onContentClick() { String[] tags = content.getTags(); Resource[] r = f1(database, tags); } • Realtime intraday state not well-handled • Baselining is a different problem than click handling
  15. 15. The Right Approach • Users have a specific Profile entity • The Profile captures trend analytics as baselining information • The Profile has per-tag “counters” that are updated with each interaction / click • Counters plus baselining are passed to fetch function • The fetch function itself could be dynamic!
  16. 16. 24 hours in the life of The System • Assume some content has been created and tagged • Two systemetized tags: Pets & PowerTools
  17. 17. Monday, 1:30AM EST • Fetch all user Profiles from mongoDB; load into Hadoop • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  18. 18. mongoDB-Hadoop MapReduce Example public class ProfileMapper extends Mapper<Object, BSONObject, IntWritable, IntWritable> { @Override public void map(final Object pKey, final BSONObject pValue, final Context pContext ) throws IOException, InterruptedException{ String user = (String)pValue.get(”user"); Date d1 = (Date)pValue.get(“lastUpdate”); int count = 0; List<String> keys = pValue.get(“tags”).keys(); for ( String tag : keys) { count += pValue.get(tag).get(“hist”).size(); ) int avg = count / keys.size(); pContext.write( new IntWritable( count), new IntWritable( avg ) ); } }
  19. 19. Monday, 1:45AM EST • Grind through all content data and user Profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline per user for tags Pets and PowerTools • Load Profiles with new baseline back into mongoDB • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  20. 20. Monday, 8AM EST • User Bob logs in and Profile retrieved from mongoDB • Bob clicks on Content X which is already tagged as “Pets” • Bob has clicked on Pets tagged content many times • Adjust Profile for tag “Pets” and save back to mongoDB • Analysis = f(Profile) • Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc. mongoDB HadoopApp(s) MapReduce
  21. 21. Monday, 8:02AM EST • Bob clicks on Content Y which is already tagged as “Spices” • Spice is a new tag type for Bob • Adjust Profile for tag “Spices” and save back to mongoDB • Analysis = f(profile) mongoDB HadoopApp(s) MapReduce
  22. 22. Profile in Detail { user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ { ts: datetime3, url: url3 } ]} } }
  23. 23. Tag-based algorithm detail getRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available { filter = algo(profile, tag); } fetch N recommendations (filter); } A4(profile, tag) { weight = get tag (“PETS”) global weighting; adjustForPersonalBaseline(weight, “PETS” baseline); if “PETS” clicked more than 2 times in past 10 mins then weight += 10; if “PETS” clicked more than 10 times in past 2 days then weight += 3; return new filter({“PETS”, weight}, globals) }
  24. 24. Tuesday, 1AM EST mongoDB HadoopApp(s) MapReduce • Fetch all user Profiles from mongoDB; load into Hadoop • Or skip if using the mongoDB-Hadoop connector!
  25. 25. Tuesday, 1:30AM EST • Grind through all content data and user profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline for Pets and PowerTools and Spice • Data can be specific to individual or by group • Load baseline back into mongoDB • Or skip if using the mongoDB-Hadoop connector! mongoDB HadoopApp(s) MapReduce
  26. 26. New Profile in Detail { user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} } }
  27. 27. Tuesday, 1:35AM EST • Perform maintenance on user Profiles • Click history trimming (variety of algorithms) • “Dead tag” removal • Update of auxiliary reference data mongoDB HadoopApp(s) MapReduce
  28. 28. New Profile in Detail { user: “Bob”, personalData: { zip: “10022”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [ 1322,44,23, … ], hist: [ { ts: datetime1, url: url1 } // 50 more ]}, SPICE: { algo: “Z1”, hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} } }
  29. 29. Feel free to run the baselining more frequently … but avoid “Are We There Yet?” mongoDB HadoopApp(s) MapReduce
  30. 30. Nearterm / Realtime Questions & Actions With respect to the Customer: • What has Bob done over the past 24 hours? • Given an input, make a logic decision in 100ms or less With respect to the Provider: • What are all current users doing or looking at? • Can we nearterm correlate single events to shifts in behavior?
  31. 31. Longterm/ Not Realtime Questions & Actions With respect to the Customer: • Any way to explain historic performance / actions? • What are recommendations for the future? With respect to the Provider: • Can we correlate multiple events from multiple sources over a long period of time to identify trends? • What is my entire customer base doing over 2 years? • Show me a time vs. aggregate tag hit chart • Slice and dice and aggregate tags vs. XYZ • What tags are trending up or down?
  32. 32. The Key To Success: It is One System mongoDB Hadoop App(s) MapReduce
  33. 33. Webex Q&A
  34. 34. Thank You Buzz Moschetti buzz.moschetti@mongodb.com #MongoDB

×