SlideShare une entreprise Scribd logo
1  sur  28
Infinit.e: An Open Analytic Platform
  Driven by MongoDB & Hadoop
Agenda
• Who we are
• What Infinit.e is
• Architecture
  –   Use of Open Source
  –   Elasticsearch / MongoDB / Hadoop combo
  –   Focus on MongoDB
  –   Focus on Hadoop
• Demo
• Questions
Who we are
IKANOW (ikanow.com)
• Our vision is to enable agile intelligence
  through open analytics
• Our engineering vision is to use the best OSS
  technologies to build a document analysis
  platform that will enable this and then Open
  Source it back to the community
  – https://github.com/IKANOW/Infinit.e
  – http://bit.ly/ikanow-oss
What Infinit.e is
Infinit.e is a scalable framework for:
• Collecting,
• Storing,
• Enriching,
• Retrieving,
• Analyzing, and
• Visualizing
Unstructured documents and structured
records
What Infinit.e is - Overview
What Infinit.e is - Documents

20% Structured
     • Log files
     • Databases
     • Apps        •   Documents
                   •   Presentations
                   •   Spreadsheets
                   •   Meeting notes
                   •   Email
                   •   IM chats
                   •   Reports
                   •   Social

         80% Unstructured              •   Entities
                                       •   Events
                                       •   Facts
                                       •   Sentiment
Unstructured and Structured Data       •   Geospatial
                                       •   Temporal
                                       •   Themes
What Infinit.e is - Documents
        Duke and Progress announced merger plans in
        January 2012
            Bernanke, 57 said in his testimony price increases
            “have begun to moderate” after a jump in oil costs   Who
                                                                 people, organizations,
            earlier this year
                                                                 facilities, company

                                                                 What
               Tablet ownership levels hit 18% in China, the     events, summaries,
                UK and US versus 3% in November 2010             facts, themes

                                                                 When
       <Incident>                                                past, present, future
         <uid>20101043423</uid>                                  dates
         <subject>1 person killed in armed attack by
        suspected Boko Haram in
                                                                 Where
        Maiduguri, Borno, Nigeria</subject>                      city, state, country,
         <multipleDays>No</multipleDays>                         coordinate
         <eventDate>06/04/2011</eventDate>
       </Incident>
What Infinit.e is - Framework
What Infinit.e is - Visualization
Architecture
Use of Open Source
Architecture
Use of Open Source
Architecture
The 3 Key Elements
Architecture
Focus on MongoDB

3 key areas of benefit:
• Development
• Integration
• Deployment
MongoDB Development
Document analysis – lots of complex generic logic written in
Java
    The “records” are all complex objects
         BSON/JSON is a perfect representation
     Usually code maintainability is most important
          BSON → “Plain Old Java Object”
              (we use GSON, probably JACKSON is better;
               though GSON extensions for MongoDB types
               like dates and ObjectIds worked nicely)
      Sometimes performance is most important
           Option to stay in BSON
MongoDB – Dev Examples
   Converting to “POJO”
DocumentPojo docIn = new DocumentPojo();
docIn.setId(new ObjectId(idStr));
DocumentPojo docOut = DocumentPojo.fromDb(
    DbManager.getDocument().getMetadata().findOne(docIn.toDb()));

   Hybrid
BasicDBObject query = new BasicDBObject(DocumentPojo.communityId_,
    new BasicDBObject(MongoDbManager.in_, communityIdList));
// (then as above)

   Working in BSON only
BasicDBList l = (BasicDBList)(f.get(DocumentPojo.entities_));
for(Iterator<?> e0 = l.iterator(); e0.hasNext();){
    BasicDBObject e = (BasicDBObject)e0.next();
MongoDB
Changing Data Model
Standard requirement, particularly for an evolving project based on
whatever functionality can be derived from the latest technologies…
•  Example
             We have sentiment as a property of entity
              (person/place/organization)
             association links 2 entity objects via a verb
             New capability: NLP engine can now provide directed sentiment
              from one entity to another!
• Often requires no extra dev effort at all...
             Adding fields, eg just add sentiment to association above
•  Otherwise, built in JSON format makes data model migrations easy
             Have performed 2 major data model changes in 18 months, both
              via simple map/reduce scripts, with backwards compatibility
MongoDB Integration
Infinit.e is based on NoSQL and web 2.0 technologies
     ElasticSearch – JSON engine
      Javascript/Actionscript – JSON a key component
       NLP SaaS engines – JSON-based
A key component of the custom ingest/enrichment is the
ability to tag arbitrary source-specific metadata onto
documents
        Allows custom search / analytics / visualization
         “Best of both worlds” in conjunction with generic data
           model
          Schema-less storage is essential
MongoDB Deployment
Need to scale in many directions:
     Writes due to new documents
     Reads for queries
     The ability to scale execution of domain
      specific logic
          On ingest
          Batch analytics
Infinit.e is designed to use platforms like
EC2 to scale
MongoDB Deployment
MongoDB Deployment
MongoDB scalability
   Works!
           Scales to arbitrary sizes in both read/write dimensions
      Sophisticated sharding keys provide powerful/flexible
       balancing
      Downsides:
           Building an initial cluster is quite complex
           Managing cluster changes is quite fiddly
      For Infinit.e we used CloudFormation templates and
       (RPM-based) install scripts to manage the cluster
           Works OK, a graphical tool and some more robustness would be
            nice
               (on our roadmap, but not very close!)
MongoDB Deployment
MongoDB/EC2 integration
   m1.xlarge works best for our needs (m1.large is fine for
    ~0.5M docs)
          4 cores, 15GB
          4 500GB ephemeral disks that we RAID-0 together
              (without that performance dropped off a cliff at >1M docs)
Architecture
Focus on Hadoop
Why Hadoop?
   Queries/aggregation/visualization is an excellent first step
    for document analysis, and is often all that's required
           More complex analytics requires
                 Access to all of the data, not pre-aggregated or selected
                 A high level programming language, mature libraries etc
      Hadoop is becoming the de-facto standard for data
       analytics
           Open Source, very customizable
           Proven scalability
           Java libraries
           Mahout project (machine learning libraries for Hadoop)
           Amazon elastic cloud
Architecture
MongoDB / Hadoop
Infinit.e Demonstration
Infinit.e Demonstration
Infinit.e Demonstration
Infinit.e Demonstration
Thank You!!!


              Alex Piggott
    Director of Product Engineering
        apiggott@ikanow.com

Contenu connexe

Similaire à MongoDC - Ikanow April 2012 Meetup

Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDB
Eugene Park
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
DATAVERSITY
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
ikanow
 

Similaire à MongoDC - Ikanow April 2012 Meetup (20)

Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDB
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
MongoDB meetup at Hike
MongoDB meetup at HikeMongoDB meetup at Hike
MongoDB meetup at Hike
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDBPartner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
Introducing MongoDB into your Organization
Introducing MongoDB into your OrganizationIntroducing MongoDB into your Organization
Introducing MongoDB into your Organization
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
Big Data
Big DataBig Data
Big Data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Plus de ikanow (9)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC June 2012 Presentation
Open Analytics DC June 2012 PresentationOpen Analytics DC June 2012 Presentation
Open Analytics DC June 2012 Presentation
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

MongoDC - Ikanow April 2012 Meetup

  • 1. Infinit.e: An Open Analytic Platform Driven by MongoDB & Hadoop
  • 2. Agenda • Who we are • What Infinit.e is • Architecture – Use of Open Source – Elasticsearch / MongoDB / Hadoop combo – Focus on MongoDB – Focus on Hadoop • Demo • Questions
  • 3. Who we are IKANOW (ikanow.com) • Our vision is to enable agile intelligence through open analytics • Our engineering vision is to use the best OSS technologies to build a document analysis platform that will enable this and then Open Source it back to the community – https://github.com/IKANOW/Infinit.e – http://bit.ly/ikanow-oss
  • 4. What Infinit.e is Infinit.e is a scalable framework for: • Collecting, • Storing, • Enriching, • Retrieving, • Analyzing, and • Visualizing Unstructured documents and structured records
  • 5. What Infinit.e is - Overview
  • 6. What Infinit.e is - Documents 20% Structured • Log files • Databases • Apps • Documents • Presentations • Spreadsheets • Meeting notes • Email • IM chats • Reports • Social 80% Unstructured • Entities • Events • Facts • Sentiment Unstructured and Structured Data • Geospatial • Temporal • Themes
  • 7. What Infinit.e is - Documents Duke and Progress announced merger plans in January 2012 Bernanke, 57 said in his testimony price increases “have begun to moderate” after a jump in oil costs Who people, organizations, earlier this year facilities, company What Tablet ownership levels hit 18% in China, the events, summaries, UK and US versus 3% in November 2010 facts, themes When <Incident> past, present, future <uid>20101043423</uid> dates <subject>1 person killed in armed attack by suspected Boko Haram in Where Maiduguri, Borno, Nigeria</subject> city, state, country, <multipleDays>No</multipleDays> coordinate <eventDate>06/04/2011</eventDate> </Incident>
  • 8. What Infinit.e is - Framework
  • 9. What Infinit.e is - Visualization
  • 13. Architecture Focus on MongoDB 3 key areas of benefit: • Development • Integration • Deployment
  • 14. MongoDB Development Document analysis – lots of complex generic logic written in Java  The “records” are all complex objects  BSON/JSON is a perfect representation  Usually code maintainability is most important  BSON → “Plain Old Java Object”  (we use GSON, probably JACKSON is better; though GSON extensions for MongoDB types like dates and ObjectIds worked nicely)  Sometimes performance is most important  Option to stay in BSON
  • 15. MongoDB – Dev Examples  Converting to “POJO” DocumentPojo docIn = new DocumentPojo(); docIn.setId(new ObjectId(idStr)); DocumentPojo docOut = DocumentPojo.fromDb( DbManager.getDocument().getMetadata().findOne(docIn.toDb()));  Hybrid BasicDBObject query = new BasicDBObject(DocumentPojo.communityId_, new BasicDBObject(MongoDbManager.in_, communityIdList)); // (then as above)  Working in BSON only BasicDBList l = (BasicDBList)(f.get(DocumentPojo.entities_)); for(Iterator<?> e0 = l.iterator(); e0.hasNext();){ BasicDBObject e = (BasicDBObject)e0.next();
  • 16. MongoDB Changing Data Model Standard requirement, particularly for an evolving project based on whatever functionality can be derived from the latest technologies… • Example  We have sentiment as a property of entity (person/place/organization)  association links 2 entity objects via a verb  New capability: NLP engine can now provide directed sentiment from one entity to another! • Often requires no extra dev effort at all...  Adding fields, eg just add sentiment to association above • Otherwise, built in JSON format makes data model migrations easy  Have performed 2 major data model changes in 18 months, both via simple map/reduce scripts, with backwards compatibility
  • 17. MongoDB Integration Infinit.e is based on NoSQL and web 2.0 technologies  ElasticSearch – JSON engine  Javascript/Actionscript – JSON a key component  NLP SaaS engines – JSON-based A key component of the custom ingest/enrichment is the ability to tag arbitrary source-specific metadata onto documents  Allows custom search / analytics / visualization  “Best of both worlds” in conjunction with generic data model  Schema-less storage is essential
  • 18. MongoDB Deployment Need to scale in many directions:  Writes due to new documents  Reads for queries  The ability to scale execution of domain specific logic  On ingest  Batch analytics Infinit.e is designed to use platforms like EC2 to scale
  • 20. MongoDB Deployment MongoDB scalability  Works!  Scales to arbitrary sizes in both read/write dimensions  Sophisticated sharding keys provide powerful/flexible balancing  Downsides:  Building an initial cluster is quite complex  Managing cluster changes is quite fiddly  For Infinit.e we used CloudFormation templates and (RPM-based) install scripts to manage the cluster  Works OK, a graphical tool and some more robustness would be nice  (on our roadmap, but not very close!)
  • 21. MongoDB Deployment MongoDB/EC2 integration  m1.xlarge works best for our needs (m1.large is fine for ~0.5M docs)  4 cores, 15GB  4 500GB ephemeral disks that we RAID-0 together  (without that performance dropped off a cliff at >1M docs)
  • 22. Architecture Focus on Hadoop Why Hadoop?  Queries/aggregation/visualization is an excellent first step for document analysis, and is often all that's required  More complex analytics requires  Access to all of the data, not pre-aggregated or selected  A high level programming language, mature libraries etc  Hadoop is becoming the de-facto standard for data analytics  Open Source, very customizable  Proven scalability  Java libraries  Mahout project (machine learning libraries for Hadoop)  Amazon elastic cloud
  • 28. Thank You!!! Alex Piggott Director of Product Engineering apiggott@ikanow.com