SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
MongoDB
 Hadoop
          &
    humongous data
Tuesday, December 11, 12
Talking about
      What is Humongous Data
      Humongous Data & You
      MongoDB & Data processing
      Future of Humongous Data
Tuesday, December 11, 12
What is
  humongous
                           data ?
Tuesday, December 11, 12
2000
                Google Inc
                Today announced it has released
                the largest search engine on the
                Internet.

                Google’s new index, comprising
                more than 1 billion URLs

Tuesday, December 11, 12
2008
                Our indexing system for processing
                links indicates that
                we now count 1 trillion unique URLs

                (and the number of individual web
                pages out there is growing by
                several billion pages per day).


Tuesday, December 11, 12
An unprecedented
            amount of data is
            being created and is
            accessible

Tuesday, December 11, 12
Data Growth                                   1,000
             1000



               750


                                                                       500
               500


                                                                250
               250
                                                          120
                                                  55
                            4      10     24
                       1
                   0
                    2000   2001   2002   2003   2004     2005   2006   2007   2008

                                           Millions of URLs
Tuesday, December 11, 12
Truly Exponential
                                Growth
                Is hard for people to grasp


                A BBC reporter recently: "Your current PC
                is more powerful than the computer they
                had on board the first flight to the moon".




Tuesday, December 11, 12
Moore’s Law
                Applies to more than just CPUs


                Boiled down it is that things double at
                regular intervals


                It’s exponential growth.. and applies to
                big data


Tuesday, December 11, 12
How BIG is it?




Tuesday, December 11, 12
How BIG is it?


                           2008




Tuesday, December 11, 12
How BIG is it?
                                          2007


                           2008
                                                 2005
                                   2006
                                                     2003
                                            2004
                                                          2001
                                                   2002




Tuesday, December 11, 12
Why all this
              talk about BIG
                Data now?
Tuesday, December 11, 12
In the past few
                 years open source
                 software emerged
                 enabling ‘us’ to
                 handle BIG Data
Tuesday, December 11, 12
The Big Data
                           Story
Tuesday, December 11, 12
Is actually
                           two stories

Tuesday, December 11, 12
Doers & Tellers talking about
                                 different things
                                           http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
Tuesday, December 11, 12
Tellers
Tuesday, December 11, 12
Doers
Tuesday, December 11, 12
Doers talk a lot more about
                                actual solutions
Tuesday, December 11, 12
They know it’s a two sided story

                            Storage




                           Processing

Tuesday, December 11, 12
Take aways
                MongoDB and Hadoop
                MongoDB for storage &
                operations
                Hadoop for processing &
                analytics
Tuesday, December 11, 12
MongoDB
   & Data Processing
Tuesday, December 11, 12
Applications have
                            complex needs
           MongoDB ideal operational
           database
           MongoDB ideal for BIG data
           Not a data processing engine, but
           provides processing functionality
Tuesday, December 11, 12
Many options for
                           Processing Data
               •Process in MongoDB using
                                    Map Reduce


               •Process in MongoDB using
                           Aggregation Framework


               •Process outside MongoDB (using Hadoop)

Tuesday, December 11, 12
MongoDB Map Reduce
                                           Map()
                   MongoDB   Data
                                                                 Group(k)
                                           emit(k,v)

                                           map iterates on
                                           documents
                                           Document is $this
                                                                 Sort(k)
                                           1 at time per shard




                                                                 Reduce(k,values)

                                                                  k,v


                                    Finalize(k,v)
                                                                 Input matches output

                                     k,v                         Can run multiple times




Tuesday, December 11, 12
MongoDB Map Reduce
       MongoDB map reduce quite capable... but with
       limits
       - Javascript not best language for processing map
         reduce
       - Javascript limited in external data processing
         libraries
       - Adds load to data store


Tuesday, December 11, 12
MongoDB
                           Aggregation
      Most uses of MongoDB Map Reduce were for
      aggregation

      Aggregation Framework optimized for aggregate
      queries

      Realtime aggregation similar to SQL GroupBy


Tuesday, December 11, 12
MongoDB & Hadoop
                              same as Mongo's          Many map operations
        MongoDB             shard chunks (64mb)        1 at time per input split

                           Creates a list     each split      Map (k1,1v1,1ctx)                          Runs on same
                           of Input Splits                     Map (k ,1v ,1ctx)                         thread as map
                                              each split        Map (k , v , ctx)
        single server or
        sharded cluster    (InputFormat)      each split           ctx.write(k2,v2)2
                                                                     ctx.write(k2,v )2            Combiner(k2,values2)2
                                             RecordReader              ctx.write(k2,v )            Combiner(k2,values )2
                                                                                                    Combiner(k2,values )
                                                                                                        k2, 2v3 3
                                                                                                         k , 2v 3
                                                                                                             k ,v


                                                       Partitioner(k2)2
                                                        Partitioner(k )2
                                                         Partitioner(k )
                                                                                          Sort(keys2)
                                                                                           Sort(k2)2
                                                                                            Sort(k )

        MongoDB



                                                                                                                    Reducer threads



                                                                         Reduce(k2,values3)
                                                   Output Format                                    Runs once per key

                                                                            kf,vf



Tuesday, December 11, 12
DEMO
         TIME
Tuesday, December 11, 12
DEMO
  Install Hadoop MongoDB Plugin
  Import tweets from twitter
  Write mapper in Python using Hadoop
  streaming
  Write reducer in Python using Hadoop
  streaming
  Call myself a data scientist
Tuesday, December 11, 12
Installing Mongo-hadoop
                           https://gist.github.com/1887726

   hadoop_version '0.23'
   hadoop_path="/usr/local/Cellar/hadoop/
   $hadoop_version.0/libexec/lib"

   git clone git://github.com/mongodb/mongo-
   hadoop.git
   cd mongo-hadoop
   sed -i '' "s/default/$hadoop_version/g" build.sbt
   cd streaming
   ./build.sh

Tuesday, December 11, 12
Groking Twitter

   curl 
   https://stream.twitter.com/1/
   statuses/sample.json 
   -u<login>:<password> 
   | mongoimport -d test -c live


                           ... let it run for about 2 hours
Tuesday, December 11, 12
DEMO 1
Tuesday, December 11, 12
Map Hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
    for doc in documents:
        for hashtag in doc['entities']['hashtags']:
            yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Tuesday, December 11, 12
Reduce hashtags in Python
 #!/usr/bin/env python

 import sys
 sys.path.append(".")

 from pymongo_hadoop import BSONReducer

 def reducer(key, values):
     print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
     _count = 0
     for v in values:
         _count += v['count']
     return {'_id': key.encode('utf8'), 'count': _count}

 BSONReducer(reducer)
Tuesday, December 11, 12
All together

  hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar 
  -mapper examples/twitter/twit_hashtag_map.py 
  -reducer examples/twitter/twit_hashtag_reduce.py 
  -inputURI mongodb://127.0.0.1/test.live 
  -outputURI mongodb://127.0.0.1/test.twit_reduction 
  -file examples/twitter/twit_hashtag_map.py 
  -file examples/twitter/twit_hashtag_reduce.py




Tuesday, December 11, 12
Popular Hash Tags
  db.twit_hashtags.find().sort( {'count' : -1 })

  {    "_id"         :     "YouKnowYoureInLoveIf", "count" : 287 }
  {    "_id"         :     "teamfollowback", "count" : 200 }
  {    "_id"         :     "RT", "count" : 150 }
  {    "_id"         :     "Arsenal", "count" : 148 }
  {    "_id"         :     "milars", "count" : 145 }
  {    "_id"         :     "sanremo", "count" : 145 }
  {    "_id"         :     "LoseMyNumberIf", "count" : 139 }
  {    "_id"         :     "RelationshipsShould", "count" : 137 }
  {    "_id"         :     "oomf", "count" : 117 }
  {    "_id"         :     "TeamFollowBack", "count" : 105 }
  {    "_id"         :     "WhyDoPeopleThink", "count" : 102 }
  {    "_id"         :     "np", "count" : 100 }


Tuesday, December 11, 12
DEMO 2
Tuesday, December 11, 12
Aggregation in Mongo 2.1
     db.live.aggregate(
     { $unwind : "$entities.hashtags" } ,
     { $match :
        { "entities.hashtags.text" :
            { $exists : true } } } ,
     { $group :
        { _id : "$entities.hashtags.text",
        count : { $sum : 1 } } } ,
     { $sort : { count : -1 } },
     { $limit : 10 }
)
Tuesday, December 11, 12
Popular Hash Tags
    db.twit_hashtags.aggregate(a){
     "result" : [
        { "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
        { "_id" : "teamfollowback", "count" : 200 },
        { "_id" : "RT", "count" : 150 },
        { "_id" : "Arsenal", "count" : 148 },
        { "_id" : "milars", "count" : 145 },
        { "_id" : "sanremo","count" : 145 },
        { "_id" : "LoseMyNumberIf", "count" : 139 },
        { "_id" : "RelationshipsShould", "count" : 137 },
      ],"ok" : 1
}

Tuesday, December 11, 12
The
                           Future of
humongous
                                 data
Tuesday, December 11, 12
What is BIG?
      BIG today is
    normal tomorrow
Tuesday, December 11, 12
Data Growth                                                 9,000
           9000



           6750


                                                                                   4,400
           4500


                                                                           2,150
           2250
                                                                   1,000
                                                             500
                                         55     120   250
                      1    4   10   24
                 0
                  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

                                              Millions of URLs
Tuesday, December 11, 12
Data Growth                                                 9,000
           9000



           6750


                                                                                   4,400
           4500


                                                                           2,150
           2250
                                                                   1,000
                                                             500
                                         55     120   250
                      1    4   10   24
                 0
                  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

                                              Millions of URLs
Tuesday, December 11, 12
2012
                Generating over
                250 Millions of
                tweets per day

Tuesday, December 11, 12
MongoDB enables us to scale
      with the redefinition of BIG.

      New processing tools like
      Hadoop & Storm are enabling
      us to process the new BIG.

Tuesday, December 11, 12
Hadoop is our
            first step

Tuesday, December 11, 12
MongoDB is
    committed to working
     with best data tools
          including
         Hadoop, Storm,
        Disco, Spark & more
Tuesday, December 11, 12
http://spf13.com
                                            http://github.com/spf13
                                            @spf13



         Questions?
                           download at
                           github.com/mongodb/mongo-hadoop

Tuesday, December 11, 12
Tuesday, December 11, 12

Contenu connexe

Tendances

Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentationjexp
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015Guy Harrison
 
Latest trends in database management
Latest trends in database managementLatest trends in database management
Latest trends in database managementBcomBT
 
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Dipti Borkar
 
Transition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayTransition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayDipti Borkar
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseDipti Borkar
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?Martin Scholl
 
Tagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceTagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceEduard Bondarenko
 
D Maeda Bi Portfolio
D Maeda Bi PortfolioD Maeda Bi Portfolio
D Maeda Bi PortfolioDMaeda
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Hibernate OGM - JPA for Infinispan and NoSQL
Hibernate OGM - JPA for Infinispan and NoSQLHibernate OGM - JPA for Infinispan and NoSQL
Hibernate OGM - JPA for Infinispan and NoSQLJBUG London
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Drupal DKAN CapitalCamp 2013 Presentation
Drupal DKAN CapitalCamp 2013 PresentationDrupal DKAN CapitalCamp 2013 Presentation
Drupal DKAN CapitalCamp 2013 PresentationAndrew Hoppin
 

Tendances (16)

Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentation
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
 
Anti-social Databases
Anti-social DatabasesAnti-social Databases
Anti-social Databases
 
Latest trends in database management
Latest trends in database managementLatest trends in database management
Latest trends in database management
 
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
 
Transition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayTransition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA Day
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and Couchbase
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
 
Db trends final
Db trends   finalDb trends   final
Db trends final
 
Tagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceTagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and Performance
 
D Maeda Bi Portfolio
D Maeda Bi PortfolioD Maeda Bi Portfolio
D Maeda Bi Portfolio
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
Hibernate OGM - JPA for Infinispan and NoSQL
Hibernate OGM - JPA for Infinispan and NoSQLHibernate OGM - JPA for Infinispan and NoSQL
Hibernate OGM - JPA for Infinispan and NoSQL
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Drupal DKAN CapitalCamp 2013 Presentation
Drupal DKAN CapitalCamp 2013 PresentationDrupal DKAN CapitalCamp 2013 Presentation
Drupal DKAN CapitalCamp 2013 Presentation
 
Ddn 2017 10_dse_primer
Ddn 2017 10_dse_primerDdn 2017 10_dse_primer
Ddn 2017 10_dse_primer
 

Similaire à MongoDB Hadoop and Humongous Data

MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Brasil
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Faster Cheaper Better-Replacing Oracle with Hadoop & Solr
Faster Cheaper Better-Replacing Oracle with Hadoop & SolrFaster Cheaper Better-Replacing Oracle with Hadoop & Solr
Faster Cheaper Better-Replacing Oracle with Hadoop & SolrDataWorks Summit
 
Morning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMorning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMongoDB
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDBYnon Perek
 
Drupal and the rise of the documents
Drupal and the rise of the documentsDrupal and the rise of the documents
Drupal and the rise of the documentsClaudio Beatrice
 
Cloud Databases in Research and Practice
Cloud Databases in Research and PracticeCloud Databases in Research and Practice
Cloud Databases in Research and PracticeFelix Gessert
 
How Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionHow Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionLuca Garulli
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An IntroductionRishi Arora
 
Minnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraMinnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraJeff Bollinger
 
I Love Techno - the site
I Love Techno - the siteI Love Techno - the site
I Love Techno - the sitePeter Arato
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLSumeet Bansal
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012Mike Miller
 
MongoDB Use Cases and Roadmap
MongoDB Use Cases and RoadmapMongoDB Use Cases and Roadmap
MongoDB Use Cases and RoadmapMongoDB
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)packetloop
 
iOS Prototyping with Xcode Storyboards
iOS Prototyping with Xcode StoryboardsiOS Prototyping with Xcode Storyboards
iOS Prototyping with Xcode StoryboardsKyle Oba
 

Similaire à MongoDB Hadoop and Humongous Data (20)

MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPal
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Faster Cheaper Better-Replacing Oracle with Hadoop & Solr
Faster Cheaper Better-Replacing Oracle with Hadoop & SolrFaster Cheaper Better-Replacing Oracle with Hadoop & Solr
Faster Cheaper Better-Replacing Oracle with Hadoop & Solr
 
Morning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et IntroductionsMorning with MongoDB Paris 2012 - Accueil et Introductions
Morning with MongoDB Paris 2012 - Accueil et Introductions
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
Drupal and the rise of the documents
Drupal and the rise of the documentsDrupal and the rise of the documents
Drupal and the rise of the documents
 
Cloud Databases in Research and Practice
Cloud Databases in Research and PracticeCloud Databases in Research and Practice
Cloud Databases in Research and Practice
 
How Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionHow Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolution
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Minnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraMinnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with Cassandra
 
I Love Techno - the site
I Love Techno - the siteI Love Techno - the site
I Love Techno - the site
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQL
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
MongoDB Use Cases and Roadmap
MongoDB Use Cases and RoadmapMongoDB Use Cases and Roadmap
MongoDB Use Cases and Roadmap
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Developing XWiki
Developing XWikiDeveloping XWiki
Developing XWiki
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
iOS Prototyping with Xcode Storyboards
iOS Prototyping with Xcode StoryboardsiOS Prototyping with Xcode Storyboards
iOS Prototyping with Xcode Storyboards
 

Plus de MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump StartMongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
 

Plus de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

MongoDB Hadoop and Humongous Data

  • 1. MongoDB Hadoop & humongous data Tuesday, December 11, 12
  • 2. Talking about What is Humongous Data Humongous Data & You MongoDB & Data processing Future of Humongous Data Tuesday, December 11, 12
  • 3. What is humongous data ? Tuesday, December 11, 12
  • 4. 2000 Google Inc Today announced it has released the largest search engine on the Internet. Google’s new index, comprising more than 1 billion URLs Tuesday, December 11, 12
  • 5. 2008 Our indexing system for processing links indicates that we now count 1 trillion unique URLs (and the number of individual web pages out there is growing by several billion pages per day). Tuesday, December 11, 12
  • 6. An unprecedented amount of data is being created and is accessible Tuesday, December 11, 12
  • 7. Data Growth 1,000 1000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs Tuesday, December 11, 12
  • 8. Truly Exponential Growth Is hard for people to grasp A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon". Tuesday, December 11, 12
  • 9. Moore’s Law Applies to more than just CPUs Boiled down it is that things double at regular intervals It’s exponential growth.. and applies to big data Tuesday, December 11, 12
  • 10. How BIG is it? Tuesday, December 11, 12
  • 11. How BIG is it? 2008 Tuesday, December 11, 12
  • 12. How BIG is it? 2007 2008 2005 2006 2003 2004 2001 2002 Tuesday, December 11, 12
  • 13. Why all this talk about BIG Data now? Tuesday, December 11, 12
  • 14. In the past few years open source software emerged enabling ‘us’ to handle BIG Data Tuesday, December 11, 12
  • 15. The Big Data Story Tuesday, December 11, 12
  • 16. Is actually two stories Tuesday, December 11, 12
  • 17. Doers & Tellers talking about different things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september Tuesday, December 11, 12
  • 20. Doers talk a lot more about actual solutions Tuesday, December 11, 12
  • 21. They know it’s a two sided story Storage Processing Tuesday, December 11, 12
  • 22. Take aways MongoDB and Hadoop MongoDB for storage & operations Hadoop for processing & analytics Tuesday, December 11, 12
  • 23. MongoDB & Data Processing Tuesday, December 11, 12
  • 24. Applications have complex needs MongoDB ideal operational database MongoDB ideal for BIG data Not a data processing engine, but provides processing functionality Tuesday, December 11, 12
  • 25. Many options for Processing Data •Process in MongoDB using Map Reduce •Process in MongoDB using Aggregation Framework •Process outside MongoDB (using Hadoop) Tuesday, December 11, 12
  • 26. MongoDB Map Reduce Map() MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times Tuesday, December 11, 12
  • 27. MongoDB Map Reduce MongoDB map reduce quite capable... but with limits - Javascript not best language for processing map reduce - Javascript limited in external data processing libraries - Adds load to data store Tuesday, December 11, 12
  • 28. MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Realtime aggregation similar to SQL GroupBy Tuesday, December 11, 12
  • 29. MongoDB & Hadoop same as Mongo's Many map operations MongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx) single server or sharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k ) MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf Tuesday, December 11, 12
  • 30. DEMO TIME Tuesday, December 11, 12
  • 31. DEMO Install Hadoop MongoDB Plugin Import tweets from twitter Write mapper in Python using Hadoop streaming Write reducer in Python using Hadoop streaming Call myself a data scientist Tuesday, December 11, 12
  • 32. Installing Mongo-hadoop https://gist.github.com/1887726 hadoop_version '0.23' hadoop_path="/usr/local/Cellar/hadoop/ $hadoop_version.0/libexec/lib" git clone git://github.com/mongodb/mongo- hadoop.git cd mongo-hadoop sed -i '' "s/default/$hadoop_version/g" build.sbt cd streaming ./build.sh Tuesday, December 11, 12
  • 33. Groking Twitter curl https://stream.twitter.com/1/ statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours Tuesday, December 11, 12
  • 35. Map Hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Tuesday, December 11, 12
  • 36. Reduce hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count} BSONReducer(reducer) Tuesday, December 11, 12
  • 37. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py Tuesday, December 11, 12
  • 38. Popular Hash Tags db.twit_hashtags.find().sort( {'count' : -1 }) { "_id" : "YouKnowYoureInLoveIf", "count" : 287 } { "_id" : "teamfollowback", "count" : 200 } { "_id" : "RT", "count" : 150 } { "_id" : "Arsenal", "count" : 148 } { "_id" : "milars", "count" : 145 } { "_id" : "sanremo", "count" : 145 } { "_id" : "LoseMyNumberIf", "count" : 139 } { "_id" : "RelationshipsShould", "count" : 137 } { "_id" : "oomf", "count" : 117 } { "_id" : "TeamFollowBack", "count" : 105 } { "_id" : "WhyDoPeopleThink", "count" : 102 } { "_id" : "np", "count" : 100 } Tuesday, December 11, 12
  • 40. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } ) Tuesday, December 11, 12
  • 41. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1 } Tuesday, December 11, 12
  • 42. The Future of humongous data Tuesday, December 11, 12
  • 43. What is BIG? BIG today is normal tomorrow Tuesday, December 11, 12
  • 44. Data Growth 9,000 9000 6750 4,400 4500 2,150 2250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs Tuesday, December 11, 12
  • 45. Data Growth 9,000 9000 6750 4,400 4500 2,150 2250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs Tuesday, December 11, 12
  • 46. 2012 Generating over 250 Millions of tweets per day Tuesday, December 11, 12
  • 47. MongoDB enables us to scale with the redefinition of BIG. New processing tools like Hadoop & Storm are enabling us to process the new BIG. Tuesday, December 11, 12
  • 48. Hadoop is our first step Tuesday, December 11, 12
  • 49. MongoDB is committed to working with best data tools including Hadoop, Storm, Disco, Spark & more Tuesday, December 11, 12
  • 50. http://spf13.com http://github.com/spf13 @spf13 Questions? download at github.com/mongodb/mongo-hadoop Tuesday, December 11, 12