MongoDB Hadoop and Humongous Data

MongoDB
Hadoop
&
humongous data
Tuesday, December 11, 12

Talking about
What is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous Data

What is
humongous
data ?

2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.

Google’s new index, comprising
more than 1 billion URLs


2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs

(and the number of individual web
pages out there is growing by
several billion pages per day).


An unprecedented
amount of data is
being created and is
accessible


Data Growth 1,000
1000

750

500
500

250
250
120
55
4 10 24
1
0
2000 2001 2002 2003 2004 2005 2006 2007 2008

Millions of URLs

Truly Exponential
Growth
Is hard for people to grasp

A BBC reporter recently: "Your current PC
is more powerful than the computer they
had on board the ﬁrst ﬂight to the moon".


Moore’s Law
Applies to more than just CPUs

Boiled down it is that things double at
regular intervals

It’s exponential growth.. and applies to
big data


How BIG is it?


How BIG is it?

2008


How BIG is it?
2007

2008
2005
2006
2003
2004
2001
2002


Why all this
talk about BIG
Data now?

In the past few
years open source
software emerged
enabling ‘us’ to
handle BIG Data

The Big Data
Story

Is actually
two stories


Doers & Tellers talking about
different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september

Tellers

Doers

Doers talk a lot more about
actual solutions

They know it’s a two sided story

Storage

Processing


Take aways
MongoDB and Hadoop
MongoDB for storage &
operations
Hadoop for processing &
analytics

MongoDB
& Data Processing

Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality

Many options for
Processing Data
•Process in MongoDB using
Map Reduce

•Process in MongoDB using
Aggregation Framework

•Process outside MongoDB (using Hadoop)


MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)

map iterates on
documents
Document is $this
Sort(k)
1 at time per shard

Reduce(k,values)

k,v

Finalize(k,v)
Input matches output

k,v Can run multiple times


MongoDB Map Reduce
MongoDB map reduce quite capable... but with
limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store


MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation

Aggregation Framework optimized for aggregate
queries

Realtime aggregation similar to SQL GroupBy


MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split

Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v

Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )

MongoDB

Reducer threads

Reduce(k2,values3)
Output Format Runs once per key

kf,vf


DEMO
TIME

DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist

Installing Mongo-hadoop
https://gist.github.com/1887726

hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh


Groking Twitter

curl
https://stream.twitter.com/1/
statuses/sample.json
-u<login>:<password>
| mongoimport -d test -c live

... let it run for about 2 hours

DEMO 1

Map Hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
for doc in documents:
for hashtag in doc['entities']['hashtags']:
yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Reduce hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):
print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
_count = 0
for v in values:
_count += v['count']
return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar
-mapper examples/twitter/twit_hashtag_map.py
-reducer examples/twitter/twit_hashtag_reduce.py
-inputURI mongodb://127.0.0.1/test.live
-outputURI mongodb://127.0.0.1/test.twit_reduction
-file examples/twitter/twit_hashtag_map.py
-file examples/twitter/twit_hashtag_reduce.py


Popular Hash Tags
db.twit_hashtags.find().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }
{ "_id" : "teamfollowback", "count" : 200 }
{ "_id" : "RT", "count" : 150 }
{ "_id" : "Arsenal", "count" : 148 }
{ "_id" : "milars", "count" : 145 }
{ "_id" : "sanremo", "count" : 145 }
{ "_id" : "LoseMyNumberIf", "count" : 139 }
{ "_id" : "RelationshipsShould", "count" : 137 }
{ "_id" : "oomf", "count" : 117 }
{ "_id" : "TeamFollowBack", "count" : 105 }
{ "_id" : "WhyDoPeopleThink", "count" : 102 }
{ "_id" : "np", "count" : 100 }


DEMO 2

Aggregation in Mongo 2.1
db.live.aggregate(
{ $unwind : "$entities.hashtags" } ,
{ $match :
{ "entities.hashtags.text" :
{ $exists : true } } } ,
{ $group :
{ _id : "$entities.hashtags.text",
count : { $sum : 1 } } } ,
{ $sort : { count : -1 } },
{ $limit : 10 }
)

Popular Hash Tags
db.twit_hashtags.aggregate(a){
"result" : [
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
{ "_id" : "teamfollowback", "count" : 200 },
{ "_id" : "RT", "count" : 150 },
{ "_id" : "Arsenal", "count" : 148 },
{ "_id" : "milars", "count" : 145 },
{ "_id" : "sanremo","count" : 145 },
{ "_id" : "LoseMyNumberIf", "count" : 139 },
{ "_id" : "RelationshipsShould", "count" : 137 },
],"ok" : 1
}


The
Future of
humongous
data

What is BIG?
BIG today is
normal tomorrow

Data Growth 9,000
9000

6750

4,400
4500

2,150
2250
1,000
500
55 120 250
1 4 10 24
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Millions of URLs

2012
Generating over
250 Millions of
tweets per day


MongoDB enables us to scale
with the redeﬁnition of BIG.

New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.


Hadoop is our
ﬁrst step


MongoDB is
committed to working
with best data tools
including
Hadoop, Storm,
Disco, Spark & more

http://spf13.com
http://github.com/spf13
@spf13

Questions?
download at
github.com/mongodb/mongo-hadoop


MongoDB Hadoop and Humongous Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à MongoDB Hadoop and Humongous Data

Similaire à MongoDB Hadoop and Humongous Data (20)

Plus de MongoDB

Plus de MongoDB (20)

MongoDB Hadoop and Humongous Data