2. Talking about
What is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous Data
Tuesday, December 11, 12
3. What is
humongous
data ?
Tuesday, December 11, 12
4. 2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.
Google’s new index, comprising
more than 1 billion URLs
Tuesday, December 11, 12
5. 2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs
(and the number of individual web
pages out there is growing by
several billion pages per day).
Tuesday, December 11, 12
6. An unprecedented
amount of data is
being created and is
accessible
Tuesday, December 11, 12
7. Data Growth 1,000
1000
750
500
500
250
250
120
55
4 10 24
1
0
2000 2001 2002 2003 2004 2005 2006 2007 2008
Millions of URLs
Tuesday, December 11, 12
8. Truly Exponential
Growth
Is hard for people to grasp
A BBC reporter recently: "Your current PC
is more powerful than the computer they
had on board the first flight to the moon".
Tuesday, December 11, 12
9. Moore’s Law
Applies to more than just CPUs
Boiled down it is that things double at
regular intervals
It’s exponential growth.. and applies to
big data
Tuesday, December 11, 12
16. Is actually
two stories
Tuesday, December 11, 12
17. Doers & Tellers talking about
different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
Tuesday, December 11, 12
20. Doers talk a lot more about
actual solutions
Tuesday, December 11, 12
21. They know it’s a two sided story
Storage
Processing
Tuesday, December 11, 12
22. Take aways
MongoDB and Hadoop
MongoDB for storage &
operations
Hadoop for processing &
analytics
Tuesday, December 11, 12
23. MongoDB
& Data Processing
Tuesday, December 11, 12
24. Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality
Tuesday, December 11, 12
25. Many options for
Processing Data
•Process in MongoDB using
Map Reduce
•Process in MongoDB using
Aggregation Framework
•Process outside MongoDB (using Hadoop)
Tuesday, December 11, 12
26. MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)
map iterates on
documents
Document is $this
Sort(k)
1 at time per shard
Reduce(k,values)
k,v
Finalize(k,v)
Input matches output
k,v Can run multiple times
Tuesday, December 11, 12
27. MongoDB Map Reduce
MongoDB map reduce quite capable... but with
limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store
Tuesday, December 11, 12
28. MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Realtime aggregation similar to SQL GroupBy
Tuesday, December 11, 12
29. MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split
Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v
Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )
MongoDB
Reducer threads
Reduce(k2,values3)
Output Format Runs once per key
kf,vf
Tuesday, December 11, 12
31. DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist
Tuesday, December 11, 12
32. Installing Mongo-hadoop
https://gist.github.com/1887726
hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"
git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh
Tuesday, December 11, 12
33. Groking Twitter
curl
https://stream.twitter.com/1/
statuses/sample.json
-u<login>:<password>
| mongoimport -d test -c live
... let it run for about 2 hours
Tuesday, December 11, 12
42. The
Future of
humongous
data
Tuesday, December 11, 12
43. What is BIG?
BIG today is
normal tomorrow
Tuesday, December 11, 12
44. Data Growth 9,000
9000
6750
4,400
4500
2,150
2250
1,000
500
55 120 250
1 4 10 24
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Millions of URLs
Tuesday, December 11, 12
45. Data Growth 9,000
9000
6750
4,400
4500
2,150
2250
1,000
500
55 120 250
1 4 10 24
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Millions of URLs
Tuesday, December 11, 12
46. 2012
Generating over
250 Millions of
tweets per day
Tuesday, December 11, 12
47. MongoDB enables us to scale
with the redefinition of BIG.
New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.
Tuesday, December 11, 12