SlideShare une entreprise Scribd logo
1  sur  63
Codemotion Milano 2013

Data Processing and
Aggregation
Massimo Brignoli
Solutions Architect, MongoDB Inc.

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Who Am I?
• Solutions Architect/Evangelist in MongoDB Inc.
• 20 years of experience in databases
• Former MySQL employee

• Previous life: web, web, web

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Big Data

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
What is Big Data?
• Big Data is like teenage sex:
• everyone talks about it
• nobody really knows how to do it

• everyone thinks everyone else is doing it
• so everyone claims they are doing it…

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Understanding Big Data – It’s Not Very “Big”

64% - Ingest diverse,
new data in real-time

15% - More than 100TB
of data
20% - Less than 100TB
(average of all? <20TB)
from Big Data Executive Summary – 50+ top executives from Government and F500 firms
For over a decade

Big Data == Custom
Software

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Lots of Great Innovations Since 1970
Including the Relational Database
RDBMS Makes Development Hard

Code

XML Config

DB Schema

Application

Object Relational
Mapping

Relational
Database
And Even Harder To Iterate
New
Table

New
Column

New
Table
Name

Pet

Phone

New
Column

3 months later…

Email
From Complexity to Simplicity
MongoDB

RDBMS

{

_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{

type :

"Health",

plan : "PPO Plus" },
{

type :

"Dental",

plan : "Standard" }
]
}
In the past few years
Open source software has
emerged enabling the rest of
us to handle Big Data

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Use Popular, Well-Known Technologies

Source: Silicon Angle, 2012
Enterprise Big Data Stack

CRM, ERP, Collaboration, Mobile, BI

Data Management
Online Data
RDBMS
RDBMS

Offline Data
Hadoop

Infrastructure
OS & Virtualization, Compute, Storage, Network

EDW

Security & Auditing

Management & Monitoring

Applications
Consideration – Online vs. Offline
Online

• Real-time
• Low-latency
• High availability

vs.

Offline

• Long-running
• High-Latency
• Availability is lower priority
How MongoDB Meets Our
Requirements
• MongoDB is an operational database
• MongoDB provides high performance for storage

and retrieval at large scale
• MongoDB has a robust query interface permitting

intelligent operations
• MongoDB is not a data processing engine, but

provides processing functionality

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB data processing options
http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-sa/3.0/
Except where otherwise noted, this work is licensed under
Getting Example Data

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
The “hello world” of
MapReduce is counting words
in a paragraph of text.
Let’s try something a little
more interesting…

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
What is the most popular pub
name?

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Open Street Map Data
#!/usr/bin/env python
# Data Source
# http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]
import re
import sys
from imposm.parser import OSMParser
import pymongo
class Handler(object):
def nodes(self, nodes):
if not nodes:
return
docs = []
for node in nodes:
osm_id, doc, (lon, lat) = node
if "name" not in doc:
node_points[osm_id] = (lon, lat)
continue
doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&")
doc["_id"] = osm_id
doc["location"] = {"type": "Point", "coordinates": [lon, lat]}
docs.append(doc)
collection.insert(docs)

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Example Pub Data
{
"_id" : 451152,
"amenity" : "pub",
"name" : "The Dignity",
"addr:housenumber" : "363",
"addr:street" : "Regents Park Road",
"addr:city" : "London",
"addr:postcode" : "N3 1DH",
"toilets" : "yes",
"toilets:access" : "customers",
"location" : {
"type" : "Point",
"coordinates" : [-0.1945732, 51.6008172]
}
}

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MapReduce

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB MapReduce
•

map
MongoDB

reduce
finalize

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
map

Map Function
MongoDB

reduce

> var map = function() {
finalize

emit(this.name, 1);

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
map

Reduce Function
MongoDB

reduce

> var reduce = function (key, values) {
finalize

var sum = 0;
values.forEach( function (val) {sum += val;} );
return sum;
}

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{ "_id" : "The Red Lion", "value" : 407 }
{ "_id" : "The Royal Oak", "value" : 328 }
{ "_id" : "The Crown", "value" : 242 }
{ "_id" : "The White Hart", "value" : 214 }
{ "_id" : "The White Horse", "value" : 200 }
{ "_id" : "The New Inn", "value" : 187 }
{ "_id" : "The Plough", "value" : 185 }
{ "_id" : "The Rose & Crown", "value" : 164 }
{ "_id" : "The Wheatsheaf", "value" : 147 }
{ "_id" : "The Swan", "value" : 140 }

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Pub Names in the Center of London
> db.pubs.mapReduce(map, reduce, { out: "pub_names",
query: {
location: {
$within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] }
}}
})

{
"result" : "pub_names",
"timeMillis" : 116,
"counts" : {
"input" : 643,
"emit" : 643,
"reduce" : 54,
"output" : 537
},
"ok" : 1,
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Results
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Green Man", "value" : 5 }
"The Kings Arms", "value" : 5 }
"The Red Lion", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB MapReduce
• Real-time
• Output directly to document or collection
• Runs inside MongoDB on local data

− Adds load to your DB
− In Javascript – debugging can be a challenge
− Translating in and out of C++

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Aggregation Framework

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Aggregation Framework
•

op1
MongoDB

op2

opN
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Aggregation Framework in 60
Seconds

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Aggregation Framework Operators
• $project
• $match
• $limit

• $skip
• $sort
• $unwind
• $group

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
$match
• Filter documents
• Uses existing query syntax
• If using $geoNear it has to be first in pipeline

• $where is not supported

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Matching Field Values
{

"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
}

{ "$match": {

"name": "The Red Lion"
}}

{

"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]}

{
"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}

}

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields

• Create sub-document fields

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Including and Excluding Fields
{ “$project”: {

{

"_id" : 271466,

"name" : "The Red Lion",

“_id”: 0,
“amenity”: 1,
“name”: 1,

"location" : {

}}

"amenity" : "pub",

"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}

{
“amenity” : “pub”,
“name” : “The Red Lion”
}

}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Reformatting Documents
{ “$project”: {

{

"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"location" : {

“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}

"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119
]
}
}

{
“name” : “The Red Lion”
“meta” : {
“type” : “pub”
}}

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Dealing with Arrays
{ “$project”: {

{

"_id" : 271466,
"amenity" : "pub",
"name" : "The Red Lion",
"facilities" : [

"toilets",

“_id”: 0,
“name”: 1,
“meta”: {
“type”: “$amenity”}
}}
{"$unwind": "$facility"}

"food"
],
}

{ "name" : "The Red Lion",
"facility" : "toilets" },
{ "name" : "The Red Lion",
"facility" : "food" }

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
$group
• Group documents by an ID
• Field reference, object, constant
• Other output fields are computed

$max, $min, $avg, $sum
$addToSet, $push $first, $last
• Processes all data in memory

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Back to the pub!

•

http://www.offwestend.com/index.php/theatres/pastshows/71

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Popular Pub Names
>var popular_pub_names = [
{ $match : location:
{ $within: { $centerSphere:
[[-0.12, 51.516], 2 / 3959]}}}
},
{ $group :
{ _id: “$name”
value: {$sum: 1} }
},
{ $sort : {value: -1} },
{ $limit : 10 }

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Results
> db.pubs.aggregate(popular_pub_names)
{
"result" : [
{ "_id" : "All Bar One", "value" : 11 }
{ "_id" : "The Slug & Lettuce", "value" : 7 }
{ "_id" : "The Coach & Horses", "value" : 6 }
{ "_id" : "The Green Man", "value" : 5 }
{ "_id" : "The Kings Arms", "value" : 5 }
{ "_id" : "The Red Lion", "value" : 5 }
{ "_id" : "Corney & Barrow", "value" : 4 }
{ "_id" : "O'Neills", "value" : 4 }
{ "_id" : "Pitcher & Piano", "value" : 4 }
{ "_id" : "The Crown", "value" : 4 }
],
"ok" : 1
}
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Declared in JSON, executes in C++

• Runs inside MongoDB on local data

− Adds load to your DB
− Limited Operators
− Data output is limited

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Analyzing MongoDB Data in
External Systems

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB with Hadoop
•

MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB with Hadoop
•

MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/

warehouse
MongoDB with Hadoop
•

ETL

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/

MongoDB
Map Pub Names in Python
#!/usr/bin/env python
from pymongo_hadoop import BSONMapper
def mapper(documents):
bounds = get_bounds() # ~2 mile polygon
for doc in documents:
geo = get_geo(doc["location"]) # Convert the geo type
if not geo:
continue
if bounds.intersects(geo):
yield {'_id': doc['name'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Reduce Pub Names in Python
#!/usr/bin/env python

from pymongo_hadoop import BSONReducer

def reducer(key, values):
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'value': _count}

BSONReducer(reducer)
Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Execute MapReduce
hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar 
-mapper examples/pub/map.py 
-reducer examples/pub/reduce.py 
-mongo mongodb://127.0.0.1/demo.pubs 
-outputURI mongodb://127.0.0.1/demo.pub_names

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Popular Pub Names Nearby
> db.pub_names.find().sort({value: -1}).limit(10)
{
{
{
{
{
{
{
{
{
{

"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"
"_id"

:
:
:
:
:
:
:
:
:
:

"All Bar One", "value" : 11 }
"The Slug & Lettuce", "value" : 7 }
"The Coach & Horses", "value" : 6 }
"The Kings Arms", "value" : 5 }
"Corney & Barrow", "value" : 4 }
"O'Neills", "value" : 4 }
"Pitcher & Piano", "value" : 4 }
"The Crown", "value" : 4 }
"The George", "value" : 4 }
"The Green Man", "value" : 4 }

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB and Hadoop
• Away from data store
• Can leverage existing data processing infrastructure
• Can horizontally scale your data processing
- Offline batch processing
- Requires synchronisation between store &

processor
- Infrastructure is much more complex

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
The Future of Big Data and
MongoDB

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
What is Big Data?
Big Data today will be
normal tomorrow

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Exponential Data Growth
Billions of URLs indexed by Google
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2000

2002

2004

2006

2008

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/

2010

2012
MongoDB enables you to
scale big

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
MongoDB is evolving

so you can process the
big

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Data Processing with MongoDB
• Process in MongoDB using Map/Reduce
• Process in MongoDB using Aggregation

Framework
• Process outside MongoDB using Hadoop and

other external tools

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Questions?

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
Codemotion Milano

Thanks!
Massimo Brignoli
Solutions Architect, MongoDB Inc.

Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/

Contenu connexe

Tendances

Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8
Tatsuhiko Miyagawa
 
Apache CouchDB talk at Ontario GNU Linux Fest
Apache CouchDB talk at Ontario GNU Linux FestApache CouchDB talk at Ontario GNU Linux Fest
Apache CouchDB talk at Ontario GNU Linux Fest
Myles Braithwaite
 

Tendances (19)

Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8
 
Webinar: Building Your First App with MongoDB and Java
Webinar: Building Your First App with MongoDB and JavaWebinar: Building Your First App with MongoDB and Java
Webinar: Building Your First App with MongoDB and Java
 
Learn Learn how to build your mobile back-end with MongoDB
Learn Learn how to build your mobile back-end with MongoDBLearn Learn how to build your mobile back-end with MongoDB
Learn Learn how to build your mobile back-end with MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Apache CouchDB talk at Ontario GNU Linux Fest
Apache CouchDB talk at Ontario GNU Linux FestApache CouchDB talk at Ontario GNU Linux Fest
Apache CouchDB talk at Ontario GNU Linux Fest
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDB
 
Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015
 
Java development with MongoDB
Java development with MongoDBJava development with MongoDB
Java development with MongoDB
 
Back to Basics Webinar 3: Schema Design Thinking in Documents
 Back to Basics Webinar 3: Schema Design Thinking in Documents Back to Basics Webinar 3: Schema Design Thinking in Documents
Back to Basics Webinar 3: Schema Design Thinking in Documents
 
Honing headers for highly hardened highspeed hypertext
Honing headers for highly hardened highspeed hypertextHoning headers for highly hardened highspeed hypertext
Honing headers for highly hardened highspeed hypertext
 
Going on an HTTP Diet: Front-End Web Performance
Going on an HTTP Diet: Front-End Web PerformanceGoing on an HTTP Diet: Front-End Web Performance
Going on an HTTP Diet: Front-End Web Performance
 
Emerging threats jonkman_sans_cti_summit_2015
Emerging threats jonkman_sans_cti_summit_2015Emerging threats jonkman_sans_cti_summit_2015
Emerging threats jonkman_sans_cti_summit_2015
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
 
How to make Ajax work for you
How to make Ajax work for youHow to make Ajax work for you
How to make Ajax work for you
 
Active Https Cookie Stealing
Active Https Cookie StealingActive Https Cookie Stealing
Active Https Cookie Stealing
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
 
My First Cluster with MongoDB Atlas
My First Cluster with MongoDB AtlasMy First Cluster with MongoDB Atlas
My First Cluster with MongoDB Atlas
 
Python and MongoDB
Python and MongoDBPython and MongoDB
Python and MongoDB
 

En vedette

En vedette (12)

Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big ComputeMongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big Compute
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 

Similaire à Past, Present and Future of Data Processing in Apache Hadoop

Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
Serdar Buyuktemiz
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
MongoDB
 
MongoDB at Gilt Groupe
MongoDB at Gilt GroupeMongoDB at Gilt Groupe
MongoDB at Gilt Groupe
MongoDB
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
rfischer20
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
MongoDB
 
Mongodb at-gilt-groupe-seattle-2012-09-14-final
Mongodb at-gilt-groupe-seattle-2012-09-14-finalMongodb at-gilt-groupe-seattle-2012-09-14-final
Mongodb at-gilt-groupe-seattle-2012-09-14-final
MongoDB
 

Similaire à Past, Present and Future of Data Processing in Apache Hadoop (20)

Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
MongoDB
MongoDBMongoDB
MongoDB
 
Practical Use of MongoDB for Node.js
Practical Use of MongoDB for Node.jsPractical Use of MongoDB for Node.js
Practical Use of MongoDB for Node.js
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB at Gilt Groupe
MongoDB at Gilt GroupeMongoDB at Gilt Groupe
MongoDB at Gilt Groupe
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101
MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101
MongoDB Days Silicon Valley: Jumpstart: Ops/Admin 101
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
 
Back to Basics Webinar 2 - Your First MongoDB Application
Back to  Basics Webinar 2 - Your First MongoDB ApplicationBack to  Basics Webinar 2 - Your First MongoDB Application
Back to Basics Webinar 2 - Your First MongoDB Application
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
Mongodb at-gilt-groupe-seattle-2012-09-14-final
Mongodb at-gilt-groupe-seattle-2012-09-14-finalMongodb at-gilt-groupe-seattle-2012-09-14-final
Mongodb at-gilt-groupe-seattle-2012-09-14-final
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
MongoDB: Comparing WiredTiger In-Memory Engine to RedisMongoDB: Comparing WiredTiger In-Memory Engine to Redis
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
 
Getting started with MongoDB and Scala - Open Source Bridge 2012
Getting started with MongoDB and Scala - Open Source Bridge 2012Getting started with MongoDB and Scala - Open Source Bridge 2012
Getting started with MongoDB and Scala - Open Source Bridge 2012
 

Plus de Codemotion

Plus de Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Past, Present and Future of Data Processing in Apache Hadoop

  • 1. Codemotion Milano 2013 Data Processing and Aggregation Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 2. Who Am I? • Solutions Architect/Evangelist in MongoDB Inc. • 20 years of experience in databases • Former MySQL employee • Previous life: web, web, web Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 3. Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 4. What is Big Data? • Big Data is like teenage sex: • everyone talks about it • nobody really knows how to do it • everyone thinks everyone else is doing it • so everyone claims they are doing it… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 5. Understanding Big Data – It’s Not Very “Big” 64% - Ingest diverse, new data in real-time 15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB) from Big Data Executive Summary – 50+ top executives from Government and F500 firms
  • 6. For over a decade Big Data == Custom Software Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 7. Lots of Great Innovations Since 1970
  • 9. RDBMS Makes Development Hard Code XML Config DB Schema Application Object Relational Mapping Relational Database
  • 10. And Even Harder To Iterate New Table New Column New Table Name Pet Phone New Column 3 months later… Email
  • 11. From Complexity to Simplicity MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  • 12. In the past few years Open source software has emerged enabling the rest of us to handle Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 13. Use Popular, Well-Known Technologies Source: Silicon Angle, 2012
  • 14. Enterprise Big Data Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data RDBMS RDBMS Offline Data Hadoop Infrastructure OS & Virtualization, Compute, Storage, Network EDW Security & Auditing Management & Monitoring Applications
  • 15. Consideration – Online vs. Offline Online • Real-time • Low-latency • High availability vs. Offline • Long-running • High-Latency • Availability is lower priority
  • 16. How MongoDB Meets Our Requirements • MongoDB is an operational database • MongoDB provides high performance for storage and retrieval at large scale • MongoDB has a robust query interface permitting intelligent operations • MongoDB is not a data processing engine, but provides processing functionality Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 17. MongoDB data processing options http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-sa/3.0/ Except where otherwise noted, this work is licensed under
  • 18. Getting Example Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 19. The “hello world” of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 20. What is the most popular pub name? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 21. Open Street Map Data #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 22. Example Pub Data { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 23. MapReduce Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 24. MongoDB MapReduce • map MongoDB reduce finalize Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 25. map Map Function MongoDB reduce > var map = function() { finalize emit(this.name, 1); Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 26. map Reduce Function MongoDB reduce > var reduce = function (key, values) { finalize var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 27. Results > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 28. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 29. Pub Names in the Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 30. Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 31. MongoDB MapReduce • Real-time • Output directly to document or collection • Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++ Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 32. Aggregation Framework Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 33. Aggregation Framework • op1 MongoDB op2 opN Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 34. Aggregation Framework in 60 Seconds Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 35. Aggregation Framework Operators • $project • $match • $limit • $skip • $sort • $unwind • $group Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 36. $match • Filter documents • Uses existing query syntax • If using $geoNear it has to be first in pipeline • $where is not supported Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 37. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 38. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 39. Including and Excluding Fields { “$project”: { { "_id" : 271466, "name" : "The Red Lion", “_id”: 0, “amenity”: 1, “name”: 1, "location" : { }} "amenity" : "pub", "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } { “amenity” : “pub”, “name” : “The Red Lion” } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 40. Reformatting Documents { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }} Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 41. Dealing with Arrays { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "facilities" : [ "toilets", “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} {"$unwind": "$facility"} "food" ], } { "name" : "The Red Lion", "facility" : "toilets" }, { "name" : "The Red Lion", "facility" : "food" } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 42. $group • Group documents by an ID • Field reference, object, constant • Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last • Processes all data in memory Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 43. Back to the pub! • http://www.offwestend.com/index.php/theatres/pastshows/71 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 44. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 45. Results > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 46. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Declared in JSON, executes in C++ • Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 47. Analyzing MongoDB Data in External Systems Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 48. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 49. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ warehouse
  • 50. MongoDB with Hadoop • ETL Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ MongoDB
  • 51. Map Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 52. Reduce Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 53. Execute MapReduce hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 54. Popular Pub Names Nearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 55. MongoDB and Hadoop • Away from data store • Can leverage existing data processing infrastructure • Can horizontally scale your data processing - Offline batch processing - Requires synchronisation between store & processor - Infrastructure is much more complex Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 56. The Future of Big Data and MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 57. What is Big Data? Big Data today will be normal tomorrow Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 58. Exponential Data Growth Billions of URLs indexed by Google 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 2000 2002 2004 2006 2008 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ 2010 2012
  • 59. MongoDB enables you to scale big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 60. MongoDB is evolving so you can process the big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 61. Data Processing with MongoDB • Process in MongoDB using Map/Reduce • Process in MongoDB using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 62. Questions? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 63. Codemotion Milano Thanks! Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/

Notes de l'éditeur

  1. IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS&apos;s challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
  2. This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
  3. The industry is talking about Hadoop and MongoDB for Big Data. So should you
  4. This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
  5. Another common use case we see is warehousing of data -* again the connector allows you to utilize existing libraries via hadoopUS
  6. The third most common usecase is an ETL - extract transform load - function.Then putting the aggregated data into mongodb for further analysis.