SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Ken W. Alger – Developer Advocate, MongoDB
Exploring your MongoDB Data with Pirates (R) and
Snakes (Python)
@kenwalger
Ken W. Alger
Developer Advocate
Overview
§ The Document Model
§ Data Frames
§ R vs. Python
§ MongoDB to Data Frames
§ Array Consumption
§ The Power of MongoDB
The Document Model
Document Model Features
Naturally maps objects to
code using JSON
Represent data of any
structure. Our data model is
very flexible.
Strongly typed for ease of
processing. We support
over twenty binary
encoded JSON data types.
Document Model for Analytics
Flexibility helps with
feature engineering by
allowing for
experimentation and the
picking of features
iteratively.
For Deep Learning the
flexibility allows for faster
iteration.
Pre-filtering of data with
aggregation framework.
Flexibility is Great if You're a Python or a Pirate…
Pirate Ships
Island Packet 31
Götheborg & Batavia
Sail Data
{
"Name": "Götheborg",
"Year Completed": 1738,
"Sail Area": {
"Lateen mizzen": 160,
"Mizzen topsail": 2500,
…
},
…
}
{
"Name": "Batavia",
"Year Completed": 1628,
"Sail Area": {
"Lateen mizzen": 250,
”Main topsail": 3500,
…
},
…
}
…but some tasks require data that's rigidly structured.
Data Frames
Data Frame
A data frame is a list of vectors, factors,
and/or matrices all having the same length
(number of rows in the case of matrices).
Used for storing data tables.
Data Frame Data
Length (ft) Year Completed Displacement Sail Area (sq. ft)
Batavia 186 1628 1200 33000
Cutty Sark 280 1869 2100 32000
Götheborg 190 1738 NaN 21140
HMS Endeavor 97.75 1764 NaN 29889
Kruzenshtern 375 1926 3064 NaN
HMS Victory 227.5 1765 3500 58556
Distributed Data Frames
Big data distributed across clusters.
Use tools like:
R vs. Python
R Data Frame
R dataframe is more or less built into the
language.
More functional than Python.
More statistical support in general.
Python Data Frame
More object-oriented.
Relies on packages (pandas, numpy, scikit-
learn)
As a language it’s great for additional tasks
along side of analytics.
Language Usage
MongoDB to Data Frames
Sail Data
{
"Lateen mizzen": 120,
"Mizzen topsail": 400,
"Mainsail": 2500,
…
}
{
"Lateen mizzen": 120,
"Main topgallant": 600,
"Mainsail": 3500,
…
}
library(mongolite)
connection <-
mongo(collection = "sails",
db="ships",
url="mongodb://localhost”)
sails <- data.frame(
connection$find()
)
MongoDB Data to Data Frames
import pandas as pd
from pymongo import
MongoClient
client =
MongoClient('localhost',
27017)
db = client.ships
df = pd.DataFrame(list
(db.sails.find())
)
Data Frame
Lateen mizzen Main topgallant Mainsail Mizzen topsail _id
0 120.0 NaN 2500 400.0 5cf53129984ae2b600701611
1 120.0 600.0 3500 NaN 5cf53141984ae2b600701612
Lateen mizzen Mizzen topsail Mainsail Main topgallant
1 120 400 2500 NA
2 120 NA 3500 600
Array Consumption
Array Data
Pattern 1 – Arrays of Arrays
sail_area = [
[160, 2500, 1600, 450],
[250, 3500, 2800, 575],
[120, 3500, 295]
]
0 1 2 3
0 160 2500 1600 450
1 250 3500 2800 575
2 120 3500 295 NaN
Resulting Data Frame
pd.DataFrame(sail_area)
-or-
data.frame(sail_area)
pd.DataFrame(list(db.sail_area.find()))
-or-
data.frame(connection$find())
Array Data
Pattern 2
[
{"lateen mizzen": 160, "mail topsail": 2500, "mainsail": 1600, "main topgallant": 450},
{"lateen mizzen": 250, "mail topsail": 3500, "mainsail": 2800, "main topgallant": 575},
{"lateen mizzen": 120, "mailsail": 3500, "jib": 450},
]
_id jib lateen mizzen mail topsail mainsail main topgallant
0 5cf5607b984ae2b600701613 NaN 160.0 2500.0 1600.0 450.0
1 5cf5609f984ae2b600701614 NaN 250.0 3500.0 2800.0 575.0
2 5cf560d4984ae2b600701615 450.0 120.0 NaN 3500.0 NaN
Resulting Data Frame
Array Data
Pattern 3
[
{"area": [160, 2500, 1600, 450]},
{"area": [250, 3500, 2800, 575]},
{"area": [120, 3500, 295]}
]
_id area
0 5cf57574984ae2b600701623 [160.0, 2500.0, 1600.0, 450.0]
1 5cf57584984ae2b600701624 [250.0, 3500.0, 2800.0, 575.0]
2 5cf5758f984ae2b600701625 [120.0, 3500.0, 295.0]
Resulting Data Frame
Are our hopes lost?
Moving Data from MongoDB Arrays
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
library(mongolite)
connection <- mongo(collection =
"sails", db="ships",
url="mongodb://localhost”)
sails <-
data.frame(connection$find())
Working with MongoDB Arrays
import pandas as pd
from pymongo import MongoClient
client = MongoClient('localhost',
27017)
db = client.ships
values = []
for ship in db.sailareas.find():
values.append(ship['area'])
print(pd.DataFrame(values))
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
Resulting Data Frame
0 1 2 3
0 160.0 2500.0 1600.0 450.0
1 250.0 3500.0 2800.0 575.0
2 120.0 3500.0 95.0 NaN
library(mongolite)
connection <- mongo(collection =
"sails", db="ships",
url="mongodb://localhost”)
sails <-
data.frame(connection$find())
Working with MongoDB Arrays
import pandas as pd
from pymongo import MongoClient
client = MongoClient('localhost',
27017)
db = client.ships
values = []
seriesLabels = []
for ship in db.sailareas.find():
values.append(ship['area'])
seriesLabels.append(ship['name'])
print(pd.DataFrame(values,
index=seriesLabels))
Array Data
[
{"name": "Batavia", "area": [160, 2500, 1600, 450]},
{"name": "Götheberg", "area": [250, 3500, 2800, 575]},
{"name": "HMS Endeavor", "area": [120, 3500, 295]}
]
Resulting Data Frame
0 1 2 3
Batavia 160.0 2500.0 1600.0 450.0
Götheborg 250.0 3500.0 2800.0 575.0
HMS Endeavor 120.0 3500.0 95.0 NaN
The Power of
MongoDB
Aggregation Framework
Aggregation Framework
• Pre-filter and/or pre-aggregate data on the server before
moving it across the network.
• Reduces the amount of data in the data frame.
• Improves performance.
Sample Data
Country
Year
Completed
Displacement
Individual Sail Areas
(sq. ft)
Batavia NLD 1628 1200 [292, 2012, 990, 550, 403, 642, 1056, ...]
Cutty Sark GBR 1869 2100 [2408, 866, 155, 2041, 518, 1675, …]
Götheborg SWE 1738 NaN [315, 614, 314, 2451, 2096, 2477, …]
HMS
Endeavor
GBR 1764 NaN [1060, 2089, 1101, 420, 2320, 2245]
Kruzenshtern DEU 1926 3064 [1476, 1352, 2383, 1100, 1807, 448, 2415]
HMS Victory GBR 1765 3500 [1310, 2445, 1327, 1668, 2098, 2179, …]
from datetime import datetime, timezone
values = []
seriesLabels = []
for ship in db.ships.aggregate [
{
'$match': {
'year_completed': {
'$gte': datetime(1571, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
}
}
},
Aggregation Pipeline 1
{
'$match': {
'year_completed': {
'$lt': datetime(1862, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
}
}
}, {
'$match': {
'country_of_origin': {
'$ne': 'USA'
}
}
},
Aggregation Pipeline 2
{
'$project': {
'name': 1,
'country_of_origin': 1,
'sail_areas': 1,
'total_sails': {
'$cond': {
'if': {
'$isArray': '$sail_areas'
},
'then': {
'$size': '$sail_areas'
},
'else': 'NA'
}
}}},
Aggregation Pipeline 3
{
'$project': {
'name': 1,
'country_of_origin': 1,
'total_sails': 1,
'total_area': {
'$sum': '$sail_area'
}
}
}
]):
values.append(ship['sail_area'])
seriesLabels.append(ship['name'])
dataframe = pd.DataFrame(values, index=seriesLabels)
Aggregation Pipeline 4
Results
0
La Amistad 19335
Batavia 31246
Götheborg 18464
HMS Endeavour 9235
Golden Hind 11749
Grand Turk 8405
Kalmar Nyckel 3710
Lady Nelson 24938
Pallada 20785
Shtandart 9363
HMS Sultana 35061
HMS Surprise 26272
HMS Trincomalee 21744
HMS Victory 14740
Other Sessions
Today
1:00pm Real-time Clinical Decision Support System – Prem Timisina & Arash Kia
2:00 Analytics with MongoDB – Stuart Shiell & Mark Clancy
2:00 A Complete Methodology to Data Modeling for MongoDB –Daniel Coupal
3:15 Unleash the Power of the MongoDB Aggregation Framework – Abhishek Bagga
Tomorrow
9:00am Best Practices for Working with IoT and Time-series Data – Robert Walters
3:00pm MongoDB in Data Science – Vigen Sahakyan
Takeaways
MongoDB's flexible data model is very powerful for data analytics.
Some analytic tools require a more structured approach.
When forming your data the schema design used can make a huge
impact on analytics.
Use MongoDB's Aggregation Framework to improve performance.
Thank You!
Ken W. Alger - @kenwalger
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python)

Contenu connexe

Tendances

Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB Webinar
MongoDB
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Tyler Brock
 

Tendances (20)

Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB Webinar
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Tiered storage intro. By Robert Hodges, Altinity CEO
Tiered storage intro. By Robert Hodges, Altinity CEOTiered storage intro. By Robert Hodges, Altinity CEO
Tiered storage intro. By Robert Hodges, Altinity CEO
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Presto in Treasure Data
Presto in Treasure DataPresto in Treasure Data
Presto in Treasure Data
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 
Peggy elasticsearch應用
Peggy elasticsearch應用Peggy elasticsearch應用
Peggy elasticsearch應用
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
MariaDB and Clickhouse Percona Live 2019 talk
MariaDB and Clickhouse Percona Live 2019 talkMariaDB and Clickhouse Percona Live 2019 talk
MariaDB and Clickhouse Percona Live 2019 talk
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
 

Similaire à MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python)

FleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojure
elliando dias
 
FleetDB: A Schema-Free Database in Clojure
FleetDB: A Schema-Free Database in ClojureFleetDB: A Schema-Free Database in Clojure
FleetDB: A Schema-Free Database in Clojure
Mark McGranaghan
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
Jeremy Kendall
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 

Similaire à MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python) (20)

MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
 
Mongodb workshop
Mongodb workshopMongodb workshop
Mongodb workshop
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Elasticsearch in 15 Minutes
Elasticsearch in 15 MinutesElasticsearch in 15 Minutes
Elasticsearch in 15 Minutes
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
FleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojure
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Codable routing
Codable routingCodable routing
Codable routing
 
FleetDB: A Schema-Free Database in Clojure
FleetDB: A Schema-Free Database in ClojureFleetDB: A Schema-Free Database in Clojure
FleetDB: A Schema-Free Database in Clojure
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of Things
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
 
Mongo+java (1)
Mongo+java (1)Mongo+java (1)
Mongo+java (1)
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 

Plus de MongoDB

Plus de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (Python)

  • 1. Ken W. Alger – Developer Advocate, MongoDB Exploring your MongoDB Data with Pirates (R) and Snakes (Python) @kenwalger
  • 3. Overview § The Document Model § Data Frames § R vs. Python § MongoDB to Data Frames § Array Consumption § The Power of MongoDB
  • 5. Document Model Features Naturally maps objects to code using JSON Represent data of any structure. Our data model is very flexible. Strongly typed for ease of processing. We support over twenty binary encoded JSON data types.
  • 6. Document Model for Analytics Flexibility helps with feature engineering by allowing for experimentation and the picking of features iteratively. For Deep Learning the flexibility allows for faster iteration. Pre-filtering of data with aggregation framework.
  • 7. Flexibility is Great if You're a Python or a Pirate…
  • 11. Sail Data { "Name": "Götheborg", "Year Completed": 1738, "Sail Area": { "Lateen mizzen": 160, "Mizzen topsail": 2500, … }, … } { "Name": "Batavia", "Year Completed": 1628, "Sail Area": { "Lateen mizzen": 250, ”Main topsail": 3500, … }, … }
  • 12. …but some tasks require data that's rigidly structured.
  • 14. Data Frame A data frame is a list of vectors, factors, and/or matrices all having the same length (number of rows in the case of matrices). Used for storing data tables.
  • 15. Data Frame Data Length (ft) Year Completed Displacement Sail Area (sq. ft) Batavia 186 1628 1200 33000 Cutty Sark 280 1869 2100 32000 Götheborg 190 1738 NaN 21140 HMS Endeavor 97.75 1764 NaN 29889 Kruzenshtern 375 1926 3064 NaN HMS Victory 227.5 1765 3500 58556
  • 16. Distributed Data Frames Big data distributed across clusters. Use tools like:
  • 18. R Data Frame R dataframe is more or less built into the language. More functional than Python. More statistical support in general.
  • 19. Python Data Frame More object-oriented. Relies on packages (pandas, numpy, scikit- learn) As a language it’s great for additional tasks along side of analytics.
  • 21. MongoDB to Data Frames
  • 22. Sail Data { "Lateen mizzen": 120, "Mizzen topsail": 400, "Mainsail": 2500, … } { "Lateen mizzen": 120, "Main topgallant": 600, "Mainsail": 3500, … }
  • 23. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame( connection$find() ) MongoDB Data to Data Frames import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships df = pd.DataFrame(list (db.sails.find()) )
  • 24. Data Frame Lateen mizzen Main topgallant Mainsail Mizzen topsail _id 0 120.0 NaN 2500 400.0 5cf53129984ae2b600701611 1 120.0 600.0 3500 NaN 5cf53141984ae2b600701612 Lateen mizzen Mizzen topsail Mainsail Main topgallant 1 120 400 2500 NA 2 120 NA 3500 600
  • 26. Array Data Pattern 1 – Arrays of Arrays sail_area = [ [160, 2500, 1600, 450], [250, 3500, 2800, 575], [120, 3500, 295] ] 0 1 2 3 0 160 2500 1600 450 1 250 3500 2800 575 2 120 3500 295 NaN Resulting Data Frame pd.DataFrame(sail_area) -or- data.frame(sail_area) pd.DataFrame(list(db.sail_area.find())) -or- data.frame(connection$find())
  • 27. Array Data Pattern 2 [ {"lateen mizzen": 160, "mail topsail": 2500, "mainsail": 1600, "main topgallant": 450}, {"lateen mizzen": 250, "mail topsail": 3500, "mainsail": 2800, "main topgallant": 575}, {"lateen mizzen": 120, "mailsail": 3500, "jib": 450}, ] _id jib lateen mizzen mail topsail mainsail main topgallant 0 5cf5607b984ae2b600701613 NaN 160.0 2500.0 1600.0 450.0 1 5cf5609f984ae2b600701614 NaN 250.0 3500.0 2800.0 575.0 2 5cf560d4984ae2b600701615 450.0 120.0 NaN 3500.0 NaN Resulting Data Frame
  • 28. Array Data Pattern 3 [ {"area": [160, 2500, 1600, 450]}, {"area": [250, 3500, 2800, 575]}, {"area": [120, 3500, 295]} ] _id area 0 5cf57574984ae2b600701623 [160.0, 2500.0, 1600.0, 450.0] 1 5cf57584984ae2b600701624 [250.0, 3500.0, 2800.0, 575.0] 2 5cf5758f984ae2b600701625 [120.0, 3500.0, 295.0] Resulting Data Frame
  • 29. Are our hopes lost?
  • 30. Moving Data from MongoDB Arrays
  • 31. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ]
  • 32. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame(connection$find()) Working with MongoDB Arrays import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships values = [] for ship in db.sailareas.find(): values.append(ship['area']) print(pd.DataFrame(values))
  • 33. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ] Resulting Data Frame 0 1 2 3 0 160.0 2500.0 1600.0 450.0 1 250.0 3500.0 2800.0 575.0 2 120.0 3500.0 95.0 NaN
  • 34. library(mongolite) connection <- mongo(collection = "sails", db="ships", url="mongodb://localhost”) sails <- data.frame(connection$find()) Working with MongoDB Arrays import pandas as pd from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.ships values = [] seriesLabels = [] for ship in db.sailareas.find(): values.append(ship['area']) seriesLabels.append(ship['name']) print(pd.DataFrame(values, index=seriesLabels))
  • 35. Array Data [ {"name": "Batavia", "area": [160, 2500, 1600, 450]}, {"name": "Götheberg", "area": [250, 3500, 2800, 575]}, {"name": "HMS Endeavor", "area": [120, 3500, 295]} ] Resulting Data Frame 0 1 2 3 Batavia 160.0 2500.0 1600.0 450.0 Götheborg 250.0 3500.0 2800.0 575.0 HMS Endeavor 120.0 3500.0 95.0 NaN
  • 37. Aggregation Framework • Pre-filter and/or pre-aggregate data on the server before moving it across the network. • Reduces the amount of data in the data frame. • Improves performance.
  • 38. Sample Data Country Year Completed Displacement Individual Sail Areas (sq. ft) Batavia NLD 1628 1200 [292, 2012, 990, 550, 403, 642, 1056, ...] Cutty Sark GBR 1869 2100 [2408, 866, 155, 2041, 518, 1675, …] Götheborg SWE 1738 NaN [315, 614, 314, 2451, 2096, 2477, …] HMS Endeavor GBR 1764 NaN [1060, 2089, 1101, 420, 2320, 2245] Kruzenshtern DEU 1926 3064 [1476, 1352, 2383, 1100, 1807, 448, 2415] HMS Victory GBR 1765 3500 [1310, 2445, 1327, 1668, 2098, 2179, …]
  • 39. from datetime import datetime, timezone values = [] seriesLabels = [] for ship in db.ships.aggregate [ { '$match': { 'year_completed': { '$gte': datetime(1571, 1, 1, 0, 0, 0, tzinfo=timezone.utc) } } }, Aggregation Pipeline 1
  • 40. { '$match': { 'year_completed': { '$lt': datetime(1862, 1, 1, 0, 0, 0, tzinfo=timezone.utc) } } }, { '$match': { 'country_of_origin': { '$ne': 'USA' } } }, Aggregation Pipeline 2
  • 41. { '$project': { 'name': 1, 'country_of_origin': 1, 'sail_areas': 1, 'total_sails': { '$cond': { 'if': { '$isArray': '$sail_areas' }, 'then': { '$size': '$sail_areas' }, 'else': 'NA' } }}}, Aggregation Pipeline 3
  • 42. { '$project': { 'name': 1, 'country_of_origin': 1, 'total_sails': 1, 'total_area': { '$sum': '$sail_area' } } } ]): values.append(ship['sail_area']) seriesLabels.append(ship['name']) dataframe = pd.DataFrame(values, index=seriesLabels) Aggregation Pipeline 4
  • 43. Results 0 La Amistad 19335 Batavia 31246 Götheborg 18464 HMS Endeavour 9235 Golden Hind 11749 Grand Turk 8405 Kalmar Nyckel 3710 Lady Nelson 24938 Pallada 20785 Shtandart 9363 HMS Sultana 35061 HMS Surprise 26272 HMS Trincomalee 21744 HMS Victory 14740
  • 44. Other Sessions Today 1:00pm Real-time Clinical Decision Support System – Prem Timisina & Arash Kia 2:00 Analytics with MongoDB – Stuart Shiell & Mark Clancy 2:00 A Complete Methodology to Data Modeling for MongoDB –Daniel Coupal 3:15 Unleash the Power of the MongoDB Aggregation Framework – Abhishek Bagga Tomorrow 9:00am Best Practices for Working with IoT and Time-series Data – Robert Walters 3:00pm MongoDB in Data Science – Vigen Sahakyan
  • 45. Takeaways MongoDB's flexible data model is very powerful for data analytics. Some analytic tools require a more structured approach. When forming your data the schema design used can make a huge impact on analytics. Use MongoDB's Aggregation Framework to improve performance.
  • 46. Thank You! Ken W. Alger - @kenwalger