SlideShare une entreprise Scribd logo
1  sur  36
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
My Background
 Data Architect / Engineer
 NoSQL and relational data
modeler
 Big data
 Analytics, machine learning
and text mining
 Cloud computing

 Author
 No SQL for Mere Mortals
 Contributor to TechTarget
 SearchDataManagement
 SearchCloudComputing
 SearchAWS
Overview
 Quick Intro to Data and Text Mining
 Need for Data Management in Data and Text
Mining
 Relational or NoSQL?
 Document Database Design Patterns
 MongoDB (Document Database) Model
 Questions
*
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* Feature Vector
* Distributed Neural Network
* Algorithms - Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
*
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
Training Instances
Error Rate
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Collect
* Data
* Documents
* Extract and Pre-processing
* Normalization
* Data Cleansing
* Case conversion
* Punctuation removal
* Stemming
* Analysis
* Classification Models
* Predictive Analytics
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* Error Evaluation
* Integration
* Link to Structured Data
* Deploy predictive models
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
*
*Experiments
*Type
*Data sets
*Algorithms
*Type
*Hyper-parameters
*Implementation software
*Results
*Model generation
*Error evaluation
* Raw Data
* Pre-processing steps
Image: http://content.timesjobs.com/ data-mining-specialist-will-lead-
demand-bpo-sector/
*
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
Schema-less <> Model-less
 Schema-less Document
Databases
 No fixed schema
 Polymorphic documents

 ...however, not a Design
Free-for-All
 Queries drives organization
 Performance Considerations
 Long-term Maintenance

 Middle Ground: Data
Model Patterns
 Reusable methods for
organizing data
 Model is implicit in
document structures
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
*
Pattern 1: One-to-Many
 Embed Documents
 Multiple documents
embedded
 “Many” attributes stored
with “One” document
 Pros
 Single fetch returns
primary and related data
 Might improve
performance
 Simplifies application
code
 Cons
 Increases document size
 Might degrade
performance
{
OrderID: 1837373,
customer : {Name: 'Jane Lox'
Addr: '123 Main St'
City: 'Boston'
State: 'MA'},
orderItem:{ Sku: 38383838,
Descr: 'Black chair'},
orderItem:{ Sku: 2872636,
Descr: 'Glass desk'},
orderItem:{ Sku: 4747433,
Descr: 'USB Drive 32GB''}
}
One-to-Many Considerations
 Query attributes in
embedded documents?
 Support for indexing
embedded documents?
 Potential for arbitrary
growth after record
created?
 Need for atomic writes?
Pattern 2: Many-to-Many
Employees
({empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [487,973, 287]}
{empID: 9872,
pname: “Bob”,
lname:”Williams”
projects: [487,973, 121]})
Projects
({projID:121,
projName:'NoSQL Pilot'',
team: [9872, 2431,
{projID:487,
projName:'Customer Churn
Analysis'',
team: [1873,9872]})
References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads
Pattern 2: Many-to-Many
Employee
{empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [
{projID:121,
projName:'NoSQL Pilot''},
{projID:487,
projName:'Customer Churn
Analysis''}
]}
Project
{projID:121,
projName:'NoSQL Pilot'',
team: [
{ empID: 1783,
fname: “Michelle”,
lname:”Jones”},
{ empID: 9872,
fname: “Bob”,
lname:”Williams”}
]}
Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth
Many-to-Many Considerations
 References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads
 Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth
Pattern 3: Trees with Parent & Child
References
 Trees
 Single root
document
 At most one parent
 No cycles
 Multiple Types
 Is-A
 Part-of
Pattern 3: Trees with References
Children Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
children: [179,180]},
{orgUnitID:179,
orgUnitType: “Branch”,
orgUnitName:”B1”
children: [181,182]},
{orgUnitID:180,
orgUnitType: “Branch”,
orgUnitName:”B2”
children: [183,184]})
Parent Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
parent: 177},
{orgUnitID:179,
orgUnitType: “Branch”,
orgUnitName:”B1”
parent: 178},
{orgUnitID:180,
orgUnitType: “Branch”,
orgUnitName:”B2”
parent: 178})
Tree Considerations
 Children reference allow for
top-down navigation
 Parent references allow for-
bottom up navigation
 Combination allow for
bottom-up and top-down
navigation
 Avoid large arrays
 Consider need for point in
time data
Anti-Patterns
 Large arrays
 Significant growth in
document size
 Fetching more data than
needed
 Fear of data duplication
 Thinking SQL, using
NoSQL
 Normalizing without need
*
*
Corpus
Experiment
Corpus
Experiment1:M 1:M
*
Corpus : {
corpus_id : ObjectID,
name : string,
descr : string,
create_date : date,
version : string,
contents: [ { id, text } ]
contents_uri: string
}
Experiment_Corpus : {
exp_corpus_id: ObjectID,
name : string,
type : string,
corpus_id : ObjectID,
descr_stats : {
count: integer,
min_len :integer,
max_len: integer,
mean_len: integer,
std_dev : float }
pre_process_opers: {
lowercase : boolean,
nopunct : boolean,
stem :boolean,
normal: boolean
}
contents: [{ id, text }],
contents_uri: string
}
*
Experiment : {
exp_id: ObjectID,
type : string,
exp_corups_id : OjbectID,
algorithm : {
type : string,
hyperparams: [{param, val}},
implementation : [
{software:string,
version: string,
code_uri: string } ]
}
model_file : string,
results : [ {metric, val} ],
model_gen_log : string,
error_evaluation : [
{ training_size,
training_error,
validation_error } ]
}
*
* Data and text mining processes are multi-
faceted
* Well suited to advantages of document
database models
*Design patterns provide building blocks of
models
* Query patterns determine choice among
patterns
Questions?

Contenu connexe

Plus de Dan Sullivan, Ph.D.

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 

Plus de Dan Sullivan, Ph.D. (9)

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Dernier

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 

Dernier (20)

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

Managing Data Analytics and Text Mining with MongoDB

  • 1. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Managing Data Analytics and Text Mining with MongoDB
  • 2. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Managing Data Analytics and Text Mining with MongoDB
  • 3. My Background  Data Architect / Engineer  NoSQL and relational data modeler  Big data  Analytics, machine learning and text mining  Cloud computing   Author  No SQL for Mere Mortals  Contributor to TechTarget  SearchDataManagement  SearchCloudComputing  SearchAWS
  • 4. Overview  Quick Intro to Data and Text Mining  Need for Data Management in Data and Text Mining  Relational or NoSQL?  Document Database Design Patterns  MongoDB (Document Database) Model  Questions
  • 5. *
  • 6. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * Feature Vector * Distributed Neural Network * Algorithms - Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 7. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 8. * 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error Training Instances Error Rate
  • 9. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 10. * *Large volumes of accessible and relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  • 11. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 12. * * Collect * Data * Documents * Extract and Pre-processing * Normalization * Data Cleansing * Case conversion * Punctuation removal * Stemming * Analysis * Classification Models * Predictive Analytics * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * Error Evaluation * Integration * Link to Structured Data * Deploy predictive models * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  • 13. * *Experiments *Type *Data sets *Algorithms *Type *Hyper-parameters *Implementation software *Results *Model generation *Error evaluation * Raw Data * Pre-processing steps Image: http://content.timesjobs.com/ data-mining-specialist-will-lead- demand-bpo-sector/
  • 14. *
  • 15.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 16. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 17. Schema-less <> Model-less  Schema-less Document Databases  No fixed schema  Polymorphic documents   ...however, not a Design Free-for-All  Queries drives organization  Performance Considerations  Long-term Maintenance   Middle Ground: Data Model Patterns  Reusable methods for organizing data  Model is implicit in document structures
  • 18. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 19. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 20. *
  • 21. Pattern 1: One-to-Many  Embed Documents  Multiple documents embedded  “Many” attributes stored with “One” document  Pros  Single fetch returns primary and related data  Might improve performance  Simplifies application code  Cons  Increases document size  Might degrade performance { OrderID: 1837373, customer : {Name: 'Jane Lox' Addr: '123 Main St' City: 'Boston' State: 'MA'}, orderItem:{ Sku: 38383838, Descr: 'Black chair'}, orderItem:{ Sku: 2872636, Descr: 'Glass desk'}, orderItem:{ Sku: 4747433, Descr: 'USB Drive 32GB''} }
  • 22. One-to-Many Considerations  Query attributes in embedded documents?  Support for indexing embedded documents?  Potential for arbitrary growth after record created?  Need for atomic writes?
  • 23. Pattern 2: Many-to-Many Employees ({empID: 1783, pname: “Michelle”, lname:”Jones” projects: [487,973, 287]} {empID: 9872, pname: “Bob”, lname:”Williams” projects: [487,973, 121]}) Projects ({projID:121, projName:'NoSQL Pilot'', team: [9872, 2431, {projID:487, projName:'Customer Churn Analysis'', team: [1873,9872]}) References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads
  • 24. Pattern 2: Many-to-Many Employee {empID: 1783, pname: “Michelle”, lname:”Jones” projects: [ {projID:121, projName:'NoSQL Pilot''}, {projID:487, projName:'Customer Churn Analysis''} ]} Project {projID:121, projName:'NoSQL Pilot'', team: [ { empID: 1783, fname: “Michelle”, lname:”Jones”}, { empID: 9872, fname: “Bob”, lname:”Williams”} ]} Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  • 25. Many-to-Many Considerations  References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads  Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  • 26. Pattern 3: Trees with Parent & Child References  Trees  Single root document  At most one parent  No cycles  Multiple Types  Is-A  Part-of
  • 27. Pattern 3: Trees with References Children Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” children: [179,180]}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” children: [181,182]}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” children: [183,184]}) Parent Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” parent: 177}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” parent: 178}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” parent: 178})
  • 28. Tree Considerations  Children reference allow for top-down navigation  Parent references allow for- bottom up navigation  Combination allow for bottom-up and top-down navigation  Avoid large arrays  Consider need for point in time data
  • 29. Anti-Patterns  Large arrays  Significant growth in document size  Fetching more data than needed  Fear of data duplication  Thinking SQL, using NoSQL  Normalizing without need
  • 30. *
  • 32. * Corpus : { corpus_id : ObjectID, name : string, descr : string, create_date : date, version : string, contents: [ { id, text } ] contents_uri: string } Experiment_Corpus : { exp_corpus_id: ObjectID, name : string, type : string, corpus_id : ObjectID, descr_stats : { count: integer, min_len :integer, max_len: integer, mean_len: integer, std_dev : float } pre_process_opers: { lowercase : boolean, nopunct : boolean, stem :boolean, normal: boolean } contents: [{ id, text }], contents_uri: string }
  • 33. * Experiment : { exp_id: ObjectID, type : string, exp_corups_id : OjbectID, algorithm : { type : string, hyperparams: [{param, val}}, implementation : [ {software:string, version: string, code_uri: string } ] } model_file : string, results : [ {metric, val} ], model_gen_log : string, error_evaluation : [ { training_size, training_error, validation_error } ] }
  • 34. * * Data and text mining processes are multi- faceted * Well suited to advantages of document database models *Design patterns provide building blocks of models * Query patterns determine choice among patterns
  • 35.

Notes de l'éditeur

  1. Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  2. JSON/BSON or XML storage