SlideShare une entreprise Scribd logo
1  sur  36
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
My Background
 Data Architect / Engineer
 NoSQL and relational data
modeler
 Big data
 Analytics, machine learning
and text mining
 Cloud computing

 Author
 No SQL for Mere Mortals
 Contributor to TechTarget
 SearchDataManagement
 SearchCloudComputing
 SearchAWS
Overview
 Quick Intro to Data and Text Mining
 Need for Data Management in Data and Text
Mining
 Relational or NoSQL?
 Document Database Design Patterns
 MongoDB (Document Database) Model
 Questions
*
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* Feature Vector
* Distributed Neural Network
* Algorithms - Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
*
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
Training Instances
Error Rate
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Collect
* Data
* Documents
* Extract and Pre-processing
* Normalization
* Data Cleansing
* Case conversion
* Punctuation removal
* Stemming
* Analysis
* Classification Models
* Predictive Analytics
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* Error Evaluation
* Integration
* Link to Structured Data
* Deploy predictive models
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
*
*Experiments
*Type
*Data sets
*Algorithms
*Type
*Hyper-parameters
*Implementation software
*Results
*Model generation
*Error evaluation
* Raw Data
* Pre-processing steps
Image: http://content.timesjobs.com/ data-mining-specialist-will-lead-
demand-bpo-sector/
*
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
Schema-less <> Model-less
 Schema-less Document
Databases
 No fixed schema
 Polymorphic documents

 ...however, not a Design
Free-for-All
 Queries drives organization
 Performance Considerations
 Long-term Maintenance

 Middle Ground: Data
Model Patterns
 Reusable methods for
organizing data
 Model is implicit in
document structures
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
*
Pattern 1: One-to-Many
 Embed Documents
 Multiple documents
embedded
 “Many” attributes stored
with “One” document
 Pros
 Single fetch returns
primary and related data
 Might improve
performance
 Simplifies application
code
 Cons
 Increases document size
 Might degrade
performance
{
OrderID: 1837373,
customer : {Name: 'Jane Lox'
Addr: '123 Main St'
City: 'Boston'
State: 'MA'},
orderItem:{ Sku: 38383838,
Descr: 'Black chair'},
orderItem:{ Sku: 2872636,
Descr: 'Glass desk'},
orderItem:{ Sku: 4747433,
Descr: 'USB Drive 32GB''}
}
One-to-Many Considerations
 Query attributes in
embedded documents?
 Support for indexing
embedded documents?
 Potential for arbitrary
growth after record
created?
 Need for atomic writes?
Pattern 2: Many-to-Many
Employees
({empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [487,973, 287]}
{empID: 9872,
pname: “Bob”,
lname:”Williams”
projects: [487,973, 121]})
Projects
({projID:121,
projName:'NoSQL Pilot'',
team: [9872, 2431,
{projID:487,
projName:'Customer Churn
Analysis'',
team: [1873,9872]})
References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads
Pattern 2: Many-to-Many
Employee
{empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [
{projID:121,
projName:'NoSQL Pilot''},
{projID:487,
projName:'Customer Churn
Analysis''}
]}
Project
{projID:121,
projName:'NoSQL Pilot'',
team: [
{ empID: 1783,
fname: “Michelle”,
lname:”Jones”},
{ empID: 9872,
fname: “Bob”,
lname:”Williams”}
]}
Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth
Many-to-Many Considerations
 References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads
 Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth
Pattern 3: Trees with Parent & Child
References
 Trees
 Single root
document
 At most one parent
 No cycles
 Multiple Types
 Is-A
 Part-of
Pattern 3: Trees with References
Children Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
children: [179,180]},
{orgUnitID:179,
orgUnitType: “Branch”,
orgUnitName:”B1”
children: [181,182]},
{orgUnitID:180,
orgUnitType: “Branch”,
orgUnitName:”B2”
children: [183,184]})
Parent Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
parent: 177},
{orgUnitID:179,
orgUnitType: “Branch”,
orgUnitName:”B1”
parent: 178},
{orgUnitID:180,
orgUnitType: “Branch”,
orgUnitName:”B2”
parent: 178})
Tree Considerations
 Children reference allow for
top-down navigation
 Parent references allow for-
bottom up navigation
 Combination allow for
bottom-up and top-down
navigation
 Avoid large arrays
 Consider need for point in
time data
Anti-Patterns
 Large arrays
 Significant growth in
document size
 Fetching more data than
needed
 Fear of data duplication
 Thinking SQL, using
NoSQL
 Normalizing without need
*
*
Corpus
Experiment
Corpus
Experiment1:M 1:M
*
Corpus : {
corpus_id : ObjectID,
name : string,
descr : string,
create_date : date,
version : string,
contents: [ { id, text } ]
contents_uri: string
}
Experiment_Corpus : {
exp_corpus_id: ObjectID,
name : string,
type : string,
corpus_id : ObjectID,
descr_stats : {
count: integer,
min_len :integer,
max_len: integer,
mean_len: integer,
std_dev : float }
pre_process_opers: {
lowercase : boolean,
nopunct : boolean,
stem :boolean,
normal: boolean
}
contents: [{ id, text }],
contents_uri: string
}
*
Experiment : {
exp_id: ObjectID,
type : string,
exp_corups_id : OjbectID,
algorithm : {
type : string,
hyperparams: [{param, val}},
implementation : [
{software:string,
version: string,
code_uri: string } ]
}
model_file : string,
results : [ {metric, val} ],
model_gen_log : string,
error_evaluation : [
{ training_size,
training_error,
validation_error } ]
}
*
* Data and text mining processes are multi-
faceted
* Well suited to advantages of document
database models
*Design patterns provide building blocks of
models
* Query patterns determine choice among
patterns
Questions?

Contenu connexe

Plus de Dan Sullivan, Ph.D.

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 

Plus de Dan Sullivan, Ph.D. (9)

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Dernier

如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heapaashikalamichhane
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7gragkhusi
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 

Dernier (20)

如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heap
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 

Managing Data Analytics and Text Mining with MongoDB

  • 1. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Managing Data Analytics and Text Mining with MongoDB
  • 2. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Managing Data Analytics and Text Mining with MongoDB
  • 3. My Background  Data Architect / Engineer  NoSQL and relational data modeler  Big data  Analytics, machine learning and text mining  Cloud computing   Author  No SQL for Mere Mortals  Contributor to TechTarget  SearchDataManagement  SearchCloudComputing  SearchAWS
  • 4. Overview  Quick Intro to Data and Text Mining  Need for Data Management in Data and Text Mining  Relational or NoSQL?  Document Database Design Patterns  MongoDB (Document Database) Model  Questions
  • 5. *
  • 6. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * Feature Vector * Distributed Neural Network * Algorithms - Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 7. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 8. * 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error Training Instances Error Rate
  • 9. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 10. * *Large volumes of accessible and relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  • 11. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 12. * * Collect * Data * Documents * Extract and Pre-processing * Normalization * Data Cleansing * Case conversion * Punctuation removal * Stemming * Analysis * Classification Models * Predictive Analytics * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * Error Evaluation * Integration * Link to Structured Data * Deploy predictive models * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  • 13. * *Experiments *Type *Data sets *Algorithms *Type *Hyper-parameters *Implementation software *Results *Model generation *Error evaluation * Raw Data * Pre-processing steps Image: http://content.timesjobs.com/ data-mining-specialist-will-lead- demand-bpo-sector/
  • 14. *
  • 15.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 16. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 17. Schema-less <> Model-less  Schema-less Document Databases  No fixed schema  Polymorphic documents   ...however, not a Design Free-for-All  Queries drives organization  Performance Considerations  Long-term Maintenance   Middle Ground: Data Model Patterns  Reusable methods for organizing data  Model is implicit in document structures
  • 18. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 19. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 20. *
  • 21. Pattern 1: One-to-Many  Embed Documents  Multiple documents embedded  “Many” attributes stored with “One” document  Pros  Single fetch returns primary and related data  Might improve performance  Simplifies application code  Cons  Increases document size  Might degrade performance { OrderID: 1837373, customer : {Name: 'Jane Lox' Addr: '123 Main St' City: 'Boston' State: 'MA'}, orderItem:{ Sku: 38383838, Descr: 'Black chair'}, orderItem:{ Sku: 2872636, Descr: 'Glass desk'}, orderItem:{ Sku: 4747433, Descr: 'USB Drive 32GB''} }
  • 22. One-to-Many Considerations  Query attributes in embedded documents?  Support for indexing embedded documents?  Potential for arbitrary growth after record created?  Need for atomic writes?
  • 23. Pattern 2: Many-to-Many Employees ({empID: 1783, pname: “Michelle”, lname:”Jones” projects: [487,973, 287]} {empID: 9872, pname: “Bob”, lname:”Williams” projects: [487,973, 121]}) Projects ({projID:121, projName:'NoSQL Pilot'', team: [9872, 2431, {projID:487, projName:'Customer Churn Analysis'', team: [1873,9872]}) References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads
  • 24. Pattern 2: Many-to-Many Employee {empID: 1783, pname: “Michelle”, lname:”Jones” projects: [ {projID:121, projName:'NoSQL Pilot''}, {projID:487, projName:'Customer Churn Analysis''} ]} Project {projID:121, projName:'NoSQL Pilot'', team: [ { empID: 1783, fname: “Michelle”, lname:”Jones”}, { empID: 9872, fname: “Bob”, lname:”Williams”} ]} Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  • 25. Many-to-Many Considerations  References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads  Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  • 26. Pattern 3: Trees with Parent & Child References  Trees  Single root document  At most one parent  No cycles  Multiple Types  Is-A  Part-of
  • 27. Pattern 3: Trees with References Children Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” children: [179,180]}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” children: [181,182]}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” children: [183,184]}) Parent Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” parent: 177}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” parent: 178}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” parent: 178})
  • 28. Tree Considerations  Children reference allow for top-down navigation  Parent references allow for- bottom up navigation  Combination allow for bottom-up and top-down navigation  Avoid large arrays  Consider need for point in time data
  • 29. Anti-Patterns  Large arrays  Significant growth in document size  Fetching more data than needed  Fear of data duplication  Thinking SQL, using NoSQL  Normalizing without need
  • 30. *
  • 32. * Corpus : { corpus_id : ObjectID, name : string, descr : string, create_date : date, version : string, contents: [ { id, text } ] contents_uri: string } Experiment_Corpus : { exp_corpus_id: ObjectID, name : string, type : string, corpus_id : ObjectID, descr_stats : { count: integer, min_len :integer, max_len: integer, mean_len: integer, std_dev : float } pre_process_opers: { lowercase : boolean, nopunct : boolean, stem :boolean, normal: boolean } contents: [{ id, text }], contents_uri: string }
  • 33. * Experiment : { exp_id: ObjectID, type : string, exp_corups_id : OjbectID, algorithm : { type : string, hyperparams: [{param, val}}, implementation : [ {software:string, version: string, code_uri: string } ] } model_file : string, results : [ {metric, val} ], model_gen_log : string, error_evaluation : [ { training_size, training_error, validation_error } ] }
  • 34. * * Data and text mining processes are multi- faceted * Well suited to advantages of document database models *Design patterns provide building blocks of models * Query patterns determine choice among patterns
  • 35.

Notes de l'éditeur

  1. Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  2. JSON/BSON or XML storage