Managing Data Analytics and Text Mining with MongoDB

Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB

My Background
 Data Architect / Engineer
 NoSQL and relational data
modeler
 Big data
 Analytics, machine learning
and text mining
 Cloud computing

 Author
 No SQL for Mere Mortals
 Contributor to TechTarget
 SearchDataManagement
 SearchCloudComputing
 SearchAWS

Overview
 Quick Intro to Data and Text Mining
 Need for Data Management in Data and Text
Mining
 Relational or NoSQL?
 Document Database Design Patterns
 MongoDB (Document Database) Model
 Questions

* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* Feature Vector
* Distributed Neural Network
* Algorithms - Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*

*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

*
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
Training Instances
Error Rate

Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis

*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management

Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations

*
* Collect
* Data
* Documents
* Extract and Pre-processing
* Normalization
* Data Cleansing
* Case conversion
* Punctuation removal
* Stemming
* Analysis
* Classification Models
* Predictive Analytics
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* Error Evaluation
* Integration
* Link to Structured Data
* Deploy predictive models
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize

*
*Experiments
*Type
*Data sets
*Algorithms
*Type
*Hyper-parameters
*Implementation software
*Results
*Model generation
*Error evaluation
* Raw Data
* Pre-processing steps
Image: http://content.timesjobs.com/ data-mining-specialist-will-lead-
demand-bpo-sector/

 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models

Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}

Schema-less <> Model-less
 Schema-less Document
Databases
 No fixed schema
 Polymorphic documents

 ...however, not a Design
Free-for-All
 Queries drives organization
 Performance Considerations
 Long-term Maintenance

 Middle Ground: Data
Model Patterns
 Reusable methods for
organizing data
 Model is implicit in
document structures

Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins


Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}

Pattern 1: One-to-Many
 Embed Documents
 Multiple documents
embedded
 “Many” attributes stored
with “One” document
 Pros
 Single fetch returns
primary and related data
 Might improve
performance
 Simplifies application
code
 Cons
 Increases document size
 Might degrade
performance
{
OrderID: 1837373,
customer : {Name: 'Jane Lox'
Addr: '123 Main St'
City: 'Boston'
State: 'MA'},
orderItem:{ Sku: 38383838,
Descr: 'Black chair'},
Descr: 'Glass desk'},
Descr: 'USB Drive 32GB''}
}

One-to-Many Considerations
 Query attributes in
embedded documents?
 Support for indexing
embedded documents?
 Potential for arbitrary
growth after record
created?
 Need for atomic writes?

Pattern 2: Many-to-Many
Employees
({empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [487,973, 287]}
{empID: 9872,
pname: “Bob”,
lname:”Williams”
projects: [487,973, 121]})
Projects
({projID:121,
projName:'NoSQL Pilot'',
team: [9872, 2431,
{projID:487,
projName:'Customer Churn
Analysis'',
team: [1873,9872]})
References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads

Pattern 2: Many-to-Many
Employee
{empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [
{projID:121,
projName:'NoSQL Pilot''},
{projID:487,
projName:'Customer Churn
Analysis''}
]}
Project
{projID:121,
projName:'NoSQL Pilot'',
team: [
{ empID: 1783,
fname: “Michelle”,
lname:”Jones”},
{ empID: 9872,
fname: “Bob”,
lname:”Williams”}
]}
Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth

Many-to-Many Considerations
 References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires multiple reads
 Embedded Documents
 Captures point in time data
 One document read retrieves
data
 Increases document growth

Pattern 3: Trees with Parent & Child
References
 Trees
 Single root
document
 At most one parent
 No cycles
 Multiple Types
 Is-A
 Part-of

Pattern 3: Trees with References
Children Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
children: [179,180]},
{orgUnitID:179,
orgUnitType: “Branch”,
orgUnitName:”B1”
children: [181,182]},
{orgUnitID:180,
children: [183,184]})
Parent Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
parent: 177},
{orgUnitID:179,
parent: 178},
{orgUnitID:180,
parent: 178})

Tree Considerations
 Children reference allow for
top-down navigation
 Parent references allow for-
bottom up navigation
 Combination allow for
bottom-up and top-down
navigation
 Avoid large arrays
 Consider need for point in
time data

Anti-Patterns
 Large arrays
 Significant growth in
document size
 Fetching more data than
needed
 Fear of data duplication
 Thinking SQL, using
NoSQL
 Normalizing without need

*
Corpus
Experiment
Corpus
Experiment1:M 1:M

*
Corpus : {
corpus_id : ObjectID,
name : string,
descr : string,
create_date : date,
version : string,
contents: [ { id, text } ]
contents_uri: string
}
Experiment_Corpus : {
exp_corpus_id: ObjectID,
name : string,
type : string,
corpus_id : ObjectID,
descr_stats : {
count: integer,
min_len :integer,
max_len: integer,
mean_len: integer,
std_dev : float }
pre_process_opers: {
lowercase : boolean,
nopunct : boolean,
stem :boolean,
normal: boolean
}
contents: [{ id, text }],
contents_uri: string
}

*
Experiment : {
exp_id: ObjectID,
type : string,
exp_corups_id : OjbectID,
algorithm : {
type : string,
hyperparams: [{param, val}},
implementation : [
{software:string,
version: string,
code_uri: string } ]
}
model_file : string,
results : [ {metric, val} ],
model_gen_log : string,
error_evaluation : [
{ training_size,
training_error,
validation_error } ]
}

*
* Data and text mining processes are multi-
faceted
* Well suited to advantages of document
database models
*Design patterns provide building blocks of
models
* Query patterns determine choice among
patterns

Managing Data Analytics and Text Mining with MongoDB

Recommandé

Recommandé

Contenu connexe

Plus de Dan Sullivan, Ph.D.

Plus de Dan Sullivan, Ph.D. (9)

Dernier

Dernier (20)

Managing Data Analytics and Text Mining with MongoDB

Notes de l'éditeur