Data analysis is an exploratory process that requires a variety of tools and a flexible data store. Data analysis projects are easy to start but quickly become difficult to manage and error prone when depending on file-based data storage. Relational databases are poorly equipped to accommodate the dynamic demands complex analysis. This talk describes best practices for using MongoDB for analytics projects. Examples will be drawn from a large scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. Tools discussed include R, Spark, Python scientific stack, and custom pre-processing scripts but the focus is on using these with the document database.
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Managing Data Analytics and Text Mining with MongoDB
1. Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
2. Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Managing Data
Analytics and Text
Mining with
MongoDB
3. My Background
Data Architect / Engineer
NoSQL and relational data
modeler
Big data
Analytics, machine learning
and text mining
Cloud computing
Author
No SQL for Mere Mortals
Contributor to TechTarget
SearchDataManagement
SearchCloudComputing
SearchAWS
4. Overview
Quick Intro to Data and Text Mining
Need for Data Management in Data and Text
Mining
Relational or NoSQL?
Document Database Design Patterns
MongoDB (Document Database) Model
Questions
6. * 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* Feature Vector
* Distributed Neural Network
* Algorithms - Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
7. *
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
10. *
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
11. Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
12. *
* Collect
* Data
* Documents
* Extract and Pre-processing
* Normalization
* Data Cleansing
* Case conversion
* Punctuation removal
* Stemming
* Analysis
* Classification Models
* Predictive Analytics
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* Error Evaluation
* Integration
* Link to Structured Data
* Deploy predictive models
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
15. Pragmatic
Widely applicable
Many options
Modeling
Reduce risk of data
anomalies.
Separate logical
and physical
models
16. Features
JSON/XML structures
Fields vary between docs
No predefined schema
Documents analogous to
rows
Collections analogous to
tables
Query capabilities
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
17. Schema-less <> Model-less
Schema-less Document
Databases
No fixed schema
Polymorphic documents
...however, not a Design
Free-for-All
Queries drives organization
Performance Considerations
Long-term Maintenance
Middle Ground: Data
Model Patterns
Reusable methods for
organizing data
Model is implicit in
document structures
18. Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins
19. Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
22. One-to-Many Considerations
Query attributes in
embedded documents?
Support for indexing
embedded documents?
Potential for arbitrary
growth after record
created?
Need for atomic writes?
28. Tree Considerations
Children reference allow for
top-down navigation
Parent references allow for-
bottom up navigation
Combination allow for
bottom-up and top-down
navigation
Avoid large arrays
Consider need for point in
time data
29. Anti-Patterns
Large arrays
Significant growth in
document size
Fetching more data than
needed
Fear of data duplication
Thinking SQL, using
NoSQL
Normalizing without need
34. *
* Data and text mining processes are multi-
faceted
* Well suited to advantages of document
database models
*Design patterns provide building blocks of
models
* Query patterns determine choice among
patterns
Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well.
Mature set of tools, such as IDEs for database developers. Many resources and best practices available.
From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly).
Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media).
Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.