5. A Typical “Big Data” Analytics Setup
Data Aggregation and Analytics Applications
Commodity Linux Platforms and/or High Performance Computing Clusters
Column Data Graph Object K-V
RDBMS Hadoop Doc DB
Store W/H DB DB Store
Structured Semi-Structured Unstructured
7. Not Only SQL – a group of 4 primary technologies
•
Users choose between four different primary technologies for different
purposes:
–
Key-Value Stores
–
“Big Table” Clones
–
Document Databases
–
Object and Graph databases (including InfiniteGraph)
•
Many implementations sacrifice consistency (ACID transactions, CAP
– eventual consistency) for performance.
•
Technologies such as Objectivity/DB and InfiniteGraph offer ACID
transactions, with consistency and performance.
9. Key-Value Stores
“Dynamo: Amazon’s High Available Key-Value Store” [2007]
•
Data model:
–
Global key-value mapping
–
Scalable (sharded) HashMap KEY VALUE
–
Highly fault tolerant (typically)
•
Examples:
–
Riak, Redis and Voldemort
10. Key-Value Stores: Pros & Cons
•
Strengths:
–
Simple data model
–
Great at scaling out horizontally
–
Scalable
–
Available
KEY VALUE
•
Weaknesses:
–
Simplistic data model
–
Poor for complex data
–
Unsuited for interconnected data
11. Big Table Clones – Column Family
•
Google’s “Bigtable: A Distributed Storage System for
Structured Data” [2006]
•
Column-Family are essentially Big Table clones.
Column
•
Data Model: KEY Column Name Value D/Time
–
A big table, with column families.
–
Map-reduce for parallel query/processing.
•
Examples:
–
Hbase, HyperTable and Cassandra.
12. Big Table Clones – Pros & Cons
•
Strengths:
–
Data model supports semi-structured data
–
Naturally indexed (columns)
–
Good at scaling out horizontally
Column
•
Weaknesses:
KEY Column Name Value D/Time
–
Complex data model
–
Unsuited for highly interconnected data
13. Document Databases
•
Data Model:
–
A collection of unstructured or semi-structured documents.
–
Each document is referenced using a key-value pair.
–
The “value” can range from unstructured text to a collection of key-
value pairs or a group of XML objects.
–
Index-centric to support queries based on content.
•
Examples:
KEY DOCUMENT
–
CouchDB and MongoDB.
14. Document Databases – Pros & Cons
•
Strengths:
–
Simple, powerful data model
–
Good scalability if sharding is supported
•
Weaknesses: KEY DOCUMENT
–
Unsuited for interconnected data
–
Query model limited is to keys and indexes
–
Generally uses Map-Reduce (designed for batch operations) for
larger queries
15. Object Databases
•
Data Model [ODMG'93]:
–
Objects have a Class (type) and a group of Values
–
Each Object instance has a unique Object Identifier [OID]
–
Connections use Object Identifiers for efficiency
–
Supports class inheritance and polymorphism
•
Examples:
OID OBJECT
–
Objectivity/DB and db4objects
Connections
16. Object Databases – Pros & Cons
•
Strengths:
–
Simple, powerful data model that includes inheritance and
polymorphism
–
Every object has a class (type) and a unique Object Identifier
–
Good scalability if sharding is supported
–
Uses Object Identifiers instead of JOIN tables to support very fast
navigational operations OID OBJECT
Connections
•
Weaknesses:
–
The query language never became a standard
–
Supports standard object oriented languages but isn't supported by
a wide range of third party tools in the way that SQL is.
17. Graph Databases
•
Data model:
–
Node (Vertex) and Relationship (Edge) objects
–
Directed
–
May be a hypergraph (edges with multiple endpoints)
•
Examples:
–
InfiniteGraph, Neo4j, OrientDB, AllegroGraph, TitanDB and Dex
2 N
VERTEX EDGE
18. Graph Databases – Pros & Cons
•
Strengths:
–
Extremely fast for connected data
–
Scales out, typically
–
Easy to query (navigation)
–
Simple data model
•
Weaknesses:
–
May not support distribution or sharding
–
Requires conceptual shift... a different way of thinking
2 N
VERTEX EDGE
20. Typical “Big Data” Analytics Phases
Analytics and
Front-End Processing Repository Visualization Tools
The strategic competitors are all moving in the same direction
21. Incremental Improvements Aren’t Enough
All current solutions use the same basic architectural model
• None of the current solutions have a way to store connections between
entities in different silos
• Most analytic technology focuses on the content of the data nodes,
rather than the many kinds of connections between the nodes and the
data in those connections
• Why? Because relational and most NoSQL solutions are bad at handling
relationships.
• Object and Graph databases can efficiently store, manage and query the
many kinds of relationships hidden in the data.
26. Example 4 - Ad Placement Networks
Smartphone Ad placement - based on the the user’s profile and location data
captured by opt-in applications.
• The location data can be stored and distilled in a key-value and column store
hybrid database, such as Cassandra
• The locations are matched with geospatial data to deduce user interests.
• As Ad placement orders arrive, an application built on a graph database such
as InfiniteGraph, matches groups of users with Ads:
• Maximizes relevance for the user.
• Yields maximum value for the advertiser and the placer.
27. Example 5 - Healthcare Informatics
Problem: Physicians need better electronic records for managing patient data on a global
basis and match symptoms, causes, treatments and interdependencies to improve
diagnoses and outcomes.
• Solution: Create a database capable of leveraging existing architecture using NOSQL tools
such as Objectivity/DB and InfiniteGraph that can handle data capture, symptoms,
diagnoses, treatments, reactions to medications, interactions and progress.
• Result: It works:
• Diagnosis is faster and more accurate
• The knowledge base tracks similar medical cases.
• Treatment success rates have improved.
28. Relationship (Connection) Analytics...
Relational Database
Think about the SQL query for finding all links between the two “blue” rows... Good luck!
Table_A Table_B Table_C Table_D Table_E Table_F Table_G
Relational databases aren’t good at handling complex relationships!
29. Relationship (Connection) Analytics...
Relational Database
Think about the SQL query for finding all links between the two “blue” rows... Good luck!
Table_A Table_B Table_C Table_D Table_E Table_F Table_G
Objectivity/DB or InfiniteGraph - The solution can be found with a few lines of code
A3 G4
32. Lesson 1 – The Repository Matters A Lot
NEED RDBMS Key- Column Document ODBMS Graph
Value Family Database Database
OLTP YES No Maybe No Maybe No
Text No No No YES Maybe No
Handling
Multimedia No Maybe No Maybe YES Maybe
Engineering/ No No No No YES Maybe
Scientific
Business YES No Maybe No Maybe Maybe
Intelligence
Log Maybe No Maybe No YES Maybe
Processing
Connection No No No No Maybe YES
Handling/
Analysis
33. Lesson 2 – Languages and Tools Matter Too
NEED Repository Language BI Tools Visual
Analytics
OLTP RDBMS SQL, Java YES Maybe
Text Document Java, XML No Maybe
Database
Multimedia ODBMS Java, C++ No Maybe
Eng/Science ODBMS C,C++, R Maybe YES
Fortran
Business RDBMS Java, SQL, R YES YES
Intelligence
Log NoSQL, C++, R, Maybe YES
Processing ODBMS Java, SQL
Connection Graph Java, C++, Maybe YES
Handling/ Database SPARQL
Analysis
34. SUMMARY: A Polyglot Approach Works Best...
LANGUAGE REPOSITORY
PROBLEM
ANALYTICS
BI TOOLS GRAPH TOOLS VISUAL ANALYTICS
38. InfiniteGraph - The Enterprise Graph Database
• A high performance distributed database engine that supports analyst-time decision
support and actionable intelligence
• Cost effective link analysis – flexible deployment on commodity resources (hardware
and OS).
• Efficient, scalable, risk averse technology – enterprise proven.
• High Speed parallel ingest to load graph data quickly.
• Parallel, distributed queries
• Flexible plugin architecture
• Complementary technology
• Fast proof of concept – easy to use Graph API.
39. Objectivity/DB
A distributed, object database built for handling data with many complex relationships.
• Reliable - Deployed in process control, telecom and medical equipment, Big Science,
complex financial, defense and Intelligence Community applications.
• Provably scalable - used to build the World’s first Petabyte+ database at Stanford
Linear Accelerator in the year 2000.
• Advanced query capabilities - Parallel Query Engine
• Interoperable - across languages and platforms
–
C++, C#, Java, Python and SQL++
–
Linux, Mac OS X and Windows (32 and 64-bit)
40. The Big Data Connection Platform
Data Visualization
& Analytics
*Now HP *Now IBM
Big Data Connection
Platform
Processing Platform
*Now EMC *Now IBM *Now IBM
*Now Teradata *Now HP
*Now SAP
Connectors /
Integration
Servers /
File Storage *Now Oracle
41. The Big Data Connection Platform
Data Visualization
& Analytics
*Now HP *Now IBM
Big Data Connection
Platform
Processing Platform
*Now EMC *Now IBM *Now IBM
*Now Teradata *Now HP
*Now SAP
Connectors /
Integration
Servers /
File Storage *Now Oracle
42. Thank You!
Please take a look at objectivity.com
For Online Demos, White Papers, Free Downloads,
Samples & Tutorials
You Can Also See Us At NoSQL Now!
In San Jose, CA on August 22
Notes de l'éditeur
Thinking we should be less about Objy in the last bullet… possibly Object oriented and graph databases… ?
Note Object Oriented Databases as NOSQL here.
By initiating a polyglot approach – One can utilize existing SQL based architecture and databases while still gaining the competitive advantage that the latest NOSQL technologies provide. One example of this Polyglot approach is shown here. The technology(ies) used would be dependent on the use case.
By initiating a polyglot approach – One can utilize existing SQL based architecture and databases while still gaining the competitive advantage that the latest NOSQL technologies provide. One example of this Polyglot approach is shown here. The technology(ies) used would be dependent on the use case.
By initiating a polyglot approach – One can utilize existing SQL based architecture and databases while still gaining the competitive advantage that the latest NOSQL technologies provide. One example of this Polyglot approach is shown here. The technology(ies) used would be dependent on the use case.
By initiating a polyglot approach – One can utilize existing SQL based architecture and databases while still gaining the competitive advantage that the latest NOSQL technologies provide. One example of this Polyglot approach is shown here. The technology(ies) used would be dependent on the use case.
This section seems out of place.
By having a scalable and distributed platform that can manage connections between all types of disparate data, enterprise can easily capitalize on the best tools for the job at hand.
By having a scalable and distributed platform that can manage connections between all types of disparate data, enterprise can easily capitalize on the best tools for the job at hand.