In analytics world, when you need to process many millions or billions of documents to generate a single report. Novel techniques have been developed for exploiting modern processor architecture (larger on-chip cache, SIMD processing, compression, vector processing, columnar approach). Now, this technology is available to process your large JSON data. This talk will discuss analysis of JSON data using advanced data warehousing techniques and make it simple and seamless for the application/tool developer.
NoSQL Analytics: JSON Data Analysis and Acceleration in MongoDB World
1. NoSQL Analytics: JSON Data
Analysis and Acceleration in
MongoDB World
Ajaykumar Gupte
IBM
1
2. 2
Agenda
Basic overview of JSON data management.
Overview of IBM in-memory accelerator
Performance using in-memory accelerator with JSON data
3. 3
Explosion of mobile
devices – gaming
and social apps
Advertising:
serving ads and
real-time
bidding
Social networking,
online
communities
E-commerce, social
commerce
Machine data and
real-time
operational
decisions
Smart
Devices
Internet of Data
Internet of
Things
SQL
SQL, {JSON}, Spatial
{JSON},
TimeSeries
SQL, {JSON}
Simple,
{JSON},
Timeseries
SQL, {JSON}
4. 4
Schema-less app development lowers costs
– Simplicity and agility to develop apps quickly
– Inter operate with modern applications, especially mobile
– Applications address schema migrations
– Sometimes referred to a “flexible schema” development
While fixed schema is required for traditional
relational
– Transaction volumes growing
– Operational data, analytics, static records
– Creates greater value through “360 degree” view of the
business
Demand is growing in both areas
These approaches are
complementary
NoSQL Compared to “Traditional” DBMS
5. 5
Real Time Analytics
•
Customer Issues
– Several different models of data (SQL,
NoSQL, TimeSeries/Sensor)
– NoSQL is not strong building relations
between collections
– Most valuable analytics combine the results of
all data models
– Most prominent analytic system written using
standard SQL
7. 7
IBM Informix database 12.1
Relational, Embeddable
Real-time Analytics
Fast, Always-on Transactions
NoSQL capability
Multi-Tenancy
Sensor data management
High availability
Easy to Use
8. 8
Tier 2
Smart Gateway / Aggregator
Tier 3
Sensor Operational
Analytics Zone
Tier 1
Devices/Sensors
AnalyticsTier 4
Deep Analytics Zone
Simple
Customer
Portal
REST (https)
Time Series
REST (https)
MAC Addr. MAC Addr. MAC Addr. MAC Addr. MAC Addr. MAC Addr.
Informix TS IWA
Streams
Logic
dist.
engine
Cloudant
MessageSight
Lightweight
Analytics
Context
The IoT Architecture
IBM has opportunity in multiple Tiers of IoT
9. 9
•
Relational and non-relational data in one system
•
NoSQL/MongoDB Apps can access Informix Relational Tables
•
Distributed Queries
•
Multi-statement Transactions
•
Enterprise Proven Reliability
•
Enterprise Scalability
•
Enterprise Level Availability
Informix provides the capability to leverage
the abilities of both relational DBMS and document store
systems.
The Hybrid Solution
Informix has the Best of Both Worlds
10. 10
Informix 12.1 & MongoDB Clients
•
New Wire Protocol Listener supports
existing MongoDB drivers
•
Connect to MongoDB or Informix with
same application!
MongoDB
native Client
MongoDB
web browser
Mobile
Applications
MongoDB
Wire
Protocol
Informix
12.1
MongoDB
driver
11. 11
Informix JSON Store Benefits
•
Row locking on the individual JSON document
•
Large documents, up to 2GB maximum size
•
Ability to compress documents
•
Ability to intelligently cache commonly used
documents
•
Use existing storage options and management tools.
12. 12
Two New Data Types JSON and BSON
•
Native JSON and BSON data types
•
Index support for NoSQL data types
•
Native operators and comparator
functions allow for direct
manipulation of the BSON data type
•
Database Server seamlessly
converts to and from
• JSON BSON
• Character data JSON
13. 13
Informix: All Together Now!
SQL Tables
JSON Collections
TimeSeries
MQ Series
SQL APIs
JDBC, ODBC
Informix
IWA – BLU ACCELERATION
GENBSON: SQL to {BSON}
MongoDB
Drivers
TEXT SEARCH
SPATIAL
TIME SERIES {BSON}
REST API
14. 14
RelationalTable JSON Timeseries SpatialText
SQL API
StandardODBC, JDBC,.NET, OData,etc.Language SQL.
Direct SQLAccess.Dynamic ViewsRow types
StandardSQL/extJDBC/ODBCJSON Support
Standard SQLJDBC/ODBCJSON Support
Mongo API(NoSQL)
Mongo APIs forJava,
Javascript, C++, C#,...
Mongo APIs forJava,Javascript, C++, C#,...
Virtual TableJSON support JSON Support
Hybrid Access:SQL, JSON, Timeseries &
Spatial
15. 15
Benefits of Hybrid Power
Access consistent data from its source
Avoid ETL, continuous data sync and conflicts.
Exploit the power of SQL, MongoAPI seamlessly
Exploit the power of RDBMS technologies in MongoAPI:
– Informix Warehouse accelerator,
– Cost based Optimizer & power of SQL
– R-tree indices for spatial, Lucene text indexes, and more.
Access all your data thru any interface: MongoAPI & SQL
Store data in one place and efficiently transform and use them
on demand.
Existing SQL based tools and APIs can access new data in
JSON
16. 16
How to Convert Relational Data as JSON Documents
•
Relational data can be treated as structured JSON documents; column
name-value becomes key-value pair.
•
select partner, pnum, country from partners;
partner pnum Country
Acme 1748 Australia
Vernco 1746 USA
Baker 1472 Spain
Contrex 1742 France
{parnter: “Acme”, pnum:1748, Country: “Australia”}
{parnter: “Vernco”, pnum:1746, Country: “USA”}
{parnter: “Backer”, pnum:1472, Country: “Spain”}
{parnter: “Contrex”, pnum:1742, Country: “France”}
•
GENBSON function
Method for transforming existing SQL data into a JSON or BSON
document store format
•
select GENBSON( partners )::JSON from partners;
16
17. 17
•
Supports B-Tree indexes on any key-value pairs.
•
Typed indices could be on simple basic type (int, decimal,)
•
Type-less indices could be created on BSON and use BSON type comparison
•
Informix translates ensureIndex() to CREATE INDEX
•
Informix translates dropIndex() to DROP INDEX
Indexing
Mongo Operation SQL Operation
db.customers.ensureIndex(
{orderDate:1, zip:-1})
CREATE INDEX IF NOT EXISTS v_customer_2 ONcustomer (bson_get(data,‘orderDate') ASC,bson_get(data,‘zip') DESC) USING BSON
CREATE UNIQUE INDEX IF NOT EXISTS
v_customer_3 ON customer
(bson_get(data,'orderDate')ASC) USING BSON
18. Flexible Grid + Sharding
Informix
Shard 1
Informix/1
Secondary
Disk or Diskless
Informix/1
Secondary
Disk or Diskless
Informix
Secondary 1
Disk or Diskless
Informix
Shard 2
Informix/1
Secondary
Disk or Diskless
Informix/1
Secondary
Disk or Diskless
Informix
Secondary 2
Disk or Diskless
Informix
Shard 3
Informix/1
Secondary
Disk or Diskless
Informix/1
Secondary
Disk or Diskless
Informix
Secondary 3
Disk or Diskless
Informix
Shard 4
Informix/1
Secondary
Disk or Diskless
Informix/1
Secondary
Disk or Diskless
Informix
Secondary 4
Disk or Diskless
Informix NoSQL Cluster Architecture Overview
Scaling in both directions
Shard Disk Secondary
Secondary server(s) provide HA and scaling
Allow write on Secondary
19. 19
Mongo Application
IBM Wire Listener
IDXs
Logs
Enterprise replication + Flexible Grid + Sharding
Distributed
Queries
Informix Dynamic Server
Tables
Tables
IDXs
Relational
Tables
JSON
Collections
SELECT bson_new(bson, ‘{}’) FROM customer WHERE
data.state::varchar(128) =“MO”
db.customer.find({state:”MO”}) db.partners.find({state:”CA”})
SELECT * FROM partners WHERE state=“CA”
Customer
partners
JSON JSON
Access RelationalAccess JSON
MongoAPI Accessing Both NoSQL and Relational Tables
20. 20
MongoAPI Accessing Both NoSQL and Relational Tables
•
Typically NoSQL does not involve transactions
– In many cases, a document update is atomic, but not the application
statement (Example :7 targeted for deletion, but only 4 are removed )
•
Informix-NoSQL provides transactions on all application statements
– Each server operation INSERT, UPDATE, DELETE, SELECT will
automatically be committed after each operation.
•
Default isolation level is DIRTY READ
•
All standard isolation level support
•
$sql operator – execute SQL commands within Informix database
db.getCollection("system.sql").find({ "$sql":
"select c.customer_num, p.customer_num as p_cust from customer c left
outer join partners p on c.customer_num = p.customer_num order by 1" })
23. 23
You can use IWA's In-Memory Analytics to Speed Up queries on…
Local or remote views
HA Clusters
24. IWA Overview and Seamless Integration with
Informix/IDS
• Before IWA…
Informix
Receives analytic query from client
Spends some time doing intensive I/O
Returns results back to the client
Informix 12.1
Results
SQL
25. Informix/IWA Setup and Workflow
• Using IWA: Process is transparent to Informix client
Results
SQL
Informix
Receives analytic query from client
If query uses data matching an IWA datamart
and can be accelerated, route/offload it to IWA
Returns results back to the client
If query is not based on an IWA datamart or
cannot be accelerated, Informix will resolve it
Informix 12.1
The Accelerator
Processes the routed SQL query extremely
fast and returns answer back to Informix
Linux on
Intel/AMD 64-bit
Bulk Loader Compressed
Database
Partition
TCP/IP
Compression
In-Memory Columnar Storage
Frequency Partitioning
Parallelism
Predicate evaluation on compressed data
Multi-core and Vector optimized algorithms
SIMD
Query Router
Query Processor
26. 26
Informix Dynamic Server
Tables
Tables
Relational
Tables and
views
JSON
Collections {Customer}
partners
SQL & BI Applications
{Orders}
Inventory
Tables
Timeseries
Tables
{Orders}
Text index (BTS)
spatial indices
Informix Warehouse Accelerator (In-Memory Query Engine)
ODBC, JDBC connections
SQL Apps/Tools
MongoDB Drivers
NoSQL Apps/Tools
27. 27
Informix Database Server
Informix warehouse Accelerator
BI Applications
Step 1. Install, configure,
start Informix
Step 2. Install, configure,
start Accelerator
Step 3. Connect Studio to
Informix & add accelerator
Step 4. Design, validate,
Deploy Data mart
Step 5. Load data to
accelerator
Ready for Queries
IBM Smart
Analytics Studio
Step 1
Step 2
Step 3
Step 4
Step 5
Ready
IWA – BLU ACCELERATION
28. 28
All NOSQL Marts (all views based on JSON collections )
Hybrid Marts (subset of views based on JSON collections)
TPCDS 10GB workload
web_returns fact table along with 13 dimension tables
Total Memory on the machine: 250GB
Total nodes: 5
Coordinator nodes: 1
Coordinator node Memory: 20000 MB
Worker nodes : 4
Worker nodes memory : 100000 MB
IWA DRDA Interface: eth1 (IWA running on a different machine
than IDS server)
JSON Data Acceleration
29. 29
Using genbson to create web_returns, we can literally accomplish this
with just two statements and with good performance
create table json_web_returns_coll(c1 serial, c2 bson);
insert into json_web_returns_coll select 0, genbson( web_returns_ext ) from
web_returns_ext;
719964 row(s) inserted.
create table json_customer_address_coll(c1 serial, c2 bson);
insert into json_customer_address_coll select 0, genbson( customer_address_ext )
from customer_address_ext;
250000 row(s) inserted.
create table json_date_dim_coll(c1 serial, c2 bson);
insert into json_date_dim_coll select 0, genbson( date_dim_ext ) from date_dim_ext;
73049 row(s) inserted.
create table json_time_dim_coll(c1 serial, c2 bson);
insert into json_time_dim_coll select 0, genbson( time_dim_ext ) from time_dim_ext;
86400 row(s) inserted.
JSON Data Acceleration
33. 33
demo_database –
create view vcomments(uid,pid,comment) AS SELECT
data.uid::INT,
data.pid::INT,
data.comment::VARCHAR(128)
from comments;
create view vusers(uid,name) AS SELECT
data.uid::int,
data.name::varchar(128)
from users;
Create the SQL Views & analyze workload
34. 34
set environment use_dwa 'probe cleanup';
set environment use_dwa 'probe start';
select {+ avoid_execute} * from vcomments c,vusers u
where c.uid=u.uid;
set environment use_dwa 'probe stop';
execute procedure
ifx_probe2mart('demo_database','noSQL_mart');
execute function ifx_createmart('demo_dwa','noSQL_mart');
execute function ifx_loadmart('demo_dwa','noSQL_mart','NONE');
Create the SQL Views & analyze workload
35. 35
set environment use_dwa 'accelerate on';
select c.uid,name,comment from vcomments c,vusers u
where c.uid=u.uid and pid=444;
uid 12345
name john
comment first
uid 99999
name mia
comment third
Deploy NoSQL data mart & Issue queries
Informix (left side):
Object-Relational Database for OLTP & OLAP
Provides extreme performance for transactions
Best database for Time-stamped (sensor) data
Best in market Cluster / HA and Grids / Data Replication technology
Support on Cloud and Virtual environments
Hybrid SQL and NoSQL database - Big-Data and IOT ready platform
Easy to use and administer (GUI, commands, SQL functions)
Enterprise-class Autonomics and Embeddable Database
IWA (right side):
In-Memory compressed parallel columnar Database software
Combines multiple state-of-the-art IMBD technologies for OLAP speed
Plugs-in to an Informix database server via TCP/IP
Leverages existing Informix database environment and schema
Keeps an in-memory columnar copy of Informix data relevant for analytics
Works behind Informix database, tightly integrated, transparent to users
Provides Extreme Performance for I/O intensive and Analytic queries
Uses low-cost commodity hardware and O/S: Linux on Intel x86_64