1. Building An Elastic Real Time NoSQL Platform
Creating a platform for unlimited elastic
computation power and storage
2. Motivation
Complete elastic solution stack
Applications that need massive “strategic” storage (disk-
based NoSQL) and a real time (“tactical”) component
Horizontally and vertically scalable
Highly available
Self healing
Fault tolerant: suitable for commodity h/w strategy
Simplified management and monitoring, vs
conventional, multi-product solutions
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
3. What Is Real-Time?
It’s all relative
In this context, it means “really fast”.
How fast is really fast? Reads as low as 5 μs read and typically
under 1 ms for a fully replicated write.
Source: http://blog.gigaspaces.com/2010/12/06/possible-impossibility-the-race-to-zero-latency/
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
4. Two Layer Approach
Advantage: Minimal
Raw Event Stream
Raw Event Stream
Raw Event Stream
ts
ents
“impedance mismatch”
en
Real Time Ev
Real Time Ev
between layers.
– Both NoSQL cluster
technologies, with similar
advantages SCALE
Grid layer serves as an in
Reporting Engine
In Memory Compute Cluster
memory cache for interactive
Raw And Derived Events
requests.
Grid layer serves as a real time ...
SCALE
computation fabric for CEP, and
NoSQL Cluster
limited ( to allocated memory)
real time map/reduce
capability.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
5. Two Layer Approach (continued)
Grid layer doing CEP can act as a filter, as many raw events
get converted to semantic/business events, reducing
meaningless data verbosity
Grid layer provides scalable messaging
NoSQL layer provides unlimited cheap storage on commodity
hardware
NoSQL layer provides virtually unlimited scale processing
power
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
6. Basics Of In Memory DataGrid Technology
An In Memory Data Grid (IMDG) is a data store
Grid just means “cluster”
Data can be partitioned across cluster nodes
Processing power near data storage
Distributed hash table
Application optimized data model denormalization
Nodes are typically configured with one or more replicas
(sound familiar yet)?
Not a “cache”: a system of record, but can be used as a
cache, or both
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
7. Advanced Capabilities
Business logic (code) co-resident with data shards
Scalable messaging
Dynamic code execution across cluster
Multi-language support
Object-oriented
Document-oriented/schema free
Multi-level indexing
SQL Queries
Full ACID transaction support
Elastic scaling (automatic and manual)
Write-behind persistence
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
8. Features: IMDG vs NoSQL
Disk Based
Data Grid
NoSQL
Low Latency
Eventual/Tunable
Horizontally Scalable
Consistency
Code co-location
Service remoting
Parallel Execution Unlimited scale
Fault Tolerant
Cloud enabled Hadoop tools
Transactional
Highly Available
Elastic
Messaging
Platform Independent
Complex Event Processing Flexible Schema
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
9. Vive La Difference
The IMDG compliments a NoSQL store:
– Can serve as a short term request cache (side cache or inline)
– Can serve as a cache for MR results
– Enables event driven architectures / CEP
– In memory map/reduce
– Very fast writes, regardless of NoSQL store
– Transactional layer: can essentially turn “eventual” consistency into
pure transactional persistency without a performance hit
– Highly available and independently scalable
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
10. A Complete Scalable Application Platform
Raw Event Stream
Raw Event Stream
Raw Event Stream
ts
vents
n
Real Time Eve
Real Time E
SCALE
Reporting Engine
In Memory Compute Cluster
Raw And Derived Events
...
SCALE
NoSQL Cluster
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
11. Key Implementation Issues
Grid must support reliable asynchronous persistence
– If not reliable: in-flight data is at risk. Ideally tunable to accommodate
differing risk tolerance.
– If not asynchronous: too slow
– If not persistent: obviously nothing gets send to disk
To do more than a distributed cache, grid must support code
and data partitioning
– Ideally, code is collocated in memory with data partition
– Needed to support CEP, application, and service remoting capabilities
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
12. Key Implementation Issues
Grid ideally supports FIFO entry ordering
– Key to using grid as a queue
– Key to scaling messaging without an additional tier
– Combined with co-located business logic, operates at memory speeds
Write speed on the NoSQL layer
– Grid is, in effect, queuing entries to the NoSQL layer
– If the NoSQL layer cannot keep up, in memory grid backs up
– This behavior is an asset, unless an unanticipated, sustained flood
occurs.
– The faster the write speed the better
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
13. Use Case 1 – Event Cloud
Complex event processing
Collect events in real time Transform into decision factors
•Interactions •Good customer
•Orders •Pays 3-6 days early
•Bills •Decreasing usage
•Payments •Missed payment
•Activations •Unusual bill
•… •App usage
Original events, possibly scrubbed or annotated, are passed
through
Business logic derived “synthetic events” constructed from
raw event stream. Possible rule engine integration(e.g.
Drools).
Derived events and analytics passed on to NoSQL layer
Other events forwarded to external listeners, systems
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
14. Use Case 2 – Time Bounded
Time Bounded – suited to operations with daily business cycle
(e.g. trading)
Current day (or other time period that will fit in memory) held
in memory, along with related application state, caching etc…
Still streaming operations to underlying NoSQL platform, or
hold for end of day flush if back end can’t write fast enough.
Supports application hosting, messaging, and complex event
processing.
External clients are aware of “current day” store, vs archival.
Large scale reports/analytics run in background on NoSQL
archive.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
15. Use Case 3 - LRU
Grid holds a subset of NoSQL store, and supports an LRU
caching model.
In line or side-cache.
Appropriate only in cases where, like any cache, usage
pattern does not generate many cache misses.
Still supports CEP, messaging, and computation scaling
(provided grid product supports it).
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
16. Wishlist
This platform concept is still at an early stage
For Gigaspaces, integrations already exist for Cassandra and
MongoDB.
Customers are currently implementing solutions
Stuff I’d like to see:
– Unified management and scaling. Shared infrastructure.
– Grid/NoSQL aware hive façade that can run MR jobs on both. Perhaps
other Hadoop tools integration
– Deeper integration. To further optimize write speed/capacity, and
perhaps offload some in-memory aspects of underlying NoSQL
platform to minimize duplication and possibly optimize elasticity.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
17. Conclusion
Two shared nothing “NoSQL” architectures complementing
each other
Fully elastic/scalable
Ultra high performance/low latency combined with unlimited
scale.
Full application stack
Highly reliable and self-healing
Scalable complex event handling
Multi-language
Simple. Two products.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved