Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Big data stores
1. Introduction to Big Data stores:
Key Value stores:
Cassandra:
• First developed at Facebook (powered the Inbox Search)
• Uses decentralized clustered nodes
• Considered one of the most scalable NoSQL systems
• Very high availability – no single point of failure
• Flexible data storage (structured/un-structured)
• Relatively easy to configure
• Designed for high transaction rates
• Java based – Available under the latest Apache license
2. Key Value NOSQL Databases
DynamoDB:
• Amazon DynamoDB stores data on Solid State Drives (SSDs)
• DynamoDB implements cryptographic methods to authenticate
users and prevent unauthorized data access.
• Stronger consistency on read tracked by atomic counters enables
latest values.
• Reduces the over-head of scaling and replication from developers.
• Synchronous replication across multiple AWS Availability Zones in
an Single Region.
• DynamoDB with other AWS features like AWS-EMR, AWS-Data
Pipeline can perform complex analytics and data movement
respectively.
3. Key Value NOSQL Databases
Riak:
• Riak adopts Mater-less peer-peer architecture
• Written in Erlang & C, some JavaScript.
• Distributes data and performs replication across nodes with consistent
hashing.
• Riak uses HTTP/REST or custom binary to communicate data with
Cluster/Nodes.
• Riak has two modes of operation (ie) fullsync (Synchronization occurs
every 6 hours) and real-time. (requires synchronization trigger)
• When new nodes are added to cluster, data is rebalanced across nodes
with no downtime.
• Used by 25% of fortune 50 companies. AT&T, AOL, Ask.com, Best Buy,
Boeing and Comcast.
4. Key Value NOSQL Databases
Redis:
• Redis adopts Master-Slave architecture
• Slaves are allowed communicate with each other.
• Redis is written in ANSI C and is best suited for rapidly changing
data, with predictable size. Ex) Stock-Analysis
• By default, latency monitoring is disabled and user can enable by
setting a threshold value to variable "latency threshold"
• Redis is designed to be accessed by trusted-users within trusted
environment.
• Performs Hash or Range partitioning(Mapping range of object to
specific Redis instance)
5. Key Value NOSQL Databases
CouchDB:
• Written in Erlang.
• Instead of locks, CouchDB uses Multi-Version Concurrency Control
(MVCC) to manage concurrent access to the database.
• CouchDB achieves eventual consistency between multiple
databases by using incremental replication.
• Validates documents using Java Script functions and approve/deny
the document update.
• CouchDB supports both pull replication(node acts as target)and
push replication(node acts as source).
• CouchDB is best suited for data that changes occasionally.
6. Key Value NOSQL Databases
Azure Table Storage:
• Maximum data size is 200 TB per table.
• Azure Table retrieves a maximum of 1000 rows per table.
• Azure Table Storage provides ACID transaction that guarantees CRDU
operations for a single entity in a table.
• Storage access architecture of Azure Table Storage has three-layered structure
Front-End (FE) layer - Authenticates and authorizes the request.
Partition Layer - partitions the object data and performs load-balancing.
Distributed and replicated File System (DFS) Layer - Distributes and
Replicates data across many clusters.
• Azure Table Storage does not provide a way to represent relationships between
data.
• To provide fault tolerance the stored data is replicated three times within the
region, and replicated an additional 3 times in another region.
7. Key Value NOSQL Databases
BerkeleyDB:
• Berkeley DB is a embedded database engine and is suitable for storing
key/value data.
• Key and data items are stored in simple structures called DBT (DBT is an
acronym for database thang) that contains reference to memory and length.
• Berkeley DB supports concurrency in threads even in database with size.
• Program accessing Berkeley DB determines how data is to stored in records.
• Berkeley DB has three different products:
o Berkeley DB - contains database implementations and is written in C
o Berkeley DB Java Edition - Log structured storage architecture and
coded in Pure Java.
o Berkeley DB XML - specializes in the storage of XML documents
8. Column-Family NOSQL databases:
HBase:
• First developed at Powerset (to power natural language
search)
• Distributed column oriented database on top of
Hadoop/HDFS.
• Continuous access to data - Multiple master nodes.
• Linear and modular scalability.
• Provides interactive commands for manipulating database
• Single row atomic operations and row level exclusive locks.
• Multiple clients like its native Java library, Thrift, and REST
9. Column-Family NOSQL databases:
BigTable:
• First developed at Google(Structured data ).
• Sparse, distributed, persistent multidimensional sorted
map.
• Self Managing ( Servers can be added/removed
dynamically. Servers adjust to load imbalance).
• Fault tolerant & Persistent.
• Designed to scale into the petabyte range.
• Tables are optimized for GFS (Google File System) by being
split into multiple tablets.
10. Column-Family NOSQL databases:
HyperTable:
• Developed as an in-house software at Zvents.
• Manages massive spare tables with timestamped cell
versions.
• Maximum efficiency (Less hardware, power, datacenter).
• Good fit for wide range of applications.
• Clean semantics.
• High performance.
11. Graph NOSQL databases:
Neo4j:
• Developed by Neo Technology
• Highly scalable, robust.
• Graph structures with nodes, edges and properties to
store data.
• Provides index-free adjacency
• Neo4j is schema free – Data does not have to adhere to
any convention
• ACID – atomic, consistent, isolated and durable for logical
units of work
• Easy to get started and use.
• Support for wide variety of languages (Java, Python, Perl,
Scala, Cypher, etc)
12. Document NOSQL databases:
MongoDB:
• Developed by the software company 10gen as service
product later shifted to open source.
• Document Oriented Database.
• Implemented in C++ for best performance. (built for
speed).
• Super low latency access to your data (Very little CPU
overhead).
• Auto Sharding for easy scalability.
• Map/Reduce for Aggregation.
• Full index support for high performace.
• Language drivers for (Ruby/Ruby on rails, Java, C#,
JavaScript, Python, Perl, Erlang etc).
Notes de l'éditeur
Cassandra (an Apache project) is a NOSQL Key Value store distributed storage system designed for storing and managing huge amounts of structured or unstructured data over many nodes. Cassandra was first developed at Facebook and has been available as an Apache top-level project since 2010. Like many other NOSQL systems, Cassandra is designed to run over cheap commodity hardware. Cassandra runs over a series of many decentralized clustered nodes and offers very elastic scalability. Capacity can be increased and put online on the fly. This makes Cassandra an ‘always on solution’. Also, because of its distributed architecture, Cassandra has no single point of failure. Cassandra is designed never to go down. Ever
Some design aspects of Cassandra resemble a traditional database management system. Some of the terminology will look recognizable to SQL/DDL database developers. However, Cassandra (like most other NOSQL solutions) does not support a normalized data model.
Cassandra is hugely popular and is generally considered the most implemented of the NO/SQL databases. Most like the low complexity of Cassandra. Many consider it an easy and simple solution for cloud data storage. Its simplicity and elegant design makes it a natural choice for many organizations