Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Cassandra tutorial

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 12 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Cassandra tutorial (20)

Publicité

Plus par Ramakrishna kapa (20)

Plus récents (20)

Publicité

Cassandra tutorial

  1. 1. Cassandra Tutorial Apache Cassandra is a free open source and distributed database management system.It is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure.
  2. 2. NoSQLDatabase • A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data. • The primary objective of a NoSQL database is to have • simplicity of design, • horizontal scaling • finer control over availability. • NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve.
  3. 3. • Apache Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. Cassandra can serve as both a real-time operational data store for online transactional applications and a read-intensive database for large-scale business intelligence systems. • Originally created for facebook, Cassandra is designed to have peer to peer symmetric nodes, instead of master or named nodes, to ensure there can never be a single point of failure Cassandra automatically partitions data across all the nodes in the database cluster, but the administrator has the power to determine what data will be replicated and how many copies of the data will be created.
  4. 4. Features of Cassandra • Cassandra Features: • Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement. • Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure. • Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time. • Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need. • Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers. • Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). • Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
  5. 5. Components of Cassandra • Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each other and detect any faulty nodes in the cluster. • The key components of Cassandra are as follows − • Node − It is the place where data is stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value. • Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
  6. 6. Apache Cassandra data types • Apache Cassandra NoSQL DBMS supports the most common data types, including ASCII, bigint, BLOB, Boolean, counter, decimal, double, float, int, text, timestamp, UUID, VARCHAR and varint. • Cassandra's data model offers the convenience of column indexes with the performance of log- structured updates, strong support for denormalization and materialized views, and built- in caching. • Data access is performed using Cassandra Query Language (CQL), which resembles SQL.
  7. 7. Cassandra Query Language • Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. • Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.
  8. 8. • Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns .Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key. • To understand why databases like Cassandra, HBase and BigTable (I’ll call them DSS, Distributed Storage Services, from now on) were designed the way they are, we’ll first have to understand what they were built to be used for.
  9. 9. • DSS(A decision support system (DSS) is a computer-based information system that supports business or organizational decision-making activities. were designed to handle enormous amounts of data, stored in billions of rows on large clusters. Relational databases incorporate a lot of things that make it hard to efficiently distribute them over multiple machines. DSS simply remove some or all of these ties. No operations are allowed, that require scanning extensive parts of the dataset, meaning no JOINS or rich-queries • Cassandra is a NoSQL Column family implementation supporting the Big Table data model using the architectural aspects introduced by Amazon Dynamo.
  10. 10. column family • Cassandra consists of many storage nodes and stores each row within a single storage node. Within each row, Cassandra always stores columns sorted by their column names. Using this sort order, Cassandra supports slice queries where given a row, users can retrieve a subset of its columns falling within a given column name range. For example, a slice query with range tag0 to tag9999 will get all the columns whose names fall between tag0 and tag9999. • Keyspace – a group of many column families together. It is only a logical grouping of column families and provides an isolated scope for names. • Finally, super columns reside within a column family that groups several columns under a one key.
  11. 11. • Cassandra provides very fast writes, and they are actually faster than reads where it can transfer data about 80- 360MB/sec per node. It achieves this using two techniques.Cassandra keeps most of the data within memory at the responsible node, and any updates are done in the memory and written to the persistent storage (file system) in a lazy fashion. To avoid losing data, however, Cassandra writes all transactions to a commit log in the disk. Unlike updating data items in the disk, writes to commit logs are append-only and, therefore, avoid rotational delay while writing to the disk. For more information on disk-drive performance characteristics, see Resources.
  12. 12. • Unless writes have requested full consistency, Cassandra writes data to enough nodes without resolving any data inconsistencies where it resolves inconsistencies only at the first read. This process is called "read repair.“ • Healing from failure is manual • If a node in a Cassandra cluster has failed, the cluster will continue to work if you have replicas. Full recovery, which is to redistribute data and compensate for missing replicas, is a manual operation through a command line tool called node tool. Also, while the manual operation happens, the system will be unavailable. • It remembers deletes • Cassandra is designed such that it continues to work without a problem even if a node goes down (or gets disconnected) and comes back later. A consequence is this complicates data deletions. For example, assume a node is down. While down, a data item has been deleted in replicas. When the unavailable node comes back on, it will reintroduce the deleted data item at the syncing process unless Cassandra remembers that data item has been deleted.

×