Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

1. introduction to no sql

Chargement dans…3

Consultez-les par la suite

1 sur 139 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à 1. introduction to no sql (20)


Plus récents (20)

1. introduction to no sql

  1. 1. INTRODUCTION TO NOSQL By Anuja G. Gunale
  2. 2. WHAT IS DBMS? DBMS software primarily functions as an interface between the end user and the database, simultaneously managing the data, the database engine, and the database schema in order to facilitate the organization and manipulation of data. A database management system (or DBMS) is essentially nothing more than a computerized data-keeping system. Users of the system are given facilities to perform several kinds of operations on such a system for either manipulation of the data in the database or the management of the database structure itself. Database Management Systems (DBMSs) are categorized according to their data structures or types. There are several types of databases : inverted list, hierarchic, network, or relational.
  3. 3. WHAT IS RDBMS? RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for all modern database systems such as MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. The software used to store, manage, query, and retrieve data stored in a relational database is called a relational database management system (RDBMS). The RDBMS provides an interface between users and applications and the database, as well as administrative functions for managing data storage, access, and performance. The data in RDBMS is stored in database objects called tables. A table is a collection of related data entries and it consists of columns and rows.
  4. 4. LIMITATIONS OF RELATIONAL DATABASES 1. In relational database we need to define structure and schema of data first and then only we can process the data. 2. Relational database systems provides consistency and integrity of data by enforcing ACID properties (Atomicity, Consistency, Isolation and Durability ). There are some scenarios where this is useful like banking system. However in most of the other cases these properties are significant performance overhead and can make your database response very slow. 3. Most of the applications store their data in JSON format and RDBMS don’t provide you a better way of performing operations such as create, insert, update, delete etc on this data. On the other hand NoSQL store their data in JSON format, which is compatible with most of the today’s world application.
  5. 5. JSON is the data structure of the Web. It's a simple data format that allows programmers to store and communicate sets of values, lists, and key-value mappings across systems. As JSON adoption has grown, database vendors have sprung up offering JSON-centric document databases. JSON is the data structure of the Web. It's a simple data format that allows programmers to store and communicate sets of values, lists, and key-value mappings across systems.
  6. 6. WHAT IS SQL? SQL (Structured Query Language) is a standardized programming language that's used to manage relational databases and perform various operations on the data in them.  SQL (Structured Query Language) is a standardized programming language that's used to manage relational databases and perform various operations on the data in them. Initially created in the 1970s, SQL is regularly used not only by database administrators, but also by developers writing data integration scripts and data analysts looking to set up and run analytical queries.
  7. 7. The uses of SQL include modifying database table and index structures; adding, updating and deleting rows of data; and retrieving subsets of information from within a database for transaction processing and analytics applications. Queries and other SQL operations take the form of commands written as statements.  Commonly used SQL statements include select, add, insert, update, delete, create, alter and truncate. SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system.
  8. 8. NOSQL
  9. 9. A database Management System provides the mechanism to store and retrieve the data. There are different kinds of database Management Systems: 1. RDBMS (Relational Database Management Systems) 2. OLAP (Online Analytical Processing) 3. NoSQL (Not only SQL)
  10. 10. BRIEF HISTORY OF NOSQL DATABASES •1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational database •2000- Graph database Neo4j is launched •2004- Google BigTable is launched •2005- CouchDB is launched •2007- The research paper on Amazon Dynamo is released •2008- Facebooks open sources the Cassandra project •2009- The term NoSQL was reintroduced
  11. 11. WHAT IS NOSQL? NoSQL, known as Not only SQL database, provides a mechanism for storage and retrieval of data and is the next generation database . It has a distributed architecture with MongoDB and is open source. Most of the NoSQL are open source and it has a capability of horizontal scalability which means that commodity kind of machines could be added The capacity of your clusters can be increased. It is schema free and there is no requirement to design the tables and pushing the data to it. NoSQL provides easy replication claiming there are very less manual interventions in this. Once the replication is done, the system will automatically take care of fail overs.
  12. 12. The crucial factor about NoSQL is that it can handle huge amount of data and can achieve performance by adding more machines to your clusters and can be implemented on commodity hardware. There are close to 150 NoSQL databases in the market which will make it difficult to choose the right pick for your system. One of the advantage of NoSQL database is that they are really easy to scale and they are much faster in most types of operations that we perform on database. There are certain situations where you would prefer relational database over NoSQL, however when you are dealing with huge amount of data then NoSQL database is your best choice.
  13. 13. WHY NOSQL? In today’s time data is becoming easier to access and capture through third parties such as Facebook, Google+ and others. Personal user information, social graphs, geo location data, user- generated content and machine logging data are just a few examples where the data has been increasing exponentially. To avail the above service properly, it is required to process huge amount of data. Which SQL databases were never designed. The evolution of NoSQL databases is to handle these huge data properly.
  14. 14. NoSQL Is: •More than rows in tables — NoSQL systems store and retrieve data from many formats; key-value stores, graph databases, column-family (Bigtable) stores, document stores and even rows in tables. •Free of joins — NoSQL systems allow you to extract your data using simple interfaces without joins. •Schema free — NoSQL systems allow you to drag-and-drop your data into a folder and then query it without creating an entity-relational model. •Compatible with many processors — NoSQL systems allow you to store your database on multiple processors and maintain high-speed performance. •Usable on shared-nothing commodity computers — Most (but not all) NoSQL systems leverage low cost commodity processors that have separate RAM and disk. •Supportive of linear scalability — NoSQL supports linear scalability; when you add more processors you get a consistent increase in performance. •Innovative — NoSQL offers options to a single way of storing, retrieving and manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL community, NoSQL means not only SQL.
  15. 15. NoSQL Is Not: •About the SQL language — The definition of NoSQL is not an application that uses a language other than SQL. SQL, as well as other query languages, are used with NoSQL databases. •Not only open source — Although many NoSQL systems have an open source model, commercial products use NoSQL concepts as well as open source initiatives. You can still have an innovative approach to problem solving with a commercial product. •Not only Big Data — Many, but not all NoSQL applications, are driven by the inability of a current application to efficiently scale when Big Data is an issue. While volume and velocity are important, NoSQL also focuses on variability and agility. •About cloud computing — Many NoSQL systems reside in the cloud to take advantage of its ability to rapidly scale when the situations dictate. NoSQL systems can run in the cloud as well as in your corporate data center. •About a clever use of RAM and SSD — Many NoSQL systems focus on the efficient use of RAM or solid-state disks to increase performance. While important, NoSQL systems can run on standard hardware. •An elite group of products — NoSQL is not an exclusive club with a few products. There are no membership dues or tests required to join.
  16. 16. FEATURES OF NOSQL: NoSQL has the following features: 1) Non-Relational 2) Schema free 3) Simple API 4) Distributed
  17. 17. 1) Non-relational NoSQL databases never follow the relational model. Never provide tables with flat fixed-column records. Work with self-contained aggregates or BLOBs. (A binary large object is a collection of binary data stored as a single entity. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob) Doesn’t require object-relational mapping and data normalization. No complex features like query languages, query planners, referential integrity joins, ACID.
  18. 18. 2) Schema-free NoSQL databases are either schema-free or have relaxed schemas. Do not require any sort of definition of the schema of the data. Offers heterogeneous structures of data in the same domain.
  19. 19. 3) Simple API Offers easy to use interfaces for storage and querying data provided APIs allow low-level data manipulation & selection methods Text-based protocols mostly used with HTTP REST with JSON Mostly used no standard based NoSQL query language Web-enabled databases running as internet-facing services
  20. 20. 4) Distributed •Multiple NoSQL databases can be executed in a distributed fashion •Offers auto-scaling and fail-over capabilities •Often ACID concept can be sacrificed for scalability and throughput •Mostly no synchronous replication between distributed nodes Master Replication, peer-to-peer, HDFS Replication •Only providing eventual consistency •Shared Nothing Architecture. This enables less coordination and higher
  21. 21. WHEN TO GO FOR NOSQL When you would want to choose NoSQL over relational database: 1.When you want to store and retrieve huge amount of data. 2.The relationship between the data you store is not that important 3.The data is not structured and changing over time 4.Constraints and Joins support is not required at database level 5.The data is growing continuously and you need to scale the database regular to handle the data.
  23. 23. NoSQL is a database technology driven by Cloud Computing, the Web, Big Data and the Big Users. NoSQL now leads the way for the popular internet companies such as LinkedIn, Google, Amazon, and Facebook - to overcome the drawbacks of the 40 year old RDBMS. NoSQL Database, also known as “Not Only SQL” is an alternative to SQL database which does not require any kind of fixed table schemas unlike the SQL. NoSQL generally scales horizontally and avoids major join operations on the data. NoSQL database can be referred to as structured storage which consists of relational database as the subset. NoSQL Database covers a swarm of multitude databases, each having a different kind of data storage model. The most popular types are Graph, Key-Value pairs, Columnar
  25. 25. WHAT IS MEAN BY BUSINESS DRIVERS? Business drivers are the key inputs and activities that drive the operational and financial results of a business. Common examples of business drivers are salespeople, number of stores, website traffic, number and price of products sold, units of production, etc.
  27. 27. 1) VOLUME Without a doubt, the key factor pushing organizations to look at alternatives to their current RDBMSs is a need to query Big Data using clusters of commodity processors. Until around 2005, performance concerns were resolved by purchasing faster processors. In time, however, the ability to increase processing speed was no longer an option. As chip density increased heat could no longer dissipate fast enough without chip overheating. This phenomenon, known as the PowerWall, forced systems designers to shift their focus from increasing speed on a single chip to using more processors working together. The need to scale out (also known as horizontal scaling), rather than scale up (faster processors), moved organizations from serial to parallel processing where data problems are split into separate paths and sent to separate processors to divide and conquer the work.
  28. 28. 2) VELOCITY While Big Data problems are a consideration for many organizations moving away from RDBMS systems, the ability of a single processor system to rapidly read and write data is also key. Many single processor RDBMS systems are unable to keep up with the demands of real-time inserts and online queries to the database made by public-facing websites. RDBMS systems frequently index many columns of every new row, a process that decreases system performance. When single processors RDBMSs are used as a back end to a web storefront, the random bursts in web traffic slow down response for everyone and tuning these systems can be costly when both high read and write throughput is desired.
  29. 29. 3) VARIABILITY Companies that want to capture and report on exception data struggle when attempting to use rigid database schema structures imposed by RDBMS systems. For example, if a business unit wants to capture a few custom fields for a particular customer, all customer rows within the database need to store this information even though it doesn't apply.  Adding new columns to an RDBMS requires the system to be shut down and ALTER TABLE commands to be run.  When a database is large, this process can impact system availability, losing time and money in the process.
  30. 30. 4) AGILITY The most complex part of building applications using RDBMSs is the process of putting data into and getting data out of the database. If your data has nested and repeated subgroups of data structures you need to include an object-relational mapping layer. The responsibility of this layer is to generate the correct combination of INSERT, UPDATE, DELETE and SELECT SQL statements to move object data to and from the RDBMS persistence layer. This process is not simple and is associated with the largest barrier to rapid change when developing new or modifying existing applications.
  31. 31. CAP THEOREM (BREWER’S THEOREM) You must understand the CAP theorem when you talk about NoSQL databases or in fact when designing any distributed system. CAP theorem states that there are three basic requirements which exist in a special relation when designing applications for a distributed architecture.
  32. 32. Consistency - This means that the data in the database remains consistent after the execution of an operation. For example after an update operation all clients see the same data. Availability - This means that the system is always on (service guarantee availability), no downtime. Partition Tolerance - This means that the system continues to function even the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
  33. 33. In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic requirements for a distributed system to follow 2 of the 3 requirements.  Therefore all the current NoSQL database follow the different combinations of the C, A, P from the CAP theorem.  Here is the brief description of three combinations CA, CP, AP : CA - Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks. CP -Some data may not be accessible, but the rest is still consistent/accurate. AP - System is still available under partitioning, but some of the data returned may be inaccurate.
  34. 34. NOSQL PROS/CONS Advantages : •High scalability •Distributed Computing •Lower cost •Schema flexibility, semi-structure data •No complicated Relationships Disadvantages •No standardization •Limited query capabilities (so far) •Eventual consistent is not intuitive to program for
  35. 35. ADVANTAGES OF NOSQL: •Can be used as Primary or Analytic Data Source •Big Data Capability •No Single Point of Failure •Easy Replication •No Need for Separate Caching Layer •It provides fast performance and horizontal scalability. •Can handle structured, semi-structured, and unstructured data with equal effect •Object-oriented programming which is easy to use and flexible •NoSQL databases don’t need a dedicated high-performance server •Support Key Developer Languages and Platforms •Simple to implement than using RDBMS •It can serve as the primary data source for online applications. •Handles big data which manages data velocity, variety, volume, and complexity •Excels at distributed database and multi-data center operations •Eliminates the need for a specific caching layer to store data •Offers a flexible schema design which can easily be altered without downtime or service disruption
  36. 36. DISADVANTAGES OF NOSQL •No standardization rules •Limited query capabilities •RDBMS databases and tools are comparatively mature •It does not offer any traditional database capabilities, like consistency when multiple transactions are performed simultaneously. •When the volume of data increases it is difficult to maintain unique values as keys become difficult •Doesn’t work as well with relational data •The learning curve is stiff for new developers •Open source options so not so popular for enterprises.
  37. 37. SQL V/S NOSQL:
  39. 39. NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented, Graph-based Document-oriented. Every category has its unique attributes and limitations. None of the above-specified database is better to solve problems. Users should select the database based on their Types of NoSQL Databases: •Key-value Pair Based •Column-oriented Graph •Graphs based •Document-oriented
  40. 40. 1) KEY VALUE PAIR BASED Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy load. Key-value pair storage databases store data as a hash table where each key is unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc. For example, a key-value pair may contain a key like “Website” associated with a value like “Guru99”. It is one of the most basic NoSQL database
  41. 41. Key value stores help the developer to store schema-less data.They work best for shopping cart contents. Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.They are all based on Amazon’s Dynamo paper.
  42. 42. 2) COLUMN-BASED Column-oriented databases work on columns and are based on BigTable paper by Google. Every column is treated separately. Values of single column databases are stored contiguously. They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the data is readily
  43. 43. Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM, Library card catalog, HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
  44. 44. AGGREGATE FUNCTIONS IN DBMS 1. SQL provides a number of built-in functions to perform operations on data these functions are very much useful for performing mathematical calculations on table data 2. Aggregate functions return a single value after performing calculations on a set of values, here will discuss the five frequently used aggregate functions provided by SQL. 3. These aggregate functions are used with the SELECT statement at a time only one column can be applied with the function
  45. 45. BIGTABLE PAPER BY GOOGLE: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high- performance solution for all of these Google products.  In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
  46. 46. 3) DOCUMENT-ORIENTED: Document-Oriented NoSQL DB stores and retrieves data as a key-value pair but the value part is stored as a document. The document is stored in JSON or XML formats. In the following diagram you can see we have rows and columns, and in the right, we have a document database which has a similar structure to JSON. Now for the relational database, you have to know what columns you have and so on. However, for a document database, you have data store like JSON object. You do not require to define which make it flexible. value is understood by the DB and can be queried.
  47. 47. The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-commerce applications.  It should not use for complex transactions which require multiple operations or queries against varying aggregate structures. Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, are popular Document originated DBMS systems.
  48. 48. A content management system, often abbreviated as CMS, is software that helps users create, manage, and modify content on a website without the need for specialized technical knowledge. In simpler language, a content management system is a tool that helps you build a website without needing to write all the code from scratch (or even know how to code at all). Instead of building your own system for creating web pages, storing images, and other functions, the content management system handles all that basic infrastructure stuff for you so that you can focus on more forward-facing parts of your website.
  49. 49. 4) GRAPH-BASED A graph type database stores entities as well the relations amongst those entities. The entity is stored as a node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has a unique identifier. Compared to a relational database where tables are loosely connected, a Graph database is a multi- relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is no need to
  50. 50. Graph base database mostly used for social networks, logistics, spatial data. Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph- based databases.
  51. 51. A graph database is a database that is based on graph theory. It consists of a set of objects, which can be a node or an edge. •Nodes represent entities or instances such as people, businesses, accounts, or any other item to be tracked. They are roughly the equivalent of a record, relation, or row in a relational database, or a document in a document-store database. •Edges, also termed graphs or relationships, are the lines that connect nodes to other nodes; representing the relationship between them. Meaningful patterns emerge when examining the connections and interconnections of nodes, properties and edges. The edges can either be directed or undirected.
  52. 52. •In an undirected graph, an edge connecting two nodes has a single meaning. •In a directed graph, the edges connecting two different nodes have different meanings, depending on their direction. •Edges are the key concept in graph databases, representing an abstraction that is not directly implemented in a relational model. •Properties are information associated to nodes. •For example, if Wikipedia were one of the nodes, it might be tied to properties such as website, reference material, or words that starts with the letter w, depending on which aspects of Wikipedia are germane to a given database.
  54. 54. Using a NoSQL solution to solve your big data problems gives you some unique ways to handle and manage your big data. By moving data to queries, using hash rings to distribute the load, using replication to scale your reads, and allowing the database to distribute queries evenly to your data nodes, you can manage your data and keep your systems running fast. Now let’s look at how NoSQL systems, with their inherently horizontal scale-out architectures, are ideal for tackling big data problems. We’ll look at several strategies that NoSQL systems use to scale horizontally on commodity hardware. We’ll see how NoSQL systems move queries to the data, not data to the queries. We’ll see how they use the hash rings to evenly distribute the data on a cluster and use replication to scale reads. All these strategies allow NoSQL systems to distribute the workload evenly and eliminate performance bottlenecks.
  56. 56. So what exactly is a big data problem? A big data class problem is any business problem that’s so large that it can’t be easily managed using a single processor. Big data problems force you to move away from a single- processor environment toward the more complex world of distributed computing. Though great for solving big data problems, distributed computing environments come with their own set of challenges
  59. 59. One of the core concepts in big data is linear scaling. When a system has linear scaling, you automatically get a proportional performance gain each time you add a new processor to your cluster, as shown in fig. Scaling independent transformations — Many big data problems are driven by discrete transformations on individual items without interaction among the items. These types of problems tend to be the easiest to solve: simply add a new node to your cluster. Image transformation is a good example of this. Scaling availability — Duplicate the writes onto multiple servers in data centers in distinct geographic regions. If one data center experiences an outage, the other data centers can supply the data. Scaling availability keeps replica copies in sync and automates the switchover if one system fails.
  61. 61. As we mentioned earlier, linear scalability is the ability to get a consistent amount of performance improvement as you add additional processors to your cluster. Expressivity is the ability to perform fine-grained queries on individual elements of your dataset. Understanding how well each NoSQL technology performs in terms of scalability and expressivity is necessary when you’re selecting a NoSQL solution. To select the right system, you’ll need to identify the scalability and expressivity requirements of your system and then make sure the system that you select meets both of these criteria. Scalability and expressivity can be difficult to quantify, and vendor claims may not match actual performance for a particular business problem.
  63. 63. There are many types of big data problems, each requiring a different combination of NoSQL systems. After you’ve categorized your data and determined its type, you’ll find there are different solutions. How you build your own big data classification system might be different from this example, but the process of differentiating data types should be similar. Read-mostly — Read-mostly data is the most common classification. It includes data that’s created once and rarely altered. This type of data is typically found in data warehouse applications but is also identified as a set of non-RDBMS items like images or video, event-logging data, published documents, or graph data. Event data includes things like retail sales events, hits on a website, system logging data, or real-time sensor data.
  64. 64. Full-text documents — This category of data includes any document that contains natural- language text like the English language. An important aspect of document stores is that you can query the entire contents of your office document in the same way you would query rows in your SQL system. This means that you can create new reports that combine traditional data in RDBMSs as well as the data within your office documents. For example, you could create a single query that extracted all the authors of titles of PowerPoint slides that contained the keywords NoSQL or big data.
  66. 66. Three ways to share resources. The left panel shows a shared RAM architecture, where many CPUs access a single shared RAM over a high-speed bus. This system is ideal for large graph traversal. The middle panel shows a shared disk system, where processors have independent RAM but share disk using a storage area network (SAN). The right panel shows an architecture used in big data solutions: cache-friendly, using low-cost commodity hardware, and a shared-nothing architecture. There are three ways that resources can be shared between computer systems: shared RAM, shared disk, and shared- nothing. Figure shows a comparison of these three distributed computing architectures.
  67. 67. Of the architectural data patterns we’ve discussed so far (row store, key-value store, graph store, document store, and Bigtable store), only two (key- value store and document store) lend themselves to cache-friendliness. Bigtable stores scale well on shared-nothing architectures because their row-column identifiers are similar to key-value stores. But row stores and graph stores aren’t cache- friendly since they don’t allow a large BLOB to be referenced by a short key that can be stored in the cache.
  68. 68. For graph traversals to be fast, the entire graph should be in main memory. This is why graph stores work most efficiently when you have enough RAM to hold the graph. If you can’t keep your graph in RAM, graph stores will try to swap the data to disk, which will decrease graph query performance by a factor of 1,000.  The only way to combat the problem is to move to a shared-memory architecture, where multiple threads all access a large RAM structure without the graph data moving outside of the shared RAM.
  69. 69. 4. CHOOSING DISTRIBUTION MODELS: MASTER-SLAVE VERSUS PEER-TO-PEER From a distribution perspective, there are two main models: master- slave and peer-to-peer. Distribution models determine the responsibility for processing data when a request is made.
  70. 70. Master-slave versus peer-to-peer—the panel on the left illustrates a master-slave configuration where all incoming database requests (reads or writes) are sent to a single master node and redistributed from there. The master node is called the NameNode in Hadoop. This node keeps a database of all the other nodes in the cluster and the rules for distributing requests to each node. The panel on the right shows how the peer-to-peer model stores all the information about the cluster on each node in the cluster. If any node crashes, the other nodes can take over and processing can continue.
  71. 71. With a master-slave distribution model, the role of managing the cluster is done on a single master node. This node can run on specialized hardware such as RAID drives to lower the probability that it crashes. The cluster can also be configured with a standby master that’s continually updated from the master node. The challenge with this option is that it’s difficult to test the standby master without jeopardizing the health of the cluster. Failure of the standby master to take over from the master node is a real concern for high-availability operations.
  72. 72. The initial versions of Hadoop (frequently referred to as the 1.x versions) were designed to use a master-slave architecture with the NameNode of a cluster being responsible for managing the status of the cluster. NameNodes usually don’t deal with any MapReduce data themselves. Their job is to manage and distribute queries to the correct nodes on the cluster. Hadoop 2.x versions are designed to remove single points of failure from a Hadoop cluster.
  73. 73. One of the strengths of a Hadoop system is that it’s designed to work directly with a filesystem that supports big data problems.  As you’ll see, Hadoop makes big data processing easier by using a filesystem structure that’s different from a traditional system. The Hadoop Distributed File System (HDFS) provides many of the supporting features that MapReduce transforms need to be efficient and reliable. Unlike an ordinary filesystem, it’s customized for transparent, reliable, write-once, read-many operations. You can think of HDFS as a fault-tolerant, distributed, key-value store tuned to work with large files. HDFSs are different: they use a large (64 megabytes by default) block size to handle data. Figure shows how large HDFS blocks are compared to a typical 5.MapReduce and distributed filesystems
  74. 74. The size difference between a filesystem block size on a typical desktop or UNIX operating system (4 KB) and the logical block size within the Apache Hadoop Distributed File System (64 MB), which is optimized for big data transforms. The default block size defines a unit of work for the filesystem. The fewer blocks used in a transfer, the more efficient the transfer process.  The downside of using large blocks Is that if data doesn’t fill an entire physical block, the empty section of the block can’t be used.
  75. 75. 6. HOW MAPREDUCE ALLOWS EFFICIENT TRANSFORMATION OF BIG DATA PROBLEMS: MapReduce is a core component in many big data solutions. Figure provides a detailed look at the internal components of a MapReduce job.
  76. 76.  The basics of how the map and reduce functions work together to gain linear scalability over big data transforms. The map operation takes input data and creates a uniform set of key-value pairs.  In the shuffle phase, which is done automatically by the MapReduce framework, key-value pairs are automatically distributed to the correct reduce node based on the value of the key. The reduce operation takes the key-value pairs and returns consolidated values for each key. It’s the job of the MapReduce framework to get the right keys to the right reduce nodes.
  77. 77. 7.USING REPLICATION TO SCALE READS How you can replicate data to speed read performance in NoSQL systems. All incoming client requests enter from the left. All reads can be directed to any node, either a primary read/write node or a replica node. All write transactions can be sent to a central read/write node that will update the data and then automatically send the updates to replica nodes. The time between the write to the primary and the time the update arrives on the replica nodes determines how long it takes for reads to return consistent results.
  79. 79. Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big data, companies realized the benefit of processing big data and started opting for solutions like Hadoop. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured. Limitations of Hadoop Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access).
  80. 80. WHAT IS HBASE? Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable. It is based on Google's Big Table. It has set of tables which keep data in key value format. Hbase is well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling development in practically any programming language. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File
  81. 81. WHY HBASE? •RDBMS get exponentially slow as the data becomes large. •Expects data to be highly structured, i.e. ability to fit in a well-defined schema. •Any change in schema might require a downtime. •For sparse datasets, too much of overhead of maintaining NULL values.
  82. 82. FEATURES OF HBASE: •Horizontally scalable: You can add any number of columns anytime. •Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch data handling to a standby system in the event of system compromise •Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to do the task and it is built over Hadoop Distributed File System. •sparse, distributed, persistent, multidimensional sorted map, which is indexed by row-key, column-key and timestamp. •Often referred as a key value store or column family-oriented database, or storing versioned maps of maps. •fundamentally, it's a platform for storing and retrieving data with random access. •It doesn't care about datatypes(storing an integer in one row and a string in another for the same column). •It doesn't enforce relationships within your data. •It is designed to run on a cluster of computers, built using commodity hardware.
  83. 83. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
  84. 84. WHERE TO USE HBASE? •Apache HBase is used to have random, real-time read/write access to Big Data. •It hosts very large tables on top of clusters of commodity hardware. •Apache HBase is a non-relational database modelled after Google's Bigtable. •Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
  85. 85. HBASE V/S RDBMS:
  86. 86. HBASE V/S HDFS:
  88. 88. HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column- oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:
  89. 89. 1. ROW-ORIENTED NOSQL: •Row-oriented databases store table records in a sequence of rows. •To better understand it, let us take an example and consider the table below. •If this table is stored in a row-oriented database. It will store the records as shown below: 1, Paul Walker, US, 231, Gallardo, 2, Vin Diesel, Brazil, 520, Mustang In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
  90. 90. COLUMN-ORIENTED NOSQL: Whereas column-oriented databases store table records in a sequence of columns, i.e. the entries in a column are stored in contiguous locations on disks. In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner. The column-oriented databases store this data as: 1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
  91. 91. When the amount of data is very huge, like in terms of petabytes or exabytes, we use column-oriented approach, because the data of a single column is stored together and can be accessed faster. While row-oriented approach comparatively handles less number of rows and columns efficiently, as row-oriented database stores data is a structured format. When we need to process and analyze a large set of semi-structured or unstructured data, we use column oriented approach. Such as applications dealing with Online Analytical Processing like data mining, data warehousing, applications including analytics, etc. Whereas, Online Transactional Processing such as banking and finance domains which handle structured data and require transactional properties (ACID properties) use row-oriented approach.
  93. 93. •Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format. •Row Key: Row keys are used to search records which make searches fast. You would be curious to know how? I will explain it in the in this blog. •Column Families: Various columns are combined in a column family. These column families are stored together which makes the searching data belonging to same column family can be accessed together in a •Column Qualifiers: Each column’s name is known as its column qualifier. •Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by row-key and column qualifiers. •Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored with its timestamp. This makes easy to of data.
  94. 94. In a more simple and understanding way, we can say HBase consists of: •Set of tables •Each table with column families and rows •Row key acts as a Primary key in HBase. •Any access to HBase tables uses this Primary •Each column qualifier present in HBase denotes corresponding to the object which resides in the
  97. 97. HBase architecture consists mainly of five components: •1) HMaster •2) HRegionserver •3) HRegions •4) Zookeeper •5) HDFS
  98. 98. 1) HMASTER: HMaster in HBase is the implementation of a Master server in HBase architecture. It acts as a monitoring agent to monitor all Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background threads.
  99. 99. The following are important roles performed by HMaster in HBase. 1. HMaster Plays a vital role in terms of performance and maintaining nodes in the cluster. 2. HMaster provides admin performance and distributes different region servers. 3. HMaster assigns regions to region servers. 4. HMaster has the features like controlling load failover to handle the load over nodes present in the 5. When a client wants to change any schema and to Metadata operations, HMaster takes responsibility for operations.
  100. 100. Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods. •Table (createTable, removeTable, enable, disable) •ColumnFamily (add Column, modify Column) •Region (move, assign) The client communicates in a bi-directional way with both ZooKeeper. For read and write operations, it directly contacts with HRegion HMaster assigns regions to region servers and in turn, check the status of region servers. In entire architecture, we have multiple region servers.  Hlog present in region servers which are going to store all the
  101. 101. 2) HBASE REGION SERVERS When HBase Region Server receives writes and read requests from the client, it assigns the request to a specific region, where the actual column family resides. However, the client can directly contact with HRegion servers, there is no need of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires HMaster help when operations related to metadata and schema changes are required. HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that is present in a distributed cluster.The region servers run on Data Nodes present in the Hadoop cluster. HMaster can get into contact with multiple HRegion servers and performs the following following functions. •Hosting and managing regions •Splitting regions automatically •Handling read and writes requests
  102. 102. Components of Region Server:
  103. 103. A Region Server maintains various regions running on the top of HDFS. Components of a Region Server are: •WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores persisted or committed to the permanent storage. It is used in case of failure to •Block Cache: From the above image, it is clearly visible that Block Cache resides in the top of Region Server. It stores the frequently read data in the memory. If the data recently used, then that data is removed from BlockCache. •MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in the image, there are multiple MemStores for a region because each region contains data is sorted in lexicographical order before committing it to the disk. •HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual cells on the disk. MemStore commits the data to HFile when the size of
  104. 104. 3) HBASE REGIONS HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile. So, concluding in a simpler way: •A table can be divided into a number of regions. A Region is a sorted range of rows storing data between a start key and an end key. •A Region has a default size of 256MB which can be configured according to the need. •A Group of regions is served to the clients by a Region Server. •A Region Server can serve approximately 1000 regions to the client.
  105. 105. 4) ZOOKEEPER: HBase Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate with regions, the server’s client has to approach ZooKeeper first. It is an open source project, and it provides so many important services.
  106. 106. •Zookeeper acts like a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions. •Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available as mentioned in above image. It also provides server failure notifications so that, recovery measures can be executed. •Referring from the above image you can see, there is an inactive server, which acts as a backup for active server. If the active server fails, it comes for the rescue.
  107. 107. •The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster listens for the notification send by active HMaster. If the active HMaster fails to send a heartbeat the session is deleted and the inactive HMaster becomes active. •While if a Region Server fails to send a heartbeat, the session is expired and all listeners are notified about it. Then HMaster performs suitable recovery actions which we will discuss later in this blog. •Zookeeper also maintains the .META Server’s path, which helps any client in searching for any region. The Client first has to check with .META Server in which Region Server a region belongs, and it gets the path of that Region Server.
  108. 108. META TABLE:
  109. 109. •The META table is a special HBase catalog table. It maintains a list of all the Regions Servers in the HBase storage system, as you can see in the above image. •Looking at the figure you can see, .META file maintains the table in form of keys and values. Key represents the start key of the region and its id whereas the value contains the path of the Region Server.
  110. 110. Services provided by ZooKeeper •Maintains Configuration information •Provides distributed synchronization •Client Communication establishment with region servers •Provides ephemeral nodes for which represent different servers •Master servers usability of ephemeral nodes for servers in the cluster •To track server failure and network partitions
  111. 111. 5) HDFS HDFS is a Hadoop distributed File System, as the name implies it provides a distributed environment for the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are replicated across a Hadoop cluster. HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes to the cluster and performing processing & storing by using the cheap commodity hardware, it will give the client better results as compared to the existing one. HDFS get in contact with the HBase components and stores a large amount of amount of data in a distributed manner.
  113. 113. HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As shown above, HBase has RowId, which is the collection of several column families that are present in the table. The column families that are present in the schema are key-value pairs. If we observe in detail each column family having multiple numbers of columns.The column values stored into disk memory. Each cell of the table has its own Metadata like timestamp and other information. Coming to HBase the following are the key terms representing table schema schema •Table: Collection of rows present. •Row: Collection of column families. •Column Family: Collection of columns. •Column: Collection of key-value pairs. •Namespace: Logical grouping of tables.
  116. 116. 3 mechanisms are followed to handle the requests in Hbase Architecture: 1. Commence the Search in HBase Architecture 2. Write Mechanism in HBase Architecture 3. Read Mechanism in HBase Architecture
  117. 117. 1. COMMENCE THE SEARCH IN HBASE ARCHITECTURE The steps to initialize the search are: 1.The user retrieves the Meta table from ZooKeeper and then requests for the location of the relevant Region Server. 2.Then the user will request the exact data from the Region Server with the help of RowKey.
  119. 119. The write mechanism goes through the following process sequentially (refer to the above image): Step 1: Whenever the client has a write request, the client writes the data to (Write Ahead Log). •The edits are then appended at the end of the WAL file. •This WAL file is maintained in every Region Server and Region Server recover data which is not committed to the disk. Step 2: Once data is written to the WAL, then it is copied to the MemStore. Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment. Step 4: When the MemStore reaches the threshold, it dumps or commits the HFile.
  120. 120. HBase Write Mechanism- MemStore •The MemStore always updates the data stored in it, in a lexicographical order (sequentially in a dictionary manner) Key-Values. There is one MemStore for each column the updates are stored in a sorted manner for each column •When the MemStore reaches the threshold, it dumps all the a new HFile in a sorted manner. This HFile is stored in contains multiple HFiles for each Column Family. •Over time, the number of HFile grows as MemStore dumps •MemStore also saves the last written sequence number, so Server and MemStore both knows, that what is committed where to start from. When region starts up, the last is read, and from that number, new edits start.
  121. 121. HBase Architecture: HBase Write Mechanism- HFile •The writes are placed sequentially on the disk. movement of the disk’s read-write head is very less. write and search mechanism very fast. •The HFile indexes are loaded in memory whenever an opened. This helps in finding a record in a single •The trailer is a pointer which points to the HFile’s is written at the end of the committed file. It contains information about timestamp and bloom filters. •Bloom Filter helps in searching key value pairs, it which does not contain the required rowkey. helps in searching a version of the file, it helps in data.
  122. 122. 3. READ MECHANISM IN HBASE ARCHITECTURE To read any data, the user will first have to access the relevant Region Server. Once the Region Server is known, the other process includes: 1.The first scan is made at the read cache, which is the Block cache. 2.The next scan location is MemStore, which is the write cache. 3.If the data is not found in block cache or MemStore, the scanner will retrieve the data from HFile.