3.
Introduction to BigData
What is hadoop?
What hadoop is used for and is not?
Top level Hadoop Projects
Differences between RDBMS and Hbase.
Facebook server model.
4. BigData- The Data Age
Big data is a collection of datasets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
The data that is getting generated by different
companies has an inherent value, can be used for
different use cases in their analytics and predictions.
5. A new approach
As per Moore's Law, which was true for the past 40 years.
1) Processing power doubles every two years
2) Processing speed is no longer the problem.
Getting the data to the processors becomes the bottleneck.
Average time taken to transfer 100GB of data takes 22 min, if
the disk transfer rate is 75 MB/sec
So, the new approach is to move processing of the data to the
data side in a distributed way, and need to satisfy the different
requirements like : Data Recoverability, Component Recovery,
Consistency, Reliability and Scalability.
The answer is the Google's File System(GFS) and
MapReduce, which is now Hadoop called HDFS and
MapReduce.
6. Hadoop used for.
Hadoop is recommended to coexist with your RDBMS as a
data ware house.
It is not a replacement to any of the RDBMS.
Processing over TB and PB of data is specified to take hours
of time with traditional methods, with Hadoop and and it's eco-
system it would take a few minutes with the power of
distribution.
Many related tools integrate with Hadoop –
Data"analysis”
Data"visualization"
Database"integration"
Workflow"management"
Cluster"management"
7. ➲ Distributed File system and parallel processing for large scale
data operations using HDFS and MapReduce.
➲ Plus the infrastructure needed to make them work, include
Filesystem utilities
Job scheduling and monitoring
Web UI
There are many other projects running around the core
components of Hadoop. Pig, Hive, HBase, Flume, Oozie,
Sqoop, etc called as Ecosystem.
A set of machines running HDFS and MapReduce is known
as Hadoop Cluster.
Individual machines are known as nodes – A cluster
can have as few as one node, as many as several
thousands , horizontally scalable.
More nodes = better performance!
Hadoop and EcoSystem
9. Hadoop Components: HDFS
HDFS, the Hadoop Distributed File System, is responsible for
storing data on the cluster. Uses Ext3/Ext4 or xfs file system.
HDFS is a file-system designed for storing very large files with
streaming data-acess(write-once, read many time), running on
clusters of commodity hardware.
Data is split into blocks and distributed across mul/ple nodes in the
cluster
Each block is typically 64MB or 128MB in size
Each block is replicated multiple times
Default is to replicate each block three times
Replicas are stored on different nodes
This ensures both reliability and availability.
14. HDFS Access
•
WebHDFS – REST API
•
Fuse DFS – Mounting HDFS as normal
drive.
•
Direct Access – Direct HDFS access
15. Hive and Pig
Hive is a powerful SQL language, though not
fully supported SQL, can be used to perform
joins on top of datasets in HDFS.
Used for large batch Programming. At the
backend, hive does the MapReduce Jobs only.
Pig is a powerful scripting language, that is
built on top of the MapReduce Jobs, the
language is called PigLatin.
16. HBASE
The most powerful NoSQL database on earth.
Supports Master Active-Active Setup and is
based on the Google's BigTable.
Supports Columns and ColumnFamilies, can
support many billions of rows and many
millions of columns in its datamodel.
An excellent Architectural master-piece, as far
as the scalability is concerned.
A NoSQL database, which can support
transactions, very fast reads/writes typically
millions of queries / second.
18. ZooKeeper, Mahout
Zookeeper is a distributed coordinator and can
be used as independent package, in any
distributed servers management.
Mahout is a machine learning tool useful for
using it for various Data science techniques.
For eg: Data Clustering, Classification and
Recommender Systems by using Supervised
and Unsupervised Learning.
19. Flume
Flume is a real time data access mechanism
and writes to a data mart.
Flume can move large capacity of streaming
data into HDFS and will be used for further
analysis.
A part from this realtime analysis of the web-
log data is also possible along with Flume.
Logs of a group of webservers can be written
to HDFS using Flume.
20. Sqoop and Oozie
Sqoop is a data import and export mechanism
from RDBMS to HDFS or hive and vice-versa.
There are lot of free connectors that has been
prepared by various vendors with different
RDBMS, which has really made the data
transfer very fast, as it supports parallel
transfer of stuff.
Oozie is a workflow, mechanism of executing
a large sequence of MapReduce Jobs, Hive or
Pig Jobs and Hbase Jobs and any other Java
Programs. Oozie also has an email job which
21. RDBMS vs HBASE
A typical RDBMS scaling story runs this way:
Initial Public Launch
Service Popular, too many reads hitting database.
Service continues to grow in popularity; too many writes hitting
the database.
New features increases query complexity; now we have too
many joins
Rising popularity swamps the server; things are too slow
Some queries are still too slow
Reads are OK, but writes are getting slower and slower
22. With Hbase
Enter HBase, which has the following characteristics:
No real indexes.
Automatic partitioning/Sharding
Scale linearly and automatically with new nodes
Commodity hardware
Fault tolerance
Batch processing