Fundamental of Big Data with Hadoop and Hive

A real time experience.
Fundamental of Big Data
Sharjeel Imtiaz | PhD Data Science – last stage | University of East London, UK

BIG DATA CHARACTERISTICS
• In 2001, Doug Laney detailed that Big Data were characterized by three
traits:
• Volume (consisting of enormous quantities of data)
• Velocity (created in real-time)
• Variety (being structured, semi-structured and unstructured).

Big Data Definition
Exhaustively (an entire system is captured, rather than being sampled) (Mayer-Schonberger and
Cukier, 2013).
Fine-grained (in resolution) and uniquely indexical (in identification) (Dodge and Kitchin, 2005).
Rationality (containing common fields that enable the conjoining of different datasets) (Boyd and
Crawford, 2012).
Extensionality (can add/change new fields easily) and scalability (can expand in size rapidly) (Marz
and Warren, 2012).
Veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014).
Value (many insights can be extracted and the data repurposed) (Marr, 2014).
Variability (data whose meaning can be constantly shifting in relation to the context in which
they are generated) (McNulty, 2014).

How to process Big Data
• Ecosystem are interrelated
• it was evident that the data process in
map-reduce and stored in HDFS.
• HBase and Hive a SQL like interface to
easily manage all data and store in
HDFS in file format like text and comma
separated file format (csv)
• HDFS file system and HABASE (big
tables), the Sqoop is used to bulk
transfer.

• the Flume is typically used to stream
• index the data from HDFS. HIVE, AND
HBASE for fast retrieval the Solr
• Solr stores the data in a disk file system
but with indexing.

HADOOP AND HADOOP COMPONENT
• petabytes the apache Hadoop is a
software framework that enables
distributed processing on large clusters
• Hadoop distributed file system
(HDFS): to store comprehensively HDFS
handle it by replication into various
places
• Frameworks for parallel processing of
data: MapReduce, Hive, Mahout, Spark

• YARN: Manage size and memory of the
cluster nodes
• files in HDFS divided into large blocks
that are typically 128 MB in size and
distributed across the cluster
• a single block (A1) as its size (100
MB) is less than the default block size
(150 MB) and replicated on Node 1
worker, and Node 2 worker. Block1 (A1)
is replicated on the first node (Node 1)
and then Node 1 replicates on Node 2
worker

• File2 is divided into two blocks as its size
(250 MB) is greater than the block size, and
block2 (B) and block3 (C) are replicated on
node.
• Blocks' metadata (file name, blocks,
location, date created, and size) is stored in
NameNode
• Clusters run Hadoop's open source
distributed processing software
• Master
• Slave

HADOOP CLUSTER
• The job tracker will plan the jobs of
map closer to data that is being
processed, which is running on the same
Data Node as the essential block.
• HDFS and Map-Reduce master and
slave components have been included,
where Name Node and Data Node are
from HDFS and Job Tracker and Task
Tracker are from the Map-Reduce
paradigm

Big data Process
 Data collection first
Data prepared and clean
Data explore by plots, cor, regression
Data apply model by regression, K-mean
Visualize results by dashboard and finally
product

• Creating Project in R
• Start RStudio: Under the File menu, click
on New Project. Choose New Directory,
then New Project.
• Enter a name for this new folder (or
“directory”), and choose a convenient location
for it. This will be your working directory for
the rest of the day (e.g., ~/data-carpentry).
• Click on Create Project.
• (Optional) Set Preferences to ‘Never’ save
workspace in RStudio.

HADOOP SCOPE OF COURSE
Hadoop installation and data analytics
with basic numeric problems
Please follow the appendix A. for
installation 3.4.3 version of 64 bit is
required.
R has been used for statistical analysis,
machine learning, visualization, and data
operations.
R will not load the big data but with the
help of Hadoop one can the data

 R will handle data analysis operations
with the initial functions, such as data
loading, exploration, analysis, and
visualization.
 Hadoop will handle parallel data storage
and processing as a computation power
alongside distributed data.

 The middleware for R and Hadoop IS
RHIVE that help to provide fast
streaming by SQL interface as
middleware that aid development and
execution of Hadoop MapReduce
program

http://localhost:50070.
NameNode: the role of node is as master
that maintain the directories, files, and
copes the blocks that are resides on the
DataNodes.
DataNode: The role of datanode is as slave
and deployed on all machines which
intended for storage. The main
responsibility is read-and-write data
services for client

Installation Hadoop
• Follow the link AND install framework
• https://github.com/Sharjeel1234/HADOOP--HIVE--READY-INSTALLATION/

Installation Hadoop
RHadoop project has three different R packages: rhdfs, rmr, and
rhive.
rhdfs: This an package that provide distributed files (HDFS)
management within R.
rmr: This is an R package that help to develop MAP / Reduce
program.
rhive: This is an R package that help to provide SQL interface to
data for fast retrieval and processing.

Program Steps for HIVE
• Install three packages and other packages in R
• Set the environment variables
• Load the libraries
• Connect to hive database
• Start putting data in HDFS and querying using HIVE for fast retrieval
• Create table and populate data from csv from HDFS location.
• Query other required function and display data using GGPLOT and k-mean

Display each type according to length

Map - Reduce
This Map Reduce paradigm is classified into two phases
• MAP/REDUCE
• Map and Reduce primarily deal with key/value pairs
• the output of the Map phase becomes the input for the Reduce phase.

Map - Reduce
• Map input preparation: the input is data row wise and return key/value pairs.
• Input: list (k2, v2)
• Run the given Map() code
• Output: list (k3, v3)
• The output map shuffle by reducer. Like similar keys will be grouping and input them to the
same reducer.
• Run the given Reduce function code: This output will be reduced key-value pairs.
• Reduce input: (k3, list (v3))
• Reduce output: (k4, v4)
• Final output: then master node will collect all key/value pairs after combining them and write

RMR Package Functions
• The categories of the functions are as follows:
• For storing and retrieving data:
• to.dfs: This is used to write R objects from or to the filesystem.
• small.ints = to.dfs(1:10)
• from.dfs: This is used to read the R objects from the HDFS filesystem that
are in the binary encrypted format.
• from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')

RMR Package Functions
• mapreduce: This is used for defining and executing the MapReduce job.

• Thank you!
• Any question please submit to blog.
• https://sharjeel1978.blogspot.com/

Fundamental of Big Data with Hadoop and Hive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Fundamental of Big Data with Hadoop and Hive

Similaire à Fundamental of Big Data with Hadoop and Hive (20)

Dernier

Dernier (20)

Fundamental of Big Data with Hadoop and Hive