it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
How to Troubleshoot Apps for the Modern Connected Worker
Fundamental of Big Data with Hadoop and Hive
1. A real time experience.
Fundamental of Big Data
Sharjeel Imtiaz | PhD Data Science – last stage | University of East London, UK
2. BIG DATA CHARACTERISTICS
• In 2001, Doug Laney detailed that Big Data were characterized by three
traits:
• Volume (consisting of enormous quantities of data)
• Velocity (created in real-time)
• Variety (being structured, semi-structured and unstructured).
3. Big Data Definition
Exhaustively (an entire system is captured, rather than being sampled) (Mayer-Schonberger and
Cukier, 2013).
Fine-grained (in resolution) and uniquely indexical (in identification) (Dodge and Kitchin, 2005).
Rationality (containing common fields that enable the conjoining of different datasets) (Boyd and
Crawford, 2012).
Extensionality (can add/change new fields easily) and scalability (can expand in size rapidly) (Marz
and Warren, 2012).
Veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014).
Value (many insights can be extracted and the data repurposed) (Marr, 2014).
Variability (data whose meaning can be constantly shifting in relation to the context in which
they are generated) (McNulty, 2014).
4. How to process Big Data
• Ecosystem are interrelated
• it was evident that the data process in
map-reduce and stored in HDFS.
• HBase and Hive a SQL like interface to
easily manage all data and store in
HDFS in file format like text and comma
separated file format (csv)
• HDFS file system and HABASE (big
tables), the Sqoop is used to bulk
transfer.
5. How to process Big Data
• the Flume is typically used to stream
• index the data from HDFS. HIVE, AND
HBASE for fast retrieval the Solr
• Solr stores the data in a disk file system
but with indexing.
6. HADOOP AND HADOOP COMPONENT
• petabytes the apache Hadoop is a
software framework that enables
distributed processing on large clusters
• Hadoop distributed file system
(HDFS): to store comprehensively HDFS
handle it by replication into various
places
• Frameworks for parallel processing of
data: MapReduce, Hive, Mahout, Spark
7. HADOOP AND HADOOP COMPONENT
• YARN: Manage size and memory of the
cluster nodes
• files in HDFS divided into large blocks
that are typically 128 MB in size and
distributed across the cluster
• a single block (A1) as its size (100
MB) is less than the default block size
(150 MB) and replicated on Node 1
worker, and Node 2 worker. Block1 (A1)
is replicated on the first node (Node 1)
and then Node 1 replicates on Node 2
worker
8. HADOOP AND HADOOP COMPONENT
• File2 is divided into two blocks as its size
(250 MB) is greater than the block size, and
block2 (B) and block3 (C) are replicated on
node.
• Blocks' metadata (file name, blocks,
location, date created, and size) is stored in
NameNode
• Clusters run Hadoop's open source
distributed processing software
• Master
• Slave
9. HADOOP CLUSTER
• The job tracker will plan the jobs of
map closer to data that is being
processed, which is running on the same
Data Node as the essential block.
• HDFS and Map-Reduce master and
slave components have been included,
where Name Node and Data Node are
from HDFS and Job Tracker and Task
Tracker are from the Map-Reduce
paradigm
10. Big data Process
Data collection first
Data prepared and clean
Data explore by plots, cor, regression
Data apply model by regression, K-mean
Visualize results by dashboard and finally
product
11. How to process Big Data
• Creating Project in R
• Start RStudio: Under the File menu, click
on New Project. Choose New Directory,
then New Project.
• Enter a name for this new folder (or
“directory”), and choose a convenient location
for it. This will be your working directory for
the rest of the day (e.g., ~/data-carpentry).
• Click on Create Project.
• (Optional) Set Preferences to ‘Never’ save
workspace in RStudio.
12. HADOOP SCOPE OF COURSE
Hadoop installation and data analytics
with basic numeric problems
Please follow the appendix A. for
installation 3.4.3 version of 64 bit is
required.
R has been used for statistical analysis,
machine learning, visualization, and data
operations.
R will not load the big data but with the
help of Hadoop one can the data
13. HADOOP SCOPE OF COURSE
R will handle data analysis operations
with the initial functions, such as data
loading, exploration, analysis, and
visualization.
Hadoop will handle parallel data storage
and processing as a computation power
alongside distributed data.
14. HADOOP SCOPE OF COURSE
The middleware for R and Hadoop IS
RHIVE that help to provide fast
streaming by SQL interface as
middleware that aid development and
execution of Hadoop MapReduce
program
15. HADOOP SCOPE OF COURSE
http://localhost:50070.
NameNode: the role of node is as master
that maintain the directories, files, and
copes the blocks that are resides on the
DataNodes.
DataNode: The role of datanode is as slave
and deployed on all machines which
intended for storage. The main
responsibility is read-and-write data
services for client
16. Installation Hadoop
• Follow the link AND install framework
• https://github.com/Sharjeel1234/HADOOP--HIVE--READY-INSTALLATION/
17. Installation Hadoop
RHadoop project has three different R packages: rhdfs, rmr, and
rhive.
rhdfs: This an package that provide distributed files (HDFS)
management within R.
rmr: This is an R package that help to develop MAP / Reduce
program.
rhive: This is an R package that help to provide SQL interface to
data for fast retrieval and processing.
18. Program Steps for HIVE
• Install three packages and other packages in R
• Set the environment variables
• Load the libraries
• Connect to hive database
• Start putting data in HDFS and querying using HIVE for fast retrieval
• Create table and populate data from csv from HDFS location.
• Query other required function and display data using GGPLOT and k-mean
21. Map - Reduce
This Map Reduce paradigm is classified into two phases
• MAP/REDUCE
• Map and Reduce primarily deal with key/value pairs
• the output of the Map phase becomes the input for the Reduce phase.
22. Map - Reduce
• Map input preparation: the input is data row wise and return key/value pairs.
• Input: list (k2, v2)
• Run the given Map() code
• Output: list (k3, v3)
• The output map shuffle by reducer. Like similar keys will be grouping and input them to the
same reducer.
• Run the given Reduce function code: This output will be reduced key-value pairs.
• Reduce input: (k3, list (v3))
• Reduce output: (k4, v4)
• Final output: then master node will collect all key/value pairs after combining them and write
23. RMR Package Functions
• The categories of the functions are as follows:
• For storing and retrieving data:
• to.dfs: This is used to write R objects from or to the filesystem.
• small.ints = to.dfs(1:10)
• from.dfs: This is used to read the R objects from the HDFS filesystem that
are in the binary encrypted format.
• from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')
24. RMR Package Functions
• mapreduce: This is used for defining and executing the MapReduce job.
25. • Thank you!
• Any question please submit to blog.
• https://sharjeel1978.blogspot.com/