SlideShare une entreprise Scribd logo
1  sur  25
A real time experience.
Fundamental of Big Data
Sharjeel Imtiaz | PhD Data Science – last stage | University of East London, UK
BIG DATA CHARACTERISTICS
• In 2001, Doug Laney detailed that Big Data were characterized by three
traits:
• Volume (consisting of enormous quantities of data)
• Velocity (created in real-time)
• Variety (being structured, semi-structured and unstructured).
Big Data Definition
Exhaustively (an entire system is captured, rather than being sampled) (Mayer-Schonberger and
Cukier, 2013).
Fine-grained (in resolution) and uniquely indexical (in identification) (Dodge and Kitchin, 2005).
Rationality (containing common fields that enable the conjoining of different datasets) (Boyd and
Crawford, 2012).
Extensionality (can add/change new fields easily) and scalability (can expand in size rapidly) (Marz
and Warren, 2012).
Veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014).
Value (many insights can be extracted and the data repurposed) (Marr, 2014).
Variability (data whose meaning can be constantly shifting in relation to the context in which
they are generated) (McNulty, 2014).
How to process Big Data
• Ecosystem are interrelated
• it was evident that the data process in
map-reduce and stored in HDFS.
• HBase and Hive a SQL like interface to
easily manage all data and store in
HDFS in file format like text and comma
separated file format (csv)
• HDFS file system and HABASE (big
tables), the Sqoop is used to bulk
transfer.
How to process Big Data
• the Flume is typically used to stream
• index the data from HDFS. HIVE, AND
HBASE for fast retrieval the Solr
• Solr stores the data in a disk file system
but with indexing.
HADOOP AND HADOOP COMPONENT
• petabytes the apache Hadoop is a
software framework that enables
distributed processing on large clusters
• Hadoop distributed file system
(HDFS): to store comprehensively HDFS
handle it by replication into various
places
• Frameworks for parallel processing of
data: MapReduce, Hive, Mahout, Spark
HADOOP AND HADOOP COMPONENT
• YARN: Manage size and memory of the
cluster nodes
• files in HDFS divided into large blocks
that are typically 128 MB in size and
distributed across the cluster
• a single block (A1) as its size (100
MB) is less than the default block size
(150 MB) and replicated on Node 1
worker, and Node 2 worker. Block1 (A1)
is replicated on the first node (Node 1)
and then Node 1 replicates on Node 2
worker
HADOOP AND HADOOP COMPONENT
• File2 is divided into two blocks as its size
(250 MB) is greater than the block size, and
block2 (B) and block3 (C) are replicated on
node.
• Blocks' metadata (file name, blocks,
location, date created, and size) is stored in
NameNode
• Clusters run Hadoop's open source
distributed processing software
• Master
• Slave
HADOOP CLUSTER
• The job tracker will plan the jobs of
map closer to data that is being
processed, which is running on the same
Data Node as the essential block.
• HDFS and Map-Reduce master and
slave components have been included,
where Name Node and Data Node are
from HDFS and Job Tracker and Task
Tracker are from the Map-Reduce
paradigm
Big data Process
 Data collection first
Data prepared and clean
Data explore by plots, cor, regression
Data apply model by regression, K-mean
Visualize results by dashboard and finally
product
How to process Big Data
• Creating Project in R
• Start RStudio: Under the File menu, click
on New Project. Choose New Directory,
then New Project.
• Enter a name for this new folder (or
“directory”), and choose a convenient location
for it. This will be your working directory for
the rest of the day (e.g., ~/data-carpentry).
• Click on Create Project.
• (Optional) Set Preferences to ‘Never’ save
workspace in RStudio.
HADOOP SCOPE OF COURSE
Hadoop installation and data analytics
with basic numeric problems
Please follow the appendix A. for
installation 3.4.3 version of 64 bit is
required.
R has been used for statistical analysis,
machine learning, visualization, and data
operations.
R will not load the big data but with the
help of Hadoop one can the data
HADOOP SCOPE OF COURSE
 R will handle data analysis operations
with the initial functions, such as data
loading, exploration, analysis, and
visualization.
 Hadoop will handle parallel data storage
and processing as a computation power
alongside distributed data.
HADOOP SCOPE OF COURSE
 The middleware for R and Hadoop IS
RHIVE that help to provide fast
streaming by SQL interface as
middleware that aid development and
execution of Hadoop MapReduce
program
HADOOP SCOPE OF COURSE
http://localhost:50070.
NameNode: the role of node is as master
that maintain the directories, files, and
copes the blocks that are resides on the
DataNodes.
DataNode: The role of datanode is as slave
and deployed on all machines which
intended for storage. The main
responsibility is read-and-write data
services for client
Installation Hadoop
• Follow the link AND install framework
• https://github.com/Sharjeel1234/HADOOP--HIVE--READY-INSTALLATION/
Installation Hadoop
RHadoop project has three different R packages: rhdfs, rmr, and
rhive.
rhdfs: This an package that provide distributed files (HDFS)
management within R.
rmr: This is an R package that help to develop MAP / Reduce
program.
rhive: This is an R package that help to provide SQL interface to
data for fast retrieval and processing.
Program Steps for HIVE
• Install three packages and other packages in R
• Set the environment variables
• Load the libraries
• Connect to hive database
• Start putting data in HDFS and querying using HIVE for fast retrieval
• Create table and populate data from csv from HDFS location.
• Query other required function and display data using GGPLOT and k-mean
Display example of iris
Display each type according to length
Map - Reduce
This Map Reduce paradigm is classified into two phases
• MAP/REDUCE
• Map and Reduce primarily deal with key/value pairs
• the output of the Map phase becomes the input for the Reduce phase.
Map - Reduce
• Map input preparation: the input is data row wise and return key/value pairs.
• Input: list (k2, v2)
• Run the given Map() code
• Output: list (k3, v3)
• The output map shuffle by reducer. Like similar keys will be grouping and input them to the
same reducer.
• Run the given Reduce function code: This output will be reduced key-value pairs.
• Reduce input: (k3, list (v3))
• Reduce output: (k4, v4)
• Final output: then master node will collect all key/value pairs after combining them and write
RMR Package Functions
• The categories of the functions are as follows:
• For storing and retrieving data:
• to.dfs: This is used to write R objects from or to the filesystem.
• small.ints = to.dfs(1:10)
• from.dfs: This is used to read the R objects from the HDFS filesystem that
are in the binary encrypted format.
• from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')
RMR Package Functions
• mapreduce: This is used for defining and executing the MapReduce job.
• Thank you!
• Any question please submit to blog.
• https://sharjeel1978.blogspot.com/

Contenu connexe

Tendances

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 

Tendances (19)

Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop
HadoopHadoop
Hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 

Similaire à Fundamental of Big Data with Hadoop and Hive

HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

Similaire à Fundamental of Big Data with Hadoop and Hive (20)

Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Anju
AnjuAnju
Anju
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Fundamental of Big Data with Hadoop and Hive

  • 1. A real time experience. Fundamental of Big Data Sharjeel Imtiaz | PhD Data Science – last stage | University of East London, UK
  • 2. BIG DATA CHARACTERISTICS • In 2001, Doug Laney detailed that Big Data were characterized by three traits: • Volume (consisting of enormous quantities of data) • Velocity (created in real-time) • Variety (being structured, semi-structured and unstructured).
  • 3. Big Data Definition Exhaustively (an entire system is captured, rather than being sampled) (Mayer-Schonberger and Cukier, 2013). Fine-grained (in resolution) and uniquely indexical (in identification) (Dodge and Kitchin, 2005). Rationality (containing common fields that enable the conjoining of different datasets) (Boyd and Crawford, 2012). Extensionality (can add/change new fields easily) and scalability (can expand in size rapidly) (Marz and Warren, 2012). Veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014). Value (many insights can be extracted and the data repurposed) (Marr, 2014). Variability (data whose meaning can be constantly shifting in relation to the context in which they are generated) (McNulty, 2014).
  • 4. How to process Big Data • Ecosystem are interrelated • it was evident that the data process in map-reduce and stored in HDFS. • HBase and Hive a SQL like interface to easily manage all data and store in HDFS in file format like text and comma separated file format (csv) • HDFS file system and HABASE (big tables), the Sqoop is used to bulk transfer.
  • 5. How to process Big Data • the Flume is typically used to stream • index the data from HDFS. HIVE, AND HBASE for fast retrieval the Solr • Solr stores the data in a disk file system but with indexing.
  • 6. HADOOP AND HADOOP COMPONENT • petabytes the apache Hadoop is a software framework that enables distributed processing on large clusters • Hadoop distributed file system (HDFS): to store comprehensively HDFS handle it by replication into various places • Frameworks for parallel processing of data: MapReduce, Hive, Mahout, Spark
  • 7. HADOOP AND HADOOP COMPONENT • YARN: Manage size and memory of the cluster nodes • files in HDFS divided into large blocks that are typically 128 MB in size and distributed across the cluster • a single block (A1) as its size (100 MB) is less than the default block size (150 MB) and replicated on Node 1 worker, and Node 2 worker. Block1 (A1) is replicated on the first node (Node 1) and then Node 1 replicates on Node 2 worker
  • 8. HADOOP AND HADOOP COMPONENT • File2 is divided into two blocks as its size (250 MB) is greater than the block size, and block2 (B) and block3 (C) are replicated on node. • Blocks' metadata (file name, blocks, location, date created, and size) is stored in NameNode • Clusters run Hadoop's open source distributed processing software • Master • Slave
  • 9. HADOOP CLUSTER • The job tracker will plan the jobs of map closer to data that is being processed, which is running on the same Data Node as the essential block. • HDFS and Map-Reduce master and slave components have been included, where Name Node and Data Node are from HDFS and Job Tracker and Task Tracker are from the Map-Reduce paradigm
  • 10. Big data Process  Data collection first Data prepared and clean Data explore by plots, cor, regression Data apply model by regression, K-mean Visualize results by dashboard and finally product
  • 11. How to process Big Data • Creating Project in R • Start RStudio: Under the File menu, click on New Project. Choose New Directory, then New Project. • Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., ~/data-carpentry). • Click on Create Project. • (Optional) Set Preferences to ‘Never’ save workspace in RStudio.
  • 12. HADOOP SCOPE OF COURSE Hadoop installation and data analytics with basic numeric problems Please follow the appendix A. for installation 3.4.3 version of 64 bit is required. R has been used for statistical analysis, machine learning, visualization, and data operations. R will not load the big data but with the help of Hadoop one can the data
  • 13. HADOOP SCOPE OF COURSE  R will handle data analysis operations with the initial functions, such as data loading, exploration, analysis, and visualization.  Hadoop will handle parallel data storage and processing as a computation power alongside distributed data.
  • 14. HADOOP SCOPE OF COURSE  The middleware for R and Hadoop IS RHIVE that help to provide fast streaming by SQL interface as middleware that aid development and execution of Hadoop MapReduce program
  • 15. HADOOP SCOPE OF COURSE http://localhost:50070. NameNode: the role of node is as master that maintain the directories, files, and copes the blocks that are resides on the DataNodes. DataNode: The role of datanode is as slave and deployed on all machines which intended for storage. The main responsibility is read-and-write data services for client
  • 16. Installation Hadoop • Follow the link AND install framework • https://github.com/Sharjeel1234/HADOOP--HIVE--READY-INSTALLATION/
  • 17. Installation Hadoop RHadoop project has three different R packages: rhdfs, rmr, and rhive. rhdfs: This an package that provide distributed files (HDFS) management within R. rmr: This is an R package that help to develop MAP / Reduce program. rhive: This is an R package that help to provide SQL interface to data for fast retrieval and processing.
  • 18. Program Steps for HIVE • Install three packages and other packages in R • Set the environment variables • Load the libraries • Connect to hive database • Start putting data in HDFS and querying using HIVE for fast retrieval • Create table and populate data from csv from HDFS location. • Query other required function and display data using GGPLOT and k-mean
  • 20. Display each type according to length
  • 21. Map - Reduce This Map Reduce paradigm is classified into two phases • MAP/REDUCE • Map and Reduce primarily deal with key/value pairs • the output of the Map phase becomes the input for the Reduce phase.
  • 22. Map - Reduce • Map input preparation: the input is data row wise and return key/value pairs. • Input: list (k2, v2) • Run the given Map() code • Output: list (k3, v3) • The output map shuffle by reducer. Like similar keys will be grouping and input them to the same reducer. • Run the given Reduce function code: This output will be reduced key-value pairs. • Reduce input: (k3, list (v3)) • Reduce output: (k4, v4) • Final output: then master node will collect all key/value pairs after combining them and write
  • 23. RMR Package Functions • The categories of the functions are as follows: • For storing and retrieving data: • to.dfs: This is used to write R objects from or to the filesystem. • small.ints = to.dfs(1:10) • from.dfs: This is used to read the R objects from the HDFS filesystem that are in the binary encrypted format. • from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')
  • 24. RMR Package Functions • mapreduce: This is used for defining and executing the MapReduce job.
  • 25. • Thank you! • Any question please submit to blog. • https://sharjeel1978.blogspot.com/