SlideShare a Scribd company logo
1 of 25
Download to read offline
UNDERSTANDING HADOOP
REX RAMOS
rexjasonramos@yahoo.com
2021-04-20
Abstract
This document will focus on application development using Hadoop environment to help
solve some of real-world business problems
1 | P a g e
Table of Contents
Table of Contents....................................................................................................................................1
What is Hadoop ......................................................................................................................................2
HDFS Overview........................................................................................................................................2
MapReduce – Overview and Its Significance..........................................................................................3
How it distributes processing..............................................................................................................4
YARN - Overview.....................................................................................................................................4
Handling Failure ......................................................................................................................................5
Simple Application of MapReduce..........................................................................................................5
Programming Hadoop with Pig...............................................................................................................9
What is Spark and its Significance.........................................................................................................13
Relational Data and How it Works with Hadoop..................................................................................16
Hive ...................................................................................................................................................16
Sqoop ................................................................................................................................................17
Non-Relational Database and How it Works with Hadoop...................................................................17
No SQL...............................................................................................................................................17
HBase ................................................................................................................................................17
Cassandra..........................................................................................................................................18
MongoDB ..........................................................................................................................................18
Feeding Data into Cluster......................................................................................................................18
Kafka..................................................................................................................................................18
Flume ................................................................................................................................................19
Processing Streaming Data ...................................................................................................................19
Spark Streaming................................................................................................................................19
Designing a System ...............................................................................................................................20
Understand the Requirements .........................................................................................................20
An Example of Application................................................................................................................20
Table of Figures.....................................................................................................................................22
References ............................................................................................................................................23
2 | P a g e
What is Hadoop
Hadoop is an open-source framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage. Rather relying on a
hardware to deliver high availability, the library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on top of a cluster of computers, each of may
prone to failures. (Apache Hadoop, n.d.)
Hadoop can be divided into four layers: (Kaplarevic, 2020)
1.) Distributer storage layer which is the Hadoop Distributed File System (HDFS)
2.) Cluster resource management which is the Yet Another Resource Negotiator (YARN)
3.) Processing framework layer – MapReduce, Spark
4.) Hadoop common utilities such as Pig, Hive, Sqoop, HBase, etc.
First three items can be considered the main core of Hadoop ecosystem.
Figure 1 Hadoop Ecosystem (Sinha, 2020)
HDFS Overview
Hadoop distributed file system is a distributed file system that provides high-throughput access to
application data (Apache Hadoop, n.d.). Individual nodes in a Hadoop cluster has its own disk space,
memory bandwidth, and processing. The design itself thinks that each node is unreliable. Thus, HDFS
stores three copies of each dataset throughout the cluster. (Kaplarevic, 2020)
3 | P a g e
Figure 2: HDFS Architecture (Bakshi, 2020)
MapReduce – Overview and Its Significance
MapReduce is one of the core pieces of Hadoop that's provided along with HDFS and yarn. It provides
for distributing the processing of your data on your cluster. Its job is to divide all your data up into
partitions that can be processed in parallel across your cluster and it manages how that's done and
how failure is handled. It maps and reduces data. Mapping organizes the information as it comes and
transform data into key-value pair. Mapping won’t care for other chunks of input data and in these
sense, parallel operation is being done. Shuffle will put all together all the information for each unique
key and sort those values associated with the key. Reducer is aggregating data together using the key
from the pair. MapReduce also provides a mechanism for handling failures such as particular node
going bad or down, etc.
Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)
4 | P a g e
How it distributes processing
Depending on how big data is, it might choose to distribute the processing of multiple tasks on the
same computer and the same computer where it might distribute them amongst multiple computers.
In a nutshell, input data will be partitioned and will be assigned to multiple nodes to do the processing.
Mapping stage can be split up across different computers where each computer receives a different
chunks of input data, and after shuffle stage you can have different computer responsible for different
sets of keys while reducing the input data.
YARN - Overview
Figure 4: YARN Architecture (Murthy, 2012)
One of the main issues for previous Hadoop version was its scalability. Because of MapReduce Ver.1
use to perform both task process and resource management by itself, resulting into a single master to
handle everything. Here comes the bottleneck issue which causes scalability issue. With the
introduction of YARN (Yet Another Resource Negotiator), It solves this problem. The main idea for this
is separating MapReduce from resource management and job scheduling instead of a single master to
control everything. With YARN, job scheduling and resources management by this component.
5 | P a g e
Handling Failure
Hadoop handles failure very well. Application master monitors tasks for errors or hanging. It is smart
enough to restart nodes as needed or allocate it into another nodes. What if the application master
fails? YARN will restart it and if it reaches the max limit and application master still fails, the resource
manager will start a new instance of application master in which it will run on different nodes. The
single point of failure is in the resource manager. What will happen if resource manager fails? With
proper design this can be negated. During the design phase, it is imperative to setup a standby
resource manager wherein if the main resource manager fails, this will kick in and will recover from
the last backup point. In this way, the operations will still continue. (adarsh, 2017)
Simple Application of MapReduce
For this exercise, we will be using dataset from movie lens with 10M movie ratings, applied to 10,000
movies by 72,000 users. (MovieLens, n.d.)
Link: http://files.grouplens.org/datasets/movielens/ml-10m.zip (Grouplens, 2009)
We will access Hadoop using command line interface through Putty
Figure 5: Hortonworks Hadoop after Startup from a VM
6 | P a g e
Figure 6:Accessing Hadoop Using Putty
• maria_dev – is the default user name for the Hadoop sandbox we will use
• 127.0.0.1 – address to access Hadoop using Port “2222”
Figure 7: Accessing Hadoop file system
• Use maria_dev as password
• Create a directory to store the dataset inside Hadoop file system [-mkdir <folderName>]
• Get the dataset from source http://files.grouplens.org/datasets/movielens/ml-10m.zip
(Grouplens, 2009)
Password: maria_dev
7 | P a g e
Figure 8: Unzip movie lens dataset
• Unzip is already inherent on this linux distribution, so unzip file the using cmd
[unzip <zipFile>]
Figure 9:Transfer file from Local source to HDFS
• Transfer file from local source to HDFS file folder
[hadoop fs -copyFromLocal <sourceFile> <destinationFile>]
• Check if transfer of file is successful by checking the destination of the file
[hadoop fs -ls <destinationFile>]
Figure 10: Python script to run count of ratings
8 | P a g e
Figure 11: Command to run the python script to count ratings from <ratings.dat> file
Figure 12: Figures after running the script
Another example was to create a script to determine which movies are most rated in terms of count
of ratings it has. Using this python script:
Figure 13: MoviesBreakdown.py Script
9 | P a g e
Figure 14: Commands to run the script and the ratings.dat file
Running the script in conjunction with ratings.dat, it will produce which movies has the most ratings
count.
Figure 15: Sorted ratings count result of the script
Based on the counts, movieID 296 (Pulp Fiction) is the most rated with 34,864 rating counts followed
by movieID 356 (Forrest Gump) with a rating count of 34,457.
Programming Hadoop with Pig
What is Pig? According to its Apache, it is a platform for analyzing large datasets that consists of a
high-level language expressing data analysis programs, coupled with infrastructure for evaluating
these programs. (The Apache Software Foundation, 2021)
So what makes this an upgrade of writing MapReduce script is the ease of programming, it lets you
write SQL like syntax to define map and reduce and it is highly extensible with user-defined functions.
It is much easier to execute pig at Ambari. Ambari provides UI to manage overall Hadoop environment.
Its more of a one stop shop in handling Hadoop related stuff.
Using Hortonworks Docker Sandbox, log into Ambari UI by going into browser and key in
127.0.0.1:8080, using credentials as username: maria_dev, password: maria_dev
10 | P a g e
Figure 16: Ambari Log-in
To upload datafiles, click the follow figure 17:
Figure 17: Uploading datafiles (part1)
127.0.0.1:8080
Use the supplied
credentials
1. Click
2. Click
11 | P a g e
Figure 18: Uploading datafiles (part2)
Figure 19: Uploading datafiles (part3)
Go to user folder and your
assigned folder
1. Click Upload
2.Drop data files
here
12 | P a g e
To start or create a PIG script, follow figure 20.
Figure 20: Going to Pig View UI
Figure 21: Pig View UI
Using the same scenario from MapReduce example, we will determine which movie has the most
count of ratings. PigStorage function was not used as delimiter handler since it only handles single
character delimiter. One work around to handle double colon delimiter is to use a regex. By using this
script:
Figure 22: Pig script to count movies with most ratings
Click the icon and go to Pig View
Create or Re-run saved Pig Scripts
+
If the delimiter is only a single character,
PigStorage() can be use to handle the delimiter
13 | P a g e
-The result of the script, same as figure 15 showing Pulp Fiction and Forrest Gump have the highest
count of ratings.
Figure 23: Result of Pig script
What is Spark and its Significance
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop
clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra,
Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to
MapReduce) and new workloads like streaming, interactive queries, and machine learning. (Apache
Software Foundation, 2018)
It does give a lot of flexibility where you can actually write real scripts using real programming
languages like Java, or Scala, or Python to do complex manipulations and transformations and analysis
of data and what sets it apart from other technologies like Pig for example is that it has a rich
ecosystem on top of Spark that lets you do all sorts of complicated things like machine learning, data
mining, graph analysis and streaming data.
What makes it fast comparing to MapReduce is that it uses Cache from RAM instead of Disk. As we
know accessing RAM is much quicker compare to retrieving data from Disk. Its cluster manager
supports Hadoop YARN, Kubernetes, Apache Mesos and it can run its stand-alone cluster manager.
(Apache Software Foundation, 2018)
Spark is built around concept of Resilient Distributed Dataset (RDD), in which an object represents a
dataset. It uses various functions on that RDD object to transform, or in anyway you want to transform
it into a new RDD form.
14 | P a g e
Figure 24: Spark Architecture (Apache Software Foundation, 2018)
Spark will run by default version “1” unless user specify what version needs to run his scripts. From
CLI, use the command below stated at Figure 25
Figure 25: Declaring Spark Version to Use
Going back from previous scenario from MapReduce, we will determine which movie is the most
rated. Script was written using PySpark.
Figure 26: Import needed Classes
Create a function to get only needed data from the datasets. The function “loadMovieName” is to
create a dictionary to have reference for movieID and its corresponding movie name. Next function
“parseInput(line)” will be use to create a row with movieID and rating to be use for the dataframe.
Figure 27: Function for row and dictionary
15 | P a g e
Figure 28: Script to Get Movies with highest Count of Rate
To run Spark script, use command “spark-submit <script-file>”.
Figure 29: Command to Run Spark Script
The result is shows Pulp Fiction with highest count of rate by movie followed by Forrest Gump
Figure 30: Result of Spark Script
16 | P a g e
Spark suites of stack includes Spark SQL, Spark Streaming, MLlib (used for machine learning) and
GraphX (Visualization). This stack won’t be covered on this document. Below is the snapshot of Spark
stack that can be used on Spark.
Figure 31: Spark Stack (Madaka, 2019)
Relational Data and How it Works with Hadoop
Hive
Hive works in Hadoop as a way for SQL like to work with data. It translates SQL into MapReduce
commands or Tez command and it runs on YARN cluster manager to execute query. This is nice tool
for users who are adept in SQL queries (Abeywardana, 2020). Hive is highly scalable, in which works
with big data on a cluster. It is really appropriate for data warehouse application. Hive is perfect for
Online Analytic Processing (OLAP). In a nut shell, Hive is a layer on top of HDFS that gives schema like
structure that gives a simple way to query data across Hadoop cluster that uses SQL queries and since
its batch execution, it is suitable for large analytic query.
Figure 32: How Hive Works (Hortonworks Inc., n.d.)
17 | P a g e
Sqoop
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured
datastore such as relational databases (Apache Software Foundation, 2019). Since transfer of data is
in parallel, it is efficient in moving large datasets. Hadoop also do incremental imports of data from
RDBMS into Hadoop system. With these, Sqoop can transfer only data from last timestamp it moves
data from database or in column value is greater than given values.
Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.)
Non-Relational Database and How it Works with Hadoop
No SQL
NoSQL store data in a schema-less or form-free fashion and data are de-normalize. NoSQL is useful for
quick access to the data with emphasis on speed and simplicity of access versus reliable transaction
or consistency, storing of large volume of data with schema-less structure. Also, migration of
unstructured data from multiple sources is fast as the data is kept in its original form for flexibility
(Yegulalp, 2017).
HBase
HBase is a column- oriented non- relational database management system that runs on top of HDFS.
It provides fault-tolerant way of storing sparse datasets. It is well suited for real-time data processing
or random read/write access to large volume of data set. HBase doesn’t support SQL but it can be
used with Hive – a SQL like query engine for batch processing of big data (IBM, n.d.).
HBase allows for many attributes to be grouped together into column families, such that the elements
of a column family are all stored together. This is different from a row-oriented relational database,
where all the columns of a given row are stored together. With HBase you must predefine the table
schema and specify the column families. However, new columns can be added to families at any time,
18 | P a g e
making the schema flexible and able to adapt to changing application requirements (IBM, n.d.). HBase
is modeled after Google’s Bigtable concept.
Cassandra
The Apache Cassandra database is the right choice when you need scalability and high availability
without compromising performance. Linear scalability and proven fault-tolerance on commodity
hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's
support for replicating across multiple datacenters is best-in-class, providing lower latency for your
users and the peace of mind of knowing that you can survive regional outages (Apache Software
Foundation, 2016). Its main advantage is best in class in terms of fault tolerant
MongoDB
MongoDB is a schema-less database and stores data as JSON-like documents (binary JSON). This
provides flexibility and agility in the type of records that can be stored, and even the fields can vary
from one document to the other. Also, the embedded documents provide support for faster queries
through indexes and vastly reduce the I/O overload generally associated with database systems. Along
with this is support for schema on write and dynamic schema for easy evolving data structures (Klein,
2020).
Feeding Data into Cluster
Streaming data is a continuous transfer of data at a steady, high-speed rate (Contributor, 2019). It
makes sense in todays time to analyze streaming data to make a data driven decision in real time.
Imagine buying stocks from a stock exchange, you need to have the up-to date data of stocks to make
a timely decision if you need to buy a stock. There are ways to ingest streaming data into Hadoop
cluster.
Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies
for high-performance data pipelines, streaming analytics, data integration, and mission-critical
applications (Apache Software Foundation, n.d.). According to Apache Kafka website, 80% of Fortune
100 companies are using Kafka to ingest their streaming data because of its high throughput, its
scalable, fault tolerant cluster and high availability.
Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.)
19 | P a g e
Kafka cluster as being at the center of this entire world of streaming data it represents many processes
running on many servers that are distributing at Kafka storage and Kafka processing. Producers are
generating the data and on the other end consumers receive that data as it comes out so, as producers
published messages to topics and receiving that data as it comes in so these consumers also link in the
Kafka libraries to be able to read that data as well and process.
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's
HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic applications (Alten-Lorenz, 2015).
Figure 35: Flume Architecture
Flume is a tool to collect log data from distributed web servers in which the data collected will go to
HDFS for analysis. It ensures guaranteed data delivery because both agents receiver and sender invoke
transaction.
Processing Streaming Data
What to do we do to the data once it was ingested? It is to process the data into a meaningful
information. There are ways to process streaming data that can be done in Hadoop ecosystem. Spark
Streaming, Apache Storm, Apache Flink and others. This document will only focus on Spark Streaming.
Spark Streaming
Apache Spark provides a streaming API to analyze streaming data in pretty much the same way we
work with batch data. Apache Spark Structured Streaming is built on top of the Spark-SQL API to
leverage its optimization. Spark Streaming is a processing engine to process data in real-time from
sources and output data to external storage systems (Bhadani, 2021).
20 | P a g e
Figure 36: Spark streaming Architecture
Spark Streaming receives live input data streams and divides the data into batches, which are then
processed by the Spark engine to generate the final stream of results in batches. The continuous
stream of data are represented by RDD (Resilient Distributed Datasets).
Designing a System
When designing a system, there is no perfect formula in designing a system. The important thing to
remember is to understand the requirement and if the end product (system) will give added value into
the business. By keeping the system as simple as possible, the designed system can be deployed in a
timely manner.
Understand the Requirements
Designing a system is defining what input to use in order to meet the specified requirement. In a case
of building a desktop computer, you first need to understand what it will be used for. Gaming desktop
setup has a different requirement versus a desktop that will be use for word processing. There are
many factors need to consider like who are the user of the system, legacy infrastructure that will be
incorporated in this system, the system does need to be available all the time without any downtime
as it is mission critical for the business, does the company has an in-house capability to maintain the
system or need to administered by a third-party vendor, etc.
An Example of Application
Business scenario: A big company wanted to design a system that tracks their top X best selling items
on their online store that will be offered to their customer based on their purchase habits or in their
online cart. The online store generates millions of visit every month and have thousands of queries
by the seconds. Customer doesn’t want a slow response website and web servers has to be quick.
Also, page latency is important.
Based from the information above, a distributed non relational database can be used in this system
wherein the vending data to the web services and access to it is simple where giving top X best selling
item on that category will give the customer. Information can be boiled down to a relatively small
dataset that can be distributed through some distributed quick database hourly updates. Customer
don't really care if some new hot item came out and you're not reflecting that in your top sellers
instantaneously. It means that consistency is not hugely important as new sales come in. But website
must not go down if there's big traffic coming along the way into the site and partition tolerance is
important.
21 | P a g e
NoSql using Cassandra might be the better choice as this database give importance to availability,
Spark streaming can be used for to process streaming data because computing something over a
sliding window of time by the increment time is Spark Streaming can do. How the input will stream
into data processor? Kafka or Flume can be used as both are compatible with Spark Streaming.
22 | P a g e
Table of Figures
Figure 1 Hadoop Ecosystem (Sinha, 2020)..............................................................................................2
Figure 2: HDFS Architecture (Bakshi, 2020)...........................................................................................3
Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)....................................................................3
Figure 4: YARN Architecture (Murthy, 2012)..........................................................................................4
Figure 5: Hortonworks Hadoop after Startup from a VM.......................................................................5
Figure 6:Accessing Hadoop Using Putty..................................................................................................6
Figure 7: Accessing Hadoop file system..................................................................................................6
Figure 8: Unzip movie lens dataset.........................................................................................................7
Figure 9:Transfer file from Local source to HDFS ...................................................................................7
Figure 10: Python script to run count of ratings.....................................................................................7
Figure 11: Command to run the python script to count ratings from <ratings.dat> file........................8
Figure 12: Figures after running the script .............................................................................................8
Figure 13: MoviesBreakdown.py Script ..................................................................................................8
Figure 14: Commands to run the script and the ratings.dat file.............................................................9
Figure 15: Sorted ratings count result of the script................................................................................9
Figure 16: Ambari Log-in.......................................................................................................................10
Figure 17: Uploading datafiles (part1)..................................................................................................10
Figure 18: Uploading datafiles (part2)..................................................................................................11
Figure 19: Uploading datafiles (part3)..................................................................................................11
Figure 20: Going to Pig View UI ............................................................................................................12
Figure 21: Pig View UI ...........................................................................................................................12
Figure 22: Pig script to count movies with most ratings.......................................................................12
Figure 23: Result of Pig script................................................................................................................13
Figure 24: Spark Architecture (Apache Software Foundation, 2018)...................................................14
Figure 25: Declaring Spark Version to Use............................................................................................14
Figure 26: Import needed Classes.........................................................................................................14
Figure 27: Function for row and dictionary ..........................................................................................14
Figure 28: Script to Get Movies with highest Count of Rate ................................................................15
Figure 29: Command to Run Spark Script.............................................................................................15
Figure 30: Result of Spark Script...........................................................................................................15
Figure 31: Spark Stack (Madaka, 2019).................................................................................................16
Figure 32: How Hive Works (Hortonworks Inc., n.d.) ...........................................................................16
Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.)..........................17
Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.).........................................................18
Figure 35: Flume Architecture ..............................................................................................................19
Figure 36: Spark streaming Architecture ..............................................................................................20
23 | P a g e
References
Abeywardana, P. (2020, August 9). The Touch of Relational Databases on Hadoop. Retrieved from
https://towardsdatascience.com/: https://towardsdatascience.com/the-touch-of-relational-
databases-on-hadoop-1a968cc16b61
adarsh. (2017, July). handling failures in hadoop,mapreduce and yarn. Retrieved from
https://timepasstechies.com: https://timepasstechies.com/handling-failures-
hadoopmapreduce-yarn/
Alten-Lorenz, A. (2015, June 3). Welcome to the Apache Flume wiki! Retrieved from
https://cwiki.apache.org/: https://cwiki.apache.org/confluence/display/FLUME
Apache Hadoop. (n.d.). Apache Hadoop. Retrieved from https://hadoop.apache.org/:
https://hadoop.apache.org/
Apache Software Foundation. (2016). Apache Cassandra. Retrieved from
https://cassandra.apache.org/doc/latest/architecture/overview.html:
https://cassandra.apache.org/
Apache Software Foundation. (2018). Apache Spark FAQ. Retrieved from https://spark.apache.org/:
https://spark.apache.org/faq.html
Apache Software Foundation. (2019, January 18). Apache Sqoop. Retrieved from
https://sqoop.apache.org/: https://sqoop.apache.org/
Apache Software Foundation. (n.d.). Apache Kafka. Retrieved from https://kafka.apache.org/:
https://kafka.apache.org/
Bakshi, A. (2020, November 25). Apache Hadoop HDFS Architecture. Retrieved from
https://www.edureka.co/: https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
Bhadani, N. (2021, January 28). Apache Spark Structured Streaming . Retrieved from
https://medium.com/: https://medium.com/expedia-group-tech/apache-spark-structured-
streaming-first-streaming-example-1-of-6-e8f3219748ef
Boyer, J., Osowski, R., & Almaraz, J. (n.d.). Kafka Overview. Retrieved from https://ibm-cloud-
architecture.github.io/: https://ibm-cloud-architecture.github.io/refarch-
eda/technology/kafka-overview/#architecture
Contributor, T. (2019, March). data streaming. Retrieved from
https://searchnetworking.techtarget.com/:
https://searchnetworking.techtarget.com/definition/data-streaming
Geeksforgeeks. (2020, September 10). MapReduce Architecture. Retrieved from
https://www.geeksforgeeks.org: https://www.geeksforgeeks.org/mapreduce-architecture/
Grouplens. (2009, January). MovieLens 10M Dataset. Retrieved from
https://grouplens.org/datasets/movielens/10m/: https://grouplens.org/
Hortonworks Inc. (n.d.). Data migration to Apache Hive. Retrieved from https://docs.cloudera.com/:
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/migrating-
data/content/hive_data_migration.html
24 | P a g e
IBM. (n.d.). Apache HBase. Retrieved from https://www.ibm.com/analytics/hadoop/:
https://www.ibm.com/analytics/hadoop/hbase
Kaplarevic, V. (2020, May 25). Apache Hadoop Architecture Explained. Retrieved from
https://phoenixnap.com/: https://phoenixnap.com/kb/apache-hadoop-architecture-
explained
Klein, E. (2020, March 19). Cassandra vs. MongoDB vs. Hbase: A Comparison of NoSQL Databases.
Retrieved from https://logz.io/blog: https://logz.io/blog/nosql-database-comparison/
Madaka, Y. (2019, June 27). Building an ML application using MLlib in Pyspark. Retrieved from
https://towardsdatascience.com/: https://towardsdatascience.com/building-an-ml-
application-with-mllib-in-pyspark-part-1-ac13f01606e2
MovieLens. (n.d.). MovieLens 10M Dataset. Retrieved from https://grouplens.org/:
https://grouplens.org/datasets/movielens/10m/
Murthy, A. (2012, August 15). Apache Hadoop YARN – Concepts and Applications. Retrieved from
https://blog.cloudera.com/: https://blog.cloudera.com/apache-hadoop-yarn-concepts-and-
applications/
Sinha, S. (2020, November 25). Hadoop Ecosystem: Hadoop Tools for Crunching Big Data. Retrieved
from https://www.edureka.co/: https://www.edureka.co/blog/hadoop-ecosystem
The Apache Software Foundation. (2021, February 21). Welcome to Apache Pig! Retrieved from
https://pig.apache.org/: https://pig.apache.org/
Yegulalp, S. (2017, December 7). What is NoSQL? Databases for a cloud-scale future. Retrieved from
https://www.infoworld.com/: https://www.infoworld.com/article/3240644/what-is-nosql-
databases-for-a-cloud-scale-future.html

More Related Content

What's hot

Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 

What's hot (19)

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 

Similar to Understanding hadoop

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 

Similar to Understanding hadoop (20)

B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 

Recently uploaded

原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Recently uploaded (20)

原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 

Understanding hadoop

  • 1. UNDERSTANDING HADOOP REX RAMOS rexjasonramos@yahoo.com 2021-04-20 Abstract This document will focus on application development using Hadoop environment to help solve some of real-world business problems
  • 2. 1 | P a g e Table of Contents Table of Contents....................................................................................................................................1 What is Hadoop ......................................................................................................................................2 HDFS Overview........................................................................................................................................2 MapReduce – Overview and Its Significance..........................................................................................3 How it distributes processing..............................................................................................................4 YARN - Overview.....................................................................................................................................4 Handling Failure ......................................................................................................................................5 Simple Application of MapReduce..........................................................................................................5 Programming Hadoop with Pig...............................................................................................................9 What is Spark and its Significance.........................................................................................................13 Relational Data and How it Works with Hadoop..................................................................................16 Hive ...................................................................................................................................................16 Sqoop ................................................................................................................................................17 Non-Relational Database and How it Works with Hadoop...................................................................17 No SQL...............................................................................................................................................17 HBase ................................................................................................................................................17 Cassandra..........................................................................................................................................18 MongoDB ..........................................................................................................................................18 Feeding Data into Cluster......................................................................................................................18 Kafka..................................................................................................................................................18 Flume ................................................................................................................................................19 Processing Streaming Data ...................................................................................................................19 Spark Streaming................................................................................................................................19 Designing a System ...............................................................................................................................20 Understand the Requirements .........................................................................................................20 An Example of Application................................................................................................................20 Table of Figures.....................................................................................................................................22 References ............................................................................................................................................23
  • 3. 2 | P a g e What is Hadoop Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather relying on a hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of may prone to failures. (Apache Hadoop, n.d.) Hadoop can be divided into four layers: (Kaplarevic, 2020) 1.) Distributer storage layer which is the Hadoop Distributed File System (HDFS) 2.) Cluster resource management which is the Yet Another Resource Negotiator (YARN) 3.) Processing framework layer – MapReduce, Spark 4.) Hadoop common utilities such as Pig, Hive, Sqoop, HBase, etc. First three items can be considered the main core of Hadoop ecosystem. Figure 1 Hadoop Ecosystem (Sinha, 2020) HDFS Overview Hadoop distributed file system is a distributed file system that provides high-throughput access to application data (Apache Hadoop, n.d.). Individual nodes in a Hadoop cluster has its own disk space, memory bandwidth, and processing. The design itself thinks that each node is unreliable. Thus, HDFS stores three copies of each dataset throughout the cluster. (Kaplarevic, 2020)
  • 4. 3 | P a g e Figure 2: HDFS Architecture (Bakshi, 2020) MapReduce – Overview and Its Significance MapReduce is one of the core pieces of Hadoop that's provided along with HDFS and yarn. It provides for distributing the processing of your data on your cluster. Its job is to divide all your data up into partitions that can be processed in parallel across your cluster and it manages how that's done and how failure is handled. It maps and reduces data. Mapping organizes the information as it comes and transform data into key-value pair. Mapping won’t care for other chunks of input data and in these sense, parallel operation is being done. Shuffle will put all together all the information for each unique key and sort those values associated with the key. Reducer is aggregating data together using the key from the pair. MapReduce also provides a mechanism for handling failures such as particular node going bad or down, etc. Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)
  • 5. 4 | P a g e How it distributes processing Depending on how big data is, it might choose to distribute the processing of multiple tasks on the same computer and the same computer where it might distribute them amongst multiple computers. In a nutshell, input data will be partitioned and will be assigned to multiple nodes to do the processing. Mapping stage can be split up across different computers where each computer receives a different chunks of input data, and after shuffle stage you can have different computer responsible for different sets of keys while reducing the input data. YARN - Overview Figure 4: YARN Architecture (Murthy, 2012) One of the main issues for previous Hadoop version was its scalability. Because of MapReduce Ver.1 use to perform both task process and resource management by itself, resulting into a single master to handle everything. Here comes the bottleneck issue which causes scalability issue. With the introduction of YARN (Yet Another Resource Negotiator), It solves this problem. The main idea for this is separating MapReduce from resource management and job scheduling instead of a single master to control everything. With YARN, job scheduling and resources management by this component.
  • 6. 5 | P a g e Handling Failure Hadoop handles failure very well. Application master monitors tasks for errors or hanging. It is smart enough to restart nodes as needed or allocate it into another nodes. What if the application master fails? YARN will restart it and if it reaches the max limit and application master still fails, the resource manager will start a new instance of application master in which it will run on different nodes. The single point of failure is in the resource manager. What will happen if resource manager fails? With proper design this can be negated. During the design phase, it is imperative to setup a standby resource manager wherein if the main resource manager fails, this will kick in and will recover from the last backup point. In this way, the operations will still continue. (adarsh, 2017) Simple Application of MapReduce For this exercise, we will be using dataset from movie lens with 10M movie ratings, applied to 10,000 movies by 72,000 users. (MovieLens, n.d.) Link: http://files.grouplens.org/datasets/movielens/ml-10m.zip (Grouplens, 2009) We will access Hadoop using command line interface through Putty Figure 5: Hortonworks Hadoop after Startup from a VM
  • 7. 6 | P a g e Figure 6:Accessing Hadoop Using Putty • maria_dev – is the default user name for the Hadoop sandbox we will use • 127.0.0.1 – address to access Hadoop using Port “2222” Figure 7: Accessing Hadoop file system • Use maria_dev as password • Create a directory to store the dataset inside Hadoop file system [-mkdir <folderName>] • Get the dataset from source http://files.grouplens.org/datasets/movielens/ml-10m.zip (Grouplens, 2009) Password: maria_dev
  • 8. 7 | P a g e Figure 8: Unzip movie lens dataset • Unzip is already inherent on this linux distribution, so unzip file the using cmd [unzip <zipFile>] Figure 9:Transfer file from Local source to HDFS • Transfer file from local source to HDFS file folder [hadoop fs -copyFromLocal <sourceFile> <destinationFile>] • Check if transfer of file is successful by checking the destination of the file [hadoop fs -ls <destinationFile>] Figure 10: Python script to run count of ratings
  • 9. 8 | P a g e Figure 11: Command to run the python script to count ratings from <ratings.dat> file Figure 12: Figures after running the script Another example was to create a script to determine which movies are most rated in terms of count of ratings it has. Using this python script: Figure 13: MoviesBreakdown.py Script
  • 10. 9 | P a g e Figure 14: Commands to run the script and the ratings.dat file Running the script in conjunction with ratings.dat, it will produce which movies has the most ratings count. Figure 15: Sorted ratings count result of the script Based on the counts, movieID 296 (Pulp Fiction) is the most rated with 34,864 rating counts followed by movieID 356 (Forrest Gump) with a rating count of 34,457. Programming Hadoop with Pig What is Pig? According to its Apache, it is a platform for analyzing large datasets that consists of a high-level language expressing data analysis programs, coupled with infrastructure for evaluating these programs. (The Apache Software Foundation, 2021) So what makes this an upgrade of writing MapReduce script is the ease of programming, it lets you write SQL like syntax to define map and reduce and it is highly extensible with user-defined functions. It is much easier to execute pig at Ambari. Ambari provides UI to manage overall Hadoop environment. Its more of a one stop shop in handling Hadoop related stuff. Using Hortonworks Docker Sandbox, log into Ambari UI by going into browser and key in 127.0.0.1:8080, using credentials as username: maria_dev, password: maria_dev
  • 11. 10 | P a g e Figure 16: Ambari Log-in To upload datafiles, click the follow figure 17: Figure 17: Uploading datafiles (part1) 127.0.0.1:8080 Use the supplied credentials 1. Click 2. Click
  • 12. 11 | P a g e Figure 18: Uploading datafiles (part2) Figure 19: Uploading datafiles (part3) Go to user folder and your assigned folder 1. Click Upload 2.Drop data files here
  • 13. 12 | P a g e To start or create a PIG script, follow figure 20. Figure 20: Going to Pig View UI Figure 21: Pig View UI Using the same scenario from MapReduce example, we will determine which movie has the most count of ratings. PigStorage function was not used as delimiter handler since it only handles single character delimiter. One work around to handle double colon delimiter is to use a regex. By using this script: Figure 22: Pig script to count movies with most ratings Click the icon and go to Pig View Create or Re-run saved Pig Scripts + If the delimiter is only a single character, PigStorage() can be use to handle the delimiter
  • 14. 13 | P a g e -The result of the script, same as figure 15 showing Pulp Fiction and Forrest Gump have the highest count of ratings. Figure 23: Result of Pig script What is Spark and its Significance Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. (Apache Software Foundation, 2018) It does give a lot of flexibility where you can actually write real scripts using real programming languages like Java, or Scala, or Python to do complex manipulations and transformations and analysis of data and what sets it apart from other technologies like Pig for example is that it has a rich ecosystem on top of Spark that lets you do all sorts of complicated things like machine learning, data mining, graph analysis and streaming data. What makes it fast comparing to MapReduce is that it uses Cache from RAM instead of Disk. As we know accessing RAM is much quicker compare to retrieving data from Disk. Its cluster manager supports Hadoop YARN, Kubernetes, Apache Mesos and it can run its stand-alone cluster manager. (Apache Software Foundation, 2018) Spark is built around concept of Resilient Distributed Dataset (RDD), in which an object represents a dataset. It uses various functions on that RDD object to transform, or in anyway you want to transform it into a new RDD form.
  • 15. 14 | P a g e Figure 24: Spark Architecture (Apache Software Foundation, 2018) Spark will run by default version “1” unless user specify what version needs to run his scripts. From CLI, use the command below stated at Figure 25 Figure 25: Declaring Spark Version to Use Going back from previous scenario from MapReduce, we will determine which movie is the most rated. Script was written using PySpark. Figure 26: Import needed Classes Create a function to get only needed data from the datasets. The function “loadMovieName” is to create a dictionary to have reference for movieID and its corresponding movie name. Next function “parseInput(line)” will be use to create a row with movieID and rating to be use for the dataframe. Figure 27: Function for row and dictionary
  • 16. 15 | P a g e Figure 28: Script to Get Movies with highest Count of Rate To run Spark script, use command “spark-submit <script-file>”. Figure 29: Command to Run Spark Script The result is shows Pulp Fiction with highest count of rate by movie followed by Forrest Gump Figure 30: Result of Spark Script
  • 17. 16 | P a g e Spark suites of stack includes Spark SQL, Spark Streaming, MLlib (used for machine learning) and GraphX (Visualization). This stack won’t be covered on this document. Below is the snapshot of Spark stack that can be used on Spark. Figure 31: Spark Stack (Madaka, 2019) Relational Data and How it Works with Hadoop Hive Hive works in Hadoop as a way for SQL like to work with data. It translates SQL into MapReduce commands or Tez command and it runs on YARN cluster manager to execute query. This is nice tool for users who are adept in SQL queries (Abeywardana, 2020). Hive is highly scalable, in which works with big data on a cluster. It is really appropriate for data warehouse application. Hive is perfect for Online Analytic Processing (OLAP). In a nut shell, Hive is a layer on top of HDFS that gives schema like structure that gives a simple way to query data across Hadoop cluster that uses SQL queries and since its batch execution, it is suitable for large analytic query. Figure 32: How Hive Works (Hortonworks Inc., n.d.)
  • 18. 17 | P a g e Sqoop Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastore such as relational databases (Apache Software Foundation, 2019). Since transfer of data is in parallel, it is efficient in moving large datasets. Hadoop also do incremental imports of data from RDBMS into Hadoop system. With these, Sqoop can transfer only data from last timestamp it moves data from database or in column value is greater than given values. Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.) Non-Relational Database and How it Works with Hadoop No SQL NoSQL store data in a schema-less or form-free fashion and data are de-normalize. NoSQL is useful for quick access to the data with emphasis on speed and simplicity of access versus reliable transaction or consistency, storing of large volume of data with schema-less structure. Also, migration of unstructured data from multiple sources is fast as the data is kept in its original form for flexibility (Yegulalp, 2017). HBase HBase is a column- oriented non- relational database management system that runs on top of HDFS. It provides fault-tolerant way of storing sparse datasets. It is well suited for real-time data processing or random read/write access to large volume of data set. HBase doesn’t support SQL but it can be used with Hive – a SQL like query engine for batch processing of big data (IBM, n.d.). HBase allows for many attributes to be grouped together into column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together. With HBase you must predefine the table schema and specify the column families. However, new columns can be added to families at any time,
  • 19. 18 | P a g e making the schema flexible and able to adapt to changing application requirements (IBM, n.d.). HBase is modeled after Google’s Bigtable concept. Cassandra The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages (Apache Software Foundation, 2016). Its main advantage is best in class in terms of fault tolerant MongoDB MongoDB is a schema-less database and stores data as JSON-like documents (binary JSON). This provides flexibility and agility in the type of records that can be stored, and even the fields can vary from one document to the other. Also, the embedded documents provide support for faster queries through indexes and vastly reduce the I/O overload generally associated with database systems. Along with this is support for schema on write and dynamic schema for easy evolving data structures (Klein, 2020). Feeding Data into Cluster Streaming data is a continuous transfer of data at a steady, high-speed rate (Contributor, 2019). It makes sense in todays time to analyze streaming data to make a data driven decision in real time. Imagine buying stocks from a stock exchange, you need to have the up-to date data of stocks to make a timely decision if you need to buy a stock. There are ways to ingest streaming data into Hadoop cluster. Kafka Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications (Apache Software Foundation, n.d.). According to Apache Kafka website, 80% of Fortune 100 companies are using Kafka to ingest their streaming data because of its high throughput, its scalable, fault tolerant cluster and high availability. Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.)
  • 20. 19 | P a g e Kafka cluster as being at the center of this entire world of streaming data it represents many processes running on many servers that are distributing at Kafka storage and Kafka processing. Producers are generating the data and on the other end consumers receive that data as it comes out so, as producers published messages to topics and receiving that data as it comes in so these consumers also link in the Kafka libraries to be able to read that data as well and process. Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic applications (Alten-Lorenz, 2015). Figure 35: Flume Architecture Flume is a tool to collect log data from distributed web servers in which the data collected will go to HDFS for analysis. It ensures guaranteed data delivery because both agents receiver and sender invoke transaction. Processing Streaming Data What to do we do to the data once it was ingested? It is to process the data into a meaningful information. There are ways to process streaming data that can be done in Hadoop ecosystem. Spark Streaming, Apache Storm, Apache Flink and others. This document will only focus on Spark Streaming. Spark Streaming Apache Spark provides a streaming API to analyze streaming data in pretty much the same way we work with batch data. Apache Spark Structured Streaming is built on top of the Spark-SQL API to leverage its optimization. Spark Streaming is a processing engine to process data in real-time from sources and output data to external storage systems (Bhadani, 2021).
  • 21. 20 | P a g e Figure 36: Spark streaming Architecture Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. The continuous stream of data are represented by RDD (Resilient Distributed Datasets). Designing a System When designing a system, there is no perfect formula in designing a system. The important thing to remember is to understand the requirement and if the end product (system) will give added value into the business. By keeping the system as simple as possible, the designed system can be deployed in a timely manner. Understand the Requirements Designing a system is defining what input to use in order to meet the specified requirement. In a case of building a desktop computer, you first need to understand what it will be used for. Gaming desktop setup has a different requirement versus a desktop that will be use for word processing. There are many factors need to consider like who are the user of the system, legacy infrastructure that will be incorporated in this system, the system does need to be available all the time without any downtime as it is mission critical for the business, does the company has an in-house capability to maintain the system or need to administered by a third-party vendor, etc. An Example of Application Business scenario: A big company wanted to design a system that tracks their top X best selling items on their online store that will be offered to their customer based on their purchase habits or in their online cart. The online store generates millions of visit every month and have thousands of queries by the seconds. Customer doesn’t want a slow response website and web servers has to be quick. Also, page latency is important. Based from the information above, a distributed non relational database can be used in this system wherein the vending data to the web services and access to it is simple where giving top X best selling item on that category will give the customer. Information can be boiled down to a relatively small dataset that can be distributed through some distributed quick database hourly updates. Customer don't really care if some new hot item came out and you're not reflecting that in your top sellers instantaneously. It means that consistency is not hugely important as new sales come in. But website must not go down if there's big traffic coming along the way into the site and partition tolerance is important.
  • 22. 21 | P a g e NoSql using Cassandra might be the better choice as this database give importance to availability, Spark streaming can be used for to process streaming data because computing something over a sliding window of time by the increment time is Spark Streaming can do. How the input will stream into data processor? Kafka or Flume can be used as both are compatible with Spark Streaming.
  • 23. 22 | P a g e Table of Figures Figure 1 Hadoop Ecosystem (Sinha, 2020)..............................................................................................2 Figure 2: HDFS Architecture (Bakshi, 2020)...........................................................................................3 Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)....................................................................3 Figure 4: YARN Architecture (Murthy, 2012)..........................................................................................4 Figure 5: Hortonworks Hadoop after Startup from a VM.......................................................................5 Figure 6:Accessing Hadoop Using Putty..................................................................................................6 Figure 7: Accessing Hadoop file system..................................................................................................6 Figure 8: Unzip movie lens dataset.........................................................................................................7 Figure 9:Transfer file from Local source to HDFS ...................................................................................7 Figure 10: Python script to run count of ratings.....................................................................................7 Figure 11: Command to run the python script to count ratings from <ratings.dat> file........................8 Figure 12: Figures after running the script .............................................................................................8 Figure 13: MoviesBreakdown.py Script ..................................................................................................8 Figure 14: Commands to run the script and the ratings.dat file.............................................................9 Figure 15: Sorted ratings count result of the script................................................................................9 Figure 16: Ambari Log-in.......................................................................................................................10 Figure 17: Uploading datafiles (part1)..................................................................................................10 Figure 18: Uploading datafiles (part2)..................................................................................................11 Figure 19: Uploading datafiles (part3)..................................................................................................11 Figure 20: Going to Pig View UI ............................................................................................................12 Figure 21: Pig View UI ...........................................................................................................................12 Figure 22: Pig script to count movies with most ratings.......................................................................12 Figure 23: Result of Pig script................................................................................................................13 Figure 24: Spark Architecture (Apache Software Foundation, 2018)...................................................14 Figure 25: Declaring Spark Version to Use............................................................................................14 Figure 26: Import needed Classes.........................................................................................................14 Figure 27: Function for row and dictionary ..........................................................................................14 Figure 28: Script to Get Movies with highest Count of Rate ................................................................15 Figure 29: Command to Run Spark Script.............................................................................................15 Figure 30: Result of Spark Script...........................................................................................................15 Figure 31: Spark Stack (Madaka, 2019).................................................................................................16 Figure 32: How Hive Works (Hortonworks Inc., n.d.) ...........................................................................16 Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.)..........................17 Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.).........................................................18 Figure 35: Flume Architecture ..............................................................................................................19 Figure 36: Spark streaming Architecture ..............................................................................................20
  • 24. 23 | P a g e References Abeywardana, P. (2020, August 9). The Touch of Relational Databases on Hadoop. Retrieved from https://towardsdatascience.com/: https://towardsdatascience.com/the-touch-of-relational- databases-on-hadoop-1a968cc16b61 adarsh. (2017, July). handling failures in hadoop,mapreduce and yarn. Retrieved from https://timepasstechies.com: https://timepasstechies.com/handling-failures- hadoopmapreduce-yarn/ Alten-Lorenz, A. (2015, June 3). Welcome to the Apache Flume wiki! Retrieved from https://cwiki.apache.org/: https://cwiki.apache.org/confluence/display/FLUME Apache Hadoop. (n.d.). Apache Hadoop. Retrieved from https://hadoop.apache.org/: https://hadoop.apache.org/ Apache Software Foundation. (2016). Apache Cassandra. Retrieved from https://cassandra.apache.org/doc/latest/architecture/overview.html: https://cassandra.apache.org/ Apache Software Foundation. (2018). Apache Spark FAQ. Retrieved from https://spark.apache.org/: https://spark.apache.org/faq.html Apache Software Foundation. (2019, January 18). Apache Sqoop. Retrieved from https://sqoop.apache.org/: https://sqoop.apache.org/ Apache Software Foundation. (n.d.). Apache Kafka. Retrieved from https://kafka.apache.org/: https://kafka.apache.org/ Bakshi, A. (2020, November 25). Apache Hadoop HDFS Architecture. Retrieved from https://www.edureka.co/: https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/ Bhadani, N. (2021, January 28). Apache Spark Structured Streaming . Retrieved from https://medium.com/: https://medium.com/expedia-group-tech/apache-spark-structured- streaming-first-streaming-example-1-of-6-e8f3219748ef Boyer, J., Osowski, R., & Almaraz, J. (n.d.). Kafka Overview. Retrieved from https://ibm-cloud- architecture.github.io/: https://ibm-cloud-architecture.github.io/refarch- eda/technology/kafka-overview/#architecture Contributor, T. (2019, March). data streaming. Retrieved from https://searchnetworking.techtarget.com/: https://searchnetworking.techtarget.com/definition/data-streaming Geeksforgeeks. (2020, September 10). MapReduce Architecture. Retrieved from https://www.geeksforgeeks.org: https://www.geeksforgeeks.org/mapreduce-architecture/ Grouplens. (2009, January). MovieLens 10M Dataset. Retrieved from https://grouplens.org/datasets/movielens/10m/: https://grouplens.org/ Hortonworks Inc. (n.d.). Data migration to Apache Hive. Retrieved from https://docs.cloudera.com/: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/migrating- data/content/hive_data_migration.html
  • 25. 24 | P a g e IBM. (n.d.). Apache HBase. Retrieved from https://www.ibm.com/analytics/hadoop/: https://www.ibm.com/analytics/hadoop/hbase Kaplarevic, V. (2020, May 25). Apache Hadoop Architecture Explained. Retrieved from https://phoenixnap.com/: https://phoenixnap.com/kb/apache-hadoop-architecture- explained Klein, E. (2020, March 19). Cassandra vs. MongoDB vs. Hbase: A Comparison of NoSQL Databases. Retrieved from https://logz.io/blog: https://logz.io/blog/nosql-database-comparison/ Madaka, Y. (2019, June 27). Building an ML application using MLlib in Pyspark. Retrieved from https://towardsdatascience.com/: https://towardsdatascience.com/building-an-ml- application-with-mllib-in-pyspark-part-1-ac13f01606e2 MovieLens. (n.d.). MovieLens 10M Dataset. Retrieved from https://grouplens.org/: https://grouplens.org/datasets/movielens/10m/ Murthy, A. (2012, August 15). Apache Hadoop YARN – Concepts and Applications. Retrieved from https://blog.cloudera.com/: https://blog.cloudera.com/apache-hadoop-yarn-concepts-and- applications/ Sinha, S. (2020, November 25). Hadoop Ecosystem: Hadoop Tools for Crunching Big Data. Retrieved from https://www.edureka.co/: https://www.edureka.co/blog/hadoop-ecosystem The Apache Software Foundation. (2021, February 21). Welcome to Apache Pig! Retrieved from https://pig.apache.org/: https://pig.apache.org/ Yegulalp, S. (2017, December 7). What is NoSQL? Databases for a cloud-scale future. Retrieved from https://www.infoworld.com/: https://www.infoworld.com/article/3240644/what-is-nosql- databases-for-a-cloud-scale-future.html