2. 1 | P a g e
Table of Contents
Table of Contents....................................................................................................................................1
What is Hadoop ......................................................................................................................................2
HDFS Overview........................................................................................................................................2
MapReduce – Overview and Its Significance..........................................................................................3
How it distributes processing..............................................................................................................4
YARN - Overview.....................................................................................................................................4
Handling Failure ......................................................................................................................................5
Simple Application of MapReduce..........................................................................................................5
Programming Hadoop with Pig...............................................................................................................9
What is Spark and its Significance.........................................................................................................13
Relational Data and How it Works with Hadoop..................................................................................16
Hive ...................................................................................................................................................16
Sqoop ................................................................................................................................................17
Non-Relational Database and How it Works with Hadoop...................................................................17
No SQL...............................................................................................................................................17
HBase ................................................................................................................................................17
Cassandra..........................................................................................................................................18
MongoDB ..........................................................................................................................................18
Feeding Data into Cluster......................................................................................................................18
Kafka..................................................................................................................................................18
Flume ................................................................................................................................................19
Processing Streaming Data ...................................................................................................................19
Spark Streaming................................................................................................................................19
Designing a System ...............................................................................................................................20
Understand the Requirements .........................................................................................................20
An Example of Application................................................................................................................20
Table of Figures.....................................................................................................................................22
References ............................................................................................................................................23
3. 2 | P a g e
What is Hadoop
Hadoop is an open-source framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage. Rather relying on a
hardware to deliver high availability, the library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on top of a cluster of computers, each of may
prone to failures. (Apache Hadoop, n.d.)
Hadoop can be divided into four layers: (Kaplarevic, 2020)
1.) Distributer storage layer which is the Hadoop Distributed File System (HDFS)
2.) Cluster resource management which is the Yet Another Resource Negotiator (YARN)
3.) Processing framework layer – MapReduce, Spark
4.) Hadoop common utilities such as Pig, Hive, Sqoop, HBase, etc.
First three items can be considered the main core of Hadoop ecosystem.
Figure 1 Hadoop Ecosystem (Sinha, 2020)
HDFS Overview
Hadoop distributed file system is a distributed file system that provides high-throughput access to
application data (Apache Hadoop, n.d.). Individual nodes in a Hadoop cluster has its own disk space,
memory bandwidth, and processing. The design itself thinks that each node is unreliable. Thus, HDFS
stores three copies of each dataset throughout the cluster. (Kaplarevic, 2020)
4. 3 | P a g e
Figure 2: HDFS Architecture (Bakshi, 2020)
MapReduce – Overview and Its Significance
MapReduce is one of the core pieces of Hadoop that's provided along with HDFS and yarn. It provides
for distributing the processing of your data on your cluster. Its job is to divide all your data up into
partitions that can be processed in parallel across your cluster and it manages how that's done and
how failure is handled. It maps and reduces data. Mapping organizes the information as it comes and
transform data into key-value pair. Mapping won’t care for other chunks of input data and in these
sense, parallel operation is being done. Shuffle will put all together all the information for each unique
key and sort those values associated with the key. Reducer is aggregating data together using the key
from the pair. MapReduce also provides a mechanism for handling failures such as particular node
going bad or down, etc.
Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)
5. 4 | P a g e
How it distributes processing
Depending on how big data is, it might choose to distribute the processing of multiple tasks on the
same computer and the same computer where it might distribute them amongst multiple computers.
In a nutshell, input data will be partitioned and will be assigned to multiple nodes to do the processing.
Mapping stage can be split up across different computers where each computer receives a different
chunks of input data, and after shuffle stage you can have different computer responsible for different
sets of keys while reducing the input data.
YARN - Overview
Figure 4: YARN Architecture (Murthy, 2012)
One of the main issues for previous Hadoop version was its scalability. Because of MapReduce Ver.1
use to perform both task process and resource management by itself, resulting into a single master to
handle everything. Here comes the bottleneck issue which causes scalability issue. With the
introduction of YARN (Yet Another Resource Negotiator), It solves this problem. The main idea for this
is separating MapReduce from resource management and job scheduling instead of a single master to
control everything. With YARN, job scheduling and resources management by this component.
6. 5 | P a g e
Handling Failure
Hadoop handles failure very well. Application master monitors tasks for errors or hanging. It is smart
enough to restart nodes as needed or allocate it into another nodes. What if the application master
fails? YARN will restart it and if it reaches the max limit and application master still fails, the resource
manager will start a new instance of application master in which it will run on different nodes. The
single point of failure is in the resource manager. What will happen if resource manager fails? With
proper design this can be negated. During the design phase, it is imperative to setup a standby
resource manager wherein if the main resource manager fails, this will kick in and will recover from
the last backup point. In this way, the operations will still continue. (adarsh, 2017)
Simple Application of MapReduce
For this exercise, we will be using dataset from movie lens with 10M movie ratings, applied to 10,000
movies by 72,000 users. (MovieLens, n.d.)
Link: http://files.grouplens.org/datasets/movielens/ml-10m.zip (Grouplens, 2009)
We will access Hadoop using command line interface through Putty
Figure 5: Hortonworks Hadoop after Startup from a VM
7. 6 | P a g e
Figure 6:Accessing Hadoop Using Putty
• maria_dev – is the default user name for the Hadoop sandbox we will use
• 127.0.0.1 – address to access Hadoop using Port “2222”
Figure 7: Accessing Hadoop file system
• Use maria_dev as password
• Create a directory to store the dataset inside Hadoop file system [-mkdir <folderName>]
• Get the dataset from source http://files.grouplens.org/datasets/movielens/ml-10m.zip
(Grouplens, 2009)
Password: maria_dev
8. 7 | P a g e
Figure 8: Unzip movie lens dataset
• Unzip is already inherent on this linux distribution, so unzip file the using cmd
[unzip <zipFile>]
Figure 9:Transfer file from Local source to HDFS
• Transfer file from local source to HDFS file folder
[hadoop fs -copyFromLocal <sourceFile> <destinationFile>]
• Check if transfer of file is successful by checking the destination of the file
[hadoop fs -ls <destinationFile>]
Figure 10: Python script to run count of ratings
9. 8 | P a g e
Figure 11: Command to run the python script to count ratings from <ratings.dat> file
Figure 12: Figures after running the script
Another example was to create a script to determine which movies are most rated in terms of count
of ratings it has. Using this python script:
Figure 13: MoviesBreakdown.py Script
10. 9 | P a g e
Figure 14: Commands to run the script and the ratings.dat file
Running the script in conjunction with ratings.dat, it will produce which movies has the most ratings
count.
Figure 15: Sorted ratings count result of the script
Based on the counts, movieID 296 (Pulp Fiction) is the most rated with 34,864 rating counts followed
by movieID 356 (Forrest Gump) with a rating count of 34,457.
Programming Hadoop with Pig
What is Pig? According to its Apache, it is a platform for analyzing large datasets that consists of a
high-level language expressing data analysis programs, coupled with infrastructure for evaluating
these programs. (The Apache Software Foundation, 2021)
So what makes this an upgrade of writing MapReduce script is the ease of programming, it lets you
write SQL like syntax to define map and reduce and it is highly extensible with user-defined functions.
It is much easier to execute pig at Ambari. Ambari provides UI to manage overall Hadoop environment.
Its more of a one stop shop in handling Hadoop related stuff.
Using Hortonworks Docker Sandbox, log into Ambari UI by going into browser and key in
127.0.0.1:8080, using credentials as username: maria_dev, password: maria_dev
11. 10 | P a g e
Figure 16: Ambari Log-in
To upload datafiles, click the follow figure 17:
Figure 17: Uploading datafiles (part1)
127.0.0.1:8080
Use the supplied
credentials
1. Click
2. Click
12. 11 | P a g e
Figure 18: Uploading datafiles (part2)
Figure 19: Uploading datafiles (part3)
Go to user folder and your
assigned folder
1. Click Upload
2.Drop data files
here
13. 12 | P a g e
To start or create a PIG script, follow figure 20.
Figure 20: Going to Pig View UI
Figure 21: Pig View UI
Using the same scenario from MapReduce example, we will determine which movie has the most
count of ratings. PigStorage function was not used as delimiter handler since it only handles single
character delimiter. One work around to handle double colon delimiter is to use a regex. By using this
script:
Figure 22: Pig script to count movies with most ratings
Click the icon and go to Pig View
Create or Re-run saved Pig Scripts
+
If the delimiter is only a single character,
PigStorage() can be use to handle the delimiter
14. 13 | P a g e
-The result of the script, same as figure 15 showing Pulp Fiction and Forrest Gump have the highest
count of ratings.
Figure 23: Result of Pig script
What is Spark and its Significance
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop
clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra,
Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to
MapReduce) and new workloads like streaming, interactive queries, and machine learning. (Apache
Software Foundation, 2018)
It does give a lot of flexibility where you can actually write real scripts using real programming
languages like Java, or Scala, or Python to do complex manipulations and transformations and analysis
of data and what sets it apart from other technologies like Pig for example is that it has a rich
ecosystem on top of Spark that lets you do all sorts of complicated things like machine learning, data
mining, graph analysis and streaming data.
What makes it fast comparing to MapReduce is that it uses Cache from RAM instead of Disk. As we
know accessing RAM is much quicker compare to retrieving data from Disk. Its cluster manager
supports Hadoop YARN, Kubernetes, Apache Mesos and it can run its stand-alone cluster manager.
(Apache Software Foundation, 2018)
Spark is built around concept of Resilient Distributed Dataset (RDD), in which an object represents a
dataset. It uses various functions on that RDD object to transform, or in anyway you want to transform
it into a new RDD form.
15. 14 | P a g e
Figure 24: Spark Architecture (Apache Software Foundation, 2018)
Spark will run by default version “1” unless user specify what version needs to run his scripts. From
CLI, use the command below stated at Figure 25
Figure 25: Declaring Spark Version to Use
Going back from previous scenario from MapReduce, we will determine which movie is the most
rated. Script was written using PySpark.
Figure 26: Import needed Classes
Create a function to get only needed data from the datasets. The function “loadMovieName” is to
create a dictionary to have reference for movieID and its corresponding movie name. Next function
“parseInput(line)” will be use to create a row with movieID and rating to be use for the dataframe.
Figure 27: Function for row and dictionary
16. 15 | P a g e
Figure 28: Script to Get Movies with highest Count of Rate
To run Spark script, use command “spark-submit <script-file>”.
Figure 29: Command to Run Spark Script
The result is shows Pulp Fiction with highest count of rate by movie followed by Forrest Gump
Figure 30: Result of Spark Script
17. 16 | P a g e
Spark suites of stack includes Spark SQL, Spark Streaming, MLlib (used for machine learning) and
GraphX (Visualization). This stack won’t be covered on this document. Below is the snapshot of Spark
stack that can be used on Spark.
Figure 31: Spark Stack (Madaka, 2019)
Relational Data and How it Works with Hadoop
Hive
Hive works in Hadoop as a way for SQL like to work with data. It translates SQL into MapReduce
commands or Tez command and it runs on YARN cluster manager to execute query. This is nice tool
for users who are adept in SQL queries (Abeywardana, 2020). Hive is highly scalable, in which works
with big data on a cluster. It is really appropriate for data warehouse application. Hive is perfect for
Online Analytic Processing (OLAP). In a nut shell, Hive is a layer on top of HDFS that gives schema like
structure that gives a simple way to query data across Hadoop cluster that uses SQL queries and since
its batch execution, it is suitable for large analytic query.
Figure 32: How Hive Works (Hortonworks Inc., n.d.)
18. 17 | P a g e
Sqoop
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured
datastore such as relational databases (Apache Software Foundation, 2019). Since transfer of data is
in parallel, it is efficient in moving large datasets. Hadoop also do incremental imports of data from
RDBMS into Hadoop system. With these, Sqoop can transfer only data from last timestamp it moves
data from database or in column value is greater than given values.
Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.)
Non-Relational Database and How it Works with Hadoop
No SQL
NoSQL store data in a schema-less or form-free fashion and data are de-normalize. NoSQL is useful for
quick access to the data with emphasis on speed and simplicity of access versus reliable transaction
or consistency, storing of large volume of data with schema-less structure. Also, migration of
unstructured data from multiple sources is fast as the data is kept in its original form for flexibility
(Yegulalp, 2017).
HBase
HBase is a column- oriented non- relational database management system that runs on top of HDFS.
It provides fault-tolerant way of storing sparse datasets. It is well suited for real-time data processing
or random read/write access to large volume of data set. HBase doesn’t support SQL but it can be
used with Hive – a SQL like query engine for batch processing of big data (IBM, n.d.).
HBase allows for many attributes to be grouped together into column families, such that the elements
of a column family are all stored together. This is different from a row-oriented relational database,
where all the columns of a given row are stored together. With HBase you must predefine the table
schema and specify the column families. However, new columns can be added to families at any time,
19. 18 | P a g e
making the schema flexible and able to adapt to changing application requirements (IBM, n.d.). HBase
is modeled after Google’s Bigtable concept.
Cassandra
The Apache Cassandra database is the right choice when you need scalability and high availability
without compromising performance. Linear scalability and proven fault-tolerance on commodity
hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's
support for replicating across multiple datacenters is best-in-class, providing lower latency for your
users and the peace of mind of knowing that you can survive regional outages (Apache Software
Foundation, 2016). Its main advantage is best in class in terms of fault tolerant
MongoDB
MongoDB is a schema-less database and stores data as JSON-like documents (binary JSON). This
provides flexibility and agility in the type of records that can be stored, and even the fields can vary
from one document to the other. Also, the embedded documents provide support for faster queries
through indexes and vastly reduce the I/O overload generally associated with database systems. Along
with this is support for schema on write and dynamic schema for easy evolving data structures (Klein,
2020).
Feeding Data into Cluster
Streaming data is a continuous transfer of data at a steady, high-speed rate (Contributor, 2019). It
makes sense in todays time to analyze streaming data to make a data driven decision in real time.
Imagine buying stocks from a stock exchange, you need to have the up-to date data of stocks to make
a timely decision if you need to buy a stock. There are ways to ingest streaming data into Hadoop
cluster.
Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies
for high-performance data pipelines, streaming analytics, data integration, and mission-critical
applications (Apache Software Foundation, n.d.). According to Apache Kafka website, 80% of Fortune
100 companies are using Kafka to ingest their streaming data because of its high throughput, its
scalable, fault tolerant cluster and high availability.
Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.)
20. 19 | P a g e
Kafka cluster as being at the center of this entire world of streaming data it represents many processes
running on many servers that are distributing at Kafka storage and Kafka processing. Producers are
generating the data and on the other end consumers receive that data as it comes out so, as producers
published messages to topics and receiving that data as it comes in so these consumers also link in the
Kafka libraries to be able to read that data as well and process.
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's
HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic applications (Alten-Lorenz, 2015).
Figure 35: Flume Architecture
Flume is a tool to collect log data from distributed web servers in which the data collected will go to
HDFS for analysis. It ensures guaranteed data delivery because both agents receiver and sender invoke
transaction.
Processing Streaming Data
What to do we do to the data once it was ingested? It is to process the data into a meaningful
information. There are ways to process streaming data that can be done in Hadoop ecosystem. Spark
Streaming, Apache Storm, Apache Flink and others. This document will only focus on Spark Streaming.
Spark Streaming
Apache Spark provides a streaming API to analyze streaming data in pretty much the same way we
work with batch data. Apache Spark Structured Streaming is built on top of the Spark-SQL API to
leverage its optimization. Spark Streaming is a processing engine to process data in real-time from
sources and output data to external storage systems (Bhadani, 2021).
21. 20 | P a g e
Figure 36: Spark streaming Architecture
Spark Streaming receives live input data streams and divides the data into batches, which are then
processed by the Spark engine to generate the final stream of results in batches. The continuous
stream of data are represented by RDD (Resilient Distributed Datasets).
Designing a System
When designing a system, there is no perfect formula in designing a system. The important thing to
remember is to understand the requirement and if the end product (system) will give added value into
the business. By keeping the system as simple as possible, the designed system can be deployed in a
timely manner.
Understand the Requirements
Designing a system is defining what input to use in order to meet the specified requirement. In a case
of building a desktop computer, you first need to understand what it will be used for. Gaming desktop
setup has a different requirement versus a desktop that will be use for word processing. There are
many factors need to consider like who are the user of the system, legacy infrastructure that will be
incorporated in this system, the system does need to be available all the time without any downtime
as it is mission critical for the business, does the company has an in-house capability to maintain the
system or need to administered by a third-party vendor, etc.
An Example of Application
Business scenario: A big company wanted to design a system that tracks their top X best selling items
on their online store that will be offered to their customer based on their purchase habits or in their
online cart. The online store generates millions of visit every month and have thousands of queries
by the seconds. Customer doesn’t want a slow response website and web servers has to be quick.
Also, page latency is important.
Based from the information above, a distributed non relational database can be used in this system
wherein the vending data to the web services and access to it is simple where giving top X best selling
item on that category will give the customer. Information can be boiled down to a relatively small
dataset that can be distributed through some distributed quick database hourly updates. Customer
don't really care if some new hot item came out and you're not reflecting that in your top sellers
instantaneously. It means that consistency is not hugely important as new sales come in. But website
must not go down if there's big traffic coming along the way into the site and partition tolerance is
important.
22. 21 | P a g e
NoSql using Cassandra might be the better choice as this database give importance to availability,
Spark streaming can be used for to process streaming data because computing something over a
sliding window of time by the increment time is Spark Streaming can do. How the input will stream
into data processor? Kafka or Flume can be used as both are compatible with Spark Streaming.
23. 22 | P a g e
Table of Figures
Figure 1 Hadoop Ecosystem (Sinha, 2020)..............................................................................................2
Figure 2: HDFS Architecture (Bakshi, 2020)...........................................................................................3
Figure 3: MapReduce Architecture (Geeksforgeeks, 2020)....................................................................3
Figure 4: YARN Architecture (Murthy, 2012)..........................................................................................4
Figure 5: Hortonworks Hadoop after Startup from a VM.......................................................................5
Figure 6:Accessing Hadoop Using Putty..................................................................................................6
Figure 7: Accessing Hadoop file system..................................................................................................6
Figure 8: Unzip movie lens dataset.........................................................................................................7
Figure 9:Transfer file from Local source to HDFS ...................................................................................7
Figure 10: Python script to run count of ratings.....................................................................................7
Figure 11: Command to run the python script to count ratings from <ratings.dat> file........................8
Figure 12: Figures after running the script .............................................................................................8
Figure 13: MoviesBreakdown.py Script ..................................................................................................8
Figure 14: Commands to run the script and the ratings.dat file.............................................................9
Figure 15: Sorted ratings count result of the script................................................................................9
Figure 16: Ambari Log-in.......................................................................................................................10
Figure 17: Uploading datafiles (part1)..................................................................................................10
Figure 18: Uploading datafiles (part2)..................................................................................................11
Figure 19: Uploading datafiles (part3)..................................................................................................11
Figure 20: Going to Pig View UI ............................................................................................................12
Figure 21: Pig View UI ...........................................................................................................................12
Figure 22: Pig script to count movies with most ratings.......................................................................12
Figure 23: Result of Pig script................................................................................................................13
Figure 24: Spark Architecture (Apache Software Foundation, 2018)...................................................14
Figure 25: Declaring Spark Version to Use............................................................................................14
Figure 26: Import needed Classes.........................................................................................................14
Figure 27: Function for row and dictionary ..........................................................................................14
Figure 28: Script to Get Movies with highest Count of Rate ................................................................15
Figure 29: Command to Run Spark Script.............................................................................................15
Figure 30: Result of Spark Script...........................................................................................................15
Figure 31: Spark Stack (Madaka, 2019).................................................................................................16
Figure 32: How Hive Works (Hortonworks Inc., n.d.) ...........................................................................16
Figure 33: How Sqoop import or export data from RDBMS (Hortonworks Inc., n.d.)..........................17
Figure 34: Kafka Architecture (Boyer, Osowski, & Almaraz, n.d.).........................................................18
Figure 35: Flume Architecture ..............................................................................................................19
Figure 36: Spark streaming Architecture ..............................................................................................20
24. 23 | P a g e
References
Abeywardana, P. (2020, August 9). The Touch of Relational Databases on Hadoop. Retrieved from
https://towardsdatascience.com/: https://towardsdatascience.com/the-touch-of-relational-
databases-on-hadoop-1a968cc16b61
adarsh. (2017, July). handling failures in hadoop,mapreduce and yarn. Retrieved from
https://timepasstechies.com: https://timepasstechies.com/handling-failures-
hadoopmapreduce-yarn/
Alten-Lorenz, A. (2015, June 3). Welcome to the Apache Flume wiki! Retrieved from
https://cwiki.apache.org/: https://cwiki.apache.org/confluence/display/FLUME
Apache Hadoop. (n.d.). Apache Hadoop. Retrieved from https://hadoop.apache.org/:
https://hadoop.apache.org/
Apache Software Foundation. (2016). Apache Cassandra. Retrieved from
https://cassandra.apache.org/doc/latest/architecture/overview.html:
https://cassandra.apache.org/
Apache Software Foundation. (2018). Apache Spark FAQ. Retrieved from https://spark.apache.org/:
https://spark.apache.org/faq.html
Apache Software Foundation. (2019, January 18). Apache Sqoop. Retrieved from
https://sqoop.apache.org/: https://sqoop.apache.org/
Apache Software Foundation. (n.d.). Apache Kafka. Retrieved from https://kafka.apache.org/:
https://kafka.apache.org/
Bakshi, A. (2020, November 25). Apache Hadoop HDFS Architecture. Retrieved from
https://www.edureka.co/: https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
Bhadani, N. (2021, January 28). Apache Spark Structured Streaming . Retrieved from
https://medium.com/: https://medium.com/expedia-group-tech/apache-spark-structured-
streaming-first-streaming-example-1-of-6-e8f3219748ef
Boyer, J., Osowski, R., & Almaraz, J. (n.d.). Kafka Overview. Retrieved from https://ibm-cloud-
architecture.github.io/: https://ibm-cloud-architecture.github.io/refarch-
eda/technology/kafka-overview/#architecture
Contributor, T. (2019, March). data streaming. Retrieved from
https://searchnetworking.techtarget.com/:
https://searchnetworking.techtarget.com/definition/data-streaming
Geeksforgeeks. (2020, September 10). MapReduce Architecture. Retrieved from
https://www.geeksforgeeks.org: https://www.geeksforgeeks.org/mapreduce-architecture/
Grouplens. (2009, January). MovieLens 10M Dataset. Retrieved from
https://grouplens.org/datasets/movielens/10m/: https://grouplens.org/
Hortonworks Inc. (n.d.). Data migration to Apache Hive. Retrieved from https://docs.cloudera.com/:
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/migrating-
data/content/hive_data_migration.html
25. 24 | P a g e
IBM. (n.d.). Apache HBase. Retrieved from https://www.ibm.com/analytics/hadoop/:
https://www.ibm.com/analytics/hadoop/hbase
Kaplarevic, V. (2020, May 25). Apache Hadoop Architecture Explained. Retrieved from
https://phoenixnap.com/: https://phoenixnap.com/kb/apache-hadoop-architecture-
explained
Klein, E. (2020, March 19). Cassandra vs. MongoDB vs. Hbase: A Comparison of NoSQL Databases.
Retrieved from https://logz.io/blog: https://logz.io/blog/nosql-database-comparison/
Madaka, Y. (2019, June 27). Building an ML application using MLlib in Pyspark. Retrieved from
https://towardsdatascience.com/: https://towardsdatascience.com/building-an-ml-
application-with-mllib-in-pyspark-part-1-ac13f01606e2
MovieLens. (n.d.). MovieLens 10M Dataset. Retrieved from https://grouplens.org/:
https://grouplens.org/datasets/movielens/10m/
Murthy, A. (2012, August 15). Apache Hadoop YARN – Concepts and Applications. Retrieved from
https://blog.cloudera.com/: https://blog.cloudera.com/apache-hadoop-yarn-concepts-and-
applications/
Sinha, S. (2020, November 25). Hadoop Ecosystem: Hadoop Tools for Crunching Big Data. Retrieved
from https://www.edureka.co/: https://www.edureka.co/blog/hadoop-ecosystem
The Apache Software Foundation. (2021, February 21). Welcome to Apache Pig! Retrieved from
https://pig.apache.org/: https://pig.apache.org/
Yegulalp, S. (2017, December 7). What is NoSQL? Databases for a cloud-scale future. Retrieved from
https://www.infoworld.com/: https://www.infoworld.com/article/3240644/what-is-nosql-
databases-for-a-cloud-scale-future.html