What is Big Data? What is Hadoop? What is MapReduce? How do the other components such as: Oozie, Hue, Hive, Impala works? Which are the main Hadoop distributions? What is Spark? What are the differences between Batch and Streaming processing? What are some Business Intelligence Solutions by focusing on some business cases?
Presentation on how to chat with PDF using ChatGPT code interpreter
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - Robert Brunet
1. DXC Proprietary and Confidential
The Role of Hadoop Ecosystem
in Advance Analytics
Robert Brunet, PhD
Big Data Engineer
Warsaw, Poland
DXC Technology
4. Big Data Engineer Skills & Talents
• Software Engineering
• Mathematics
• Database architecture
• Extract-Transform-Load
• Distributed Computing
• Predictive Modelling
• Visualization
• Cloud Tools
Programs & Languages
• Hadoop | Spark
• Linux | Hue
• Azure | AWS
• Cloudera | Hortonworks
• SQL | Hive | Impala
• Python | Scala | R | Java
• Oozie | Airflow | XML
Big Data Analytics interdisciplinary field uses scientific methods to extract insights from
data
5. Big Data Data Preparation Analytics
Business Intelligence
Database
Big Data Analytics
6. Why Big Data?
Currently 8Vs but let’s focus on 3Vs:
1. Volume
Terabytes and Petabytes of the storage system
2. Velocity
Almost real time and the update window
fractions of seconds
3. Variety
Sometimes data not in the traditional format. It
may be in the form of video, SMS, pdf, etc.
7. Data Ownership
• Once we have a business problem to solve, we have to start to identify the
data that will bring us to the solution.
• Most of business projects the data is owned by Company.
• Other cases the data is bought to external company.
• Some case data can be find on open-source repositories.
Own Data Open SourceBuy Data
8. Data Structures
structured
Data resides in fixed field within a record.
This includes data contained in relational databases and spreadsheets.
semi-structured
Cross between structured and unstructured data.
It is a subtype of structured data, but lacks a strict data model: JSON, xml.
unstructured
Things that cannot be readily classified: images, maps, videos, etc.
9. Data Storage
Local directory based
MS Excel, MS Access, txt, and XML files stored on a workstation
Network based
Your organization’s database server connected to the intranet (SQL Server, SAP)
Cloud based
Data-as-a-Service (Hadoop on Azure, AWS or Google Cloud)
11. The Hadoop Ecosystem
The Hadoop Ecosystem refers to the various components of the
Apache Hadoop software library
12. Hadoop
• Early 2000’s Dough Cutting was attempting to build an Open Source Search
engine called.
• After Google published its papers on MapReduce in 2004, Dough Cutting
developed the distributed computing part Hadoop.
• The name Hadoop comes from Cutting’s kid yellow elephant toy.
• Nowadays, Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of computers using
simple programming models.
• It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
13. HDFS
• Hadoop Cluster contains a lot of data and this data has to be stored somewhere.
• Hadoop Distributed File System (HDFS) is the standard platform for data
storage.
• HDFS is a fault-tolerant, distributed file system written entirely in Java.
• The core benefit of HDFS is in its ability to store large files across multiple machines.
• HDFS is a durable, scalable and low-cost data storage.
14. MapReduce
• MapReduce was introduced by Google in its published paper “MapReduce: Simplified
Data Processing on Large Clusters” in 2004.
• A MapReduce program is composed of a map procedure, which performs filtering and sorting,
and a reduce method, which performs a summary operation.
15. YARN
• Yet Another Resource Negotiator (YARN) is the resource management and job
scheduling technology in the open source Hadoop distributed processing framework.
• YARN is responsible for allocating system resources to the various applications running
in a Hadoop Cluster and scheduling tasks to be executed on different cluster nodes.
• The technology release by the Apache Software Foundation in 2012 was one of the key
features added in Hadoop 2.0.
16. The Big Data Interfaces (Part1)
Hue
• Hue is an open source Analytics Workbench for browsing,
querying and visualizing data.
Shell
• Linux console provides a way for the kernel to receive
text input from the user and send text output.
17. Databricks
• A notebook is a web-based interface to a document
that contains runnable code, visualizations, and
narrative text.
Airflow
• Airflow scheduler executes your tasks on an array
of workers while following the specified
dependencies.
The Big Data Interfaces (Part2)
18. Workflows
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Oozie is supports several types of Hadoop jobs such as: Java map-reduce, Streaming map-
reduce, Pig, Hive, Sqoop and Distcp.
Oozie
19. SQL
Hive Impala
Hive is used to querying and managing large datasets
residing in distributed storage. Hive provides a
mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL.
Hive for create tables and complex operations with
data.
Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that
is very similar to those found in commercial parallel
RDBMSs.
Impala for query data faster.
20. Transfer data
Sqoop jdbc
Sqoop is a tool designed for efficiently transferring bulk
data between structured datastores such as relational
databases and Apache Hadoop.
Java Database Connectivity (JDBC) is
an application programming interface (API) for the
programming language Java, which defines how a
client may access a database.
21. Data Files
CSV Parquet json
Apache Parquet is column-
oriented and designed to bring
efficient columnar storage of data
compared to row based files like
CSV. Parquet is built to support
very efficient compression and
encoding schemes.
CSV is simple and ubiquitous.
Many tools like Excel, Google
Sheets and a host of others can
generate CSV files.
Java Script Object Notation (json)
is an open-standard file format
that uses human-readable text to
transmit data objects consisting of
attribute–value pairs and array
data types (or any other
serializable value).
22. Distributions
Cloudera Databricks MapR
Cloudera is a software company
that provides a software platform
for data engineering, data
warehousing, machine
learning and analytics that runs in
the cloud or on premises.
CDH is Cloudera’s open source
platform distribution including
Apache Hadoop or Apache Spark.
Databricks is a company founded
by the original creators of Apache
Spark, the first unified analytics
engine, that aims to help clients
with cloud-based big data
processing and machine learning.
Databricks develops a web-based
platform for working with Spark,
that provides automated cluster
management and IPython-
style notebooks.
MapR provides access to a variety
of data sources from a
single computer cluster,
including big data workloads such
as Apache Hadoop and Apache
Spark.
24. Zookeeper
Kafka
Solr
Pig
It is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide
a distributed configuration service, synchronization service, and naming registry for large distributed systems.
Kafka is a messaging system widely used in two ways: i) queuing: queue consumers act as a worker group. ii) Publish-
Subscribe: each subscriber gets a copy of each message. It acts like a notification system.
Solr major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic
clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling.
Pig Latin is a high-level data flow language. Apache Pig uses a multi-query approach, which reduces the length of
the code by 20 times. Hence, this reduces the development period by almost 16 times.
Other Components
25. Spark
• Developed by Matei Zaharia, UC Berkeley, 2014.
• Hadoop uses shared file system (disk) – Spark uses shared memory faster lower latency.
• Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset
of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
• Spark and its RDDs were developed in response to limitations in the MapReduce which read input
data from disk, map a function across the data, reduce the results of the map, and store reduction
results on disk. Spark's RDDs function as a working set for distributed programs that offers a
restricted form of distributed shared memory.
26. Programming: Scala/PySpark
Scala PySpark
Scala is the core language for Spark and allows the parallel
programming to be abstracted.
You will want to learn Scala if you want to extend Spark.
PySpark is Python library for programming, does not always
achieve the same efficiencies, but is much easier to learn.
27. Big Data Monitoring
Grafana Arcadia
Grafana is a free software based on the license of Apache
that allows the visualization and the format of metric data. It
allows you to create dashboards and charts from multiple
sources, including time series databases such as Graphite,
InfluxDB and OpenTSDB.
Arcadia Data is the analytics and BI platform built for big
data and data lakes. Unlike a traditional BI deployment, our
platform:Promotes greater agility for faster time-to-insight,
delivers faster responses and higher user concurrency on
larger data volumes, avoids middleware, thus further
reduces IT overhead and complexity.
30. DXC Proprietary and Confidential
The Role of Hadoop Ecosystem
in Advance Analytics
Robert Brunet, PhD
Big Data Engineer
Warsaw, Poland
DXC Technology