SlideShare une entreprise Scribd logo
1  sur  30
DXC Proprietary and Confidential
The Role of Hadoop Ecosystem
in Advance Analytics
Robert Brunet, PhD
Big Data Engineer
Warsaw, Poland
DXC Technology
Welcome to the Zoo…
… the Zoo of Big Data
Big Data Engineer Skills & Talents
• Software Engineering
• Mathematics
• Database architecture
• Extract-Transform-Load
• Distributed Computing
• Predictive Modelling
• Visualization
• Cloud Tools
Programs & Languages
• Hadoop | Spark
• Linux | Hue
• Azure | AWS
• Cloudera | Hortonworks
• SQL | Hive | Impala
• Python | Scala | R | Java
• Oozie | Airflow | XML
Big Data Analytics interdisciplinary field uses scientific methods to extract insights from
data
Big Data Data Preparation Analytics
Business Intelligence
Database
Big Data Analytics
Why Big Data?
Currently 8Vs but let’s focus on 3Vs:
1. Volume
Terabytes and Petabytes of the storage system
2. Velocity
Almost real time and the update window
fractions of seconds
3. Variety
Sometimes data not in the traditional format. It
may be in the form of video, SMS, pdf, etc.
Data Ownership
• Once we have a business problem to solve, we have to start to identify the
data that will bring us to the solution.
• Most of business projects the data is owned by Company.
• Other cases the data is bought to external company.
• Some case data can be find on open-source repositories.
Own Data Open SourceBuy Data
Data Structures
structured
Data resides in fixed field within a record.
This includes data contained in relational databases and spreadsheets.
semi-structured
Cross between structured and unstructured data.
It is a subtype of structured data, but lacks a strict data model: JSON, xml.
unstructured
Things that cannot be readily classified: images, maps, videos, etc.
Data Storage
Local directory based
MS Excel, MS Access, txt, and XML files stored on a workstation
Network based
Your organization’s database server connected to the intranet (SQL Server, SAP)
Cloud based
Data-as-a-Service (Hadoop on Azure, AWS or Google Cloud)
Databases
Relational Databases
• Relationships are typically two-dimensional
Non-Relational Databases
• Typically referred to as NoSQL databases
The Hadoop Ecosystem
The Hadoop Ecosystem refers to the various components of the
Apache Hadoop software library
Hadoop
• Early 2000’s Dough Cutting was attempting to build an Open Source Search
engine called.
• After Google published its papers on MapReduce in 2004, Dough Cutting
developed the distributed computing part Hadoop.
• The name Hadoop comes from Cutting’s kid yellow elephant toy.
• Nowadays, Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of computers using
simple programming models.
• It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
HDFS
• Hadoop Cluster contains a lot of data and this data has to be stored somewhere.
• Hadoop Distributed File System (HDFS) is the standard platform for data
storage.
• HDFS is a fault-tolerant, distributed file system written entirely in Java.
• The core benefit of HDFS is in its ability to store large files across multiple machines.
• HDFS is a durable, scalable and low-cost data storage.
MapReduce
• MapReduce was introduced by Google in its published paper “MapReduce: Simplified
Data Processing on Large Clusters” in 2004.
• A MapReduce program is composed of a map procedure, which performs filtering and sorting,
and a reduce method, which performs a summary operation.
YARN
• Yet Another Resource Negotiator (YARN) is the resource management and job
scheduling technology in the open source Hadoop distributed processing framework.
• YARN is responsible for allocating system resources to the various applications running
in a Hadoop Cluster and scheduling tasks to be executed on different cluster nodes.
• The technology release by the Apache Software Foundation in 2012 was one of the key
features added in Hadoop 2.0.
The Big Data Interfaces (Part1)
Hue
• Hue is an open source Analytics Workbench for browsing,
querying and visualizing data.
Shell
• Linux console provides a way for the kernel to receive
text input from the user and send text output.
Databricks
• A notebook is a web-based interface to a document
that contains runnable code, visualizations, and
narrative text.
Airflow
• Airflow scheduler executes your tasks on an array
of workers while following the specified
dependencies.
The Big Data Interfaces (Part2)
Workflows
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Oozie is supports several types of Hadoop jobs such as: Java map-reduce, Streaming map-
reduce, Pig, Hive, Sqoop and Distcp.
Oozie
SQL
Hive Impala
Hive is used to querying and managing large datasets
residing in distributed storage. Hive provides a
mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL.
Hive for create tables and complex operations with
data.
Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that
is very similar to those found in commercial parallel
RDBMSs.
Impala for query data faster.
Transfer data
Sqoop jdbc
Sqoop is a tool designed for efficiently transferring bulk
data between structured datastores such as relational
databases and Apache Hadoop.
Java Database Connectivity (JDBC) is
an application programming interface (API) for the
programming language Java, which defines how a
client may access a database.
Data Files
CSV Parquet json
Apache Parquet is column-
oriented and designed to bring
efficient columnar storage of data
compared to row based files like
CSV. Parquet is built to support
very efficient compression and
encoding schemes.
CSV is simple and ubiquitous.
Many tools like Excel, Google
Sheets and a host of others can
generate CSV files.
Java Script Object Notation (json)
is an open-standard file format
that uses human-readable text to
transmit data objects consisting of
attribute–value pairs and array
data types (or any other
serializable value).
Distributions
Cloudera Databricks MapR
Cloudera is a software company
that provides a software platform
for data engineering, data
warehousing, machine
learning and analytics that runs in
the cloud or on premises.
CDH is Cloudera’s open source
platform distribution including
Apache Hadoop or Apache Spark.
Databricks is a company founded
by the original creators of Apache
Spark, the first unified analytics
engine, that aims to help clients
with cloud-based big data
processing and machine learning.
Databricks develops a web-based
platform for working with Spark,
that provides automated cluster
management and IPython-
style notebooks.
MapR provides access to a variety
of data sources from a
single computer cluster,
including big data workloads such
as Apache Hadoop and Apache
Spark.
Cloud Services
Zookeeper
Kafka
Solr
Pig
It is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide
a distributed configuration service, synchronization service, and naming registry for large distributed systems.
Kafka is a messaging system widely used in two ways: i) queuing: queue consumers act as a worker group. ii) Publish-
Subscribe: each subscriber gets a copy of each message. It acts like a notification system.
Solr major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic
clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling.
Pig Latin is a high-level data flow language. Apache Pig uses a multi-query approach, which reduces the length of
the code by 20 times. Hence, this reduces the development period by almost 16 times.
Other Components
Spark
• Developed by Matei Zaharia, UC Berkeley, 2014.
• Hadoop uses shared file system (disk) – Spark uses shared memory faster lower latency.
• Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset
of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
• Spark and its RDDs were developed in response to limitations in the MapReduce which read input
data from disk, map a function across the data, reduce the results of the map, and store reduction
results on disk. Spark's RDDs function as a working set for distributed programs that offers a
restricted form of distributed shared memory.
Programming: Scala/PySpark
Scala PySpark
Scala is the core language for Spark and allows the parallel
programming to be abstracted.
You will want to learn Scala if you want to extend Spark.
PySpark is Python library for programming, does not always
achieve the same efficiencies, but is much easier to learn.
Big Data Monitoring
Grafana Arcadia
Grafana is a free software based on the license of Apache
that allows the visualization and the format of metric data. It
allows you to create dashboards and charts from multiple
sources, including time series databases such as Graphite,
InfluxDB and OpenTSDB.
Arcadia Data is the analytics and BI platform built for big
data and data lakes. Unlike a traditional BI deployment, our
platform:Promotes greater agility for faster time-to-insight,
delivers faster responses and higher user concurrency on
larger data volumes, avoids middleware, thus further
reduces IT overhead and complexity.
Your actions design your life
Big Data Cycle
“Life is something more than just data”
Robert Brunet
DXC Proprietary and Confidential
The Role of Hadoop Ecosystem
in Advance Analytics
Robert Brunet, PhD
Big Data Engineer
Warsaw, Poland
DXC Technology

Contenu connexe

Tendances

Tendances (19)

Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
HDFS
HDFSHDFS
HDFS
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Similaire à Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - Robert Brunet

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 

Similaire à Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - Robert Brunet (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Using Machine Learning with HDInsight
Using Machine Learning with HDInsightUsing Machine Learning with HDInsight
Using Machine Learning with HDInsight
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 

Plus de Dataconomy Media

Plus de Dataconomy Media (20)

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
 
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - Robert Brunet

  • 1. DXC Proprietary and Confidential The Role of Hadoop Ecosystem in Advance Analytics Robert Brunet, PhD Big Data Engineer Warsaw, Poland DXC Technology
  • 2. Welcome to the Zoo…
  • 3. … the Zoo of Big Data
  • 4. Big Data Engineer Skills & Talents • Software Engineering • Mathematics • Database architecture • Extract-Transform-Load • Distributed Computing • Predictive Modelling • Visualization • Cloud Tools Programs & Languages • Hadoop | Spark • Linux | Hue • Azure | AWS • Cloudera | Hortonworks • SQL | Hive | Impala • Python | Scala | R | Java • Oozie | Airflow | XML Big Data Analytics interdisciplinary field uses scientific methods to extract insights from data
  • 5. Big Data Data Preparation Analytics Business Intelligence Database Big Data Analytics
  • 6. Why Big Data? Currently 8Vs but let’s focus on 3Vs: 1. Volume Terabytes and Petabytes of the storage system 2. Velocity Almost real time and the update window fractions of seconds 3. Variety Sometimes data not in the traditional format. It may be in the form of video, SMS, pdf, etc.
  • 7. Data Ownership • Once we have a business problem to solve, we have to start to identify the data that will bring us to the solution. • Most of business projects the data is owned by Company. • Other cases the data is bought to external company. • Some case data can be find on open-source repositories. Own Data Open SourceBuy Data
  • 8. Data Structures structured Data resides in fixed field within a record. This includes data contained in relational databases and spreadsheets. semi-structured Cross between structured and unstructured data. It is a subtype of structured data, but lacks a strict data model: JSON, xml. unstructured Things that cannot be readily classified: images, maps, videos, etc.
  • 9. Data Storage Local directory based MS Excel, MS Access, txt, and XML files stored on a workstation Network based Your organization’s database server connected to the intranet (SQL Server, SAP) Cloud based Data-as-a-Service (Hadoop on Azure, AWS or Google Cloud)
  • 10. Databases Relational Databases • Relationships are typically two-dimensional Non-Relational Databases • Typically referred to as NoSQL databases
  • 11. The Hadoop Ecosystem The Hadoop Ecosystem refers to the various components of the Apache Hadoop software library
  • 12. Hadoop • Early 2000’s Dough Cutting was attempting to build an Open Source Search engine called. • After Google published its papers on MapReduce in 2004, Dough Cutting developed the distributed computing part Hadoop. • The name Hadoop comes from Cutting’s kid yellow elephant toy. • Nowadays, Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
  • 13. HDFS • Hadoop Cluster contains a lot of data and this data has to be stored somewhere. • Hadoop Distributed File System (HDFS) is the standard platform for data storage. • HDFS is a fault-tolerant, distributed file system written entirely in Java. • The core benefit of HDFS is in its ability to store large files across multiple machines. • HDFS is a durable, scalable and low-cost data storage.
  • 14. MapReduce • MapReduce was introduced by Google in its published paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004. • A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation.
  • 15. YARN • Yet Another Resource Negotiator (YARN) is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. • YARN is responsible for allocating system resources to the various applications running in a Hadoop Cluster and scheduling tasks to be executed on different cluster nodes. • The technology release by the Apache Software Foundation in 2012 was one of the key features added in Hadoop 2.0.
  • 16. The Big Data Interfaces (Part1) Hue • Hue is an open source Analytics Workbench for browsing, querying and visualizing data. Shell • Linux console provides a way for the kernel to receive text input from the user and send text output.
  • 17. Databricks • A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. Airflow • Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. The Big Data Interfaces (Part2)
  • 18. Workflows • Oozie is a workflow scheduler system to manage Hadoop jobs. • Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. • Oozie is supports several types of Hadoop jobs such as: Java map-reduce, Streaming map- reduce, Pig, Hive, Sqoop and Distcp. Oozie
  • 19. SQL Hive Impala Hive is used to querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive for create tables and complex operations with data. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. Impala for query data faster.
  • 20. Transfer data Sqoop jdbc Sqoop is a tool designed for efficiently transferring bulk data between structured datastores such as relational databases and Apache Hadoop. Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database.
  • 21. Data Files CSV Parquet json Apache Parquet is column- oriented and designed to bring efficient columnar storage of data compared to row based files like CSV. Parquet is built to support very efficient compression and encoding schemes. CSV is simple and ubiquitous. Many tools like Excel, Google Sheets and a host of others can generate CSV files. Java Script Object Notation (json) is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).
  • 22. Distributions Cloudera Databricks MapR Cloudera is a software company that provides a software platform for data engineering, data warehousing, machine learning and analytics that runs in the cloud or on premises. CDH is Cloudera’s open source platform distribution including Apache Hadoop or Apache Spark. Databricks is a company founded by the original creators of Apache Spark, the first unified analytics engine, that aims to help clients with cloud-based big data processing and machine learning. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython- style notebooks. MapR provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark.
  • 24. Zookeeper Kafka Solr Pig It is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems. Kafka is a messaging system widely used in two ways: i) queuing: queue consumers act as a worker group. ii) Publish- Subscribe: each subscriber gets a copy of each message. It acts like a notification system. Solr major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Pig Latin is a high-level data flow language. Apache Pig uses a multi-query approach, which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times. Other Components
  • 25. Spark • Developed by Matei Zaharia, UC Berkeley, 2014. • Hadoop uses shared file system (disk) – Spark uses shared memory faster lower latency. • Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. • Spark and its RDDs were developed in response to limitations in the MapReduce which read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a restricted form of distributed shared memory.
  • 26. Programming: Scala/PySpark Scala PySpark Scala is the core language for Spark and allows the parallel programming to be abstracted. You will want to learn Scala if you want to extend Spark. PySpark is Python library for programming, does not always achieve the same efficiencies, but is much easier to learn.
  • 27. Big Data Monitoring Grafana Arcadia Grafana is a free software based on the license of Apache that allows the visualization and the format of metric data. It allows you to create dashboards and charts from multiple sources, including time series databases such as Graphite, InfluxDB and OpenTSDB. Arcadia Data is the analytics and BI platform built for big data and data lakes. Unlike a traditional BI deployment, our platform:Promotes greater agility for faster time-to-insight, delivers faster responses and higher user concurrency on larger data volumes, avoids middleware, thus further reduces IT overhead and complexity.
  • 28. Your actions design your life Big Data Cycle
  • 29. “Life is something more than just data” Robert Brunet
  • 30. DXC Proprietary and Confidential The Role of Hadoop Ecosystem in Advance Analytics Robert Brunet, PhD Big Data Engineer Warsaw, Poland DXC Technology