SlideShare une entreprise Scribd logo
1  sur  58
Hadoop Eco System
1
Tilani Gunawardena
PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL)
2018/08/07
• WhatisBigData&Hadoop
• CoreHadoop
• HadoopEcosystem
• UseCases
Content
2
Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
– Sensor data
– IoT data
3
4
5
Data sets whose size is beyond the ability
of typical database software tools to
capture, store, manage, and analyze
6
Revolution in the Marketplace: The
shift
7
What is Big Data?
8
Big Data
• Exabyte , Zettabyte of data
• Big Data is not about the size of the data,
it’s about the value within the Big Data
Big Data
9
Data in an Enterprise
• Existing OLTP Databases
• User Generated Data
• Logs
• System generated data
10
The Structure of Big Data
• Structured
– Most traditional data sources
• Semi-structured
– XML,JSON
• Unstructured
– FB logs, web chats, Youtube
11
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
12
Challenges with Big Data
• Data Quality: 4th V i.e. Veracity.
• Discovery: Finding insights on Big Data is like finding a
needle in a haystack
• Storage:
– “Where to store it?”.
– Need to scale up or down on-demand.
• Analytics
– unaware of the kind of data we are dealing with, so
analyzing that data is even more difficult.
• Security
• Lack of Talent
13
14
Scale up vs Scale out
• Harder and more expensive to scale-up
• Typically “scaled-up” (not scaling-out) by getting
bigger/more powerful hardware
15
Apache Hadoop
• The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of commodity hardware.
• Concept Big Data
• Technique  MapReduce
• Hadoop is Eco System Framework which is developed
in java.
16
Hadoop Characteristics
• Open source
• Distributed processing
• Distributed storage
• Reliable
• Economical
• Flexible
17
Why Hadoop
• An open source project to manage “Big Data”
• Not just a single project, but a set of projects
• that work together
• Deals with the 4 V’s
• Traditional data stores are expensive to scale and by
design difficult to distribute
• Transforms commodity hardware to Coherent
storage service that lets you store petabytes of data
• Coherent processing service to process data
efficiently
18
Big Data
19
• In 2003, Doug Cutting launches project Nutch to
handle billions of searches and indexing millions of
web pages.
• Later in Oct 2003 – Google releases papers with GFS
(Google File System).
• In Dec 2004, Google releases papers with
MapReduce.
• In 2005, Nutch used GFS and MapReduce to perform
operations
• In 2006, Yahoo created Hadoop based on GFS and
MapReduce with Doug Cutting and team.
• In 2007 Yahoo started using Hadoop on a 1000 node
cluster.
Hadoop-History
Evolution of Hadoop
21
MapReduce
• Is processing framework
• Java based
• Is for batch processing
• High performance, fault tolerance data
processing system
22
23
MapReduce in 41 words
Goal: count the number of books in the library.
• Map:
– You count up shelf #1,
– I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual
counts.
(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html)
• MapReduce is a programming model for processing and
generating large data sets
• MapReduce was used to completely regenerate Google's
index of the World Wide Web.
• Hadoop which allows applications to run using the
MapReduce algorithm.
MapReduce
• Users implement interface of 2 function
– Map
– Reduce
• Map( in-key,in-value) (Out-key,intermediate-value) list
• Reduce(Out-key,intermediate-value list) out_value list
24
MapReduce Workflow
25
26
• Contains Libraries and other modules
Hadoop
Common
• Hadoop Distributed File SystemHDFS
• Yet Another Resource NegotiatorHadoop YARN
• A programming model for large scale
data processing
Hadoop
MapReduce
Apache Hadoop-Modules
Who Uses Hadoop?
Hadoop Vendor Distribution
• Cloudera
• MapR
• Hortonworks
• Apache BigTop
• Greenplum
29
Hadoop Architecture
30
Yarn
• New processing framework
• High availability
• YARN supports multiple processing models in
addition to MapReduce
• With Yarn we can process non mapreduce jobs
31
Server Types
• OLTP (Online Transaction Processing): data
keep on changing
• OLAP(Online Analytical Process)
– Facebook, Google, Twitter, LinkedIn, Ecommerce
site
32
Hadoop Eco System
33
Hadoop Eco System
• Acquire : Where to get data
• Arrange Data: HDFS, NOSQL
• Analysis/Process
• Decide
34
Acquire Data
• Where to get data
– Ex: Database, Web
– Flume,Sqoop,KAFKA
35
Apache Flume
• Flume is a framework for populating Hadoop
with data.
• Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
36
Sqoop
• Apache Sqoop is a connectivity tool designed
for efficiently transferring bulk data between
Apache Hadoop and structured data stores
such as relational databases.
37
Kafka
38
Kafka® is used for building real-time data
pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, very fast, and runs in
production in thousands of companies.
Arrange Data
• Hadoop Distributed File System (HDFS)
• NOSQL HBase, MongoDB, Cassandra
• NOSQL is adding transactional behavior to
data (OLTP behavior)
39
Arrange Data
• HDFS is a distributed file system designed to
run on commodity hardware.
• HDFS: Anything you save in HDFS is file
40
Data Analyse/Process
• MapReduce
• Spark
• Pig
• Hive
• Impala
41
Spark
• In memory process
• Live streaming processing
• Machine learning
42
Pig
• Initially developed by yahoo
• Platform for analyzing large data sets that
consist of high level language for expressing
data analysis programs
• Infrastructure compile language to a sequence
of MapReduce programs
43
Pig In Hadoop EcoSystem
44
Twitter
• Twitter moved to Apache Pig for analysis. Now,
– joining data sets,
– grouping them,
– sorting them and
– retrieving data
becomes easier and simpler. You can see in the below
image how twitter used Apache Pig to analyse their
large data set.
45
46
Apache Hive
• Apache Hive is a data warehouse system built on
top of Hadoop and is used for analyzing structured
and semi-structured data.
• Compile, SQL-like queries into MapReduce
programs
47
• Challenges at Facebook: Exponential Growth
of Data
• Hive project was open sourced in August’
2008 by Facebook and is freely available as
Apache Hive today.
48
Story of Hive – From Facebook to
Apache
49
SQL + Hadoop MapReduce = HiveQL
50
NASA Case Study: Regional Climate
Model Evaluation System (RCMES)
51
• MySQL database with 6 billion tuples of the form (latitude, longitude, time, data point value,
height)
• Even after dividing the whole table into smaller subsets, the system generated huge
overhead while processing the data.
How Apache Hive can solve the
problem?
52
HBase
• HBase is an open source, multidimensional,
distributed, scalable and a NoSQL database
written in Java.
• Facebook Messaging Platform shifted from
Apache Cassandra to HBase in November
2010
• Facebook Messenger combines Messages,
email, chat and SMS into a real-time
conversation
53
Apache Mahout
• Machine learning library to build scalable
machine learning algorithms implemented on
top of mapreduce.
54
Decide
• Data visualization
– Dashboards, graphs, charts
• Can take business decision
• BI
• HUE
• Tabview
• clickview
• MS-excel
55
HUE
• Hue is an open-source Web interface that
supports Apache Hadoop and its ecosystem
56
Summery
57
Thank You !
58

Contenu connexe

Tendances

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir
 

Tendances (20)

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Big data
Big dataBig data
Big data
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 

Similaire à Hadoop Eco system

Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 

Similaire à Hadoop Eco system (20)

Hadoop training
Hadoop trainingHadoop training
Hadoop training
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptx
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 

Dernier

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Dernier (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Hadoop Eco system

  • 1. Hadoop Eco System 1 Tilani Gunawardena PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL) 2018/08/07
  • 2. • WhatisBigData&Hadoop • CoreHadoop • HadoopEcosystem • UseCases Content 2
  • 3. Big Data Everywhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network – Sensor data – IoT data 3
  • 4. 4
  • 5. 5
  • 6. Data sets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze 6
  • 7. Revolution in the Marketplace: The shift 7
  • 8. What is Big Data? 8
  • 9. Big Data • Exabyte , Zettabyte of data • Big Data is not about the size of the data, it’s about the value within the Big Data Big Data 9
  • 10. Data in an Enterprise • Existing OLTP Databases • User Generated Data • Logs • System generated data 10
  • 11. The Structure of Big Data • Structured – Most traditional data sources • Semi-structured – XML,JSON • Unstructured – FB logs, web chats, Youtube 11
  • 12. What to do with these data? • Aggregation and Statistics – Data warehouse and OLAP • Indexing, Searching, and Querying – Keyword based search – Pattern matching (XML/RDF) • Knowledge discovery – Data Mining – Statistical Modeling 12
  • 13. Challenges with Big Data • Data Quality: 4th V i.e. Veracity. • Discovery: Finding insights on Big Data is like finding a needle in a haystack • Storage: – “Where to store it?”. – Need to scale up or down on-demand. • Analytics – unaware of the kind of data we are dealing with, so analyzing that data is even more difficult. • Security • Lack of Talent 13
  • 14. 14
  • 15. Scale up vs Scale out • Harder and more expensive to scale-up • Typically “scaled-up” (not scaling-out) by getting bigger/more powerful hardware 15
  • 16. Apache Hadoop • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware. • Concept Big Data • Technique  MapReduce • Hadoop is Eco System Framework which is developed in java. 16
  • 17. Hadoop Characteristics • Open source • Distributed processing • Distributed storage • Reliable • Economical • Flexible 17
  • 18. Why Hadoop • An open source project to manage “Big Data” • Not just a single project, but a set of projects • that work together • Deals with the 4 V’s • Traditional data stores are expensive to scale and by design difficult to distribute • Transforms commodity hardware to Coherent storage service that lets you store petabytes of data • Coherent processing service to process data efficiently 18
  • 20. • In 2003, Doug Cutting launches project Nutch to handle billions of searches and indexing millions of web pages. • Later in Oct 2003 – Google releases papers with GFS (Google File System). • In Dec 2004, Google releases papers with MapReduce. • In 2005, Nutch used GFS and MapReduce to perform operations • In 2006, Yahoo created Hadoop based on GFS and MapReduce with Doug Cutting and team. • In 2007 Yahoo started using Hadoop on a 1000 node cluster. Hadoop-History
  • 22. MapReduce • Is processing framework • Java based • Is for batch processing • High performance, fault tolerance data processing system 22
  • 23. 23 MapReduce in 41 words Goal: count the number of books in the library. • Map: – You count up shelf #1, – I count up shelf #2. (The more people we get, the faster this part goes) • Reduce: We all get together and add up our individual counts. (Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html)
  • 24. • MapReduce is a programming model for processing and generating large data sets • MapReduce was used to completely regenerate Google's index of the World Wide Web. • Hadoop which allows applications to run using the MapReduce algorithm. MapReduce • Users implement interface of 2 function – Map – Reduce • Map( in-key,in-value) (Out-key,intermediate-value) list • Reduce(Out-key,intermediate-value list) out_value list 24
  • 26. 26
  • 27. • Contains Libraries and other modules Hadoop Common • Hadoop Distributed File SystemHDFS • Yet Another Resource NegotiatorHadoop YARN • A programming model for large scale data processing Hadoop MapReduce Apache Hadoop-Modules
  • 29. Hadoop Vendor Distribution • Cloudera • MapR • Hortonworks • Apache BigTop • Greenplum 29
  • 31. Yarn • New processing framework • High availability • YARN supports multiple processing models in addition to MapReduce • With Yarn we can process non mapreduce jobs 31
  • 32. Server Types • OLTP (Online Transaction Processing): data keep on changing • OLAP(Online Analytical Process) – Facebook, Google, Twitter, LinkedIn, Ecommerce site 32
  • 34. Hadoop Eco System • Acquire : Where to get data • Arrange Data: HDFS, NOSQL • Analysis/Process • Decide 34
  • 35. Acquire Data • Where to get data – Ex: Database, Web – Flume,Sqoop,KAFKA 35
  • 36. Apache Flume • Flume is a framework for populating Hadoop with data. • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 36
  • 37. Sqoop • Apache Sqoop is a connectivity tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. 37
  • 38. Kafka 38 Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, very fast, and runs in production in thousands of companies.
  • 39. Arrange Data • Hadoop Distributed File System (HDFS) • NOSQL HBase, MongoDB, Cassandra • NOSQL is adding transactional behavior to data (OLTP behavior) 39
  • 40. Arrange Data • HDFS is a distributed file system designed to run on commodity hardware. • HDFS: Anything you save in HDFS is file 40
  • 41. Data Analyse/Process • MapReduce • Spark • Pig • Hive • Impala 41
  • 42. Spark • In memory process • Live streaming processing • Machine learning 42
  • 43. Pig • Initially developed by yahoo • Platform for analyzing large data sets that consist of high level language for expressing data analysis programs • Infrastructure compile language to a sequence of MapReduce programs 43
  • 44. Pig In Hadoop EcoSystem 44
  • 45. Twitter • Twitter moved to Apache Pig for analysis. Now, – joining data sets, – grouping them, – sorting them and – retrieving data becomes easier and simpler. You can see in the below image how twitter used Apache Pig to analyse their large data set. 45
  • 46. 46
  • 47. Apache Hive • Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data. • Compile, SQL-like queries into MapReduce programs 47
  • 48. • Challenges at Facebook: Exponential Growth of Data • Hive project was open sourced in August’ 2008 by Facebook and is freely available as Apache Hive today. 48 Story of Hive – From Facebook to Apache
  • 49. 49 SQL + Hadoop MapReduce = HiveQL
  • 50. 50
  • 51. NASA Case Study: Regional Climate Model Evaluation System (RCMES) 51 • MySQL database with 6 billion tuples of the form (latitude, longitude, time, data point value, height) • Even after dividing the whole table into smaller subsets, the system generated huge overhead while processing the data.
  • 52. How Apache Hive can solve the problem? 52
  • 53. HBase • HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. • Facebook Messaging Platform shifted from Apache Cassandra to HBase in November 2010 • Facebook Messenger combines Messages, email, chat and SMS into a real-time conversation 53
  • 54. Apache Mahout • Machine learning library to build scalable machine learning algorithms implemented on top of mapreduce. 54
  • 55. Decide • Data visualization – Dashboards, graphs, charts • Can take business decision • BI • HUE • Tabview • clickview • MS-excel 55
  • 56. HUE • Hue is an open-source Web interface that supports Apache Hadoop and its ecosystem 56