Hadoop Eco system

Hadoop Eco System
1
Tilani Gunawardena
PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL)
2018/08/07

• WhatisBigData&Hadoop
• CoreHadoop
• HadoopEcosystem
• UseCases
Content
2

Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
– Sensor data
– IoT data
3

Data sets whose size is beyond the ability
of typical database software tools to
capture, store, manage, and analyze
6

Revolution in the Marketplace: The
shift
7

Big Data
• Exabyte , Zettabyte of data
• Big Data is not about the size of the data,
it’s about the value within the Big Data
Big Data
9

Data in an Enterprise
• Existing OLTP Databases
• User Generated Data
• Logs
• System generated data
10

The Structure of Big Data
• Structured
– Most traditional data sources
• Semi-structured
– XML,JSON
• Unstructured
– FB logs, web chats, Youtube
11

What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
12

Challenges with Big Data
• Data Quality: 4th V i.e. Veracity.
• Discovery: Finding insights on Big Data is like finding a
needle in a haystack
• Storage:
– “Where to store it?”.
– Need to scale up or down on-demand.
• Analytics
– unaware of the kind of data we are dealing with, so
analyzing that data is even more difficult.
• Security
• Lack of Talent
13

Scale up vs Scale out
• Harder and more expensive to scale-up
• Typically “scaled-up” (not scaling-out) by getting
bigger/more powerful hardware
15

Apache Hadoop
• The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of commodity hardware.
• Concept Big Data
• Technique  MapReduce
• Hadoop is Eco System Framework which is developed
in java.
16

Hadoop Characteristics
• Open source
• Distributed processing
• Distributed storage
• Reliable
• Economical
• Flexible
17

Why Hadoop
• An open source project to manage “Big Data”
• Not just a single project, but a set of projects
• that work together
• Deals with the 4 V’s
• Traditional data stores are expensive to scale and by
design difficult to distribute
• Transforms commodity hardware to Coherent
storage service that lets you store petabytes of data
• Coherent processing service to process data
efficiently
18

• In 2003, Doug Cutting launches project Nutch to
handle billions of searches and indexing millions of
web pages.
• Later in Oct 2003 – Google releases papers with GFS
(Google File System).
• In Dec 2004, Google releases papers with
MapReduce.
• In 2005, Nutch used GFS and MapReduce to perform
operations
• In 2006, Yahoo created Hadoop based on GFS and
MapReduce with Doug Cutting and team.
• In 2007 Yahoo started using Hadoop on a 1000 node
cluster.
Hadoop-History

MapReduce
• Is processing framework
• Java based
• Is for batch processing
• High performance, fault tolerance data
processing system
22

23
MapReduce in 41 words
Goal: count the number of books in the library.
• Map:
– You count up shelf #1,
– I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual
counts.
(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html)

• MapReduce is a programming model for processing and
generating large data sets
• MapReduce was used to completely regenerate Google's
index of the World Wide Web.
• Hadoop which allows applications to run using the
MapReduce algorithm.
MapReduce
• Users implement interface of 2 function
– Map
– Reduce
• Map( in-key,in-value) (Out-key,intermediate-value) list
• Reduce(Out-key,intermediate-value list) out_value list
24

• Contains Libraries and other modules
Hadoop
Common
• Hadoop Distributed File SystemHDFS
• Yet Another Resource NegotiatorHadoop YARN
• A programming model for large scale
data processing
Hadoop
MapReduce
Apache Hadoop-Modules

Hadoop Vendor Distribution
• Cloudera
• MapR
• Hortonworks
• Apache BigTop
• Greenplum
29

Yarn
• New processing framework
• High availability
• YARN supports multiple processing models in
addition to MapReduce
• With Yarn we can process non mapreduce jobs
31

Server Types
• OLTP (Online Transaction Processing): data
keep on changing
• OLAP(Online Analytical Process)
– Facebook, Google, Twitter, LinkedIn, Ecommerce
site
32

Hadoop Eco System
• Acquire : Where to get data
• Arrange Data: HDFS, NOSQL
• Analysis/Process
• Decide
34

Acquire Data
• Where to get data
– Ex: Database, Web
– Flume,Sqoop,KAFKA
35

Apache Flume
• Flume is a framework for populating Hadoop
with data.
• Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
36

Sqoop
• Apache Sqoop is a connectivity tool designed
for efficiently transferring bulk data between
Apache Hadoop and structured data stores
such as relational databases.
37

Kafka
38
Kafka® is used for building real-time data
pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, very fast, and runs in
production in thousands of companies.

Arrange Data
• Hadoop Distributed File System (HDFS)
• NOSQL HBase, MongoDB, Cassandra
• NOSQL is adding transactional behavior to
data (OLTP behavior)
39

Arrange Data
• HDFS is a distributed file system designed to
run on commodity hardware.
• HDFS: Anything you save in HDFS is file
40

Data Analyse/Process
• MapReduce
• Spark
• Pig
• Hive
• Impala
41

Spark
• In memory process
• Live streaming processing
• Machine learning
42

Pig
• Initially developed by yahoo
• Platform for analyzing large data sets that
consist of high level language for expressing
data analysis programs
• Infrastructure compile language to a sequence
of MapReduce programs
43

Twitter
• Twitter moved to Apache Pig for analysis. Now,
– joining data sets,
– grouping them,
– sorting them and
– retrieving data
becomes easier and simpler. You can see in the below
image how twitter used Apache Pig to analyse their
large data set.
45

Apache Hive
• Apache Hive is a data warehouse system built on
top of Hadoop and is used for analyzing structured
and semi-structured data.
• Compile, SQL-like queries into MapReduce
programs
47

• Challenges at Facebook: Exponential Growth
of Data
• Hive project was open sourced in August’
2008 by Facebook and is freely available as
Apache Hive today.
48
Story of Hive – From Facebook to
Apache

49
SQL + Hadoop MapReduce = HiveQL

NASA Case Study: Regional Climate
Model Evaluation System (RCMES)
51
• MySQL database with 6 billion tuples of the form (latitude, longitude, time, data point value,
height)
• Even after dividing the whole table into smaller subsets, the system generated huge
overhead while processing the data.

How Apache Hive can solve the
problem?
52

HBase
• HBase is an open source, multidimensional,
distributed, scalable and a NoSQL database
written in Java.
• Facebook Messaging Platform shifted from
Apache Cassandra to HBase in November
2010
• Facebook Messenger combines Messages,
email, chat and SMS into a real-time
conversation
53

Apache Mahout
• Machine learning library to build scalable
machine learning algorithms implemented on
top of mapreduce.
54

Decide
• Data visualization
– Dashboards, graphs, charts
• Can take business decision
• BI
• HUE
• Tabview
• clickview
• MS-excel
55

HUE
• Hue is an open-source Web interface that
supports Apache Hadoop and its ecosystem
56

Hadoop Eco system

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop Eco system

Similaire à Hadoop Eco system (20)

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

Dernier

Dernier (20)

Hadoop Eco system