Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
1. What
is
Hadoop,
and
When
Should
I
Consider
Using
It?
Houston
HUG
June
6th,
2011
Vikram
Oberoi,
Cloudera
Copyright
2011
Cloudera
Inc.
All
rights
reserved
2. About
me
• Data
engineer
at
Cloudera,
present
•
Using
data
and
Hadoop
to
enable
more
responsive
support
• Data
engineer
at
Meebo,
Aug
’09
–
Nov’10
• Data
infrastructure,
analyLcs
• CS
at
Stanford,
’09
• Senior
project:
ext3
and
XFS
under
Hadoop
MapReduce
workloads
• Data
engineer
at
Meebo,
’08
• Built
an
A/B
tesLng
system
• SDE
Intern
at
Amazon,
’07
• R&D
on
item-‐to-‐item
similariLes
Copyright
2011
Cloudera
Inc.
All
rights
reserved
3. What
will
I
talk
about?
• What
is
Hadoop?
• Typical
Hadoop-‐able
problems
and
use
cases
• Cloudera
overview
Copyright
2011
Cloudera
Inc.
All
rights
reserved
4. What
is
Hadoop?
Copyright
2011
Cloudera
Inc.
All
rights
reserved
5. Big
Data
Problem:
Exploding
Data
Volumes
• Online
• Web-‐ready
devices
• Social
media
Complex, Unstructured
• Digital
content
• Enterprise
• TransacLons
Relational
• R&D
data
• OperaLonal
(control)
data
• Open
data
iniLaLves
• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the
Digital Universe Expands. May 2009.
Copyright
2011
Cloudera
Inc.
All
rights
reserved
.
6. Big
Data
Problem:
Data
Economics
•
Return
on
Byte
=
value
to
be
extracted
from
that
byte
/
cost
of
storing
that
byte
•
If
ROB
is
<
1
then
it
will
be
buried
into
tape
wasteland,
thus
we
need
cheaper
ac#ve
storage.
High
ROB
Low
ROB
Copyright
2011
Cloudera
Inc.
All
rights
reserved
7. Hadoop:
A
Data
PlaEorm
with
Unique
Benefits
•
Consolidates
Everything
•
Move
complex
and
relaLonal
data
into
a
single
repository
•
Stores
Inexpensively
MapReduce
•
Keep
raw
data
always
available
•
Use
commodity
hardware
•
Processes
at
the
Source
Hadoop
Distributed
•
Eliminate
ETL
boglenecks
File
System
(HDFS)
•
Mine
data
first,
govern
later
Copyright
2011
Cloudera
Inc.
All
rights
reserved
8. Hadoop
Distributed
File
System
(HDFS)
“How
is
data
stored?”
• Based
on
design
of
Google’s
GFS
• Data
stored
in
large
files
• Files
can
contain
any
data
• Files
separated
into
blocks
• 64MB
up
to
256MB
per
block
(tunable)
• Each
block
replicated
across
a
cluster
(tunable,
usually
3
replicas
across
the
cluster)
• This
buys
you:
fault
tolerance,
parallelizable
disk
reads
• Store
whatever
you
want
in
it
• This
buys
you:
flexibility
Copyright
2011
Cloudera
Inc.
All
rights
reserved
9. MapReduce
“How
is
data
processed?”
• Framework
designed
for
parallel
processing
of
large
disk
bound
batch
jobs
• Data
processed
at
the
source
• File
‘foo’
has
5
blocks,
processing
happens
on
5
nodes
• Parallelized
disk
reads
à
remove
disk
bogleneck
• Way
to
express
algorithms
such
that
they
are
parallelizable
• Two
funcLons
at
the
core
of
every
job:
• Map
funcLon
(group
by)
• Reduce
funcLon
(perform
acLon
on
group)
Copyright
2011
Cloudera
Inc.
All
rights
reserved
10. What
is
Hadoop?
• A
scalable
fault-‐tolerant
distributed
system
for
data
storage
and
processing
(open
source
under
the
Apache
license)
• Scalable
data
processing
engine
• Hadoop
Distributed
File
System
(HDFS):
self-‐healing
high-‐bandwidth
clustered
storage
• MapReduce:
fault-‐tolerant
distributed
processing
• Key
value
• Flexible
-‐>
store
data
without
a
schema
and
add
it
later
as
needed
• Affordable
-‐>
cost
/
TB
at
a
fracLon
of
tradiLonal
opLons
• Broadly
adopted
-‐>
a
large
and
acLve
ecosystem
• Proven
at
scale
-‐>
dozens
of
petabyte
+
implementaLons
in
producLon
today
Copyright
2011
Cloudera
Inc.
All
rights
reserved
11. Cloudera’s
DistribuSon
Including
Apache
Hadoop
The
Industry’s
Leading
Hadoop
Distribu<on
Hue
Hue
SDK
Oozie
Oozie
Hive
Pig/
Hive
Flume,
Sqoop
HBase
Zookeeper
• Open
source
–
100%
Apache
licensed
and
free
for
download
• Simplified
–
Component
versions
&
dependencies
managed
for
you
• Integrated
–
All
components
&
funcLons
interoperate
through
standard
API’s
• Reliable
–
Patched
with
fixes
from
future
releases
to
improve
stability
• Supported
–
Employs
project
founders
and
commigers
for
>90%
of
components
Copyright
2011
Cloudera
Inc.
All
rights
reserved
13. What
is
common
across
Hadoop-‐able
problems?
Nature
of
the
data
• Complex
data
• MulLple
data
sources
• Lots
of
it
Nature
of
the
analysis
• Batch
processing
• Parallelizable
Copyright
2010
Cloudera
Inc.
All
rights
reserved
13
14. What
kinds
of
analyses
are
possible
with
Hadoop?
• Text
mining
• CollaboraLve
filtering
• Index
building
• PredicLon
models
• Graph
creaLon
and
• SenLment
analysis
analysis
• Risk
assessment
• Pagern
recogniLon
Copyright
2010
Cloudera
Inc.
All
rights
reserved
14
15. Top
10
Hadoop-‐able
Problems
See
archived
webinar
on
cloudera.com
1. Modeling
True
Risk
2. Customer
Churn
Analysis
3. RecommendaSon
engines
4. Ad
TargeSng
5. Point
Of
Sale
TransacSon
Analysis
6. Analysing
Network
Data
To
Predict
Failure
7. Threat
Analysis/Fraud
DetecSon
8. Trade
Surveillance
9. Search
Quality
10. Data
“Sandbox”
Copyright
2011
Cloudera
Inc.
All
rights
reserved
21. Example:
RecommendaSon
Engine
SoluSon
with
Hadoop
• Batch
processing
framework
• Allow
execuLon
in
in
parallel
over
large
datasets
• CollaboraLve
filtering
• CollecLng
‘taste’
informaLon
from
many
users
• ULlizing
informaLon
to
predict
what
similar
users
like
Typical
Industry
• Ecommerce,
Manufacturing,
Retail
Copyright
2010
Cloudera
Inc.
All
rights
reserved
21
23. Example:
Analyzing
Network
Data
to
Predict
Failure
SoluSon
with
Hadoop
• Take
the
computaLon
to
the
data
• Expand
the
range
of
indexing
techniques
from
simple
scans
to
more
complex
data
mining
• Beger
understand
how
the
network
reacts
to
fluctuaLons
• How
previously
thought
discrete
anomalies
may,
in
fact,
be
interconnected
• IdenLfy
leading
indicators
of
component
failure
Typical
Industry
• ULliLes,
TelecommunicaLons,
Data
Centers
Copyright
2010
Cloudera
Inc.
All
rights
reserved
23
24. Example:
SupporSng
Hadoop
at
Cloudera
• Collect
data
from
customer
clusters
• OS
configs,
Hadoop
configs,
command
outputs,
logs
• Data
served
by
HBase,
used
by
supporters
• Consolidate
data
about
Hadoop
in
HDFS
• Mailing
lists,
issue
trackers,
wiki
pages,
IRC,
books
• Customer
cluster
data
• Analyze
many
data
sources
to
understand
Hadoop
issues
and
deployments
• Build
tools
to
enable
easier
diagnosis
or
proacLve
support
Copyright
2011
Cloudera
Inc.
All
rights
reserved
26. Cloudera
Offerings
Enabling
the
Enterprise
Adop<on
of
Apache
Hadoop
PLATFORM
SUPPORT
&
APPLICATIONS
PROFESSIONAL
SERVICES
TRAINING
Copyright
2011
Cloudera
Inc.
All
rights
reserved
27. Contact/Resources/QuesSons
• vikram@cloudera.com
• irc.freenode.net
#cloudera
#hadoop
• @cloudera
• Cloudera
Groups:
hgp://groups.cloudera.org
• Hadoop
the
DefiniLve
Guide
• 10
Hadoop-‐able
problems
on
Slideshare
• QuesLons?
(P.S.
We’re
hiring
SA’s
in
Houston!)
Copyright
2011
Cloudera
Inc.
All
rights
reserved