Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

Always-on Ingestion
Con$nuous
Inges$on
for
Data
at
Scale

©
2015
StreamSets
Inc.,
All
rights
reserved

Arvind
Prabhakar

Big
Data
Day
LA,
June
2015

© 2015 StreamSets, Inc.
About me

❏  Founder/CTO

Apache
So?ware
FoundaBon

❏  Flume
-‐
PMC
Chair

❏  Sqoop
-‐
PMC
Chair

❏  Storm
-‐
PMC,
CommiFer

❏  MetaModel
-‐
Mentor

❏  Sentry
-‐
Mentor

❏  NiFi
-‐
Mentor

❏  ASF
Member

Previously...

❏  Cloudera

❏  InformaKca

@aprabhakar

Some Background
What is Data Ingestion?
Why do we need Data Ingestion?
❏  Acquiring data from various sources
❏  Storing acquired data where it can be processed
❏  Data is consumed away from where it is produced
❏  Consuming systems are often distributed and remote
❏  Manually, with scripts, with rudimentary automation
❏  Higher level frameworks like Flume, Kafka, etc
How is Data Ingestion Implemented?
Logs

Files

Click
Streams

Sensors

Devices

Database

Logs

Social
Data

Streams

Feeds

Other

Raw
Storage

(HDFS,
S3)

EDW,
NoSQL

(Hive,
Impala,

HBase,
Cassandra,

RedShiY)

Search

(Solr,

ElasKcSearch)

Enterprise
Data

Infrastructure

Data Ingestion Challenge
Ever Increasing data volumes
and rates...
Data sources are physically
distributed and transient...
╳
╳
╳
╳
╳
╳
Data structures and semantics
are constantly changing...

Lot more than moving data!
Data Ingest should be agile
Data Ingest should be safe and reliable
❏  Welcome new data sources as they emerge
❏  Incorporate changes to existing sources as needed
❏  Protect your downstream from silent data corruption
❏  Ensure that there is no data loss in your infrastructure
Data Ingest should scale as needed
❏  Data ingest must never become a bottleneck
❏  Data ingest must scale without significant cost or effort
RELIABLE

»

Design
Wisely

»

Operate
CauKously

»

Update
Liberally

What can you do?
●  Pick
the
right
technology

and
toolset

●  Instrument
and
monitor

mercilessly

●  AnKcipate
and
understand

the
changes
in
your

environment

Here is how...

Picking the right technology
Manual/Scripted
Batch Transport
Micro-batching
Pipelining
Message-Queue
File copying using CLI or GUI interface Cloudera HUE, Hadoop FS client
Ingest Mode Description Example
Bulk data transport using tools Sqoop, DistCp
Transport of small batches of data Sqoop/Sqoop2 (Storm, etc...)
Flow-like transport of event streams Flume, Scribe
Publish-Subscribe like transport of events Kafka, Kinesis

Sqoop
Overview Advantages Disadvantages
❏  Propagates
metadata

❏  Cluster
based
parallel

scaling
capability

❏  Simple
and
easy
to

understand/operate

❏  Rich
set
of
connectors

available
for
use

❏  Supports
popular
formats

like
Avro,
sequence
ﬁle

etc.

❏  Not
a
service

❏  Direct
access
to

producKon
data
stores

from
cluster

❏  Requires
access
to
data

store
credenKals

❏  Connector
funcKonality
is

not
consistent
between

diﬀerent
connectors

❏  CLI
Tool

❏  Oriented
towards

structured
data
stores

❏  Runs
map-‐only
job
to

transport
data

Sqoop 2
Overview Advantages
❏  Propagates
metadata

❏  Cluster
based
parallel

scaling
capability

❏  Simple
and
easy
to

understand/operate

❏  Supports
popular
formats

like
Avro,
sequence
ﬁle

etc.

❏  Consistent
funcKonality

across
connectors

❏  Secure
handling
of

credenKals
with
RBAC

security

❏  Considered
pre-‐
producKon
quality
before

2.0.0
release.
Currently
at

1.99.6.

❏  May
not
have
connecKvity

at
par
with
Sqoop
1.

❏  Sqoop
Service
with
CLI

and
JSON/REST
interface

❏  Oriented
towards

structured
data
stores

❏  Runs
chained
Map-‐
Reduce
jobs
for
data

transport
and
conversion

Disadvantages

Flume
Overview Advantages
❏  Guaranteed
delivery

semanKcs

❏  Low-‐latency
reliable
data

transfer

❏  DeclaraKve
configuraKon

with
no
coding
necessary

for
common
use-‐cases

❏  Fully
extendable
and

customizable

❏  Integrates
with
most

commonly
used
end-‐
points

❏  Non-‐trivial
configuraKon

❏  Complex
topology

configuration
can
be
hard

to
build
and
maintain

❏  Custom
end-‐point

implementaKon
requires

significant
code

complexity

❏  Distributed
pipeline

system
for
efficient

transport
of
large

volumes
of
data

❏  Built
in
support
for

contextual
rouKng,

filtering,
replicaKon
and

mulKplexing

Disadvantages

Kafka
Overview Advantages
❏  Strong
retenKon
and

ordering
semanKcs

❏  Dynamic
cluster
based

scalability
and
throughput

❏  Low-‐level
APIs
for
building

consumers
and
producers

❏  Variety
of
open
source

producers
and
consumers

available
on
GitHub

❏  Allows
reprocessing
of

consumed
data

❏  Distributed
and
eﬃcient

publish-‐subscribe

messaging
system

❏  Used
for
democraKzaKon

of
data
between

applicaKons

Disadvantages
❏  Delivery
guarantee
owned

by
producers
and

consumers

❏  Opaque
pub-‐sub
design

can
cause
applicaKons
to

be

highly
coupled

❏  Minimal
metadata

support

Typical Examples
For Structured Data
Simple

❏  Sqoop
for
Batch
transport

❏  Sqoop
2
for
micro-‐batch
transport

Intermediate

❏  Flume
for
Directory
Spooling

Advanced

❏  Custom
Database
Log
Shipping

implementaKon

Simple

❏  Flume
based
AggregaKon

❏  Kaia
based
pub-‐sub
for
applicaKons

Intermediate

❏  Flume
+
Kaia
based
aggregaKon
and

pub-‐sub

Advanced

❏  Kaia
+
Storm
for
pub-‐sub
and

preparaKon

For Streaming Event Data


❏  Apache
Sqoop:
hFp://sqoop.apache.org

❏  Current
Version:
Sqoop1
-‐
1.4.6;

Sqoop2
-‐
1.99.6

❏  Apache
Flume:
hFp://ﬂume.apache.org

❏  Current
Version:
Flume
1.6.0

❏  Apache
Kaia:
hFp://kaia.apache.org

❏  Current
Version:
Kaia
0.8.2.1

For more information...

My
Contact
InformaKon:

●  Email:

arvind
at
streamsets
dot
com

●  TwiFer:
@aprabhakar

●  Website:
www.streamsets.com

Thank You!

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

Similaire à Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets (20)

Plus de Data Con LA

Plus de Data Con LA (20)

Dernier

Dernier (20)

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets