We (Concurrent) conducted a survey of Cascading users. The Cascading community is one of the most mature Hadoop development communities, with the majority having over 3 years experience. See what they are using, why they are using it and what future challenges they anticipate.
2. Confidential
WHAT’S
BEHIND
THE
RISE
OF
CASCADING?
Enterprise
IT
teams
designing
their
big
data
platforms
must
choose
from
a
daunting
array
of
development
frameworks
and
compute
fabrics.
On
the
one
hand,
they
want
a
development
framework
that
leverages
existing
skillsets.
At
the
same
time,
they
want
the
flexibility
to
benefit
from
performance
gains
of
the
latest,
greatest
compute
fabrics.
Cascading
is
a
robust
framework
with
over
10,000
known
production
deployments,
over
275,000
downloads
per
month.
Twitter,
AirBnB,
Climate
Corp,
Apple,
EBay,
Netflix,
are
examples
of
few
of
the
enterprises
that
have
built
their
Hadoop
practices
with
Cascading.
The
Cascading
user
group
is
diverse,
self-‐supporting
community
who
are
helping
innovate
Cascading’s
scalability,
portability,
performance
and
value.
In
addition,
the
presence
of
a
large
number
of
open
source
projects
contributed
by
mainstream
enterprises
such
as
by
Netflix,
Commonwealth
Bank
of
Australia,
Expedia
attests
to
vibrancy
of
the
Cascading
ecosystem.
In
this
paper,
we'll
reveal
what’s
behind
Cascading's
growth
by
digging
into
the
results
of
a
new
Cascading
user
survey.
In
general,
Cascading
users
turn
out
to
be
extremely
concerned
about
reliability
and
performance
at
scale.
Many
experimented
with
early
Hadoop
frameworks
like
Hive
and
Pig,
but
found
Cascading
to
be
a
more
scalable
approach.
And
lately,
the
easy
portability
of
Cascading
applications
between
compute
fabrics
has
generated
a
lot
of
excitement
in
the
community.
3. Confidential
0 10 20 30 40 50 60 70
Head/VP of IT
Head of IT Infrastructure
Application Manager/Director
BI/EDW Manager/Director
CIO/SVP of IT
IT Specialist
Architect
IT Manager or Director
Developer/Engineer
What title best describes your role?
N=121 Liverpool Street station crowd blur. Photo by David Sim.
CASCADING
IS
MOST
POPULAR
AMONG
BUILDERS
AND
MANAGERS
OF
BIG
DATA
APPLICATIONS
4. Confidential
CASCADING
COMMUNITY
MEMBERS
ARE
MATURE,
PRODUCTION
USERS
8%
26%
25%
41%
How long have you been using
Hadoop?
0-12 months
12-24 months
24-36 months
Over 3 years
N=69
Most
respondents
have
been
using
Hadoop
for
over
3
years.
Assuming
the
sample
is
representative,
the
Cascading
community
largely
consists
of
early
Hadoop
adopters.
Furthermore,
the
Cascading
community
isn’t
just
dabbling:
Over
84% have
already
put
their
Cascading
applications
into
production
or
plan
to
do
so.
As
for
why,
many
likely
found
out
the
hard
way
that
developing
directly
on
Hadoop
was
painful,
tedious
and
poorly
suited
to
scale.
0 5 10 15 20 25 30 35 40 45
Other
Poor integration into existing IT
infrastructure
Lack of scalability
Lack of portability across compute
fabrics
Difficult to integrate to existing systems
Poor troubleshooting capabilities
Lack of skilled Hadoop resources
High cost of development in existing
platform
Slow development in existing platform
What challenges did you have that made you look for
an application development framework?
5. Confidential
THE
PATH
TO
CASCADING:
HIVE,
PIG,
AND
GUI
TOOLS
N=69
Given
the
maturity
of
Cascading
users,
it’s
no
surprise
that
many
explored
alternatives
before
settling
on
Cascading.
The
majority
(51%)
tried
Hive
and
Pig,
both
of
which
were
early
abstraction
layers
for
MapReduce.
Today,
many
Pig
applications
run
alongside
Cascading
and
many
Hive
applications
run
within Cascading.
Why
didn’t
they
stick
with
Hive
and
Pig?
Most
organizations
determined
they
could
not
scale
with
Hive
and
Pig.
Typically
that
was
because
Hive
and
Pig
required
scarce
technical
resources
and
because
development
in
those
frameworks
was
slow.
Those
who
opted
for
other
API
frameworks
found
them
not
yet
ready
for
the
enterprise.
A
smaller
group
experimented
with
GUI-‐based
ETL
tools.
While
these
tools
made
it
easy
to
leverage
existing
resources
and
skill
sets,
their
capabilities
were
too
limited.
They
also
required
building
special
scripts
to
achieve
complex
functionality,
which
negated
the
benefits
of
simplicity.
Additionally,
many
users
did
not
like
being
locked
into
a
single-‐vendor
solution.
26%
25%22%
19%
8%
Before selecting Cascading, what alternative solutions
did you explore? (select all that apply)
Pig
Hive
Other API frameworks (Spark,
Crunch)
GUI-based ETL tools (Talend,
Informatica, Pentaho)
No other alternatives were
explored
6. Confidential
0 10 20 30 40 50 60
Other
Flink
Tez
Storm
Kafka
MapReduce
Spark
Which compute fabric(s) are you using or
planning to use in the next 18 mths?
PORTABILITY
ACROSS
FABRICS
N=69
New
compute fabrics
appear
all
the
time,
though
not
all
are
production-‐ready.
The
responses
reflect high
interest
in
Spark
and
a
desire
for
true
streaming
(not
micro-‐batches).
MapReduce isn’t going
away any
time
soon,
especially
where
reliability
is
a
requirement.
Still,
many
are
experimenting
with other
compute
fabrics.
Because
each
fabric
offers
application-‐specific
advantages,
most
organizations
will
likely
wind
up
running
multiple
fabrics.
Cascading
3.0
supports
Tez,
MapReduce,
and
local/in-‐memory,
so
users
can
port
applications
from
MapReduce to
Tez simply
by
changing
a
few
lines
of
code.
Easy
portability
makes
Cascading
an
ideal
platform
for
moving
from
MapReduce to
Tez without
incurring
the
cost
of
rewriting
applications.
Soon,
Cascading
will
support
the
same
portability
for
Spark
and
Flink (for
Flink,
support
will
be
community
contributed).
7. Confidential
CASCADING
BRIDGES
OTHER
DEVELOPMENT
FRAMEWORKS
N=69
Despite
their
shortcomings,
MapReduce,
Hive
and
Pig
are
still
widely
in
use
as
development
frameworks,
largely
because
many
early
Hadoop
applications
were
built
through
these
interfaces.
No
surprise
that
we
see
a
lot
of
excitement
about
Spark
as
a
new
development
framework
as
well;
many
users
are
experimenting
with
developing
directly
in
the
Spark
API.
Cascading
will
support
Spark
in
a
future
WIP,
adding
an
important
framework
option
for
Spark
developers.
Developers
who
build
in
Cascading
will
be
able
to
port
their
applications
from
MapReduce to
Spark
without
having
to
rewrite
them
in
the
Spark
API.
In
summary,
there
is
no
one-‐size-‐fits-‐all
framework.
Flexibility
is
key
as
organizations
build
out
their
big
data
strategies
and
platforms.
Cascalog
Scalding
Pig
Hive
MapReduce
Cascading
Spark
0 10 20 30 40 50 60
What data application development
framework do you use?
“[Cascading] Best Hadoop API for enterprise data-
intensive apps.” – Architect.Fortune 500 Healthcare Payer
8. Confidential
COMMON
USE
CASES:
ETL,
ANALYTICS
&
DATA
INTEGRATION
N=69
Most
organizations
rely
on
Hadoop
for
heavy
processing
steps
within
ETL,
analytics
or
data
integration
flows.
Some
have
moved
their
entire
ETL
processing
to
Hadoop,
while
others
have
moved
only
portions
of
their
workflows.
For
example,
AirBnB uses
Cascading
for
complicated
infrastructure
tasks
such
as
data
normalization
and
cleansing.
AirBnB also
leverages
Cascading
for
reconstructing
corrupted
files
and
merging
data.
In
combination
with
Cascading,
Pig
and
Hive
are
used
by
analysts
to
run
batch
scripts
to
perform
ad
hoc
analysis.
With
these
tools,
analysts
are
able
to
more
easily
study
crucial
metrics
like
click-‐through
rates,
page
statistics,
and
drop-‐off
rates.
0 10 20 30 40 50
Other
Search Optimization
Recommendation Engines
Data Quality
Machine Learning and Scoring
Data Integration
Analytics
ETL
What best describes the projects where you
are using Cascading?
45%
Offloading
ETL to
Hadoop
40%
To Support
Analytics/BI
Projects
33%
Data
Integration
Projects
9. Confidential
Extremely
likely - 10
23%
9
10%
8
20%
7
19%
6
11%
5
6%
4
1%
3
3%
2
4%
Not at all
likely - 0
3%
How likely is it that you would
recommend Cascading to a friend or
colleague?
WHY
THEY
LOVE
CASCADING:
TDD,
JAVA
API,
PORTABILITY
N=79
Top
3
Most
Impactful
Capabilities
v Test
Driven
Development
(49%)
-‐ Efficiently
test
code
and
process
local
files
before
you
deploy
on
a
cluster
with
Cascading’s
local
or
in-‐
memory
mode.
Incorporate
inline
data
assertions
to
define
results
at
any
point
in
your
pipeline.
Failed
assertions
are
easily
visible
and
available
for
analysis.
v JavaAPI
(44%)
-‐ Cascading
is
a
Java
library
and
does
not
require
installation.
Cascading
fits
directly
into
a
standard
development
process;
all
you
have
to
do
is
code
to
the
API.
v Application
Portability
(43%)
-‐ When
you
compile
a
Cascading
job,
it
automatically
creates
a
run-‐time
executable
for
your
specified
compute
fabric.
Simply
by
changing
a
few
lines
of
code,
you
can
test
your
application
on
multiple
fabrics
and
choose
the
best
for
your
needs.
53%Of Respondents
are Promoters
(8/10)
11. Confidential
CASCADING
SLASHES
TIME
TO MARKET
N=79
Most improved time to market by at least
40%
5%
17%
12%
18%
17%
18%
13%
What percentage would you estimate your
time to market has improved?
Over 300%
Over 100%
80%-100%
60%-80%
40%-60%
20%-40%
Less than 20%
12. Confidential
N=69
0 10 20 30 40 50 60
Other
Supporting chargeback models
Forecasting big data infrastructure
needs
Monitoring SLA's for Hadoop
applications
Identify and resolve Hadoop
application issues faster
Optimizing application performance
What future challenges do you anticipate in
managing your data applications?
THE
FUTURE:
BETTER
PERFORMANCE,
DATA
PIPELINE
VISIBILITY
Application
performance
management
is
a
top-‐of-‐mind
concern
for
most
respondents.
While
performance
tuning
happens
on
the
operations
side,
optimizing
applications
to
meet
service-‐ level
commitments
is
usually
a
collaborative
effort
between
development
and
operations teams.
Developers
need
better
tools
to
visualize
data
pipelines
and
detect
undesirable
behavior
before they
promote
applications
to
production.
Operations
teams
need
better
tools
to
monitor,
manage
and
optimize
data
delivery.
An
important,
though
secondary
concern,
is
tracking
the
rate
of
Hadoop
resource
consumption
so
clusters
can
be
right-‐sized
and
costs
distributed
across
divisions.
This
is
particularly
true
as
more
of
of
an
organization’s
departments/teams
build
and
rely
on
big
data
applications,
transforming
their
Hadoop
cluster
from
a
side
project
into
core
production
IT
infrastructure.
With
new
application
performance
management
tools
such
as
Driven,
teams
can
visualize
data
pipelines
and
identify
unwanted
behavior
more
effectively.
Tools
like
Driven
also
arm
teams
with
the
data
necessary
to
pinpoint
issues
quickly
and
resolve
them
collaboratively.
14. Confidential
DISTRIBUTIONS
0 5 10 15 20 25 30 35 40
Count of Other (please specify)
Count of MapR
Count of Hortonworks
Count of Apache Hadoop
Count of Amazon EMR
Count of Cloudera
Distributions
N=69
15. Confidential
NUMBER OFAPPLICATIONSANDVOLUME
Over 100 60-100 30-60 15-30 5-15 1-5
Less than 250 pipelines 4 5 4 26
500 - 1,000 pipelines 2 2 1 1 2
250 - 500 pipelines 1 3 5
2,500 - 5,000 pipelines 1 1
1,000 - 2,500 pipelines 2 3 1
Over 5,000 pipelines 1
Over 10,000 pipelines 1 1 2
0
5
10
15
20
25
30
35
40
Average Numberof Cascading Applications and Pipelines N=69
16. Confidential
PRODUCTIONSTATUS
0 5 10 15 20 25 30 35 40 45 50
No and not planned
Not yet but planned
Yes
Are you using your Cascading data applications in a
production environment?
N=69