Big dataplatform operationalstrategy

Building
a
Big
Data
Analytics
Function

For
Long
Term
Success

Himanshu
Bari

-‐
https://www.linkedin.com/in/himanshubari

The
last
three
years
of
my
career
have
been
in
the
big
data
space.
It
started
at
the

ground
zero
of
the
big
data
revolution
at
Hortonworks
-‐
one
of
the
leading
Hadoop

distributions.
As
a
product
manager,
I
have
had
the
privilege
of
working
closely
with

marketing,
customer
success,
pre-‐sales
as
well
as
post
sales
teams
across
industry

verticals
to
make
our
internal
customer
champions
successful
in
their
quest
to

formulate
&
execute
their
big
data
strategies.
My
view
has
spanned
across
all
phases

of
implementation
-‐
from
the
early
use
case
selection
to
POC
&
pilot
execution,
post

pilot
production,
operationalization
&
finally
evangelism.
Having
an
inside
view
of

the
evolution
of
the
big
data
market
has
been
an
extremely
rewarding
experience

and
I
learnt
a
lot.

While
working
with
the
big
data
solutions
owners,
in
many

instances
it
brought
back
memory
of
when
I
was
part
of
the
central
technology

strategy
group
at
Lehman
Brothers.
There
I
had
the
opportunity
to
build
and
drive

adoption
of
a
home
grown
application
performance
monitoring
solution
across
the

entire
company
and
overcame
some
of
the
similar
organizational
&
process
hurdles

as
faced
by
the
big
data
early
adaptors

Who
is
this
report
for?

The
pioneers
in
the
big
data
space
have
battle
scars
and
have
learnt
many
of
the

lessons
in
this
report
the
hard
way.
But
if
you
are
a
general
manger
&
just

embarking
on
the
big
data
journey,
you
should
now
have
what
they
call
the
'second

mover
advantage’.
My
hope
is
that
this
report
helps
you
better
leverage
your
second

mover
advantage.

What
this
report
is
NOT?

-‐ This
is
NOT
meant
to
be
a
technical
recipe
book
for
building
big
data
systems.

There
is
no
shortage
of
those.
Just
look
through
any
of
the
vendor
websites.
Or

ping
me
and
I
would
be
happy
to
talk
tech!

-‐ This
is
not
a
big
data
project
plan
or
a
budgeting
primer.
There
are
too
many

organizational
and
situational
specifics
needed
for
creating
those.
But
my
hope
is

that
the
content
in
this
report
will
server
as
a
guiding
post
&
input
into
those

efforts

What
is
the
FOCUS
of
this
report?

The
goal
here
is
to
shed
some
light
on
the
people
&
process
issues
in
building
a

central
big
data
analytics
function.

The
rest
of
this
report
is
organized

around
the
four
key
pillars
as
shown
on

the
left.
For
each
area,
I
will
discuss

1.
Common
problems

2.
Some
best
practices

3.
Getting
started
plan

Getting
to
‘Complete
Data’

The
data
platform
is
only
as
good
as
the
data
in
it.
Most
big
data
projects
just
assume

that
there
is
a
lot
of
data
available
and
that
just
by
having
‘more’
data
will
magically

result
in
better
insights.
While
that
is
loosely
true
in
some
cases
(like
machine

learning),
the
most
successful
organizations
pay
at
a
lot
of
attention
to
truly

understand
the
nature
of
data
available.

Common
Operational
Issues
in
getting
to
‘Complete
Data’

1. Ownership
split
across
teams
creating
‘Silos’
–
Happens
naturally
as
the

various
internal
products
&
systems
evolved
organically
or
inorganically.
]

2. Format
&
quality
issues
-‐
Inherently
introduced
by
the
silos
and
variety
of

systems.
As
a
result
there
are
often
very
different
views
&
interpretations
of

the
same
asset.

3. Data
has
‘inertia’
–
Cannot
be
easily
moved
around

4. Merging
‘event’
&
transactional
data
is
challenging

5. Access
requirements
vary
by
users
&
workloads.
stages.
Eg.
Reporting
&

analytics
use
cases
have
different
data
prep
needs
than
data
science
use

cases

6. Data
ingestion
and
distribution
into
the
platform
become
the
‘sole’

responsibility
of
the
big
data
team.
They
just
become
data
movement

monkeys

Some
best
practices

1. Store
as
much
as
possible:
not
valuable
to
you
right
now
doesn't
mean
it

won't
be
valuable
in
the
future.

2. Capture
at
Source:
At
the
lowest
granularity.
Store
pre-‐aggregated
data
at

varying
granularity

3. Make
data
consumption
APIs
‘flexible’:
Make
it
easy
to
discover,
understand

&
consume..
Only
then
folks
who
were
not
doing
anything
with
data
will
be

able
to
play
around
with
it
and
come
up
with
insights
that
no-‐one
was

thinking
off

4. On
demand
data
fusion
capability
(to
break
the
silos).
Cannot
have
all

possible
fusions
stored
all
the
time
and
you
cannot
‘guess’
the
data
fusions

ahead
of
time.

5. Self-‐service
for
data
ingestion:
The
big
data
team
cannot
be
the
‘gatekeeper’

for
all
data
ingestion
pipelines.

6. Invest
early
in
metadata,
lineage
&
security:

Focus
on
data
quality
from
day-‐
1.
If
people
lose
faith
on
quality
they
will
go
back
to
the
old
ways
of
doing

things.
Data
quality
and
continuous
audits
through
cross
check
of
results
Getting
Started
Plan

1. Catalog
your
current
data
(Metadata
management).
Start
engaging
cross

functionally
to
understand
the
type/format/meaning
of
the
data

2. Investigate
which
data
is
being
thrown
away
and
how
you
can
increase
the

‘granularity’
of
data
capture

3. Figure
out
what
would
it
take
to
capture
data
as
close
to
the
source
systems

as
possible.

4. Plan
retention
&
security
policies
from
the
inception

5. Accounting
for
step
1,
2,
3
&
4
above,
start
estimating
how
much
data
you

have
today
&
what
rate
it
will
grow

6. Classify
ingestion
requirements
for
bulk,
incremental,
change
&
streaming

data

7. Analyze
impact
on
existing
enterprise
products
and
plan
on
data
collection

integrations
to

a. Minimize
friction
and
keep
collection
processes
de-‐coupled

b. Enable
self
service

Right
Questions
–
Roadmap
Driving
The
Platform

It
is
easy
to
get
sucked
into
the
new
tech
frenzy
surrounding
the
big
data
market.

This
is
especially
true
when
the
buyers
are
the
centralized
IT
teams
looking
to
build

the
next
generation
data
processing
platforms.
But
the
most
successful
big
data

projects
always
start
with
the
right
questions
without
getting
sucked
into
analysis

paralysis.
While
this
is
well-‐known
wisdom,
here
are
some
common
problems
faced

in
putting
it
to
practice

Common
Operational
Issues

1. Extreme
approaches:
Either
extremely
narrow
business
driven
use
cases

justified
under
'quick
wins'
bucket
or
boil
the
ocean
centralized
IT
driven
data

lake

2. Useless
and
unrealistic
science
projects
hiding
under
'visionary
statements’.
The

economics
of
storing
and
processing
data
at
scale
have
improved
so
significantly

that
no
problem
seems
unachievable
so
it
is
often
easy
to
come
up
with

something
radical
and
completely
ignore
answering
‘Why
NOW?’
question.
Even

the
best
innovations
are
useless
if
they
are
introduced
ahead
of
their
time.

3. Overestimating
the
benefits-‐
Eg.
A
common
use
case
you
will
hear
is
ETL

modernization
...
if
you
are
looking
to
do
ETL
in
Hadoop
for
the
wrong
reasons

you
will
crash
and
burn.
While
there
is
truth
to
the
fact
that
you
can
do
any
ETL

or
rather
ELT
in
Hadoop,

4. Overemphasis
on
net
'new
problems’
-‐
‘Why
fix
something
that
ain’t
broken’
is

the
popular
belief.
Then
there
is
also
that
need
to
‘minimize
impact’
This
forces

many
organizations
to
look
for
net
new
problems.
These
often
mean
higher
risk

and
less
clear
understanding
of
success.
Just
because
a
problem
is
‘new’
doesn’t

mean
it
is
more
important
and
should
take
precedence
over
improving
some

existing
solutions

Some
best
practices

1. Don't
boil
the
ocean
in
the
first
use
case
but
start
with
a
problem
that
spans

across
some
business
silos
and
forces
collaboration
of
people.
This
will
give
you

a
limited
preview
of
the
collaboration,
political
&
technology
hurdles
that
will

need
to
be
overcome
if
you
want
to
create
a
big
data
platform
that
works
in
the

long
term.

2. Create
the
right
incentive
structure
for
so
the
right
and
not
necessarily
the

‘sexiest’
problems
get
attention
first.

3. Paint
the
vision
but
ground
the
execution-‐
A
vision
is
useless
if
it
starts
paying

off
only
when
it
is
'fully
realized'.
Even
the
first
milestone
needs
to
have
a

tangible
or
intangible
but
measurable
benefit.

4. Do
it
for
the
right
reasons.
Keep
asking
'so
what'
until
you
arrive
at
a
meaningful

outcome
that
will
have
a
direct
impact
on
the
business.
Going
through
this

process
will
also
help
you
sell
the
idea
at
ALL
levels
in
the
organization

Getting
Started
Plan

1. Engage
cross
functionally
to
create
a
simple
use
case
analysis
grid
that
has
the

following
information
for
every
use
case

a) Name
&
description
of
‘what’

b) Category
(net
new
addition
or
improvement
of
existing
solution)

c) Overall
benefits
expected
from
the
use
over
the
next
12
months
and
long

term

d) Data
needed
to
address
the
use
case
(What
is
available
&
what
is
missing)

e) Three
milestones
(outcomes)
to
be
hit
over
the
next
12
months

f) Measures
of
success
of
each
milestone

g) Which
BUs/product
areas
will
be
involved
per
milestone

2.
Prioritize
–
The
exercise
above
should
give
you
enough
raw
data
to
prioritize
the

use
cases.

3.
Get
to
the
next
level
of
detail
–
Pick
the
top
three
use
cases
and
start
expanding

the
milestones
into
first
level
requirements.
Break
the
requirements
into
‘Must

have’
&
‘stretch
goals’
in
three
phases
‘crawl’,
‘walk’
&
‘run’.
This
process
will
give

you
some
clarity
of
thought
&
expose
holes
or
unrealistic
assumptions

4.
Start
evangelizing
internally
–
Start
evangelizing
‘intent’.
At
a
bare
minimum,

target
the
stakeholders
across
the
product/functional
areas
benefiting
from
the
first

target
use
cases.
Incorporate
their
feedback.

You
should
now
have
enough
to
start
thinking
about
the
‘How’
i.e
the
key
functional

components
of
the
platform.

Self-‐Service
Data
Platform

Goal
should
be
to
build
a
future
proof
platform
without
boiling
the
ocean
on
day
one

and
introducing
every
possible
big
data
technology
early
on.

Common
Operational
Issues

1. Useless
pilots-‐
so
generic
success
criteria
that
results
don't
mean
much
for

production
implementation.
They
end
up
being
simple
training
exercises
for

employees
and
lead
to
heavy
fudging
and
influence
by
vendors
and
internal

political
interests.
And
ironically
the
big
data
pilot
gets
evaluated
by
anything

but
solid
criteria
founded
in
data

2. Visualization
&
BI
tools
need
data
in
their
own
islands
–If
your
reporting
&
BI

use
cases
start
needing
data
to
be
extracted
out
of
the
central
store
then
you
are

setting
yourself
up
for
long
term
disaster.

3. ‘Batch’
thinking
–
Many
of
the
early
adaptors
made
extensive
investments
in

‘batch
processing’
in
Hadoop.
Now
they
are
struggling
to
evolve
those

investments
to
support
near
real-‐time
stream
processing
so
they
can
really
take

action
at
the
right
time
and
create
a
feedback
loop
in
their
analytical
pipelines.

To
be
fair,
this
was
not
a
mistake
but
just
a
side
effect
of
being
the
first
mover.

Now
there
are
better
options
available
than
‘mapreduce’

4. Not
making
good
use
of
‘professional
services’
–
Professional
services
(PS)

revenue
accounts
for
a
large
chunk
of
all
Hadoop
distributions
revenue.
They
are

essential
given
the
skills
gap
in
this
market.
But
too
many
organizations
struggle

after
the
PS
team
as
left
the
premises.
This
is
especially
true
when
the
charter
of

the
PS
team
was
to
help
‘get
started’
with
things
like
cluster
set-‐up,
implement
a

sample
application
etc.

5. Being
too
cagy
about
your
big
data
successes-‐

Many
organizations
overestimate

the
value
of
‘secrecy’.
While
some
use
cases
do
warrant
secrecy
in
many
others

the
value
of
evangelizing
your
success
externally
in
the
community
far

outweighs
any
downsides.
Remember
the
hard
part
is
being
successful
in
your

big
data
project.
You
can
very
safely
assume
all
your
competitors
are
trying

many
of
the
same
things
as
you
are.

Some
best
practices

1. Keep
Proof
of
Concepts
(POCs)
and
Pilots
separate-‐
POCs
are
meant
for
the
team

to
get
familiar
with
the
technology.
Pilots
need
to
be
real.
The
scope
and

deliverable
needs
to
be
such
that
at
the
end
of
the
pilot
you
have
something
that

you
can
easily
migrate
to
production.

2. The
very
first
milestone’s
output
should
be
something
that
will
get
used
every

day
in
‘production’.
This
will
force
you
to
think
of
important
operations
issues

early
on

3. Be
ready
to
pay
for
your
pilots-‐
you
get
what
you
pay
for.
It
is
true
that
you
can

get
business
hungry
big
data
vendors
to
do
pilots
for
free.
But
willingness
to
pay

just
a
little
bit
will
put
you
high
up
in
their
priority
list.
It
will
also
get
you
their

best
people
and
more
importantly
you
will
get
the
vendor
to
be
more

forthcoming
in
being
a
true
partner
in
your
success
and
not
force
them
to

constantly
be
in
'sell
mode'

4. Plan
to
minimize
data
movement
out
of
the
Hadoop
cluster

5. Think
carefully
when
involving
‘professional
services
teams’
For
parts
of
the

platform
that
are
not
core
to
your
big
data
strategy,
you
might
want
to

permanently
outsource
their
operations
&
maintenance.
If
you
need
assistance

in
building
a
piece
of
the
solution
be
absolutely
sure
that
the
outside
PS
team
is

pairing
with
your
internal
developers
so
there
can
be
successful
handoff.

Getting
Started
Plan

1. Infrastructure
evaluation
-‐
Based
on
the
understanding
of
the
data
and
use
case

roadmap,
start
charting
out
the
broad
storage
&
compute
hardware

requirements.
Do
a
gap
analysis
to
figure
out
what
is
missing.

As
part
of

planning
to
address
the
gaps
consider
running
the
platform
or
parts
of
it
in
the

cloud
vs.
on-‐premise.

2. Software
functional
evaluation
–
Before
getting
into
the
technology,
it
is

important
to
understand
the
‘data
access’
pattern
requirements
here
(eg.
Search,

Ad-‐hoc
reporting,
fast
key
value
look-‐ups,
real-‐time,
batch,
machine
learning

etc.).
Model
these
as
‘services’
of
the
broader
platform
rather
than
as
islands
of

data.
Consider
the
data
ingestion
&
distribution
requirements
as
part
of
the

functions.
This
should
give
a
sense
of
the
‘gaps’
in
your
current
environment
and

also
expose
all
the
integrations
needed.
Based
on
that,
you
should
move
on
to
do

a
build
vs.
buy
analysis

3. Platform
operations
evaluation
–
This
part
is
often
neglected
and

4. Skills
evaluation
–
See
the
last
section
on
‘Organizational
glue’
for
more
on
this.

5. For
the
production
roll-‐out
phase,
plan
to
‘fix
a
ship
in
flight’.
This
would
require

a
period
of
running
your
new
system
in
parallel
with
the
old
and
doing
a
phased

End
of
life.

Here
is
an
example
of
typical
analytics
adoption/product
integration
cycle

Offline
=
Analytics
done
offline
in
batch
&
not
directly
integrated
with
core
products

Online
=
Analytics
done
in
real-‐time
and
integrated
with
enterprise
products

Analytics
Stage
Short
Term
Medium/Long
Term

Descriptive
analytics
Offline
Online

Predictive
analytics
Offline
Online

Prescriptive
analytics
Online
Online

6. Documentation
is
important
and
cant
slip
low
in
the
priority
list
(Even
if
the

products
are
internal
&
not
customer
facing))

7. Create
an
evangelism
plan
(blogs
on
website,
industry
events
talks,
internal

lunch-‐n-‐learns,
meet-‐ups,
social
media
campaigns,
&
webinars)

8. Run
it
like
a
‘startup’:
This
will
force
hard
prioritizations
&
introduce
a
much-‐
needed
sense
of
urgency
without
drowning
in
too
many
processes.
Will
enable

you
to
be
‘scrappy
&
resourceful’
within
the
organization.
The
need
to
produce

quick
output,
fail-‐fast
&
iterate
will
require
agile
development
practices.
Above

all,
will
help
attract
the
right
talent!

Organizational
Glue

The
scarcity
of
big
data
skillset
in
the
market
gets
a
lot
of
attention.
While
it
is
true

that
the
‘data
scientist’
is
the
sexiest
job
of
the
21st
century,
even
the
smartest
data

scientists
and
the
best
technology
will
not
be
successful
unless
you
have
the

organizational
glue
in
place
to
bring
all
the
pieces
together.

Common
Problems
&
Some
best
practices

Skillset
shortage
&
imbalance
is
the
most
common
problem
with
big
data
projects.
:

There
is
a
tendency
to
hire
Hadoop
developers
and
data
scientists
–
both
of
these

are
two
most
in
demand
jobs.
However,
if
you
look
at
any
big
data
Implementation
it

spans
across
various
technologies
and
also
needs
heavy
operations
focus.
It
is
hard

and
I
would
argue
unnecessary
to
plan
to
hire
a
team
that
can
own
every
piece
of
it

in
house.
The
better
approach
is
to
seek
the
right
development
APIs
that
can
enable

your
existing
talent
to
leverage
big
data
technologies.
Outsource
the
aspects
of
the

solution
that
are
not
key
differentiators
for
your
business.
Open
source
the
pieces
of

your
stack
that
add
value
but
are
not
key
differentiators
for
your
business.
There
is

a
reason
why
large
tech
companies
like
Netflix
and
Facebook
open
source
so
many

projects.
They
want
to
find
community
support
so
they
can
hire
easily
from
the

community
and
get
free
bug
fixes
as
more
and
more
developers
fix
parts
of
the
open

source
projects
they
use.

The
big
data
function
should
be
run
by
a
leader
who
understands
the
technology

and
the
operational
issues
well
but
at
the
same
time
has
the
caliber
to
get
a
firm

grasp
of
the
business
priorities.
This
will
allow
them
to
gain
credibility
&
respect

across
all
functions
of
the
organization
and
customers

Getting
Started
Plan

1. Establish
the
lay
of
the
land
(functionally
as
well
as
politically)
of
the

organization
(Know
where
to
go
for
what)

2. Form
planning
&
execution
teams
to
include
product,
engineering
&
operations

functional
liaisons
from
the
big
data
team
as
well
as
the
products/business
units

that
the
features
on
the
roamap
impact

3. Map
out
team
composition

a. Product
team

1. Hire
data
product
managers
that
can
liaison
with
various
enterprise

product
components
to
ensure
that
the
right
offline
&
online

integration
capabilities
are
exposed.
The
main
charter
is
to
make
the

big
data
platform
truly
useful
across
all
the
different
product
lines

2. Data
scientists
should
be
on
the
product
team
-‐
experimental
work
+

models
that
drive
offline
&
online
data
platform
product
capabilities

3. Reporting
and
analytics-‐
Rolled
under
the
big
data
team
and
continue

current
responsibility
in
the
short
term.
The
reporting
&
analytics

product
management

4. Shared
resource/dotted
line
-‐
UX
-‐
based
on
how
we
decide
to
evolve

the
user
interface
pieces
of
the
data
platform
and
products

b. Engineering
team

i. Silicon
valley
presence
essential

ii. Need
a
big
data
architect
to
design
the
end
to
end
system
and

who
understands
the
technical
challenges
in
piecing
the
various

big
data
tools
together

iii. Hire
based
on
platform
use
case
requirements
to
create
a
mix
of

generalist
big
data
engineers
&
experts
in
a
functional
area
eg.

NoSQL
or
search
specialists

4. Rest
of
the
organization:
Should
demand
self-‐service
from
the
platform
and
not

rely
on
the
big
data
team
all
the
time.

5. Leadership
–
drive
a
data
culture

a. Needs
to
simply
ask
these
questions
for
EVERY
decision
-‐
Show
me
the

data,
its
source,
the
analysis
and
your
confidence..

b. Avoid
FAKING
it.
(Many
people
use
the
outcomes
of
reports
to
support

preconceived
conclusions)

6. Hiring:
Look
at
universities
for
fresh
talent
in
the
stats
&
machine
learning
area

&
pair
them
with
experienced
business
analysts

7. Invest
in
ongoing
cross
training,
skill
development
&
retention:
Training
courses

around
data
consumption:
Evaluate
skillsets
of
existing
team
member,
their

current
career
goals.
Evaluate
w.r.t
requirements
of
the
data
platform
product

roadmap.
Make
training
goals
part
of
performance
review

If
you
have
been
reading
so
far
and
found
the
content
useful,
I
am
glad
I
could

help!
If
you
have
your
own
experiences
to
share
or

you
think
any
of
this
doesn’t

make
sense,
I
would
love
to
hear
your
comments!

Thank
You!

Big dataplatform operationalstrategy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big dataplatform operationalstrategy

Similar to Big dataplatform operationalstrategy (20)

Recently uploaded

Recently uploaded (20)

Big dataplatform operationalstrategy