The pioneers in the big data space have battle scars and have learnt many of the lessons in this report the hard way. But if you are a general manger & just embarking on the big data journey, you should now have what they call the 'second mover advantage’. My hope is that this report helps you better leverage your second mover advantage. The goal here is to shed some light on the people & process issues in building a central big data analytics function
1.
Building
a
Big
Data
Analytics
Function
For
Long
Term
Success
Himanshu
Bari
-‐
https://www.linkedin.com/in/himanshubari
2. The
last
three
years
of
my
career
have
been
in
the
big
data
space.
It
started
at
the
ground
zero
of
the
big
data
revolution
at
Hortonworks
-‐
one
of
the
leading
Hadoop
distributions.
As
a
product
manager,
I
have
had
the
privilege
of
working
closely
with
marketing,
customer
success,
pre-‐sales
as
well
as
post
sales
teams
across
industry
verticals
to
make
our
internal
customer
champions
successful
in
their
quest
to
formulate
&
execute
their
big
data
strategies.
My
view
has
spanned
across
all
phases
of
implementation
-‐
from
the
early
use
case
selection
to
POC
&
pilot
execution,
post
pilot
production,
operationalization
&
finally
evangelism.
Having
an
inside
view
of
the
evolution
of
the
big
data
market
has
been
an
extremely
rewarding
experience
and
I
learnt
a
lot.
While
working
with
the
big
data
solutions
owners,
in
many
instances
it
brought
back
memory
of
when
I
was
part
of
the
central
technology
strategy
group
at
Lehman
Brothers.
There
I
had
the
opportunity
to
build
and
drive
adoption
of
a
home
grown
application
performance
monitoring
solution
across
the
entire
company
and
overcame
some
of
the
similar
organizational
&
process
hurdles
as
faced
by
the
big
data
early
adaptors
Who
is
this
report
for?
The
pioneers
in
the
big
data
space
have
battle
scars
and
have
learnt
many
of
the
lessons
in
this
report
the
hard
way.
But
if
you
are
a
general
manger
&
just
embarking
on
the
big
data
journey,
you
should
now
have
what
they
call
the
'second
mover
advantage’.
My
hope
is
that
this
report
helps
you
better
leverage
your
second
mover
advantage.
What
this
report
is
NOT?
-‐ This
is
NOT
meant
to
be
a
technical
recipe
book
for
building
big
data
systems.
There
is
no
shortage
of
those.
Just
look
through
any
of
the
vendor
websites.
Or
ping
me
and
I
would
be
happy
to
talk
tech!
-‐ This
is
not
a
big
data
project
plan
or
a
budgeting
primer.
There
are
too
many
organizational
and
situational
specifics
needed
for
creating
those.
But
my
hope
is
that
the
content
in
this
report
will
server
as
a
guiding
post
&
input
into
those
efforts
What
is
the
FOCUS
of
this
report?
The
goal
here
is
to
shed
some
light
on
the
people
&
process
issues
in
building
a
central
big
data
analytics
function.
3.
The
rest
of
this
report
is
organized
around
the
four
key
pillars
as
shown
on
the
left.
For
each
area,
I
will
discuss
1.
Common
problems
2.
Some
best
practices
3.
Getting
started
plan
Getting
to
‘Complete
Data’
The
data
platform
is
only
as
good
as
the
data
in
it.
Most
big
data
projects
just
assume
that
there
is
a
lot
of
data
available
and
that
just
by
having
‘more’
data
will
magically
result
in
better
insights.
While
that
is
loosely
true
in
some
cases
(like
machine
learning),
the
most
successful
organizations
pay
at
a
lot
of
attention
to
truly
understand
the
nature
of
data
available.
Common
Operational
Issues
in
getting
to
‘Complete
Data’
1. Ownership
split
across
teams
creating
‘Silos’
–
Happens
naturally
as
the
various
internal
products
&
systems
evolved
organically
or
inorganically.
]
2. Format
&
quality
issues
-‐
Inherently
introduced
by
the
silos
and
variety
of
systems.
As
a
result
there
are
often
very
different
views
&
interpretations
of
the
same
asset.
3. Data
has
‘inertia’
–
Cannot
be
easily
moved
around
4. Merging
‘event’
&
transactional
data
is
challenging
5. Access
requirements
vary
by
users
&
workloads.
stages.
Eg.
Reporting
&
analytics
use
cases
have
different
data
prep
needs
than
data
science
use
cases
6. Data
ingestion
and
distribution
into
the
platform
become
the
‘sole’
responsibility
of
the
big
data
team.
They
just
become
data
movement
monkeys
Some
best
practices
1. Store
as
much
as
possible:
not
valuable
to
you
right
now
doesn't
mean
it
won't
be
valuable
in
the
future.
2. Capture
at
Source:
At
the
lowest
granularity.
Store
pre-‐aggregated
data
at
varying
granularity
3. Make
data
consumption
APIs
‘flexible’:
Make
it
easy
to
discover,
understand
&
consume..
Only
then
folks
who
were
not
doing
anything
with
data
will
be
able
to
play
around
with
it
and
come
up
with
insights
that
no-‐one
was
thinking
off
4. On
demand
data
fusion
capability
(to
break
the
silos).
Cannot
have
all
possible
fusions
stored
all
the
time
and
you
cannot
‘guess’
the
data
fusions
ahead
of
time.
4. 5. Self-‐service
for
data
ingestion:
The
big
data
team
cannot
be
the
‘gatekeeper’
for
all
data
ingestion
pipelines.
6. Invest
early
in
metadata,
lineage
&
security:
Focus
on
data
quality
from
day-‐
1.
If
people
lose
faith
on
quality
they
will
go
back
to
the
old
ways
of
doing
things.
Data
quality
and
continuous
audits
through
cross
check
of
results
Getting
Started
Plan
1. Catalog
your
current
data
(Metadata
management).
Start
engaging
cross
functionally
to
understand
the
type/format/meaning
of
the
data
2. Investigate
which
data
is
being
thrown
away
and
how
you
can
increase
the
‘granularity’
of
data
capture
3. Figure
out
what
would
it
take
to
capture
data
as
close
to
the
source
systems
as
possible.
4. Plan
retention
&
security
policies
from
the
inception
5. Accounting
for
step
1,
2,
3
&
4
above,
start
estimating
how
much
data
you
have
today
&
what
rate
it
will
grow
6. Classify
ingestion
requirements
for
bulk,
incremental,
change
&
streaming
data
7. Analyze
impact
on
existing
enterprise
products
and
plan
on
data
collection
integrations
to
a. Minimize
friction
and
keep
collection
processes
de-‐coupled
b. Enable
self
service
Right
Questions
–
Roadmap
Driving
The
Platform
It
is
easy
to
get
sucked
into
the
new
tech
frenzy
surrounding
the
big
data
market.
This
is
especially
true
when
the
buyers
are
the
centralized
IT
teams
looking
to
build
the
next
generation
data
processing
platforms.
But
the
most
successful
big
data
projects
always
start
with
the
right
questions
without
getting
sucked
into
analysis
paralysis.
While
this
is
well-‐known
wisdom,
here
are
some
common
problems
faced
in
putting
it
to
practice
Common
Operational
Issues
1. Extreme
approaches:
Either
extremely
narrow
business
driven
use
cases
justified
under
'quick
wins'
bucket
or
boil
the
ocean
centralized
IT
driven
data
lake
2. Useless
and
unrealistic
science
projects
hiding
under
'visionary
statements’.
The
economics
of
storing
and
processing
data
at
scale
have
improved
so
significantly
that
no
problem
seems
unachievable
so
it
is
often
easy
to
come
up
with
something
radical
and
completely
ignore
answering
‘Why
NOW?’
question.
Even
the
best
innovations
are
useless
if
they
are
introduced
ahead
of
their
time.
3. Overestimating
the
benefits-‐
Eg.
A
common
use
case
you
will
hear
is
ETL
modernization
...
if
you
are
looking
to
do
ETL
in
Hadoop
for
the
wrong
reasons
5. you
will
crash
and
burn.
While
there
is
truth
to
the
fact
that
you
can
do
any
ETL
or
rather
ELT
in
Hadoop,
4. Overemphasis
on
net
'new
problems’
-‐
‘Why
fix
something
that
ain’t
broken’
is
the
popular
belief.
Then
there
is
also
that
need
to
‘minimize
impact’
This
forces
many
organizations
to
look
for
net
new
problems.
These
often
mean
higher
risk
and
less
clear
understanding
of
success.
Just
because
a
problem
is
‘new’
doesn’t
mean
it
is
more
important
and
should
take
precedence
over
improving
some
existing
solutions
Some
best
practices
1. Don't
boil
the
ocean
in
the
first
use
case
but
start
with
a
problem
that
spans
across
some
business
silos
and
forces
collaboration
of
people.
This
will
give
you
a
limited
preview
of
the
collaboration,
political
&
technology
hurdles
that
will
need
to
be
overcome
if
you
want
to
create
a
big
data
platform
that
works
in
the
long
term.
2. Create
the
right
incentive
structure
for
so
the
right
and
not
necessarily
the
‘sexiest’
problems
get
attention
first.
3. Paint
the
vision
but
ground
the
execution-‐
A
vision
is
useless
if
it
starts
paying
off
only
when
it
is
'fully
realized'.
Even
the
first
milestone
needs
to
have
a
tangible
or
intangible
but
measurable
benefit.
4. Do
it
for
the
right
reasons.
Keep
asking
'so
what'
until
you
arrive
at
a
meaningful
outcome
that
will
have
a
direct
impact
on
the
business.
Going
through
this
process
will
also
help
you
sell
the
idea
at
ALL
levels
in
the
organization
Getting
Started
Plan
1. Engage
cross
functionally
to
create
a
simple
use
case
analysis
grid
that
has
the
following
information
for
every
use
case
a) Name
&
description
of
‘what’
b) Category
(net
new
addition
or
improvement
of
existing
solution)
c) Overall
benefits
expected
from
the
use
over
the
next
12
months
and
long
term
d) Data
needed
to
address
the
use
case
(What
is
available
&
what
is
missing)
e) Three
milestones
(outcomes)
to
be
hit
over
the
next
12
months
f) Measures
of
success
of
each
milestone
g) Which
BUs/product
areas
will
be
involved
per
milestone
2.
Prioritize
–
The
exercise
above
should
give
you
enough
raw
data
to
prioritize
the
use
cases.
3.
Get
to
the
next
level
of
detail
–
Pick
the
top
three
use
cases
and
start
expanding
the
milestones
into
first
level
requirements.
Break
the
requirements
into
‘Must
have’
&
‘stretch
goals’
in
three
phases
‘crawl’,
‘walk’
&
‘run’.
This
process
will
give
you
some
clarity
of
thought
&
expose
holes
or
unrealistic
assumptions
6. 4.
Start
evangelizing
internally
–
Start
evangelizing
‘intent’.
At
a
bare
minimum,
target
the
stakeholders
across
the
product/functional
areas
benefiting
from
the
first
target
use
cases.
Incorporate
their
feedback.
You
should
now
have
enough
to
start
thinking
about
the
‘How’
i.e
the
key
functional
components
of
the
platform.
7. Self-‐Service
Data
Platform
Goal
should
be
to
build
a
future
proof
platform
without
boiling
the
ocean
on
day
one
and
introducing
every
possible
big
data
technology
early
on.
Common
Operational
Issues
1. Useless
pilots-‐
so
generic
success
criteria
that
results
don't
mean
much
for
production
implementation.
They
end
up
being
simple
training
exercises
for
employees
and
lead
to
heavy
fudging
and
influence
by
vendors
and
internal
political
interests.
And
ironically
the
big
data
pilot
gets
evaluated
by
anything
but
solid
criteria
founded
in
data
2. Visualization
&
BI
tools
need
data
in
their
own
islands
–If
your
reporting
&
BI
use
cases
start
needing
data
to
be
extracted
out
of
the
central
store
then
you
are
setting
yourself
up
for
long
term
disaster.
3. ‘Batch’
thinking
–
Many
of
the
early
adaptors
made
extensive
investments
in
‘batch
processing’
in
Hadoop.
Now
they
are
struggling
to
evolve
those
investments
to
support
near
real-‐time
stream
processing
so
they
can
really
take
action
at
the
right
time
and
create
a
feedback
loop
in
their
analytical
pipelines.
To
be
fair,
this
was
not
a
mistake
but
just
a
side
effect
of
being
the
first
mover.
Now
there
are
better
options
available
than
‘mapreduce’
4. Not
making
good
use
of
‘professional
services’
–
Professional
services
(PS)
revenue
accounts
for
a
large
chunk
of
all
Hadoop
distributions
revenue.
They
are
essential
given
the
skills
gap
in
this
market.
But
too
many
organizations
struggle
after
the
PS
team
as
left
the
premises.
This
is
especially
true
when
the
charter
of
the
PS
team
was
to
help
‘get
started’
with
things
like
cluster
set-‐up,
implement
a
sample
application
etc.
5. Being
too
cagy
about
your
big
data
successes-‐
Many
organizations
overestimate
the
value
of
‘secrecy’.
While
some
use
cases
do
warrant
secrecy
in
many
others
the
value
of
evangelizing
your
success
externally
in
the
community
far
outweighs
any
downsides.
Remember
the
hard
part
is
being
successful
in
your
big
data
project.
You
can
very
safely
assume
all
your
competitors
are
trying
many
of
the
same
things
as
you
are.
Some
best
practices
1. Keep
Proof
of
Concepts
(POCs)
and
Pilots
separate-‐
POCs
are
meant
for
the
team
to
get
familiar
with
the
technology.
Pilots
need
to
be
real.
The
scope
and
deliverable
needs
to
be
such
that
at
the
end
of
the
pilot
you
have
something
that
you
can
easily
migrate
to
production.
2. The
very
first
milestone’s
output
should
be
something
that
will
get
used
every
day
in
‘production’.
This
will
force
you
to
think
of
important
operations
issues
early
on
3. Be
ready
to
pay
for
your
pilots-‐
you
get
what
you
pay
for.
It
is
true
that
you
can
get
business
hungry
big
data
vendors
to
do
pilots
for
free.
But
willingness
to
pay
just
a
little
bit
will
put
you
high
up
in
their
priority
list.
It
will
also
get
you
their
best
people
and
more
importantly
you
will
get
the
vendor
to
be
more
8. forthcoming
in
being
a
true
partner
in
your
success
and
not
force
them
to
constantly
be
in
'sell
mode'
4. Plan
to
minimize
data
movement
out
of
the
Hadoop
cluster
5. Think
carefully
when
involving
‘professional
services
teams’
For
parts
of
the
platform
that
are
not
core
to
your
big
data
strategy,
you
might
want
to
permanently
outsource
their
operations
&
maintenance.
If
you
need
assistance
in
building
a
piece
of
the
solution
be
absolutely
sure
that
the
outside
PS
team
is
pairing
with
your
internal
developers
so
there
can
be
successful
handoff.
Getting
Started
Plan
1. Infrastructure
evaluation
-‐
Based
on
the
understanding
of
the
data
and
use
case
roadmap,
start
charting
out
the
broad
storage
&
compute
hardware
requirements.
Do
a
gap
analysis
to
figure
out
what
is
missing.
As
part
of
planning
to
address
the
gaps
consider
running
the
platform
or
parts
of
it
in
the
cloud
vs.
on-‐premise.
2. Software
functional
evaluation
–
Before
getting
into
the
technology,
it
is
important
to
understand
the
‘data
access’
pattern
requirements
here
(eg.
Search,
Ad-‐hoc
reporting,
fast
key
value
look-‐ups,
real-‐time,
batch,
machine
learning
etc.).
Model
these
as
‘services’
of
the
broader
platform
rather
than
as
islands
of
data.
Consider
the
data
ingestion
&
distribution
requirements
as
part
of
the
functions.
This
should
give
a
sense
of
the
‘gaps’
in
your
current
environment
and
also
expose
all
the
integrations
needed.
Based
on
that,
you
should
move
on
to
do
a
build
vs.
buy
analysis
3. Platform
operations
evaluation
–
This
part
is
often
neglected
and
4. Skills
evaluation
–
See
the
last
section
on
‘Organizational
glue’
for
more
on
this.
5. For
the
production
roll-‐out
phase,
plan
to
‘fix
a
ship
in
flight’.
This
would
require
a
period
of
running
your
new
system
in
parallel
with
the
old
and
doing
a
phased
End
of
life.
Here
is
an
example
of
typical
analytics
adoption/product
integration
cycle
Offline
=
Analytics
done
offline
in
batch
&
not
directly
integrated
with
core
products
Online
=
Analytics
done
in
real-‐time
and
integrated
with
enterprise
products
Analytics
Stage
Short
Term
Medium/Long
Term
Descriptive
analytics
Offline
Online
Predictive
analytics
Offline
Online
Prescriptive
analytics
Online
Online
6. Documentation
is
important
and
cant
slip
low
in
the
priority
list
(Even
if
the
products
are
internal
&
not
customer
facing))
7. Create
an
evangelism
plan
(blogs
on
website,
industry
events
talks,
internal
lunch-‐n-‐learns,
meet-‐ups,
social
media
campaigns,
&
webinars)
9. 8. Run
it
like
a
‘startup’:
This
will
force
hard
prioritizations
&
introduce
a
much-‐
needed
sense
of
urgency
without
drowning
in
too
many
processes.
Will
enable
you
to
be
‘scrappy
&
resourceful’
within
the
organization.
The
need
to
produce
quick
output,
fail-‐fast
&
iterate
will
require
agile
development
practices.
Above
all,
will
help
attract
the
right
talent!
Organizational
Glue
The
scarcity
of
big
data
skillset
in
the
market
gets
a
lot
of
attention.
While
it
is
true
that
the
‘data
scientist’
is
the
sexiest
job
of
the
21st
century,
even
the
smartest
data
scientists
and
the
best
technology
will
not
be
successful
unless
you
have
the
organizational
glue
in
place
to
bring
all
the
pieces
together.
Common
Problems
&
Some
best
practices
Skillset
shortage
&
imbalance
is
the
most
common
problem
with
big
data
projects.
:
There
is
a
tendency
to
hire
Hadoop
developers
and
data
scientists
–
both
of
these
are
two
most
in
demand
jobs.
However,
if
you
look
at
any
big
data
Implementation
it
spans
across
various
technologies
and
also
needs
heavy
operations
focus.
It
is
hard
and
I
would
argue
unnecessary
to
plan
to
hire
a
team
that
can
own
every
piece
of
it
in
house.
The
better
approach
is
to
seek
the
right
development
APIs
that
can
enable
your
existing
talent
to
leverage
big
data
technologies.
Outsource
the
aspects
of
the
solution
that
are
not
key
differentiators
for
your
business.
Open
source
the
pieces
of
your
stack
that
add
value
but
are
not
key
differentiators
for
your
business.
There
is
a
reason
why
large
tech
companies
like
Netflix
and
Facebook
open
source
so
many
projects.
They
want
to
find
community
support
so
they
can
hire
easily
from
the
community
and
get
free
bug
fixes
as
more
and
more
developers
fix
parts
of
the
open
source
projects
they
use.
The
big
data
function
should
be
run
by
a
leader
who
understands
the
technology
and
the
operational
issues
well
but
at
the
same
time
has
the
caliber
to
get
a
firm
grasp
of
the
business
priorities.
This
will
allow
them
to
gain
credibility
&
respect
across
all
functions
of
the
organization
and
customers
Getting
Started
Plan
1. Establish
the
lay
of
the
land
(functionally
as
well
as
politically)
of
the
organization
(Know
where
to
go
for
what)
2. Form
planning
&
execution
teams
to
include
product,
engineering
&
operations
functional
liaisons
from
the
big
data
team
as
well
as
the
products/business
units
that
the
features
on
the
roamap
impact
3. Map
out
team
composition
a. Product
team
1. Hire
data
product
managers
that
can
liaison
with
various
enterprise
product
components
to
ensure
that
the
right
offline
&
online
integration
capabilities
are
exposed.
The
main
charter
is
to
make
the
big
data
platform
truly
useful
across
all
the
different
product
lines
10. 2. Data
scientists
should
be
on
the
product
team
-‐
experimental
work
+
models
that
drive
offline
&
online
data
platform
product
capabilities
3. Reporting
and
analytics-‐
Rolled
under
the
big
data
team
and
continue
current
responsibility
in
the
short
term.
The
reporting
&
analytics
product
management
4. Shared
resource/dotted
line
-‐
UX
-‐
based
on
how
we
decide
to
evolve
the
user
interface
pieces
of
the
data
platform
and
products
b. Engineering
team
i. Silicon
valley
presence
essential
ii. Need
a
big
data
architect
to
design
the
end
to
end
system
and
who
understands
the
technical
challenges
in
piecing
the
various
big
data
tools
together
iii. Hire
based
on
platform
use
case
requirements
to
create
a
mix
of
generalist
big
data
engineers
&
experts
in
a
functional
area
eg.
NoSQL
or
search
specialists
4. Rest
of
the
organization:
Should
demand
self-‐service
from
the
platform
and
not
rely
on
the
big
data
team
all
the
time.
5. Leadership
–
drive
a
data
culture
a. Needs
to
simply
ask
these
questions
for
EVERY
decision
-‐
Show
me
the
data,
its
source,
the
analysis
and
your
confidence..
b. Avoid
FAKING
it.
(Many
people
use
the
outcomes
of
reports
to
support
preconceived
conclusions)
6. Hiring:
Look
at
universities
for
fresh
talent
in
the
stats
&
machine
learning
area
&
pair
them
with
experienced
business
analysts
7. Invest
in
ongoing
cross
training,
skill
development
&
retention:
Training
courses
around
data
consumption:
Evaluate
skillsets
of
existing
team
member,
their
current
career
goals.
Evaluate
w.r.t
requirements
of
the
data
platform
product
roadmap.
Make
training
goals
part
of
performance
review
If
you
have
been
reading
so
far
and
found
the
content
useful,
I
am
glad
I
could
help!
If
you
have
your
own
experiences
to
share
or
you
think
any
of
this
doesn’t
make
sense,
I
would
love
to
hear
your
comments!
Thank
You!