SlideShare a Scribd company logo
1 of 10
Download to read offline
 
Building	
  a	
  Big	
  Data	
  Analytics	
  Function	
  
For	
  Long	
  Term	
  Success	
  
Himanshu	
  Bari	
  	
  -­‐	
  https://www.linkedin.com/in/himanshubari	
  
	
  
	
   	
  
The	
  last	
  three	
  years	
  of	
  my	
  career	
  have	
  been	
  in	
  the	
  big	
  data	
  space.	
  It	
  started	
  at	
  the	
  
ground	
  zero	
  of	
  the	
  big	
  data	
  revolution	
  at	
  Hortonworks	
  -­‐	
  one	
  of	
  the	
  leading	
  Hadoop	
  
distributions.	
  As	
  a	
  product	
  manager,	
  I	
  have	
  had	
  the	
  privilege	
  of	
  working	
  closely	
  with	
  
marketing,	
  customer	
  success,	
  pre-­‐sales	
  as	
  well	
  as	
  post	
  sales	
  teams	
  across	
  industry	
  
verticals	
   to	
   make	
   our	
   internal	
   customer	
   champions	
   successful	
   in	
   their	
   quest	
   to	
  
formulate	
  &	
  execute	
  their	
  big	
  data	
  strategies.	
  My	
  view	
  has	
  spanned	
  across	
  all	
  phases	
  
of	
  implementation	
  -­‐	
  from	
  the	
  early	
  use	
  case	
  selection	
  to	
  POC	
  &	
  pilot	
  execution,	
  post	
  
pilot	
  production,	
  operationalization	
  &	
  finally	
  evangelism.	
  Having	
  an	
  inside	
  view	
  of	
  
the	
  evolution	
  of	
  the	
  big	
  data	
  market	
  has	
  been	
  an	
  extremely	
  rewarding	
  experience	
  
and	
   I	
   learnt	
   a	
   lot.	
   	
   While	
   working	
   with	
   the	
   big	
   data	
   solutions	
   owners,	
   in	
   many	
  
instances	
   it	
   brought	
   back	
   memory	
   of	
   when	
   I	
   was	
   part	
   of	
   the	
   central	
   technology	
  
strategy	
  group	
  at	
  Lehman	
  Brothers.	
  There	
  I	
  had	
  the	
  opportunity	
  to	
  build	
  and	
  drive	
  
adoption	
  of	
  a	
  home	
  grown	
  application	
  performance	
  monitoring	
  solution	
  across	
  the	
  
entire	
  company	
  and	
  overcame	
  some	
  of	
  the	
  similar	
  organizational	
  &	
  process	
  hurdles	
  
as	
  faced	
  by	
  the	
  big	
  data	
  early	
  adaptors	
  
	
  
Who	
  is	
  this	
  report	
  for?	
  
The	
  pioneers	
  in	
  the	
  big	
  data	
  space	
  have	
  battle	
  scars	
  and	
  have	
  learnt	
  many	
  of	
  the	
  
lessons	
   in	
   this	
   report	
   the	
   hard	
   way.	
   But	
   if	
   you	
   are	
   a	
   general	
   manger	
   &	
   just	
  
embarking	
  on	
  the	
  big	
  data	
  journey,	
  you	
  should	
  now	
  have	
  what	
  they	
  call	
  the	
  'second	
  
mover	
  advantage’.	
  My	
  hope	
  is	
  that	
  this	
  report	
  helps	
  you	
  better	
  leverage	
  your	
  second	
  
mover	
  advantage.	
  
	
  
What	
  this	
  report	
  is	
  NOT?	
  
-­‐ This	
  is	
  NOT	
  meant	
  to	
  be	
  a	
  technical	
  recipe	
  book	
  for	
  building	
  big	
  data	
  systems.	
  
There	
  is	
  no	
  shortage	
  of	
  those.	
  Just	
  look	
  through	
  any	
  of	
  the	
  vendor	
  websites.	
  Or	
  
ping	
  me	
  and	
  I	
  would	
  be	
  happy	
  to	
  talk	
  tech!	
  
-­‐ This	
  is	
  not	
  a	
  big	
  data	
  project	
  plan	
  or	
  a	
  budgeting	
  primer.	
  There	
  are	
  too	
  many	
  
organizational	
  and	
  situational	
  specifics	
  needed	
  for	
  creating	
  those.	
  But	
  my	
  hope	
  is	
  
that	
  the	
  content	
  in	
  this	
  report	
  will	
  server	
  as	
  a	
  guiding	
  post	
  &	
  input	
  into	
  those	
  
efforts	
  
	
  
What	
  is	
  the	
  FOCUS	
  of	
  this	
  report?	
  
The	
   goal	
   here	
   is	
   to	
   shed	
   some	
   light	
   on	
   the	
   people	
   &	
   process	
   issues	
   in	
   building	
   a	
  
central	
  big	
  data	
  analytics	
  function.	
  
	
  
	
  
 
The	
   rest	
   of	
   this	
   report	
   is	
   organized	
  
around	
  the	
  four	
  key	
  pillars	
  as	
  shown	
  on	
  
the	
  left.	
  For	
  each	
  area,	
  I	
  will	
  discuss	
  
	
  
1.	
  Common	
  problems	
  
2.	
  Some	
  best	
  practices	
  
3.	
  Getting	
  started	
  plan	
  
	
  
Getting	
  to	
  ‘Complete	
  Data’	
  	
  
The	
  data	
  platform	
  is	
  only	
  as	
  good	
  as	
  the	
  data	
  in	
  it.	
  Most	
  big	
  data	
  projects	
  just	
  assume	
  
that	
  there	
  is	
  a	
  lot	
  of	
  data	
  available	
  and	
  that	
  just	
  by	
  having	
  ‘more’	
  data	
  will	
  magically	
  
result	
   in	
   better	
   insights.	
   While	
   that	
   is	
   loosely	
   true	
   in	
   some	
   cases	
   (like	
   machine	
  
learning),	
   the	
   most	
   successful	
   organizations	
   pay	
   at	
   a	
   lot	
   of	
   attention	
   to	
   truly	
  
understand	
  the	
  nature	
  of	
  data	
  available.	
  	
  
	
  
Common	
  Operational	
  Issues	
  in	
  getting	
  to	
  ‘Complete	
  Data’	
  
1. Ownership	
   split	
   across	
   teams	
   creating	
   ‘Silos’	
   –	
   Happens	
   naturally	
   as	
   the	
  
various	
  internal	
  products	
  &	
  systems	
  evolved	
  organically	
  or	
  inorganically.	
  ]	
  
2. Format	
  &	
  quality	
  issues	
  -­‐	
  Inherently	
  introduced	
  by	
  the	
  silos	
  and	
  variety	
  of	
  
systems.	
  As	
  a	
  result	
  there	
  are	
  often	
  very	
  different	
  views	
  &	
  interpretations	
  of	
  
the	
  same	
  asset.	
  	
  
3. Data	
  has	
  ‘inertia’	
  –	
  Cannot	
  be	
  easily	
  moved	
  around	
  
4. Merging	
  ‘event’	
  &	
  transactional	
  data	
  is	
  challenging	
  
5. Access	
   requirements	
   vary	
   by	
   users	
   &	
   workloads.	
   stages.	
   Eg.	
   Reporting	
   &	
  
analytics	
   use	
   cases	
   have	
   different	
   data	
   prep	
   needs	
   than	
   data	
   science	
   use	
  
cases	
  
6. Data	
   ingestion	
   and	
   distribution	
   into	
   the	
   platform	
   become	
   the	
   ‘sole’	
  
responsibility	
   of	
   the	
   big	
   data	
   team.	
   They	
   just	
   become	
   data	
   movement	
  
monkeys	
  
	
  
Some	
  best	
  practices	
  
1. Store	
   as	
   much	
   as	
   possible:	
   not	
   valuable	
   to	
   you	
   right	
   now	
   doesn't	
   mean	
   it	
  
won't	
  be	
  valuable	
  in	
  the	
  future.	
  
2. Capture	
   at	
   Source:	
   At	
   the	
   lowest	
   granularity.	
   Store	
   pre-­‐aggregated	
   data	
   at	
  
varying	
  granularity	
  
3. Make	
  data	
  consumption	
  APIs	
  ‘flexible’:	
  Make	
  it	
  easy	
  to	
  discover,	
  understand	
  
&	
  consume..	
  Only	
  then	
  folks	
  who	
  were	
  not	
  doing	
  anything	
  with	
  data	
  will	
  be	
  
able	
   to	
   play	
   around	
   with	
   it	
   and	
   come	
   up	
   with	
   insights	
   that	
   no-­‐one	
   was	
  
thinking	
  off	
  
4. On	
   demand	
   data	
   fusion	
   capability	
   (to	
   break	
   the	
   silos).	
   Cannot	
   have	
   all	
  
possible	
  fusions	
  stored	
  all	
  the	
  time	
  and	
  you	
  cannot	
  ‘guess’	
  the	
  data	
  fusions	
  
ahead	
  of	
  time.	
  
5. Self-­‐service	
  for	
  data	
  ingestion:	
  The	
  big	
  data	
  team	
  cannot	
  be	
  the	
  ‘gatekeeper’	
  
for	
  all	
  data	
  ingestion	
  pipelines.	
  	
  
6. Invest	
  early	
  in	
  metadata,	
  lineage	
  &	
  security:	
  	
  Focus	
  on	
  data	
  quality	
  from	
  day-­‐
1.	
  If	
  people	
  lose	
  faith	
  on	
  quality	
  they	
  will	
  go	
  back	
  to	
  the	
  old	
  ways	
  of	
  doing	
  
things.	
  Data	
  quality	
  and	
  continuous	
  audits	
  through	
  cross	
  check	
  of	
  results
Getting	
  Started	
  Plan	
  
1. Catalog	
   your	
   current	
   data	
   (Metadata	
   management).	
   Start	
   engaging	
   cross	
  
functionally	
  to	
  understand	
  the	
  type/format/meaning	
  of	
  the	
  data	
  
2. Investigate	
  which	
  data	
  is	
  being	
  thrown	
  away	
  and	
  how	
  you	
  can	
  increase	
  the	
  
‘granularity’	
  of	
  data	
  capture	
  
3. Figure	
  out	
  what	
  would	
  it	
  take	
  to	
  capture	
  data	
  as	
  close	
  to	
  the	
  source	
  systems	
  
as	
  possible.	
  
4. Plan	
  retention	
  &	
  security	
  policies	
  from	
  the	
  inception	
  
5. Accounting	
  for	
  step	
  1,	
  2,	
  3	
  &	
  4	
  above,	
  start	
  estimating	
  how	
  much	
  data	
  you	
  
have	
  today	
  &	
  what	
  rate	
  it	
  will	
  grow	
  
6. Classify	
   ingestion	
   requirements	
   for	
   bulk,	
   incremental,	
   change	
   &	
   streaming	
  
data	
  	
  
7. Analyze	
  impact	
  on	
  existing	
  enterprise	
  products	
  and	
  plan	
  on	
  data	
  collection	
  
integrations	
  to	
  
a. Minimize	
  friction	
  and	
  keep	
  collection	
  processes	
  de-­‐coupled	
  
b. Enable	
  self	
  service	
  
	
  
Right	
  Questions	
  –	
  Roadmap	
  Driving	
  The	
  Platform	
  
	
  
It	
  is	
  easy	
  to	
  get	
  sucked	
  into	
  the	
  new	
  tech	
  frenzy	
  surrounding	
  the	
  big	
  data	
  market.	
  
This	
  is	
  especially	
  true	
  when	
  the	
  buyers	
  are	
  the	
  centralized	
  IT	
  teams	
  looking	
  to	
  build	
  
the	
   next	
   generation	
   data	
   processing	
   platforms.	
   But	
   the	
   most	
   successful	
   big	
   data	
  
projects	
  always	
  start	
  with	
  the	
  right	
  questions	
  without	
  getting	
  sucked	
  into	
  analysis	
  
paralysis.	
  While	
  this	
  is	
  well-­‐known	
  wisdom,	
  here	
  are	
  some	
  common	
  problems	
  faced	
  
in	
  putting	
  it	
  to	
  practice	
  
	
  
Common	
  Operational	
  Issues	
  	
  
1. Extreme	
   approaches:	
   Either	
   extremely	
   narrow	
   business	
   driven	
   use	
   cases	
  
justified	
  under	
  'quick	
  wins'	
  bucket	
  or	
  boil	
  the	
  ocean	
  centralized	
  IT	
  driven	
  data	
  
lake	
  
2. Useless	
  and	
  unrealistic	
  science	
  projects	
  hiding	
  under	
  'visionary	
  statements’.	
  The	
  
economics	
  of	
  storing	
  and	
  processing	
  data	
  at	
  scale	
  have	
  improved	
  so	
  significantly	
  
that	
   no	
   problem	
   seems	
   unachievable	
   so	
   it	
   is	
   often	
   easy	
   to	
   come	
   up	
   with	
  
something	
  radical	
  and	
  completely	
  ignore	
  answering	
  ‘Why	
  NOW?’	
  question.	
  Even	
  
the	
  best	
  innovations	
  are	
  useless	
  if	
  they	
  are	
  introduced	
  ahead	
  of	
  their	
  time.	
  
3. Overestimating	
   the	
   benefits-­‐	
   Eg.	
   A	
   common	
   use	
   case	
   you	
   will	
   hear	
   is	
   ETL	
  
modernization	
  ...	
  if	
  you	
  are	
  looking	
  to	
  do	
  ETL	
  in	
  Hadoop	
  for	
  the	
  wrong	
  reasons	
  
you	
  will	
  crash	
  and	
  burn.	
  While	
  there	
  is	
  truth	
  to	
  the	
  fact	
  that	
  you	
  can	
  do	
  any	
  ETL	
  
or	
  rather	
  ELT	
  in	
  Hadoop,	
  	
  
4. Overemphasis	
  on	
  net	
  'new	
  problems’	
  -­‐	
  ‘Why	
  fix	
  something	
  that	
  ain’t	
  broken’	
  is	
  
the	
  popular	
  belief.	
  Then	
  there	
  is	
  also	
  that	
  need	
  to	
  ‘minimize	
  impact’	
  This	
  forces	
  
many	
  organizations	
  to	
  look	
  for	
  net	
  new	
  problems.	
  These	
  often	
  mean	
  higher	
  risk	
  
and	
  less	
  clear	
  understanding	
  of	
  success.	
  Just	
  because	
  a	
  problem	
  is	
  ‘new’	
  doesn’t	
  
mean	
   it	
   is	
   more	
   important	
   and	
   should	
   take	
   precedence	
   over	
   improving	
   some	
  
existing	
  solutions	
  
	
  
	
  
	
  
Some	
  best	
  practices	
  
1. Don't	
   boil	
   the	
   ocean	
   in	
   the	
   first	
   use	
   case	
   but	
   start	
   with	
   a	
   problem	
   that	
   spans	
  
across	
  some	
  business	
  silos	
  and	
  forces	
  collaboration	
  of	
  people.	
  This	
  will	
  give	
  you	
  
a	
   limited	
   preview	
   of	
   the	
  collaboration,	
   political	
   &	
  technology	
   hurdles	
   that	
   will	
  
need	
  to	
  be	
  overcome	
  if	
  you	
  want	
  to	
  create	
  a	
  big	
  data	
  platform	
  that	
  works	
  in	
  the	
  
long	
  term.	
  
2. Create	
   the	
   right	
   incentive	
   structure	
   for	
   so	
   the	
   right	
   and	
   not	
   necessarily	
   the	
  
‘sexiest’	
  problems	
  get	
  attention	
  first.	
  
3. Paint	
  the	
  vision	
  but	
  ground	
  the	
  execution-­‐	
  A	
  vision	
  is	
  useless	
  if	
  it	
  starts	
  paying	
  
off	
   only	
   when	
   it	
   is	
   'fully	
   realized'.	
   Even	
   the	
   first	
   milestone	
   needs	
   to	
   have	
   a	
  
tangible	
  or	
  intangible	
  but	
  measurable	
  benefit.	
  	
  
4. Do	
  it	
  for	
  the	
  right	
  reasons.	
  Keep	
  asking	
  'so	
  what'	
  until	
  you	
  arrive	
  at	
  a	
  meaningful	
  
outcome	
   that	
   will	
   have	
   a	
   direct	
   impact	
   on	
   the	
   business.	
   Going	
   through	
   this	
  
process	
  will	
  also	
  help	
  you	
  sell	
  the	
  idea	
  at	
  ALL	
  levels	
  in	
  the	
  organization	
  
	
  
Getting	
  Started	
  Plan	
  
1. Engage	
  cross	
  functionally	
  to	
  create	
  a	
  simple	
  use	
  case	
  analysis	
  grid	
  that	
  has	
  the	
  
following	
  information	
  for	
  every	
  use	
  case	
  
a) Name	
  &	
  description	
  of	
  ‘what’	
  	
  
b) Category	
  (net	
  new	
  addition	
  or	
  improvement	
  of	
  existing	
  solution)	
  
c) Overall	
   benefits	
   expected	
   from	
   the	
   use	
   over	
   the	
   next	
   12	
   months	
   and	
   long	
  
term	
  
d) Data	
  needed	
  to	
  address	
  the	
  use	
  case	
  (What	
  is	
  available	
  &	
  what	
  is	
  missing)	
  
e) Three	
  milestones	
  (outcomes)	
  to	
  be	
  hit	
  over	
  the	
  next	
  12	
  months	
  
f) Measures	
  of	
  success	
  of	
  each	
  milestone	
  
g) Which	
  BUs/product	
  areas	
  will	
  be	
  involved	
  per	
  milestone	
  
2.	
  Prioritize	
  –	
  The	
  exercise	
  above	
  should	
  give	
  you	
  enough	
  raw	
  data	
  to	
  prioritize	
  the	
  
use	
  cases.	
  
3.	
  Get	
  to	
  the	
  next	
  level	
  of	
  detail	
  –	
  Pick	
  the	
  top	
  three	
  use	
  cases	
  and	
  start	
  expanding	
  
the	
   milestones	
   into	
   first	
   level	
   requirements.	
   Break	
   the	
   requirements	
   into	
   ‘Must	
  
have’	
  &	
  ‘stretch	
  goals’	
  in	
  three	
  phases	
  ‘crawl’,	
  ‘walk’	
  &	
  ‘run’.	
  This	
  process	
  will	
  give	
  
you	
  some	
  clarity	
  of	
  thought	
  &	
  expose	
  holes	
  or	
  unrealistic	
  assumptions	
  
4.	
   Start	
   evangelizing	
   internally	
   –	
   Start	
   evangelizing	
   ‘intent’.	
   At	
   a	
   bare	
   minimum,	
  
target	
  the	
  stakeholders	
  across	
  the	
  product/functional	
  areas	
  benefiting	
  from	
  the	
  first	
  
target	
  use	
  cases.	
  Incorporate	
  their	
  feedback.	
  
	
  
You	
  should	
  now	
  have	
  enough	
  to	
  start	
  thinking	
  about	
  the	
  ‘How’	
  i.e	
  the	
  key	
  functional	
  
components	
  of	
  the	
  platform.	
  	
  
	
  
	
   	
  
Self-­‐Service	
  Data	
  Platform	
  
Goal	
  should	
  be	
  to	
  build	
  a	
  future	
  proof	
  platform	
  without	
  boiling	
  the	
  ocean	
  on	
  day	
  one	
  
and	
  introducing	
  every	
  possible	
  big	
  data	
  technology	
  early	
  on.	
  
	
  
Common	
  Operational	
  Issues	
  
1. Useless	
   pilots-­‐	
   so	
   generic	
   success	
   criteria	
   that	
   results	
   don't	
   mean	
   much	
   for	
  
production	
   implementation.	
   They	
   end	
   up	
   being	
   simple	
   training	
   exercises	
   for	
  
employees	
   and	
   lead	
   to	
   heavy	
   fudging	
   and	
   influence	
   by	
   vendors	
   and	
   internal	
  
political	
  interests.	
  And	
  ironically	
  the	
  big	
  data	
  pilot	
  gets	
  evaluated	
  by	
  anything	
  
but	
  solid	
  criteria	
  founded	
  in	
  data	
  
2. Visualization	
  &	
  BI	
  tools	
  need	
  data	
  in	
  their	
  own	
  islands	
  –If	
  your	
  reporting	
  &	
  BI	
  
use	
  cases	
  start	
  needing	
  data	
  to	
  be	
  extracted	
  out	
  of	
  the	
  central	
  store	
  then	
  you	
  are	
  
setting	
  yourself	
  up	
  for	
  long	
  term	
  disaster.	
  	
  
3. ‘Batch’	
   thinking	
   –	
   Many	
   of	
   the	
   early	
   adaptors	
   made	
   extensive	
   investments	
   in	
  
‘batch	
   processing’	
   in	
   Hadoop.	
   Now	
   they	
   are	
   struggling	
   to	
   evolve	
   those	
  
investments	
  to	
  support	
  near	
  real-­‐time	
  stream	
  processing	
  so	
  they	
  can	
  really	
  take	
  
action	
  at	
  the	
  right	
  time	
  and	
  create	
  a	
  feedback	
  loop	
  in	
  their	
  analytical	
  pipelines.	
  
To	
  be	
  fair,	
  this	
  was	
  not	
  a	
  mistake	
  but	
  just	
  a	
  side	
  effect	
  of	
  being	
  the	
  first	
  mover.	
  
Now	
  there	
  are	
  better	
  options	
  available	
  than	
  ‘mapreduce’	
  	
  	
  
4. Not	
   making	
   good	
   use	
   of	
   ‘professional	
   services’	
   –	
   Professional	
   services	
   (PS)	
  
revenue	
  accounts	
  for	
  a	
  large	
  chunk	
  of	
  all	
  Hadoop	
  distributions	
  revenue.	
  They	
  are	
  
essential	
  given	
  the	
  skills	
  gap	
  in	
  this	
  market.	
  But	
  too	
  many	
  organizations	
  struggle	
  
after	
  the	
  PS	
  team	
  as	
  left	
  the	
  premises.	
  This	
  is	
  especially	
  true	
  when	
  the	
  charter	
  of	
  
the	
  PS	
  team	
  was	
  to	
  help	
  ‘get	
  started’	
  with	
  things	
  like	
  cluster	
  set-­‐up,	
  implement	
  a	
  
sample	
  application	
  etc.	
  
5. Being	
  too	
  cagy	
  about	
  your	
  big	
  data	
  successes-­‐	
  	
  Many	
  organizations	
  overestimate	
  
the	
  value	
  of	
  ‘secrecy’.	
  While	
  some	
  use	
  cases	
  do	
  warrant	
  secrecy	
  in	
  many	
  others	
  
the	
   value	
   of	
   evangelizing	
   your	
   success	
   externally	
   in	
   the	
   community	
   far	
  
outweighs	
  any	
  downsides.	
  Remember	
  the	
  hard	
  part	
  is	
  being	
  successful	
  in	
  your	
  
big	
   data	
   project.	
   You	
   can	
   very	
   safely	
   assume	
   all	
   your	
   competitors	
   are	
   trying	
  
many	
  of	
  the	
  same	
  things	
  as	
  you	
  are.	
  
	
  
Some	
  best	
  practices	
  
1. Keep	
  Proof	
  of	
  Concepts	
  (POCs)	
  and	
  Pilots	
  separate-­‐	
  POCs	
  are	
  meant	
  for	
  the	
  team	
  
to	
   get	
   familiar	
   with	
   the	
   technology.	
   Pilots	
   need	
   to	
   be	
   real.	
   The	
   scope	
   and	
  
deliverable	
  needs	
  to	
  be	
  such	
  that	
  at	
  the	
  end	
  of	
  the	
  pilot	
  you	
  have	
  something	
  that	
  
you	
  can	
  easily	
  migrate	
  to	
  production.	
  	
  
2. The	
  very	
  first	
  milestone’s	
  output	
  should	
  be	
  something	
  that	
  will	
  get	
  used	
  every	
  
day	
  in	
  ‘production’.	
  This	
  will	
  force	
  you	
  to	
  think	
  of	
  important	
  operations	
  issues	
  
early	
  on	
  
3. Be	
  ready	
  to	
  pay	
  for	
  your	
  pilots-­‐	
  you	
  get	
  what	
  you	
  pay	
  for.	
  It	
  is	
  true	
  that	
  you	
  can	
  
get	
  business	
  hungry	
  big	
  data	
  vendors	
  to	
  do	
  pilots	
  for	
  free.	
  But	
  willingness	
  to	
  pay	
  
just	
  a	
  little	
  bit	
  will	
  put	
  you	
  high	
  up	
  in	
  their	
  priority	
  list.	
  It	
  will	
  also	
  get	
  you	
  their	
  
best	
   people	
   and	
   more	
   importantly	
   you	
   will	
   get	
   the	
   vendor	
   to	
   be	
   more	
  
forthcoming	
   in	
   being	
   a	
   true	
   partner	
   in	
   your	
   success	
   and	
   not	
   force	
   them	
   to	
  
constantly	
  be	
  in	
  'sell	
  mode'	
  
4. Plan	
  to	
  minimize	
  data	
  movement	
  out	
  of	
  the	
  Hadoop	
  cluster	
  
5. Think	
   carefully	
   when	
   involving	
   ‘professional	
   services	
   teams’	
   For	
   parts	
   of	
   the	
  
platform	
   that	
   are	
   not	
   core	
   to	
   your	
   big	
   data	
   strategy,	
   you	
   might	
   want	
   to	
  
permanently	
  outsource	
  their	
  operations	
  &	
  maintenance.	
  If	
  you	
  need	
  assistance	
  
in	
  building	
  a	
  piece	
  of	
  the	
  solution	
  be	
  absolutely	
  sure	
  that	
  the	
  outside	
  PS	
  team	
  is	
  
pairing	
  with	
  your	
  internal	
  developers	
  so	
  there	
  can	
  be	
  successful	
  handoff.	
  
	
  
Getting	
  Started	
  Plan	
  
	
  
1. Infrastructure	
  evaluation	
  -­‐	
  Based	
  on	
  the	
  understanding	
  of	
  the	
  data	
  and	
  use	
  case	
  
roadmap,	
   start	
   charting	
   out	
   the	
   broad	
   storage	
   &	
   compute	
   hardware	
  
requirements.	
   Do	
   a	
   gap	
   analysis	
   to	
   figure	
   out	
   what	
   is	
   missing.	
   	
   As	
   part	
   of	
  
planning	
  to	
  address	
  the	
  gaps	
  consider	
  running	
  the	
  platform	
  or	
  parts	
  of	
  it	
  in	
  the	
  
cloud	
  vs.	
  on-­‐premise.	
  	
  
2. Software	
   functional	
   evaluation	
   –	
   Before	
   getting	
   into	
   the	
   technology,	
   it	
   is	
  
important	
  to	
  understand	
  the	
  ‘data	
  access’	
  pattern	
  requirements	
  here	
  (eg.	
  Search,	
  
Ad-­‐hoc	
   reporting,	
   fast	
   key	
   value	
   look-­‐ups,	
   real-­‐time,	
   batch,	
   machine	
   learning	
  
etc.).	
  Model	
  these	
  as	
  ‘services’	
  of	
  the	
  broader	
  platform	
  rather	
  than	
  as	
  islands	
  of	
  
data.	
   Consider	
   the	
   data	
   ingestion	
   &	
   distribution	
   requirements	
   as	
   part	
   of	
   the	
  
functions.	
  This	
  should	
  give	
  a	
  sense	
  of	
  the	
  ‘gaps’	
  in	
  your	
  current	
  environment	
  and	
  
also	
  expose	
  all	
  the	
  integrations	
  needed.	
  Based	
  on	
  that,	
  you	
  should	
  move	
  on	
  to	
  do	
  
a	
  build	
  vs.	
  buy	
  analysis	
  
	
  
3. Platform	
  operations	
  evaluation	
  –	
  This	
  part	
  is	
  often	
  neglected	
  and	
  	
  
4. Skills	
  evaluation	
  –	
  See	
  the	
  last	
  section	
  on	
  ‘Organizational	
  glue’	
  for	
  more	
  on	
  this.	
  
5. For	
  the	
  production	
  roll-­‐out	
  phase,	
  plan	
  to	
  ‘fix	
  a	
  ship	
  in	
  flight’.	
  This	
  would	
  require	
  
a	
  period	
  of	
  running	
  your	
  new	
  system	
  in	
  parallel	
  with	
  the	
  old	
  and	
  doing	
  a	
  phased	
  
End	
  of	
  life.	
  
Here	
  is	
  an	
  example	
  of	
  typical	
  analytics	
  adoption/product	
  integration	
  cycle	
  
Offline	
  =	
  Analytics	
  done	
  offline	
  in	
  batch	
  &	
  not	
  directly	
  integrated	
  with	
  core	
  products	
  
Online	
  =	
  Analytics	
  done	
  in	
  real-­‐time	
  and	
  integrated	
  with	
  enterprise	
  products	
  
Analytics	
  Stage	
   Short	
  Term	
   Medium/Long	
  Term	
  
Descriptive	
  analytics	
   Offline	
   Online	
  
Predictive	
  analytics	
   Offline	
   Online	
  
Prescriptive	
  analytics	
   Online	
   Online	
  
	
  
6. Documentation	
   is	
   important	
   and	
   cant	
   slip	
   low	
   in	
   the	
   priority	
   list	
   (Even	
   if	
   the	
  
products	
  are	
  internal	
  &	
  not	
  customer	
  facing))	
  
7. Create	
   an	
   evangelism	
   plan	
   (blogs	
   on	
   website,	
   industry	
   events	
   talks,	
   internal	
  
lunch-­‐n-­‐learns,	
  meet-­‐ups,	
  social	
  media	
  campaigns,	
  &	
  webinars)	
  
8. Run	
  it	
  like	
  a	
  ‘startup’:	
  This	
  will	
  force	
  hard	
  prioritizations	
  &	
  introduce	
  a	
  much-­‐
needed	
  sense	
  of	
  urgency	
  without	
  drowning	
  in	
  too	
  many	
  processes.	
  Will	
  enable	
  
you	
  to	
  be	
  ‘scrappy	
  &	
  resourceful’	
  within	
  the	
  organization.	
  The	
  need	
  to	
  produce	
  
quick	
  output,	
  fail-­‐fast	
  &	
  iterate	
  will	
  require	
  agile	
  development	
  practices.	
  Above	
  
all,	
  will	
  help	
  attract	
  the	
  right	
  talent!	
  	
  
	
  
Organizational	
  Glue	
  
The	
  scarcity	
  of	
  big	
  data	
  skillset	
  in	
  the	
  market	
  gets	
  a	
  lot	
  of	
  attention.	
  While	
  it	
  is	
  true	
  
that	
  the	
  ‘data	
  scientist’	
  is	
  the	
  sexiest	
  job	
  of	
  the	
  21st	
  century,	
  even	
  the	
  smartest	
  data	
  
scientists	
  and	
  the	
  best	
  technology	
  will	
  not	
  be	
  successful	
  unless	
  you	
  have	
  the	
  
organizational	
  glue	
  in	
  place	
  to	
  bring	
  all	
  the	
  pieces	
  together.	
  
	
  
Common	
  Problems	
  &	
  Some	
  best	
  practices	
  
Skillset	
  shortage	
  &	
  imbalance	
  is	
  the	
  most	
  common	
  problem	
  with	
  big	
  data	
  projects.	
  :	
  
There	
  is	
  a	
  tendency	
  to	
  hire	
  Hadoop	
  developers	
  and	
  data	
  scientists	
  –	
  both	
  of	
  these	
  
are	
  two	
  most	
  in	
  demand	
  jobs.	
  However,	
  if	
  you	
  look	
  at	
  any	
  big	
  data	
  Implementation	
  it	
  
spans	
  across	
  various	
  technologies	
  and	
  also	
  needs	
  heavy	
  operations	
  focus.	
  It	
  is	
  hard	
  
and	
  I	
  would	
  argue	
  unnecessary	
  to	
  plan	
  to	
  hire	
  a	
  team	
  that	
  can	
  own	
  every	
  piece	
  of	
  it	
  
in	
  house.	
  The	
  better	
  approach	
  is	
  to	
  seek	
  the	
  right	
  development	
  APIs	
  that	
  can	
  enable	
  
your	
  existing	
  talent	
  to	
  leverage	
  big	
  data	
  technologies.	
  Outsource	
  the	
  aspects	
  of	
  the	
  
solution	
  that	
  are	
  not	
  key	
  differentiators	
  for	
  your	
  business.	
  Open	
  source	
  the	
  pieces	
  of	
  
your	
  stack	
  that	
  add	
  value	
  but	
  are	
  not	
  key	
  differentiators	
  for	
  your	
  business.	
  There	
  is	
  
a	
  reason	
  why	
  large	
  tech	
  companies	
  like	
  Netflix	
  and	
  Facebook	
  open	
  source	
  so	
  many	
  
projects.	
   They	
   want	
   to	
   find	
   community	
   support	
   so	
   they	
   can	
   hire	
   easily	
   from	
   the	
  
community	
  and	
  get	
  free	
  bug	
  fixes	
  as	
  more	
  and	
  more	
  developers	
  fix	
  parts	
  of	
  the	
  open	
  
source	
  projects	
  they	
  use.	
  	
  
The	
  big	
  data	
  function	
  should	
  be	
  run	
  by	
  a	
  leader	
  who	
  understands	
  the	
  technology	
  
and	
  the	
  operational	
  issues	
  well	
  but	
  at	
  the	
  same	
  time	
  has	
  the	
  caliber	
  to	
  get	
  a	
  firm	
  
grasp	
  of	
  the	
  business	
  priorities.	
  This	
  will	
  allow	
  them	
  to	
  gain	
  credibility	
  &	
  respect	
  
across	
  all	
  functions	
  of	
  the	
  organization	
  and	
  customers	
  
	
  
Getting	
  Started	
  Plan	
  
1. Establish	
   the	
   lay	
   of	
   the	
   land	
   (functionally	
   as	
   well	
   as	
   politically)	
   of	
   the	
  
organization	
  (Know	
  where	
  to	
  go	
  for	
  what)	
  
2. Form	
  planning	
  &	
  execution	
  teams	
  to	
  include	
  product,	
  engineering	
  &	
  operations	
  
functional	
  liaisons	
  from	
  the	
  big	
  data	
  team	
  as	
  well	
  as	
  the	
  products/business	
  units	
  
that	
  the	
  features	
  on	
  the	
  roamap	
  impact	
  
3. Map	
  out	
  team	
  composition	
  
a. Product	
  team	
  
1. Hire	
  data	
  product	
  managers	
  that	
  can	
  liaison	
  with	
  various	
  enterprise	
  
product	
  components	
   to	
   ensure	
   that	
   the	
   right	
   offline	
   &	
   online	
  
integration	
  capabilities	
  are	
  exposed.	
  The	
  main	
  charter	
  is	
  to	
  make	
  the	
  
big	
  data	
  platform	
  truly	
  useful	
  across	
  all	
  the	
  different	
  product	
  lines	
  
2. Data	
  scientists	
  should	
  be	
  on	
  the	
  product	
  team	
  -­‐	
  experimental	
  work	
  +	
  
models	
  that	
  drive	
  offline	
  &	
  online	
  data	
  platform	
  product	
  capabilities	
  
3. Reporting	
  and	
  analytics-­‐	
  Rolled	
  under	
  the	
  big	
  data	
  team	
  and	
  continue	
  
current	
   responsibility	
   in	
   the	
   short	
   term.	
   The	
   reporting	
   &	
   analytics	
  
product	
  management	
  	
  
4. Shared	
  resource/dotted	
  line	
  -­‐	
  UX	
  -­‐	
  based	
  on	
  how	
  we	
  decide	
  to	
  evolve	
  
the	
  user	
  interface	
  pieces	
  of	
  the	
  data	
  platform	
  and	
  products	
  	
  	
  	
  
b. Engineering	
  team	
  
i. Silicon	
  valley	
  presence	
  essential	
  
ii. Need	
  a	
  big	
  data	
  architect	
  to	
  design	
  the	
  end	
  to	
  end	
  system	
  and	
  
who	
  understands	
  the	
  technical	
  challenges	
  in	
  piecing	
  the	
  various	
  
big	
  data	
  tools	
  together	
  
iii. Hire	
  based	
  on	
  platform	
  use	
  case	
  requirements	
  to	
  create	
  a	
  mix	
  of	
  
generalist	
  big	
  data	
  engineers	
  &	
  experts	
  in	
  a	
  functional	
  area	
  eg.	
  
NoSQL	
  or	
  search	
  specialists	
  
4. Rest	
  of	
  the	
  organization:	
  Should	
  demand	
  self-­‐service	
  from	
  the	
  platform	
  and	
  not	
  
rely	
  on	
  the	
  big	
  data	
  team	
  all	
  the	
  time.	
  	
  
5. Leadership	
  –	
  drive	
  a	
  data	
  culture	
  
a. Needs	
  to	
  simply	
  ask	
  these	
  questions	
  for	
  EVERY	
  decision	
  -­‐	
  Show	
  me	
  the	
  
data,	
  its	
  source,	
  the	
  analysis	
  and	
  your	
  confidence..	
  
b. Avoid	
  FAKING	
  it.	
  (Many	
  people	
  use	
  the	
  outcomes	
  of	
  reports	
  to	
  support	
  
preconceived	
  conclusions)	
  
6. Hiring:	
  Look	
  at	
  universities	
  for	
  fresh	
  talent	
  in	
  the	
  stats	
  &	
  machine	
  learning	
  area	
  
&	
  pair	
  them	
  with	
  experienced	
  business	
  analysts	
  	
  
7. Invest	
  in	
  ongoing	
  cross	
  training,	
  skill	
  development	
  &	
  retention:	
  Training	
  courses	
  
around	
   data	
   consumption:	
   Evaluate	
   skillsets	
   of	
   existing	
   team	
   member,	
   their	
  
current	
  career	
  goals.	
  Evaluate	
  w.r.t	
  requirements	
  of	
  the	
  data	
  platform	
  product	
  
roadmap.	
  Make	
  training	
  goals	
  part	
  of	
  performance	
  review	
  
	
  
	
  
	
  
If	
  you	
  have	
  been	
  reading	
  so	
  far	
  and	
  found	
  the	
  content	
  useful,	
  I	
  am	
  glad	
  I	
  could	
  
help!	
  If	
  you	
  have	
  your	
  own	
  experiences	
  to	
  share	
  or	
  	
  you	
  think	
  any	
  of	
  this	
  doesn’t	
  
make	
  sense,	
  I	
  would	
  love	
  to	
  hear	
  your	
  comments!	
  	
  
	
  
Thank	
  You!	
  
	
  

More Related Content

What's hot

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataMicrosoft
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Big Data Management: Work Smarter Not Harder
Big Data Management: Work Smarter Not HarderBig Data Management: Work Smarter Not Harder
Big Data Management: Work Smarter Not HarderJennifer Walker
 
How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management Abhishek Sood
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationYael Garten
 
Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Jennifer Walker
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Inside Analysis
 
The New Data Dynamics How to turn data into a competitive advantage
The New Data Dynamics How to turn data into a competitive advantageThe New Data Dynamics How to turn data into a competitive advantage
The New Data Dynamics How to turn data into a competitive advantageFiona Lew
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count doubleDirk Ortloff
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 
Big Data Management For Dummies Informatica
Big Data Management For Dummies InformaticaBig Data Management For Dummies Informatica
Big Data Management For Dummies InformaticaFiona Lew
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 

What's hot (20)

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Big Data Management: Work Smarter Not Harder
Big Data Management: Work Smarter Not HarderBig Data Management: Work Smarter Not Harder
Big Data Management: Work Smarter Not Harder
 
How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management How 3 trends are shaping analytics and data management
How 3 trends are shaping analytics and data management
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
 
Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?Move It Don't Lose It: Is Your Big Data Collecting Dust?
Move It Don't Lose It: Is Your Big Data Collecting Dust?
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
 
The New Data Dynamics How to turn data into a competitive advantage
The New Data Dynamics How to turn data into a competitive advantageThe New Data Dynamics How to turn data into a competitive advantage
The New Data Dynamics How to turn data into a competitive advantage
 
Unit 2
Unit 2Unit 2
Unit 2
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count double
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
Big Data Management For Dummies Informatica
Big Data Management For Dummies InformaticaBig Data Management For Dummies Informatica
Big Data Management For Dummies Informatica
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 

Viewers also liked

Why you should be building an online community!
Why you should be building an online community! Why you should be building an online community!
Why you should be building an online community! Christopher Onderstall
 
Vienna In The Late 19th And Early 20thlesliechan
Vienna In The Late 19th And Early 20thlesliechanVienna In The Late 19th And Early 20thlesliechan
Vienna In The Late 19th And Early 20thlesliechanCharlieee
 
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Bree
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek BreeRFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Bree
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Breepslb pslb
 
AdTech - Display Advertising Space Overview
AdTech - Display Advertising Space OverviewAdTech - Display Advertising Space Overview
AdTech - Display Advertising Space OverviewHimanshu Bari
 
pineapple2
pineapple2pineapple2
pineapple2wirat25
 

Viewers also liked (6)

Why you should be building an online community!
Why you should be building an online community! Why you should be building an online community!
Why you should be building an online community!
 
Vienna In The Late 19th And Early 20thlesliechan
Vienna In The Late 19th And Early 20thlesliechanVienna In The Late 19th And Early 20thlesliechan
Vienna In The Late 19th And Early 20thlesliechan
 
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Bree
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek BreeRFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Bree
RFID-infoavond: Analyse van de werkprocessen Annemie Arras bibliotheek Bree
 
AdTech - Display Advertising Space Overview
AdTech - Display Advertising Space OverviewAdTech - Display Advertising Space Overview
AdTech - Display Advertising Space Overview
 
pineapple2
pineapple2pineapple2
pineapple2
 
Social Paid Media Guide
Social Paid Media GuideSocial Paid Media Guide
Social Paid Media Guide
 

Similar to Big dataplatform operationalstrategy

Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroStephen Lahanas
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? ScaleFocus
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Papershashanksalunkhe12
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015Fiona Lew
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of dataHarsha MV
 
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...Lora Cecere
 
Overcoming the difficulties of managing multiple databases
Overcoming the difficulties of managing multiple databasesOvercoming the difficulties of managing multiple databases
Overcoming the difficulties of managing multiple databasesMSM Software
 
Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Angie Jorgensen
 
The value of big data analytics
The value of big data analyticsThe value of big data analytics
The value of big data analyticsMarc Vael
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paperJohn Enoch
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 

Similar to Big dataplatform operationalstrategy (20)

Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - Intro
 
Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it? Big Data: Are you ready for it? Can you handle it?
Big Data: Are you ready for it? Can you handle it?
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
Rapid-fire BI
Rapid-fire BIRapid-fire BI
Rapid-fire BI
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of data
 
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...
Big Data and Analytics: The New Underpinning for Supply Chain Success? - 17 F...
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Overcoming the difficulties of managing multiple databases
Overcoming the difficulties of managing multiple databasesOvercoming the difficulties of managing multiple databases
Overcoming the difficulties of managing multiple databases
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...
 
The value of big data analytics
The value of big data analyticsThe value of big data analytics
The value of big data analytics
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paper
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Big dataplatform operationalstrategy

  • 1.   Building  a  Big  Data  Analytics  Function   For  Long  Term  Success   Himanshu  Bari    -­‐  https://www.linkedin.com/in/himanshubari        
  • 2. The  last  three  years  of  my  career  have  been  in  the  big  data  space.  It  started  at  the   ground  zero  of  the  big  data  revolution  at  Hortonworks  -­‐  one  of  the  leading  Hadoop   distributions.  As  a  product  manager,  I  have  had  the  privilege  of  working  closely  with   marketing,  customer  success,  pre-­‐sales  as  well  as  post  sales  teams  across  industry   verticals   to   make   our   internal   customer   champions   successful   in   their   quest   to   formulate  &  execute  their  big  data  strategies.  My  view  has  spanned  across  all  phases   of  implementation  -­‐  from  the  early  use  case  selection  to  POC  &  pilot  execution,  post   pilot  production,  operationalization  &  finally  evangelism.  Having  an  inside  view  of   the  evolution  of  the  big  data  market  has  been  an  extremely  rewarding  experience   and   I   learnt   a   lot.     While   working   with   the   big   data   solutions   owners,   in   many   instances   it   brought   back   memory   of   when   I   was   part   of   the   central   technology   strategy  group  at  Lehman  Brothers.  There  I  had  the  opportunity  to  build  and  drive   adoption  of  a  home  grown  application  performance  monitoring  solution  across  the   entire  company  and  overcame  some  of  the  similar  organizational  &  process  hurdles   as  faced  by  the  big  data  early  adaptors     Who  is  this  report  for?   The  pioneers  in  the  big  data  space  have  battle  scars  and  have  learnt  many  of  the   lessons   in   this   report   the   hard   way.   But   if   you   are   a   general   manger   &   just   embarking  on  the  big  data  journey,  you  should  now  have  what  they  call  the  'second   mover  advantage’.  My  hope  is  that  this  report  helps  you  better  leverage  your  second   mover  advantage.     What  this  report  is  NOT?   -­‐ This  is  NOT  meant  to  be  a  technical  recipe  book  for  building  big  data  systems.   There  is  no  shortage  of  those.  Just  look  through  any  of  the  vendor  websites.  Or   ping  me  and  I  would  be  happy  to  talk  tech!   -­‐ This  is  not  a  big  data  project  plan  or  a  budgeting  primer.  There  are  too  many   organizational  and  situational  specifics  needed  for  creating  those.  But  my  hope  is   that  the  content  in  this  report  will  server  as  a  guiding  post  &  input  into  those   efforts     What  is  the  FOCUS  of  this  report?   The   goal   here   is   to   shed   some   light   on   the   people   &   process   issues   in   building   a   central  big  data  analytics  function.      
  • 3.   The   rest   of   this   report   is   organized   around  the  four  key  pillars  as  shown  on   the  left.  For  each  area,  I  will  discuss     1.  Common  problems   2.  Some  best  practices   3.  Getting  started  plan     Getting  to  ‘Complete  Data’     The  data  platform  is  only  as  good  as  the  data  in  it.  Most  big  data  projects  just  assume   that  there  is  a  lot  of  data  available  and  that  just  by  having  ‘more’  data  will  magically   result   in   better   insights.   While   that   is   loosely   true   in   some   cases   (like   machine   learning),   the   most   successful   organizations   pay   at   a   lot   of   attention   to   truly   understand  the  nature  of  data  available.       Common  Operational  Issues  in  getting  to  ‘Complete  Data’   1. Ownership   split   across   teams   creating   ‘Silos’   –   Happens   naturally   as   the   various  internal  products  &  systems  evolved  organically  or  inorganically.  ]   2. Format  &  quality  issues  -­‐  Inherently  introduced  by  the  silos  and  variety  of   systems.  As  a  result  there  are  often  very  different  views  &  interpretations  of   the  same  asset.     3. Data  has  ‘inertia’  –  Cannot  be  easily  moved  around   4. Merging  ‘event’  &  transactional  data  is  challenging   5. Access   requirements   vary   by   users   &   workloads.   stages.   Eg.   Reporting   &   analytics   use   cases   have   different   data   prep   needs   than   data   science   use   cases   6. Data   ingestion   and   distribution   into   the   platform   become   the   ‘sole’   responsibility   of   the   big   data   team.   They   just   become   data   movement   monkeys     Some  best  practices   1. Store   as   much   as   possible:   not   valuable   to   you   right   now   doesn't   mean   it   won't  be  valuable  in  the  future.   2. Capture   at   Source:   At   the   lowest   granularity.   Store   pre-­‐aggregated   data   at   varying  granularity   3. Make  data  consumption  APIs  ‘flexible’:  Make  it  easy  to  discover,  understand   &  consume..  Only  then  folks  who  were  not  doing  anything  with  data  will  be   able   to   play   around   with   it   and   come   up   with   insights   that   no-­‐one   was   thinking  off   4. On   demand   data   fusion   capability   (to   break   the   silos).   Cannot   have   all   possible  fusions  stored  all  the  time  and  you  cannot  ‘guess’  the  data  fusions   ahead  of  time.  
  • 4. 5. Self-­‐service  for  data  ingestion:  The  big  data  team  cannot  be  the  ‘gatekeeper’   for  all  data  ingestion  pipelines.     6. Invest  early  in  metadata,  lineage  &  security:    Focus  on  data  quality  from  day-­‐ 1.  If  people  lose  faith  on  quality  they  will  go  back  to  the  old  ways  of  doing   things.  Data  quality  and  continuous  audits  through  cross  check  of  results Getting  Started  Plan   1. Catalog   your   current   data   (Metadata   management).   Start   engaging   cross   functionally  to  understand  the  type/format/meaning  of  the  data   2. Investigate  which  data  is  being  thrown  away  and  how  you  can  increase  the   ‘granularity’  of  data  capture   3. Figure  out  what  would  it  take  to  capture  data  as  close  to  the  source  systems   as  possible.   4. Plan  retention  &  security  policies  from  the  inception   5. Accounting  for  step  1,  2,  3  &  4  above,  start  estimating  how  much  data  you   have  today  &  what  rate  it  will  grow   6. Classify   ingestion   requirements   for   bulk,   incremental,   change   &   streaming   data     7. Analyze  impact  on  existing  enterprise  products  and  plan  on  data  collection   integrations  to   a. Minimize  friction  and  keep  collection  processes  de-­‐coupled   b. Enable  self  service     Right  Questions  –  Roadmap  Driving  The  Platform     It  is  easy  to  get  sucked  into  the  new  tech  frenzy  surrounding  the  big  data  market.   This  is  especially  true  when  the  buyers  are  the  centralized  IT  teams  looking  to  build   the   next   generation   data   processing   platforms.   But   the   most   successful   big   data   projects  always  start  with  the  right  questions  without  getting  sucked  into  analysis   paralysis.  While  this  is  well-­‐known  wisdom,  here  are  some  common  problems  faced   in  putting  it  to  practice     Common  Operational  Issues     1. Extreme   approaches:   Either   extremely   narrow   business   driven   use   cases   justified  under  'quick  wins'  bucket  or  boil  the  ocean  centralized  IT  driven  data   lake   2. Useless  and  unrealistic  science  projects  hiding  under  'visionary  statements’.  The   economics  of  storing  and  processing  data  at  scale  have  improved  so  significantly   that   no   problem   seems   unachievable   so   it   is   often   easy   to   come   up   with   something  radical  and  completely  ignore  answering  ‘Why  NOW?’  question.  Even   the  best  innovations  are  useless  if  they  are  introduced  ahead  of  their  time.   3. Overestimating   the   benefits-­‐   Eg.   A   common   use   case   you   will   hear   is   ETL   modernization  ...  if  you  are  looking  to  do  ETL  in  Hadoop  for  the  wrong  reasons  
  • 5. you  will  crash  and  burn.  While  there  is  truth  to  the  fact  that  you  can  do  any  ETL   or  rather  ELT  in  Hadoop,     4. Overemphasis  on  net  'new  problems’  -­‐  ‘Why  fix  something  that  ain’t  broken’  is   the  popular  belief.  Then  there  is  also  that  need  to  ‘minimize  impact’  This  forces   many  organizations  to  look  for  net  new  problems.  These  often  mean  higher  risk   and  less  clear  understanding  of  success.  Just  because  a  problem  is  ‘new’  doesn’t   mean   it   is   more   important   and   should   take   precedence   over   improving   some   existing  solutions         Some  best  practices   1. Don't   boil   the   ocean   in   the   first   use   case   but   start   with   a   problem   that   spans   across  some  business  silos  and  forces  collaboration  of  people.  This  will  give  you   a   limited   preview   of   the  collaboration,   political   &  technology   hurdles   that   will   need  to  be  overcome  if  you  want  to  create  a  big  data  platform  that  works  in  the   long  term.   2. Create   the   right   incentive   structure   for   so   the   right   and   not   necessarily   the   ‘sexiest’  problems  get  attention  first.   3. Paint  the  vision  but  ground  the  execution-­‐  A  vision  is  useless  if  it  starts  paying   off   only   when   it   is   'fully   realized'.   Even   the   first   milestone   needs   to   have   a   tangible  or  intangible  but  measurable  benefit.     4. Do  it  for  the  right  reasons.  Keep  asking  'so  what'  until  you  arrive  at  a  meaningful   outcome   that   will   have   a   direct   impact   on   the   business.   Going   through   this   process  will  also  help  you  sell  the  idea  at  ALL  levels  in  the  organization     Getting  Started  Plan   1. Engage  cross  functionally  to  create  a  simple  use  case  analysis  grid  that  has  the   following  information  for  every  use  case   a) Name  &  description  of  ‘what’     b) Category  (net  new  addition  or  improvement  of  existing  solution)   c) Overall   benefits   expected   from   the   use   over   the   next   12   months   and   long   term   d) Data  needed  to  address  the  use  case  (What  is  available  &  what  is  missing)   e) Three  milestones  (outcomes)  to  be  hit  over  the  next  12  months   f) Measures  of  success  of  each  milestone   g) Which  BUs/product  areas  will  be  involved  per  milestone   2.  Prioritize  –  The  exercise  above  should  give  you  enough  raw  data  to  prioritize  the   use  cases.   3.  Get  to  the  next  level  of  detail  –  Pick  the  top  three  use  cases  and  start  expanding   the   milestones   into   first   level   requirements.   Break   the   requirements   into   ‘Must   have’  &  ‘stretch  goals’  in  three  phases  ‘crawl’,  ‘walk’  &  ‘run’.  This  process  will  give   you  some  clarity  of  thought  &  expose  holes  or  unrealistic  assumptions  
  • 6. 4.   Start   evangelizing   internally   –   Start   evangelizing   ‘intent’.   At   a   bare   minimum,   target  the  stakeholders  across  the  product/functional  areas  benefiting  from  the  first   target  use  cases.  Incorporate  their  feedback.     You  should  now  have  enough  to  start  thinking  about  the  ‘How’  i.e  the  key  functional   components  of  the  platform.          
  • 7. Self-­‐Service  Data  Platform   Goal  should  be  to  build  a  future  proof  platform  without  boiling  the  ocean  on  day  one   and  introducing  every  possible  big  data  technology  early  on.     Common  Operational  Issues   1. Useless   pilots-­‐   so   generic   success   criteria   that   results   don't   mean   much   for   production   implementation.   They   end   up   being   simple   training   exercises   for   employees   and   lead   to   heavy   fudging   and   influence   by   vendors   and   internal   political  interests.  And  ironically  the  big  data  pilot  gets  evaluated  by  anything   but  solid  criteria  founded  in  data   2. Visualization  &  BI  tools  need  data  in  their  own  islands  –If  your  reporting  &  BI   use  cases  start  needing  data  to  be  extracted  out  of  the  central  store  then  you  are   setting  yourself  up  for  long  term  disaster.     3. ‘Batch’   thinking   –   Many   of   the   early   adaptors   made   extensive   investments   in   ‘batch   processing’   in   Hadoop.   Now   they   are   struggling   to   evolve   those   investments  to  support  near  real-­‐time  stream  processing  so  they  can  really  take   action  at  the  right  time  and  create  a  feedback  loop  in  their  analytical  pipelines.   To  be  fair,  this  was  not  a  mistake  but  just  a  side  effect  of  being  the  first  mover.   Now  there  are  better  options  available  than  ‘mapreduce’       4. Not   making   good   use   of   ‘professional   services’   –   Professional   services   (PS)   revenue  accounts  for  a  large  chunk  of  all  Hadoop  distributions  revenue.  They  are   essential  given  the  skills  gap  in  this  market.  But  too  many  organizations  struggle   after  the  PS  team  as  left  the  premises.  This  is  especially  true  when  the  charter  of   the  PS  team  was  to  help  ‘get  started’  with  things  like  cluster  set-­‐up,  implement  a   sample  application  etc.   5. Being  too  cagy  about  your  big  data  successes-­‐    Many  organizations  overestimate   the  value  of  ‘secrecy’.  While  some  use  cases  do  warrant  secrecy  in  many  others   the   value   of   evangelizing   your   success   externally   in   the   community   far   outweighs  any  downsides.  Remember  the  hard  part  is  being  successful  in  your   big   data   project.   You   can   very   safely   assume   all   your   competitors   are   trying   many  of  the  same  things  as  you  are.     Some  best  practices   1. Keep  Proof  of  Concepts  (POCs)  and  Pilots  separate-­‐  POCs  are  meant  for  the  team   to   get   familiar   with   the   technology.   Pilots   need   to   be   real.   The   scope   and   deliverable  needs  to  be  such  that  at  the  end  of  the  pilot  you  have  something  that   you  can  easily  migrate  to  production.     2. The  very  first  milestone’s  output  should  be  something  that  will  get  used  every   day  in  ‘production’.  This  will  force  you  to  think  of  important  operations  issues   early  on   3. Be  ready  to  pay  for  your  pilots-­‐  you  get  what  you  pay  for.  It  is  true  that  you  can   get  business  hungry  big  data  vendors  to  do  pilots  for  free.  But  willingness  to  pay   just  a  little  bit  will  put  you  high  up  in  their  priority  list.  It  will  also  get  you  their   best   people   and   more   importantly   you   will   get   the   vendor   to   be   more  
  • 8. forthcoming   in   being   a   true   partner   in   your   success   and   not   force   them   to   constantly  be  in  'sell  mode'   4. Plan  to  minimize  data  movement  out  of  the  Hadoop  cluster   5. Think   carefully   when   involving   ‘professional   services   teams’   For   parts   of   the   platform   that   are   not   core   to   your   big   data   strategy,   you   might   want   to   permanently  outsource  their  operations  &  maintenance.  If  you  need  assistance   in  building  a  piece  of  the  solution  be  absolutely  sure  that  the  outside  PS  team  is   pairing  with  your  internal  developers  so  there  can  be  successful  handoff.     Getting  Started  Plan     1. Infrastructure  evaluation  -­‐  Based  on  the  understanding  of  the  data  and  use  case   roadmap,   start   charting   out   the   broad   storage   &   compute   hardware   requirements.   Do   a   gap   analysis   to   figure   out   what   is   missing.     As   part   of   planning  to  address  the  gaps  consider  running  the  platform  or  parts  of  it  in  the   cloud  vs.  on-­‐premise.     2. Software   functional   evaluation   –   Before   getting   into   the   technology,   it   is   important  to  understand  the  ‘data  access’  pattern  requirements  here  (eg.  Search,   Ad-­‐hoc   reporting,   fast   key   value   look-­‐ups,   real-­‐time,   batch,   machine   learning   etc.).  Model  these  as  ‘services’  of  the  broader  platform  rather  than  as  islands  of   data.   Consider   the   data   ingestion   &   distribution   requirements   as   part   of   the   functions.  This  should  give  a  sense  of  the  ‘gaps’  in  your  current  environment  and   also  expose  all  the  integrations  needed.  Based  on  that,  you  should  move  on  to  do   a  build  vs.  buy  analysis     3. Platform  operations  evaluation  –  This  part  is  often  neglected  and     4. Skills  evaluation  –  See  the  last  section  on  ‘Organizational  glue’  for  more  on  this.   5. For  the  production  roll-­‐out  phase,  plan  to  ‘fix  a  ship  in  flight’.  This  would  require   a  period  of  running  your  new  system  in  parallel  with  the  old  and  doing  a  phased   End  of  life.   Here  is  an  example  of  typical  analytics  adoption/product  integration  cycle   Offline  =  Analytics  done  offline  in  batch  &  not  directly  integrated  with  core  products   Online  =  Analytics  done  in  real-­‐time  and  integrated  with  enterprise  products   Analytics  Stage   Short  Term   Medium/Long  Term   Descriptive  analytics   Offline   Online   Predictive  analytics   Offline   Online   Prescriptive  analytics   Online   Online     6. Documentation   is   important   and   cant   slip   low   in   the   priority   list   (Even   if   the   products  are  internal  &  not  customer  facing))   7. Create   an   evangelism   plan   (blogs   on   website,   industry   events   talks,   internal   lunch-­‐n-­‐learns,  meet-­‐ups,  social  media  campaigns,  &  webinars)  
  • 9. 8. Run  it  like  a  ‘startup’:  This  will  force  hard  prioritizations  &  introduce  a  much-­‐ needed  sense  of  urgency  without  drowning  in  too  many  processes.  Will  enable   you  to  be  ‘scrappy  &  resourceful’  within  the  organization.  The  need  to  produce   quick  output,  fail-­‐fast  &  iterate  will  require  agile  development  practices.  Above   all,  will  help  attract  the  right  talent!       Organizational  Glue   The  scarcity  of  big  data  skillset  in  the  market  gets  a  lot  of  attention.  While  it  is  true   that  the  ‘data  scientist’  is  the  sexiest  job  of  the  21st  century,  even  the  smartest  data   scientists  and  the  best  technology  will  not  be  successful  unless  you  have  the   organizational  glue  in  place  to  bring  all  the  pieces  together.     Common  Problems  &  Some  best  practices   Skillset  shortage  &  imbalance  is  the  most  common  problem  with  big  data  projects.  :   There  is  a  tendency  to  hire  Hadoop  developers  and  data  scientists  –  both  of  these   are  two  most  in  demand  jobs.  However,  if  you  look  at  any  big  data  Implementation  it   spans  across  various  technologies  and  also  needs  heavy  operations  focus.  It  is  hard   and  I  would  argue  unnecessary  to  plan  to  hire  a  team  that  can  own  every  piece  of  it   in  house.  The  better  approach  is  to  seek  the  right  development  APIs  that  can  enable   your  existing  talent  to  leverage  big  data  technologies.  Outsource  the  aspects  of  the   solution  that  are  not  key  differentiators  for  your  business.  Open  source  the  pieces  of   your  stack  that  add  value  but  are  not  key  differentiators  for  your  business.  There  is   a  reason  why  large  tech  companies  like  Netflix  and  Facebook  open  source  so  many   projects.   They   want   to   find   community   support   so   they   can   hire   easily   from   the   community  and  get  free  bug  fixes  as  more  and  more  developers  fix  parts  of  the  open   source  projects  they  use.     The  big  data  function  should  be  run  by  a  leader  who  understands  the  technology   and  the  operational  issues  well  but  at  the  same  time  has  the  caliber  to  get  a  firm   grasp  of  the  business  priorities.  This  will  allow  them  to  gain  credibility  &  respect   across  all  functions  of  the  organization  and  customers     Getting  Started  Plan   1. Establish   the   lay   of   the   land   (functionally   as   well   as   politically)   of   the   organization  (Know  where  to  go  for  what)   2. Form  planning  &  execution  teams  to  include  product,  engineering  &  operations   functional  liaisons  from  the  big  data  team  as  well  as  the  products/business  units   that  the  features  on  the  roamap  impact   3. Map  out  team  composition   a. Product  team   1. Hire  data  product  managers  that  can  liaison  with  various  enterprise   product  components   to   ensure   that   the   right   offline   &   online   integration  capabilities  are  exposed.  The  main  charter  is  to  make  the   big  data  platform  truly  useful  across  all  the  different  product  lines  
  • 10. 2. Data  scientists  should  be  on  the  product  team  -­‐  experimental  work  +   models  that  drive  offline  &  online  data  platform  product  capabilities   3. Reporting  and  analytics-­‐  Rolled  under  the  big  data  team  and  continue   current   responsibility   in   the   short   term.   The   reporting   &   analytics   product  management     4. Shared  resource/dotted  line  -­‐  UX  -­‐  based  on  how  we  decide  to  evolve   the  user  interface  pieces  of  the  data  platform  and  products         b. Engineering  team   i. Silicon  valley  presence  essential   ii. Need  a  big  data  architect  to  design  the  end  to  end  system  and   who  understands  the  technical  challenges  in  piecing  the  various   big  data  tools  together   iii. Hire  based  on  platform  use  case  requirements  to  create  a  mix  of   generalist  big  data  engineers  &  experts  in  a  functional  area  eg.   NoSQL  or  search  specialists   4. Rest  of  the  organization:  Should  demand  self-­‐service  from  the  platform  and  not   rely  on  the  big  data  team  all  the  time.     5. Leadership  –  drive  a  data  culture   a. Needs  to  simply  ask  these  questions  for  EVERY  decision  -­‐  Show  me  the   data,  its  source,  the  analysis  and  your  confidence..   b. Avoid  FAKING  it.  (Many  people  use  the  outcomes  of  reports  to  support   preconceived  conclusions)   6. Hiring:  Look  at  universities  for  fresh  talent  in  the  stats  &  machine  learning  area   &  pair  them  with  experienced  business  analysts     7. Invest  in  ongoing  cross  training,  skill  development  &  retention:  Training  courses   around   data   consumption:   Evaluate   skillsets   of   existing   team   member,   their   current  career  goals.  Evaluate  w.r.t  requirements  of  the  data  platform  product   roadmap.  Make  training  goals  part  of  performance  review         If  you  have  been  reading  so  far  and  found  the  content  useful,  I  am  glad  I  could   help!  If  you  have  your  own  experiences  to  share  or    you  think  any  of  this  doesn’t   make  sense,  I  would  love  to  hear  your  comments!       Thank  You!