SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Introduc=on	
  to	
  Apache	
  Hadoop	
  	
  
and	
  its	
  Ecosystem	
  
Mark	
  Grover	
  	
  |	
  	
  Intro	
  to	
  Cloud	
  Compu=ng,	
  Carnegie	
  Mellon	
  SV	
  
github.com/markgrover/hadoop-­‐intro-­‐fast	
  
©	
  Copyright	
  2010-­‐2014	
  	
  
	
  	
  	
  	
  	
  Cloudera,	
  Inc.	
  	
  	
  
	
  	
  	
  	
  	
  All	
  rights	
  reserved.	
  	
  	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
About	
  Me	
  
•  CommiNer	
  on	
  Apache	
  Bigtop,	
  commiNer	
  and	
  PPMC	
  member	
  
on	
  Apache	
  Sentry	
  (incuba=ng).	
  
•  Contributor	
  to	
  Apache	
  Hadoop,	
  Hive,	
  Spark,	
  Sqoop,	
  Flume.	
  
•  SoUware	
  developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
•  www.linkedin.com/in/grovermark	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Co-­‐author	
  O’Reilly	
  book	
  
•  @hadooparchbook	
  
•  hadooparchitecturebook.com	
  
•  To	
  be	
  released	
  early	
  2015	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
About	
  the	
  Presenta=on…	
  
•  What’s	
  ahead	
  
•  Fundamental	
  Concepts	
  
•  HDFS:	
  The	
  Hadoop	
  Distributed	
  File	
  System	
  
•  Data	
  Processing	
  with	
  MapReduce	
  
•  Demo	
  
•  Conclusion	
  +	
  Q&A	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Fundamental	
  Concepts	
  
Why	
  the	
  World	
  Needs	
  Hadoop	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
What’s	
  the	
  craze	
  about	
  Hadoop?	
  
•  Volume	
  
•  More	
  and	
  more	
  data	
  being	
  generated	
  
•  Machine	
  generated	
  data	
  increasing	
  
•  Velocity	
  
•  Data	
  coming	
  it	
  at	
  higher	
  speed	
  
•  Variety	
  
•  Audio,	
  video,	
  images,	
  log	
  files,	
  web	
  pages,	
  social	
  network	
  
connec=ons,	
  etc.	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
We	
  Need	
  a	
  System	
  that	
  Scales	
  
•  Too	
  much	
  data	
  for	
  tradi=onal	
  tools	
  
•  Two	
  key	
  problems	
  
•  How	
  to	
  reliably	
  store	
  this	
  data	
  at	
  a	
  reasonable	
  cost	
  
•  How	
  to	
  we	
  process	
  all	
  the	
  data	
  we’ve	
  stored	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
•  Scalable	
  data	
  storage	
  and	
  processing	
  
•  Distributed	
  and	
  fault-­‐tolerant	
  	
  
•  Runs	
  on	
  standard	
  hardware	
  
•  Two	
  main	
  components	
  
•  Storage:	
  Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
•  Processing:	
  MapReduce	
  
•  Hadoop	
  clusters	
  are	
  composed	
  of	
  computers	
  called	
  nodes	
  
•  Clusters	
  range	
  from	
  a	
  single	
  node	
  up	
  to	
  several	
  thousand	
  nodes	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  Did	
  Apache	
  Hadoop	
  Originate?	
  
•  Heavily	
  influenced	
  by	
  Google’s	
  architecture	
  
•  Notably,	
  the	
  Google	
  Filesystem	
  and	
  MapReduce	
  papers	
  
•  Other	
  Web	
  companies	
  quickly	
  saw	
  the	
  benefits	
  
•  Early	
  adop=on	
  by	
  Yahoo,	
  Facebook	
  and	
  others	
  
2002 2003 2004 2005 2006
Google publishes
MapReduce paper
Nutch rewritten
for MapReduce
Hadoop becomes
Lucene subproject
Nutch spun off
from Lucene
Google publishes
GFS paper
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Comparing	
  Hadoop	
  to	
  Other	
  Systems	
  
•  Monolithic	
  systems	
  don’t	
  scale	
  
•  Modern	
  high-­‐performance	
  compu=ng	
  systems	
  are	
  distributed	
  
•  They	
  spread	
  computa=ons	
  across	
  many	
  machines	
  in	
  parallel	
  
•  Widely-­‐used	
  used	
  for	
  scien=fic	
  applica=ons	
  
•  Let’s	
  examine	
  how	
  a	
  typical	
  HPC	
  system	
  works	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
You	
  Don’t	
  Just	
  Need	
  Speed…	
  
•  The	
  problem	
  is	
  that	
  we	
  have	
  way	
  more	
  data	
  than	
  code	
  
$ du -ks code/
1,087
$ du –ks data/
854,632,947,314
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
You	
  Need	
  Speed	
  At	
  Scale	
  
Storage System
Compute Nodes
Bottleneck
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Design	
  Fundamental:	
  Data	
  Locality	
  
•  This	
  is	
  a	
  hallmark	
  of	
  Hadoop’s	
  design	
  
•  Don’t	
  bring	
  the	
  data	
  to	
  the	
  computa=on	
  
•  Bring	
  the	
  computa=on	
  to	
  the	
  data	
  
•  Hadoop	
  uses	
  the	
  same	
  machines	
  for	
  storage	
  and	
  processing	
  
•  Significantly	
  reduces	
  need	
  to	
  transfer	
  data	
  across	
  network	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Other	
  Hadoop	
  Design	
  Fundamentals	
  
•  Machine	
  failure	
  is	
  unavoidable	
  –	
  embrace	
  it	
  
•  Build	
  reliability	
  into	
  the	
  system	
  
•  “More”	
  is	
  usually	
  beNer	
  than	
  “faster”	
  
•  Throughput	
  maNers	
  more	
  than	
  latency	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  Hadoop	
  Distributed	
  Filesystem	
  
HDFS	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS:	
  Hadoop	
  Distributed	
  File	
  System	
  
•  Inspired	
  by	
  the	
  Google	
  File	
  System	
  
•  Reliable,	
  low-­‐cost	
  storage	
  for	
  massive	
  amounts	
  of	
  data	
  
•  Similar	
  to	
  a	
  UNIX	
  filesystem	
  in	
  some	
  ways	
  
•  Hierarchical	
  
•  UNIX-­‐style	
  paths	
  (e.g.,	
  /sales/alice.txt)	
  
•  UNIX-­‐style	
  file	
  ownership	
  and	
  permissions	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS:	
  Hadoop	
  Distributed	
  File	
  System	
  
•  There	
  are	
  also	
  some	
  major	
  devia=ons	
  from	
  UNIX	
  filesystems	
  
•  Highly-­‐op=mized	
  for	
  processing	
  data	
  with	
  MapReduce	
  
•  Designed	
  for	
  sequen=al	
  access	
  to	
  large	
  files	
  
•  Cannot	
  modify	
  file	
  content	
  once	
  wriNen	
  
•  It’s	
  actually	
  a	
  user-­‐space	
  Java	
  process	
  
•  Accessed	
  using	
  special	
  commands	
  or	
  APIs	
  
•  No	
  concept	
  of	
  a	
  current	
  working	
  directory	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Copying	
  Local	
  Data	
  To	
  and	
  From	
  HDFS	
  
•  Remember	
  that	
  HDFS	
  is	
  dis=nct	
  from	
  your	
  local	
  filesystem	
  
•  hadoop fs –put	
  copies	
  local	
  files	
  to	
  HDFS	
  
•  hadoop fs –get	
  fetches	
  a	
  local	
  copy	
  of	
  a	
  file	
  from	
  HDFS	
  
$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Demo	
  
•  I	
  will	
  now	
  demonstrate	
  the	
  following	
  
1.  How	
  to	
  list	
  the	
  contents	
  of	
  a	
  directory	
  
2.  How	
  to	
  create	
  a	
  directory	
  in	
  HDFS	
  
3.  How	
  to	
  copy	
  a	
  local	
  file	
  to	
  HDFS	
  
4.  How	
  to	
  display	
  the	
  contents	
  of	
  a	
  file	
  in	
  HDFS	
  
5.  How	
  to	
  remove	
  a	
  file	
  from	
  HDFS	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
A	
  Scalable	
  Data	
  Processing	
  Framework	
  
Data	
  Processing	
  with	
  MapReduce	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
What	
  is	
  MapReduce?	
  
•  MapReduce	
  is	
  a	
  programming	
  model	
  
•  It’s	
  a	
  way	
  of	
  processing	
  data	
  	
  
•  You	
  can	
  implement	
  MapReduce	
  in	
  any	
  language	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Understanding	
  Map	
  and	
  Reduce	
  
•  You	
  supply	
  two	
  func=ons	
  to	
  process	
  data:	
  Map	
  and	
  Reduce	
  
•  Map:	
  typically	
  used	
  to	
  transform,	
  parse,	
  or	
  filter	
  data	
  
•  Reduce:	
  typically	
  used	
  to	
  summarize	
  results	
  
•  The	
  Map	
  func=on	
  always	
  runs	
  first	
  
•  The	
  Reduce	
  func=on	
  runs	
  aUerwards,	
  but	
  is	
  op=onal	
  
•  Each	
  piece	
  is	
  simple,	
  but	
  can	
  be	
  powerful	
  when	
  combined	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  Benefits	
  
•  Scalability	
  
•  Hadoop	
  divides	
  the	
  processing	
  job	
  into	
  individual	
  tasks	
  
•  Tasks	
  execute	
  in	
  parallel	
  (independently)	
  across	
  cluster	
  
•  Simplicity	
  
•  Processes	
  one	
  record	
  at	
  a	
  =me	
  
•  Ease	
  of	
  use	
  
•  Hadoop	
  provides	
  job	
  scheduling	
  and	
  other	
  infrastructure	
  
•  Far	
  simpler	
  for	
  developers	
  than	
  typical	
  distributed	
  compu=ng	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  in	
  Hadoop	
  
•  MapReduce	
  processing	
  in	
  Hadoop	
  is	
  batch-­‐oriented	
  
•  A	
  MapReduce	
  job	
  is	
  broken	
  down	
  into	
  smaller	
  tasks	
  
•  Tasks	
  run	
  concurrently	
  
•  Each	
  processes	
  a	
  small	
  amount	
  of	
  overall	
  input	
  
•  MapReduce	
  code	
  for	
  Hadoop	
  is	
  usually	
  wriNen	
  in	
  Java	
  
•  This	
  uses	
  Hadoop’s	
  API	
  directly	
  
•  You	
  can	
  do	
  basic	
  MapReduce	
  in	
  other	
  languages	
  
•  Using	
  the	
  Hadoop	
  Streaming	
  wrapper	
  program	
  
•  Some	
  advanced	
  features	
  require	
  Java	
  code	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  Example	
  in	
  Python	
  
•  The	
  following	
  example	
  uses	
  Python	
  
•  Via	
  Hadoop	
  Streaming	
  
•  It	
  processes	
  log	
  files	
  and	
  summarizes	
  events	
  by	
  type	
  
•  I’ll	
  explain	
  both	
  the	
  data	
  flow	
  and	
  the	
  code	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Job	
  Input	
  
•  Here’s	
  the	
  job	
  input	
  
	
  
•  Each	
  map	
  task	
  gets	
  a	
  chunk	
  of	
  this	
  data	
  to	
  process	
  
•  Typically	
  corresponds	
  to	
  a	
  single	
  block	
  in	
  HDFS	
  
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
for line in sys.stdin:
fields = line.split()
level = fields[3].upper()
if level in levels:
print "%st1" % level
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Python	
  Code	
  for	
  Map	
  Func=on	
  
If	
  it	
  matches	
  a	
  known	
  level,	
  print	
  
it,	
  a	
  tab	
  separator,	
  and	
  the	
  literal	
  
value	
  1	
  (since	
  the	
  level	
  can	
  only	
  
occur	
  once	
  per	
  line)	
  
Read	
  records	
  from	
  standard	
  input.	
  
Use	
  whitespace	
  to	
  split	
  into	
  fields.	
  	
  	
  
Define	
  list	
  of	
  known	
  log	
  levels	
  
Extract	
  “level”	
  field	
  and	
  convert	
  to	
  
uppercase	
  for	
  consistency.	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Output	
  of	
  Map	
  Func=on	
  
•  The	
  map	
  func=on	
  produces	
  key/value	
  pairs	
  as	
  output	
  
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  “Shuffle	
  and	
  Sort”	
  
•  Hadoop	
  automa9cally	
  merges,	
  sorts,	
  and	
  groups	
  map	
  output	
  
•  The	
  result	
  is	
  passed	
  as	
  input	
  to	
  the	
  reduce	
  func=on	
  
•  More	
  on	
  this	
  later…	
  
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Shuffle	
  and	
  Sort	
  
Map	
  Output	
   Reduce	
  Input	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Input	
  to	
  Reduce	
  Func=on	
  
•  Reduce	
  func=on	
  receives	
  a	
  key	
  and	
  all	
  values	
  for	
  that	
  key	
  	
  
	
  
•  Keys	
  are	
  always	
  passed	
  to	
  reducers	
  in	
  sorted	
  order	
  
•  Although	
  not	
  obvious	
  here,	
  values	
  are	
  unordered	
  
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Python	
  Code	
  for	
  Reduce	
  Func=on	
  
#!/usr/bin/env python
import sys
previous_key = None
sum = 0
for line in sys.stdin:
key, value = line.split()
if key == previous_key:
sum = sum + int(value)
# continued on next slide
1
2
3
4
5
6
7
8
9
10
11
12
13
Ini=alize	
  loop	
  variables	
  
Extract	
  the	
  key	
  and	
  value	
  
passed	
  via	
  standard	
  input	
  
If	
  key	
  unchanged,	
  	
  
increment	
  the	
  count	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Python	
  Code	
  for	
  Reduce	
  Func=on	
  
# continued from previous slide
else:
if previous_key:
print '%st%i' % (previous_key, sum)
previous_key = key
sum = 1
print '%st%i' % (previous_key, sum)
14
15
16
17
18
19
20
21
22 Print	
  data	
  for	
  the	
  final	
  
key	
  
If	
  key	
  changed,	
  	
  
print	
  data	
  for	
  old	
  level	
  
Start	
  tracking	
  data	
  for	
  
the	
  new	
  record	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Output	
  of	
  Reduce	
  Func=on	
  
•  Its	
  output	
  is	
  a	
  sum	
  for	
  each	
  level	
  
ERROR 1
INFO 4
WARN 2
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Recap	
  of	
  Data	
  Flow	
  
	
  	
  
ERROR 1
INFO 4
WARN 2
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Map	
  input	
  
Map	
  output	
   Reduce	
  input	
   Reduce	
  output	
  
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
Shuffle	
  
and	
  sort	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  to	
  Run	
  a	
  Hadoop	
  Streaming	
  Job	
  
•  I’ll	
  demonstrate	
  this	
  now…	
  
	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Open	
  Source	
  Tools	
  that	
  Complement	
  Hadoop	
  
The	
  Hadoop	
  Ecosystem	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  Hadoop	
  Ecosystem	
  
•  "Core	
  Hadoop"	
  consists	
  of	
  HDFS	
  and	
  MapReduce	
  
•  These	
  are	
  the	
  kernel	
  of	
  a	
  much	
  broader	
  plauorm	
  
•  Hadoop	
  has	
  many	
  related	
  projects	
  
•  Some	
  help	
  you	
  integrate	
  Hadoop	
  with	
  other	
  systems	
  
•  Others	
  help	
  you	
  analyze	
  your	
  data	
  
•  These	
  are	
  not	
  considered	
  “core	
  Hadoop”	
  
•  Rather,	
  they’re	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Many	
  are	
  also	
  open	
  source	
  Apache	
  projects	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Visual	
  Overview	
  of	
  a	
  Complete	
  Workflow	
  
Import Transaction Data
from RDBMSSessionize Web
Log Data with Pig
Analyst uses Impala for
business intelligence
Sentiment Analysis on
Social Media with Hive
Hadoop Cluster
with Impala
Generate Nightly Reports
using Pig, Hive, or Impala
Build product
recommendations for
Web site
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Key	
  Points	
  
•  We’re	
  genera=ng	
  massive	
  volumes	
  of	
  data	
  
•  This	
  data	
  can	
  be	
  extremely	
  valuable	
  
•  Companies	
  can	
  now	
  analyze	
  what	
  they	
  previously	
  discarded	
  
•  Hadoop	
  supports	
  large-­‐scale	
  data	
  storage	
  and	
  processing	
  
•  Heavily	
  influenced	
  by	
  Google's	
  architecture	
  
•  Already	
  in	
  produc=on	
  by	
  thousands	
  of	
  organiza=ons	
  
•  HDFS	
  is	
  Hadoop's	
  storage	
  layer	
  
•  MapReduce	
  is	
  Hadoop's	
  processing	
  framework	
  
•  Many	
  ecosystem	
  projects	
  complement	
  Hadoop	
  
•  Some	
  help	
  you	
  to	
  integrate	
  Hadoop	
  with	
  exis=ng	
  systems	
  
•  Others	
  help	
  you	
  analyze	
  the	
  data	
  you’ve	
  stored	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Highly	
  Recommended	
  Books	
  
Author:	
  Tom	
  White	
  
ISBN:	
  1-­‐449-­‐31152-­‐0	
  
Author:	
  Eric	
  Sammer	
  
ISBN:	
  1-­‐449-­‐32705-­‐2	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Ques=ons?	
  
•  Thank	
  you	
  for	
  aNending!	
  
•  I’ll	
  be	
  happy	
  to	
  answer	
  any	
  addi=onal	
  ques=ons	
  now…	
  
•  Demo	
  and	
  slides	
  at	
  github.com/markgrover/hadoop-­‐intro-­‐fast	
  
•  TwiNer:	
  mark_grover	
  
•  Survey	
  page:	
  =ny.cloudera.com/mark	
  

Contenu connexe

Tendances

Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
 

Tendances (20)

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 

En vedette

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

En vedette (20)

Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference ...
Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference ...Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference ...
Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference ...
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data in the Cloud
Big Data in the CloudBig Data in the Cloud
Big Data in the Cloud
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similaire à Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
John Dougherty
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similaire à Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley (20)

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
 
Hadoop
HadoopHadoop
Hadoop
 

Plus de markgrover

Plus de markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 

Dernier

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 

Dernier (20)

DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

  • 1. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Introduc=on  to  Apache  Hadoop     and  its  Ecosystem   Mark  Grover    |    Intro  to  Cloud  Compu=ng,  Carnegie  Mellon  SV   github.com/markgrover/hadoop-­‐intro-­‐fast   ©  Copyright  2010-­‐2014              Cloudera,  Inc.                All  rights  reserved.      
  • 2. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   About  Me   •  CommiNer  on  Apache  Bigtop,  commiNer  and  PPMC  member   on  Apache  Sentry  (incuba=ng).   •  Contributor  to  Apache  Hadoop,  Hive,  Spark,  Sqoop,  Flume.   •  SoUware  developer  at  Cloudera   •  @mark_grover   •  www.linkedin.com/in/grovermark  
  • 3. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Co-­‐author  O’Reilly  book   •  @hadooparchbook   •  hadooparchitecturebook.com   •  To  be  released  early  2015  
  • 4. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   About  the  Presenta=on…   •  What’s  ahead   •  Fundamental  Concepts   •  HDFS:  The  Hadoop  Distributed  File  System   •  Data  Processing  with  MapReduce   •  Demo   •  Conclusion  +  Q&A  
  • 5. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Fundamental  Concepts   Why  the  World  Needs  Hadoop  
  • 6. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What’s  the  craze  about  Hadoop?   •  Volume   •  More  and  more  data  being  generated   •  Machine  generated  data  increasing   •  Velocity   •  Data  coming  it  at  higher  speed   •  Variety   •  Audio,  video,  images,  log  files,  web  pages,  social  network   connec=ons,  etc.  
  • 7. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   We  Need  a  System  that  Scales   •  Too  much  data  for  tradi=onal  tools   •  Two  key  problems   •  How  to  reliably  store  this  data  at  a  reasonable  cost   •  How  to  we  process  all  the  data  we’ve  stored  
  • 8. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  Apache  Hadoop?   •  Scalable  data  storage  and  processing   •  Distributed  and  fault-­‐tolerant     •  Runs  on  standard  hardware   •  Two  main  components   •  Storage:  Hadoop  Distributed  File  System  (HDFS)   •  Processing:  MapReduce   •  Hadoop  clusters  are  composed  of  computers  called  nodes   •  Clusters  range  from  a  single  node  up  to  several  thousand  nodes  
  • 9. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  Did  Apache  Hadoop  Originate?   •  Heavily  influenced  by  Google’s  architecture   •  Notably,  the  Google  Filesystem  and  MapReduce  papers   •  Other  Web  companies  quickly  saw  the  benefits   •  Early  adop=on  by  Yahoo,  Facebook  and  others   2002 2003 2004 2005 2006 Google publishes MapReduce paper Nutch rewritten for MapReduce Hadoop becomes Lucene subproject Nutch spun off from Lucene Google publishes GFS paper
  • 10. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Comparing  Hadoop  to  Other  Systems   •  Monolithic  systems  don’t  scale   •  Modern  high-­‐performance  compu=ng  systems  are  distributed   •  They  spread  computa=ons  across  many  machines  in  parallel   •  Widely-­‐used  used  for  scien=fic  applica=ons   •  Let’s  examine  how  a  typical  HPC  system  works  
  • 11. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Fast Network
  • 12. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 1: Copy input data Fast Network
  • 13. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 2: Process the data Fast Network
  • 14. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 3: Copy output data Fast Network
  • 15. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Don’t  Just  Need  Speed…   •  The  problem  is  that  we  have  way  more  data  than  code   $ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314
  • 16. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Need  Speed  At  Scale   Storage System Compute Nodes Bottleneck
  • 17. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Design  Fundamental:  Data  Locality   •  This  is  a  hallmark  of  Hadoop’s  design   •  Don’t  bring  the  data  to  the  computa=on   •  Bring  the  computa=on  to  the  data   •  Hadoop  uses  the  same  machines  for  storage  and  processing   •  Significantly  reduces  need  to  transfer  data  across  network  
  • 18. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Other  Hadoop  Design  Fundamentals   •  Machine  failure  is  unavoidable  –  embrace  it   •  Build  reliability  into  the  system   •  “More”  is  usually  beNer  than  “faster”   •  Throughput  maNers  more  than  latency  
  • 19. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Distributed  Filesystem   HDFS  
  • 20. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  Inspired  by  the  Google  File  System   •  Reliable,  low-­‐cost  storage  for  massive  amounts  of  data   •  Similar  to  a  UNIX  filesystem  in  some  ways   •  Hierarchical   •  UNIX-­‐style  paths  (e.g.,  /sales/alice.txt)   •  UNIX-­‐style  file  ownership  and  permissions  
  • 21. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  There  are  also  some  major  devia=ons  from  UNIX  filesystems   •  Highly-­‐op=mized  for  processing  data  with  MapReduce   •  Designed  for  sequen=al  access  to  large  files   •  Cannot  modify  file  content  once  wriNen   •  It’s  actually  a  user-­‐space  Java  process   •  Accessed  using  special  commands  or  APIs   •  No  concept  of  a  current  working  directory  
  • 22. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Copying  Local  Data  To  and  From  HDFS   •  Remember  that  HDFS  is  dis=nct  from  your  local  filesystem   •  hadoop fs –put  copies  local  files  to  HDFS   •  hadoop fs –get  fetches  a  local  copy  of  a  file  from  HDFS   $ hadoop fs -put sales.txt /reports Hadoop Cluster Client Machine $ hadoop fs -get /reports/sales.txt
  • 23. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Demo   •  I  will  now  demonstrate  the  following   1.  How  to  list  the  contents  of  a  directory   2.  How  to  create  a  directory  in  HDFS   3.  How  to  copy  a  local  file  to  HDFS   4.  How  to  display  the  contents  of  a  file  in  HDFS   5.  How  to  remove  a  file  from  HDFS  
  • 24. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   A  Scalable  Data  Processing  Framework   Data  Processing  with  MapReduce  
  • 25. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  MapReduce?   •  MapReduce  is  a  programming  model   •  It’s  a  way  of  processing  data     •  You  can  implement  MapReduce  in  any  language  
  • 26. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Understanding  Map  and  Reduce   •  You  supply  two  func=ons  to  process  data:  Map  and  Reduce   •  Map:  typically  used  to  transform,  parse,  or  filter  data   •  Reduce:  typically  used  to  summarize  results   •  The  Map  func=on  always  runs  first   •  The  Reduce  func=on  runs  aUerwards,  but  is  op=onal   •  Each  piece  is  simple,  but  can  be  powerful  when  combined  
  • 27. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Benefits   •  Scalability   •  Hadoop  divides  the  processing  job  into  individual  tasks   •  Tasks  execute  in  parallel  (independently)  across  cluster   •  Simplicity   •  Processes  one  record  at  a  =me   •  Ease  of  use   •  Hadoop  provides  job  scheduling  and  other  infrastructure   •  Far  simpler  for  developers  than  typical  distributed  compu=ng  
  • 28. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  in  Hadoop   •  MapReduce  processing  in  Hadoop  is  batch-­‐oriented   •  A  MapReduce  job  is  broken  down  into  smaller  tasks   •  Tasks  run  concurrently   •  Each  processes  a  small  amount  of  overall  input   •  MapReduce  code  for  Hadoop  is  usually  wriNen  in  Java   •  This  uses  Hadoop’s  API  directly   •  You  can  do  basic  MapReduce  in  other  languages   •  Using  the  Hadoop  Streaming  wrapper  program   •  Some  advanced  features  require  Java  code  
  • 29. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Example  in  Python   •  The  following  example  uses  Python   •  Via  Hadoop  Streaming   •  It  processes  log  files  and  summarizes  events  by  type   •  I’ll  explain  both  the  data  flow  and  the  code  
  • 30. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Job  Input   •  Here’s  the  job  input     •  Each  map  task  gets  a  chunk  of  this  data  to  process   •  Typically  corresponds  to  a  single  block  in  HDFS   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
  • 31. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%st1" % level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Python  Code  for  Map  Func=on   If  it  matches  a  known  level,  print   it,  a  tab  separator,  and  the  literal   value  1  (since  the  level  can  only   occur  once  per  line)   Read  records  from  standard  input.   Use  whitespace  to  split  into  fields.       Define  list  of  known  log  levels   Extract  “level”  field  and  convert  to   uppercase  for  consistency.  
  • 32. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Map  Func=on   •  The  map  func=on  produces  key/value  pairs  as  output   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
  • 33. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  “Shuffle  and  Sort”   •  Hadoop  automa9cally  merges,  sorts,  and  groups  map  output   •  The  result  is  passed  as  input  to  the  reduce  func=on   •  More  on  this  later…   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Shuffle  and  Sort   Map  Output   Reduce  Input  
  • 34. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Input  to  Reduce  Func=on   •  Reduce  func=on  receives  a  key  and  all  values  for  that  key       •  Keys  are  always  passed  to  reducers  in  sorted  order   •  Although  not  obvious  here,  values  are  unordered   ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1
  • 35. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   #!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide 1 2 3 4 5 6 7 8 9 10 11 12 13 Ini=alize  loop  variables   Extract  the  key  and  value   passed  via  standard  input   If  key  unchanged,     increment  the  count  
  • 36. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   # continued from previous slide else: if previous_key: print '%st%i' % (previous_key, sum) previous_key = key sum = 1 print '%st%i' % (previous_key, sum) 14 15 16 17 18 19 20 21 22 Print  data  for  the  final   key   If  key  changed,     print  data  for  old  level   Start  tracking  data  for   the  new  record  
  • 37. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Reduce  Func=on   •  Its  output  is  a  sum  for  each  level   ERROR 1 INFO 4 WARN 2
  • 38. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Recap  of  Data  Flow       ERROR 1 INFO 4 WARN 2 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Map  input   Map  output   Reduce  input   Reduce  output   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!" Shuffle   and  sort  
  • 39. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  to  Run  a  Hadoop  Streaming  Job   •  I’ll  demonstrate  this  now…    
  • 40. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Open  Source  Tools  that  Complement  Hadoop   The  Hadoop  Ecosystem  
  • 41. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Ecosystem   •  "Core  Hadoop"  consists  of  HDFS  and  MapReduce   •  These  are  the  kernel  of  a  much  broader  plauorm   •  Hadoop  has  many  related  projects   •  Some  help  you  integrate  Hadoop  with  other  systems   •  Others  help  you  analyze  your  data   •  These  are  not  considered  “core  Hadoop”   •  Rather,  they’re  part  of  the  Hadoop  ecosystem   •  Many  are  also  open  source  Apache  projects  
  • 42. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Visual  Overview  of  a  Complete  Workflow   Import Transaction Data from RDBMSSessionize Web Log Data with Pig Analyst uses Impala for business intelligence Sentiment Analysis on Social Media with Hive Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala Build product recommendations for Web site
  • 43. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Key  Points   •  We’re  genera=ng  massive  volumes  of  data   •  This  data  can  be  extremely  valuable   •  Companies  can  now  analyze  what  they  previously  discarded   •  Hadoop  supports  large-­‐scale  data  storage  and  processing   •  Heavily  influenced  by  Google's  architecture   •  Already  in  produc=on  by  thousands  of  organiza=ons   •  HDFS  is  Hadoop's  storage  layer   •  MapReduce  is  Hadoop's  processing  framework   •  Many  ecosystem  projects  complement  Hadoop   •  Some  help  you  to  integrate  Hadoop  with  exis=ng  systems   •  Others  help  you  analyze  the  data  you’ve  stored  
  • 44. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Highly  Recommended  Books   Author:  Tom  White   ISBN:  1-­‐449-­‐31152-­‐0   Author:  Eric  Sammer   ISBN:  1-­‐449-­‐32705-­‐2  
  • 45. ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Ques=ons?   •  Thank  you  for  aNending!   •  I’ll  be  happy  to  answer  any  addi=onal  ques=ons  now…   •  Demo  and  slides  at  github.com/markgrover/hadoop-­‐intro-­‐fast   •  TwiNer:  mark_grover   •  Survey  page:  =ny.cloudera.com/mark