SlideShare une entreprise Scribd logo
1  sur  41
CONFIDENTIAL	
  -­‐	
  RESTRICTED	
  
Introduc6on	
  to	
  Spark	
  
Ben	
  White	
  –	
  Systems	
  Engineer,	
  Cloudera	
  
2
But	
  first…	
  how	
  did	
  we	
  get	
  here?	
  
What	
  does	
  Hadoop	
  look	
  like?	
  
3
	
  
HDFS	
  
worker	
  
(“DN”)	
  
	
  
MR	
  
worker	
  
(“TT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
(“DN”)	
  
	
  
MR	
  
worker	
  
(“TT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
(“DN”)	
  
	
  
MR	
  
worker	
  
(“TT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
(“DN”)	
  
	
  
MR	
  
worker	
  
(“TT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
(“DN”)	
  
	
  
MR	
  
worker	
  
(“TT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
…	
  
	
  
HDFS	
  
master	
  
(“NN”)	
  
	
  
MR	
  
master	
  
(“JT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
Standby	
  
master	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
But	
  I	
  want	
  MORE!	
  
4
HDFS	
  
worker	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
HDFS	
  
worker	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
…	
  
MapReduce
	
  
HDFS	
  
master	
  
(“NN”)	
  
	
  
MR	
  
master	
  
(“JT”)	
  
	
  
	
  
	
  
	
  
	
  
	
  
Standby	
  
master	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Hadoop	
  as	
  an	
  Architecture	
  
The	
  Old	
  Way	
  
$30,000+	
  per	
  TB	
  
Expensive	
  &	
  UnaNainable	
  
•  Hard	
  to	
  scale	
  
•  Network	
  is	
  a	
  boNleneck	
  
•  Only	
  handles	
  rela6onal	
  data	
  
•  Difficult	
  to	
  add	
  new	
  fields	
  &	
  data	
  types	
  
Expensive,	
  Special	
  purpose,	
  “Reliable”	
  Servers	
  
Expensive	
  Licensed	
  SoRware	
  
Network	
  
Data	
  Storage	
  
(SAN,	
  NAS)	
  
Compute	
  
(RDBMS,	
  EDW)	
  
The	
  Hadoop	
  Way	
  
$300-­‐$1,000	
  per	
  TB	
  
Affordable	
  &	
  ANainable	
  
•  Scales	
  out	
  forever	
  
•  No	
  boNlenecks	
  
•  Easy	
  to	
  ingest	
  any	
  data	
  
•  Agile	
  data	
  access	
  
Commodity	
  “Unreliable”	
  Servers	
  
Hybrid	
  Open	
  Source	
  SoRware	
  
Compute	
  
(CPU)	
  
Memory	
   Storage	
  
(Disk)	
  
z	
  
z	
  
CDH:	
  the	
  App	
  Store	
  for	
  Hadoop	
  
6
Integra6on	
  
Storage	
  
Resource	
  Management	
  
Metadata	
  
NoSQL	
  
DBMS	
  
…	
  
AnalyYc	
  
MPP	
  
DBMS	
  
Search	
  
Engine	
  
In-­‐	
  
Memory	
  
Batch	
  
Processing	
  
System	
  
Management	
  
Data	
  
Management	
  
Support	
  
Security	
  
Machine	
  
Learning	
  
MapReduce
7
Introduc6on	
  to	
  Apache	
  Spark	
  
Credits:	
  
•  Todd	
  Lipcon	
  
•  Ted	
  Malaska	
  
•  Jairam	
  Ranganathan	
  
•  Jayant	
  Shekhar	
  
•  Sandy	
  Ryza	
  
Can	
  we	
  improve	
  on	
  MR?	
  
•  Problems	
  with	
  MR:	
  
•  Very	
  low-­‐level:	
  requires	
  a	
  lot	
  of	
  code	
  to	
  do	
  simple	
  
things	
  
•  Very	
  constrained:	
  everything	
  must	
  be	
  described	
  as	
  
“map”	
  and	
  “reduce”.	
  Powerful	
  but	
  some6mes	
  
difficult	
  to	
  think	
  in	
  these	
  terms.	
  
8
Can	
  we	
  improve	
  on	
  MR?	
  
•  Two	
  approaches	
  to	
  improve	
  on	
  MapReduce:	
  
	
  
1.  Special	
  purpose	
  systems	
  to	
  solve	
  one	
  problem	
  domain	
  
well.	
  
•  Giraph	
  /	
  Graphlab	
  (graph	
  processing)	
  
•  Storm	
  (stream	
  processing)	
  
	
  
2.  Generalize	
  the	
  capabili6es	
  of	
  MapReduce	
  to	
  
provide	
  a	
  richer	
  founda6on	
  to	
  solve	
  problems.	
  
•  Tez,	
  MPI,	
  Hama/Pregel	
  (BSP),	
  Dryad	
  (arbitrary	
  DAGs)	
  
	
  
Both	
  are	
  viable	
  strategies	
  depending	
  on	
  the	
  problem!	
  
9
What	
  is	
  Apache	
  Spark?	
  
Spark	
  is	
  a	
  general	
  purpose	
  computa6onal	
  framework	
  
	
  
Retains	
  the	
  advantages	
  of	
  MapReduce:	
  
•  Linear	
  scalability	
  
•  Fault-­‐tolerance	
  
•  Data	
  Locality	
  based	
  computa6ons	
  
	
  
…but	
  offers	
  so	
  much	
  more:	
  
•  Leverages	
  distributed	
  memory	
  for	
  beNer	
  performance	
  
•  Supports	
  itera6ve	
  algorithms	
  that	
  are	
  not	
  feasible	
  in	
  MR	
  
•  Improved	
  developer	
  experience	
  
•  Full	
  Directed	
  Graph	
  expressions	
  for	
  data	
  parallel	
  computa6ons	
  
•  Comes	
  with	
  libraries	
  for	
  machine	
  learning,	
  graph	
  analysis,	
  etc	
  
10
Gecng	
  started	
  with	
  Spark	
  
•  Java	
  API	
  
•  Interac6ve	
  shells:	
  
•  Scala	
  (spark-­‐shell)	
  
•  Python	
  (pyspark)	
  
11
Execu6on	
  modes	
  
•  Standalone	
  Mode	
  
•  Dedicated	
  master	
  and	
  worker	
  daemons	
  
•  YARN	
  Client	
  Mode	
  
•  Launches	
  a	
  YARN	
  applica6on	
  with	
  the	
  
driver	
  program	
  running	
  locally	
  
•  YARN	
  Cluster	
  Mode	
  
•  Launches	
  a	
  YARN	
  applica6on	
  with	
  the	
  
driver	
  program	
  running	
  in	
  the	
  YARN	
  
Applica6onMaster	
  
12
Dynamic	
  resource	
  
management	
  
between	
  Spark,	
  
MR,	
  Impala…	
  
Dedicated	
  Spark	
  
run6me	
  with	
  sta6c	
  
resource	
  limits	
  
Spark	
  Concepts	
  
13
Parallelized	
  Collec6ons	
  
14	
  
scala>	
  val	
  data	
  =	
  1	
  to	
  5	
  
data:	
  Range.Inclusive	
  =	
  Range(1,	
  2,	
  3,	
  4,	
  5)	
  
	
  
scala>	
  val	
  distData	
  =	
  sc.parallelize(data)	
  
distData:	
  org.apache.spark.rdd.RDD[Int]	
  =	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ParallelCollectionRDD[0]	
  
	
  
Now	
  I	
  can	
  apply	
  parallel	
  opera6ons	
  to	
  this	
  array:	
  
	
  
scala>	
  distData.reduce(_	
  +	
  _)	
  
[…	
  Adding	
  task	
  set	
  0.0	
  with	
  56	
  tasks	
  …]	
  	
  
res0:	
  Int	
  =	
  15	
  
	
  
What	
  just	
  happened?!	
  
RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  Collec6ons	
  of	
  objects	
  par66oned	
  across	
  a	
  cluster	
  
•  Stored	
  in	
  RAM	
  or	
  on	
  Disk	
  
•  You	
  can	
  control	
  persistence	
  and	
  par66oning	
  
•  Created	
  by:	
  
•  Distribu6ng	
  local	
  collec6on	
  objects	
  
•  Transforma6on	
  of	
  data	
  in	
  storage	
  
•  Transforma6on	
  of	
  RDDs	
  
•  Automa6cally	
  rebuilt	
  on	
  failure	
  (resilient)	
  
•  Contains	
  lineage	
  to	
  compute	
  from	
  storage	
  
•  Lazy	
  materializa6on	
  
15
RDD	
  transforma6ons	
  
16	
  
Opera6ons	
  on	
  RDDs	
  
TransformaYons	
  lazily	
  transform	
  a	
  
RDD	
  to	
  a	
  new	
  RDD	
  
•  map	
  
•  flatMap	
  
•  filter	
  
•  sample	
  
•  join	
  
•  sort	
  
•  reduceByKey	
  
•  …	
  
AcYons	
  run	
  computa6on	
  to	
  return	
  a	
  
value	
  
•  collect	
  
•  reduce(func)	
  
•  foreach(func)	
  
•  count	
  
•  first,	
  take(n)	
  
•  saveAs	
  
•  …	
  
17	
  
Fault	
  Tolerance	
  
•  RDDs	
  contain	
  lineage.	
  
•  Lineage	
  –	
  source	
  loca6on	
  and	
  list	
  of	
  transforma6ons	
  
•  Lost	
  par66ons	
  can	
  be	
  re-­‐computed	
  from	
  source	
  data	
  
	
  
	
  
	
  
	
  
	
  
18
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS	
  File	
   Filtered	
  RDD	
   Mapped	
  RDD	
  
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
19
Examples	
  
Word	
  Count	
  in	
  MapReduce	
  
20	
  
package	
  org.myorg;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
import	
  java.io.IOExcep6on;	
  
import	
  java.u6l.*;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
import	
  org.apache.hadoop.fs.Path;	
  
import	
  org.apache.hadoop.conf.*;	
  
import	
  org.apache.hadoop.io.*;	
  
import	
  org.apache.hadoop.mapreduce.*;	
  
import	
  org.apache.hadoop.mapreduce.lib.input.FileInputFormat;	
  
import	
  org.apache.hadoop.mapreduce.lib.input.TextInputFormat;	
  
import	
  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;	
  
import	
  org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
public	
  class	
  WordCount	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  public	
  sta6c	
  class	
  Map	
  extends	
  Mapper<LongWritable,	
  Text,	
  Text,	
  IntWritable>	
  {	
  
	
  	
  	
  	
  private	
  final	
  sta6c	
  IntWritable	
  one	
  =	
  new	
  IntWritable(1);	
  
	
  	
  	
  	
  private	
  Text	
  word	
  =	
  new	
  Text();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  public	
  void	
  map(LongWritable	
  key,	
  Text	
  value,	
  Context	
  context)	
  throws	
  IOExcep6on,	
  
InterruptedExcep6on	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  String	
  line	
  =	
  value.toString();	
  
	
  	
  	
  	
  	
  	
  	
  	
  StringTokenizer	
  tokenizer	
  =	
  new	
  StringTokenizer(line);	
  
	
  	
  	
  	
  	
  	
  	
  	
  while	
  (tokenizer.hasMoreTokens())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  word.set(tokenizer.nextToken());	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(word,	
  one);	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  public	
  sta6c	
  class	
  Reduce	
  extends	
  Reducer<Text,	
  IntWritable,	
  Text,	
  IntWritable>	
  {	
  
	
  
	
  	
  	
  	
  public	
  void	
  reduce(Text	
  key,	
  Iterable<IntWritable>	
  values,	
  Context	
  context)	
  	
  
	
  	
  	
  	
  	
  	
  throws	
  IOExcep6on,	
  InterruptedExcep6on	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (IntWritable	
  val	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  val.get();	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  context.write(key,	
  new	
  IntWritable(sum));	
  
	
  	
  	
  	
  }	
  
	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  public	
  sta6c	
  void	
  main(String[]	
  args)	
  throws	
  Excep6on	
  {	
  
	
  	
  	
  	
  Configura6on	
  conf	
  =	
  new	
  Configura6on();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  Job	
  job	
  =	
  new	
  Job(conf,	
  "wordcount");	
  
	
  	
  	
  	
  	
  
	
  	
  	
  	
  job.setOutputKeyClass(Text.class);	
  
	
  	
  	
  	
  job.setOutputValueClass(IntWritable.class);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  job.setMapperClass(Map.class);	
  
	
  	
  	
  	
  job.setReducerClass(Reduce.class);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  job.setInputFormatClass(TextInputFormat.class);	
  
	
  	
  	
  	
  job.setOutputFormatClass(TextOutputFormat.class);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  FileInputFormat.addInputPath(job,	
  new	
  Path(args[0]));	
  
	
  	
  	
  	
  FileOutputFormat.setOutputPath(job,	
  new	
  Path(args[1]));	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  job.waitForComple6on(true);	
  
	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
}	
  
Word	
  Count	
  in	
  Spark	
  
!
sc.textFile(“words”)!
.flatMap(line => line.split(" "))!
.map(word=>(word,1))!
.reduceByKey(_+_).collect()!
!
!
21
Logis6c	
  Regression	
  
•  Read	
  two	
  sets	
  of	
  points	
  
•  Looks	
  for	
  a	
  plane	
  W	
  that	
  separates	
  them	
  
•  Perform	
  gradient	
  descent:	
  
•  Start	
  with	
  random	
  W	
  
•  On	
  each	
  itera6on,	
  sum	
  a	
  func6on	
  of	
  W	
  over	
  the	
  data	
  
•  Move	
  W	
  in	
  a	
  direc6on	
  that	
  improves	
  it	
  
22
Intui6on	
  
23
Logis6c	
  Regression	
  
24
Logis6c	
  Regression	
  Performance	
  
25	
  
26
Spark	
  and	
  Hadoop:	
  
a	
  Framework	
  within	
  a	
  Framework	
  
27
Integra6on	
  
Storage	
  
Resource	
  Management	
  
Metadata	
  
HBase	
   …	
  Impala	
   Solr	
   Spark	
  
Map	
  
Reduce	
  
System	
  
Management	
  
Data	
  
Management	
  
Support	
  
Security	
  
28	
  
29	
  
30
Integra6on	
  
Storage	
  
Resource	
  Management	
  
Metadata	
  
HBase	
   …	
  Impala	
   Solr	
   Spark	
  
Map	
  
Reduce	
  
System	
  
Management	
  
Data	
  
Management	
  
Support	
  
Security	
  
Spark	
  Streaming	
  
•  Takes	
  the	
  concept	
  of	
  RDDs	
  and	
  extends	
  it	
  to	
  
DStreams	
  
•  Fault-­‐tolerant	
  like	
  RDDs	
  
•  Transformable	
  like	
  RDDs	
  
•  Adds	
  new	
  “rolling	
  window”	
  opera6ons	
  
•  Rolling	
  averages,	
  etc	
  
•  But	
  keeps	
  everything	
  else!	
  
•  Regular	
  Spark	
  code	
  works	
  in	
  Spark	
  Streaming	
  
•  Can	
  s6ll	
  access	
  HDFS	
  data,	
  etc	
  
31
Micro-­‐batching	
  for	
  on	
  the	
  fly	
  ETL	
  
32
Fault	
  recovery	
  
How	
  fast	
  can	
  the	
  system	
  recover?	
  
33
Fault	
  Recovery	
  
•  RDDs	
  store	
  dependency	
  graph	
  
•  Because	
  RDDs	
  are	
  determinis6c:	
  
Missing	
  RDDs	
  are	
  rebuilt	
  in	
  parallel	
  on	
  other	
  nodes	
  
•  Stateful	
  RDDs	
  can	
  have	
  infinite	
  lineage	
  
•  Periodic	
  checkpoints	
  to	
  disk	
  clears	
  lineage	
  
•  Faster	
  recovery	
  6mes	
  
•  BeNer	
  handling	
  of	
  stragglers	
  vs	
  row-­‐by-­‐row	
  streaming	
  
34
35
Summary	
  
Why	
  Spark?	
  
•  Flexible	
  like	
  MapReduce	
  
•  High	
  performance	
  
•  Machine	
  learning,	
  
itera6ve	
  algorithms	
  
•  Interac6ve	
  data	
  
explora6ons	
  
•  Concise,	
  easy	
  API	
  for	
  
developer	
  produc6vity	
  
36	
  
37
Spark	
  
38	
  
hNp://www.cloudera.com/content/cloudera/en/products-­‐and-­‐services/cdh/spark.html	
  
	
  
hNp://www.cloudera.com/content/cloudera-­‐content/cloudera-­‐docs/CM5/latest/Cloudera-­‐
Manager-­‐Installa6on-­‐Guide/cm5ig_install_spark.html	
  
	
  
	
  
A	
  Brief	
  History	
  
39
2002	
   2003	
   2004	
   2005	
   2006	
   2007	
   2008	
   2009	
   2010	
   2011	
   2012	
   2013	
   2014	
  
Doug	
  Cu`ng	
  
launches	
  Nutch	
  
project	
  
Google	
  releases	
  
GFS	
  paper	
  
Google	
  releases	
  
MapReduce	
  
paper	
  
MapReduce	
  
implemented	
  in	
  
Nutch	
  
Nutch	
  adds	
  
distributed	
  file	
  
system	
  
Hadoop	
  spun	
  
out	
  of	
  Nutch	
  
project	
  
Hadoop	
  breaks	
  
Terasort	
  world	
  
record	
  
Cloudera	
  
founded	
  
CDH	
  and	
  CDH2	
  
released	
   CDH3	
  released	
  
CDH4	
  released	
  
adding	
  HA	
  
Impala	
  
(SQL	
  on	
  Hadoop)	
  
launched	
  
Sentry	
  and	
  
Search	
  
launched	
  
CDH5	
  
Cloudera	
  
Manager	
  
released	
  
HBase,	
  
Zookeeper,	
  Flume	
  
and	
  more	
  added	
  
to	
  CDH	
  
What	
  is	
  Apache	
  Hadoop?	
  
•  An	
  open-­‐source	
  implementa6on	
  of	
  Google’s	
  GFS	
  and	
  
MapReduce	
  papers	
  
•  An	
  Apache	
  So~ware	
  Founda6on	
  top-­‐level	
  project	
  
•  Good	
  at	
  storing	
  and	
  processing	
  all	
  kinds	
  of	
  data	
  
•  Reliable	
  storage	
  at	
  terabyte/petabyte-­‐scale	
  
on	
  unreliable	
  (cheap)	
  hardware	
  
•  A	
  distributed	
  system	
  for	
  coun6ng	
  words	
  J	
  
40
What	
  is	
  Apache	
  Hadoop?	
  
41
Has	
  the	
  Flexibility	
  to	
  Store	
  and	
  
Mine	
  Any	
  Type	
  of	
  Data	
  
	
  
§  Ask	
  ques6ons	
  across	
  structured	
  and	
  
unstructured	
  data	
  that	
  were	
  previously	
  
impossible	
  to	
  ask	
  or	
  solve	
  
§  Not	
  bound	
  by	
  a	
  single	
  schema	
  
Excels	
  at	
  
Processing	
  Complex	
  Data	
  
	
  
§  Scale-­‐out	
  architecture	
  divides	
  workloads	
  
across	
  mul6ple	
  nodes	
  
§  Flexible	
  file	
  system	
  eliminates	
  ETL	
  
boNlenecks	
  
Scales	
  
Economically	
  
	
  
§  Can	
  be	
  deployed	
  on	
  industry	
  standard	
  
hardware	
  
§  Open	
  source	
  pla•orm	
  guards	
  against	
  
vendor	
  lock	
  
Hadoop	
  Distributed	
  
File	
  System	
  (HDFS)	
  
	
  
Self-­‐Healing,	
  High	
  
Bandwidth	
  Clustered	
  
Storage	
  
	
  
	
  
MapReduce	
  
	
  
Distributed	
  Compu6ng	
  
Framework	
  
Apache Hadoop	
  is	
  an	
  open	
  source	
  
pla•orm	
  for	
  data	
  storage	
  and	
  processing	
  
that	
  is…	
  
ü  Scalable	
  
ü  Fault	
  tolerant	
  
ü  Distributed	
  
CORE	
  HADOOP	
  SYSTEM	
  COMPONENTS	
  

Contenu connexe

Tendances

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 

Tendances (20)

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 

En vedette

Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Data Con LA
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalData Con LA
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liuData Con LA
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kellyData Con LA
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamskyData Con LA
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaData Con LA
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcottData Con LA
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingData Con LA
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 

En vedette (20)

Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopal
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liu
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 

Similaire à 20140614 introduction to spark-ben white

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 

Similaire à 20140614 introduction to spark-ben white (20)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 

Plus de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

20140614 introduction to spark-ben white

  • 1. CONFIDENTIAL  -­‐  RESTRICTED   Introduc6on  to  Spark   Ben  White  –  Systems  Engineer,  Cloudera  
  • 2. 2 But  first…  how  did  we  get  here?  
  • 3. What  does  Hadoop  look  like?   3   HDFS   worker   (“DN”)     MR   worker   (“TT”)                   HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 …     HDFS   master   (“NN”)     MR   master   (“JT”)               Standby   master                      
  • 4. But  I  want  MORE!   4 HDFS   worker                           HDFS   worker                           HDFS   worker                           HDFS   worker                           HDFS   worker                           …   MapReduce   HDFS   master   (“NN”)     MR   master   (“JT”)               Standby   master                      
  • 5. Hadoop  as  an  Architecture   The  Old  Way   $30,000+  per  TB   Expensive  &  UnaNainable   •  Hard  to  scale   •  Network  is  a  boNleneck   •  Only  handles  rela6onal  data   •  Difficult  to  add  new  fields  &  data  types   Expensive,  Special  purpose,  “Reliable”  Servers   Expensive  Licensed  SoRware   Network   Data  Storage   (SAN,  NAS)   Compute   (RDBMS,  EDW)   The  Hadoop  Way   $300-­‐$1,000  per  TB   Affordable  &  ANainable   •  Scales  out  forever   •  No  boNlenecks   •  Easy  to  ingest  any  data   •  Agile  data  access   Commodity  “Unreliable”  Servers   Hybrid  Open  Source  SoRware   Compute   (CPU)   Memory   Storage   (Disk)   z   z  
  • 6. CDH:  the  App  Store  for  Hadoop   6 Integra6on   Storage   Resource  Management   Metadata   NoSQL   DBMS   …   AnalyYc   MPP   DBMS   Search   Engine   In-­‐   Memory   Batch   Processing   System   Management   Data   Management   Support   Security   Machine   Learning   MapReduce
  • 7. 7 Introduc6on  to  Apache  Spark   Credits:   •  Todd  Lipcon   •  Ted  Malaska   •  Jairam  Ranganathan   •  Jayant  Shekhar   •  Sandy  Ryza  
  • 8. Can  we  improve  on  MR?   •  Problems  with  MR:   •  Very  low-­‐level:  requires  a  lot  of  code  to  do  simple   things   •  Very  constrained:  everything  must  be  described  as   “map”  and  “reduce”.  Powerful  but  some6mes   difficult  to  think  in  these  terms.   8
  • 9. Can  we  improve  on  MR?   •  Two  approaches  to  improve  on  MapReduce:     1.  Special  purpose  systems  to  solve  one  problem  domain   well.   •  Giraph  /  Graphlab  (graph  processing)   •  Storm  (stream  processing)     2.  Generalize  the  capabili6es  of  MapReduce  to   provide  a  richer  founda6on  to  solve  problems.   •  Tez,  MPI,  Hama/Pregel  (BSP),  Dryad  (arbitrary  DAGs)     Both  are  viable  strategies  depending  on  the  problem!   9
  • 10. What  is  Apache  Spark?   Spark  is  a  general  purpose  computa6onal  framework     Retains  the  advantages  of  MapReduce:   •  Linear  scalability   •  Fault-­‐tolerance   •  Data  Locality  based  computa6ons     …but  offers  so  much  more:   •  Leverages  distributed  memory  for  beNer  performance   •  Supports  itera6ve  algorithms  that  are  not  feasible  in  MR   •  Improved  developer  experience   •  Full  Directed  Graph  expressions  for  data  parallel  computa6ons   •  Comes  with  libraries  for  machine  learning,  graph  analysis,  etc   10
  • 11. Gecng  started  with  Spark   •  Java  API   •  Interac6ve  shells:   •  Scala  (spark-­‐shell)   •  Python  (pyspark)   11
  • 12. Execu6on  modes   •  Standalone  Mode   •  Dedicated  master  and  worker  daemons   •  YARN  Client  Mode   •  Launches  a  YARN  applica6on  with  the   driver  program  running  locally   •  YARN  Cluster  Mode   •  Launches  a  YARN  applica6on  with  the   driver  program  running  in  the  YARN   Applica6onMaster   12 Dynamic  resource   management   between  Spark,   MR,  Impala…   Dedicated  Spark   run6me  with  sta6c   resource  limits  
  • 14. Parallelized  Collec6ons   14   scala>  val  data  =  1  to  5   data:  Range.Inclusive  =  Range(1,  2,  3,  4,  5)     scala>  val  distData  =  sc.parallelize(data)   distData:  org.apache.spark.rdd.RDD[Int]  =                                                                              ParallelCollectionRDD[0]     Now  I  can  apply  parallel  opera6ons  to  this  array:     scala>  distData.reduce(_  +  _)   […  Adding  task  set  0.0  with  56  tasks  …]     res0:  Int  =  15     What  just  happened?!  
  • 15. RDD  –  Resilient  Distributed  Dataset   •  Collec6ons  of  objects  par66oned  across  a  cluster   •  Stored  in  RAM  or  on  Disk   •  You  can  control  persistence  and  par66oning   •  Created  by:   •  Distribu6ng  local  collec6on  objects   •  Transforma6on  of  data  in  storage   •  Transforma6on  of  RDDs   •  Automa6cally  rebuilt  on  failure  (resilient)   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializa6on   15
  • 17. Opera6ons  on  RDDs   TransformaYons  lazily  transform  a   RDD  to  a  new  RDD   •  map   •  flatMap   •  filter   •  sample   •  join   •  sort   •  reduceByKey   •  …   AcYons  run  computa6on  to  return  a   value   •  collect   •  reduce(func)   •  foreach(func)   •  count   •  first,  take(n)   •  saveAs   •  …   17  
  • 18. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  loca6on  and  list  of  transforma6ons   •  Lost  par66ons  can  be  re-­‐computed  from  source  data             18 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  • 20. Word  Count  in  MapReduce   20   package  org.myorg;                     import  java.io.IOExcep6on;   import  java.u6l.*;                     import  org.apache.hadoop.fs.Path;   import  org.apache.hadoop.conf.*;   import  org.apache.hadoop.io.*;   import  org.apache.hadoop.mapreduce.*;   import  org.apache.hadoop.mapreduce.lib.input.FileInputFormat;   import  org.apache.hadoop.mapreduce.lib.input.TextInputFormat;   import  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;   import  org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;                     public  class  WordCount  {                      public  sta6c  class  Map  extends  Mapper<LongWritable,  Text,  Text,  IntWritable>  {          private  final  sta6c  IntWritable  one  =  new  IntWritable(1);          private  Text  word  =  new  Text();                            public  void  map(LongWritable  key,  Text  value,  Context  context)  throws  IOExcep6on,   InterruptedExcep6on  {                  String  line  =  value.toString();                  StringTokenizer  tokenizer  =  new  StringTokenizer(line);                  while  (tokenizer.hasMoreTokens())  {                          word.set(tokenizer.nextToken());                          context.write(word,  one);                  }          }    }                      public  sta6c  class  Reduce  extends  Reducer<Text,  IntWritable,  Text,  IntWritable>  {            public  void  reduce(Text  key,  Iterable<IntWritable>  values,  Context  context)                throws  IOExcep6on,  InterruptedExcep6on  {                  int  sum  =  0;                  for  (IntWritable  val  :  values)  {                          sum  +=  val.get();                  }                  context.write(key,  new  IntWritable(sum));          }    }                      public  sta6c  void  main(String[]  args)  throws  Excep6on  {          Configura6on  conf  =  new  Configura6on();                                    Job  job  =  new  Job(conf,  "wordcount");                    job.setOutputKeyClass(Text.class);          job.setOutputValueClass(IntWritable.class);                            job.setMapperClass(Map.class);          job.setReducerClass(Reduce.class);                            job.setInputFormatClass(TextInputFormat.class);          job.setOutputFormatClass(TextOutputFormat.class);                            FileInputFormat.addInputPath(job,  new  Path(args[0]));          FileOutputFormat.setOutputPath(job,  new  Path(args[1]));                            job.waitForComple6on(true);    }                     }  
  • 21. Word  Count  in  Spark   ! sc.textFile(“words”)! .flatMap(line => line.split(" "))! .map(word=>(word,1))! .reduceByKey(_+_).collect()! ! ! 21
  • 22. Logis6c  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   •  Start  with  random  W   •  On  each  itera6on,  sum  a  func6on  of  W  over  the  data   •  Move  W  in  a  direc6on  that  improves  it   22
  • 26. 26 Spark  and  Hadoop:   a  Framework  within  a  Framework  
  • 27. 27 Integra6on   Storage   Resource  Management   Metadata   HBase   …  Impala   Solr   Spark   Map   Reduce   System   Management   Data   Management   Support   Security  
  • 28. 28  
  • 29. 29  
  • 30. 30 Integra6on   Storage   Resource  Management   Metadata   HBase   …  Impala   Solr   Spark   Map   Reduce   System   Management   Data   Management   Support   Security  
  • 31. Spark  Streaming   •  Takes  the  concept  of  RDDs  and  extends  it  to   DStreams   •  Fault-­‐tolerant  like  RDDs   •  Transformable  like  RDDs   •  Adds  new  “rolling  window”  opera6ons   •  Rolling  averages,  etc   •  But  keeps  everything  else!   •  Regular  Spark  code  works  in  Spark  Streaming   •  Can  s6ll  access  HDFS  data,  etc   31
  • 32. Micro-­‐batching  for  on  the  fly  ETL   32
  • 33. Fault  recovery   How  fast  can  the  system  recover?   33
  • 34. Fault  Recovery   •  RDDs  store  dependency  graph   •  Because  RDDs  are  determinis6c:   Missing  RDDs  are  rebuilt  in  parallel  on  other  nodes   •  Stateful  RDDs  can  have  infinite  lineage   •  Periodic  checkpoints  to  disk  clears  lineage   •  Faster  recovery  6mes   •  BeNer  handling  of  stragglers  vs  row-­‐by-­‐row  streaming   34
  • 36. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,   itera6ve  algorithms   •  Interac6ve  data   explora6ons   •  Concise,  easy  API  for   developer  produc6vity   36  
  • 37. 37
  • 38. Spark   38   hNp://www.cloudera.com/content/cloudera/en/products-­‐and-­‐services/cdh/spark.html     hNp://www.cloudera.com/content/cloudera-­‐content/cloudera-­‐docs/CM5/latest/Cloudera-­‐ Manager-­‐Installa6on-­‐Guide/cm5ig_install_spark.html      
  • 39. A  Brief  History   39 2002   2003   2004   2005   2006   2007   2008   2009   2010   2011   2012   2013   2014   Doug  Cu`ng   launches  Nutch   project   Google  releases   GFS  paper   Google  releases   MapReduce   paper   MapReduce   implemented  in   Nutch   Nutch  adds   distributed  file   system   Hadoop  spun   out  of  Nutch   project   Hadoop  breaks   Terasort  world   record   Cloudera   founded   CDH  and  CDH2   released   CDH3  released   CDH4  released   adding  HA   Impala   (SQL  on  Hadoop)   launched   Sentry  and   Search   launched   CDH5   Cloudera   Manager   released   HBase,   Zookeeper,  Flume   and  more  added   to  CDH  
  • 40. What  is  Apache  Hadoop?   •  An  open-­‐source  implementa6on  of  Google’s  GFS  and   MapReduce  papers   •  An  Apache  So~ware  Founda6on  top-­‐level  project   •  Good  at  storing  and  processing  all  kinds  of  data   •  Reliable  storage  at  terabyte/petabyte-­‐scale   on  unreliable  (cheap)  hardware   •  A  distributed  system  for  coun6ng  words  J   40
  • 41. What  is  Apache  Hadoop?   41 Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  ques6ons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   Excels  at   Processing  Complex  Data     §  Scale-­‐out  architecture  divides  workloads   across  mul6ple  nodes   §  Flexible  file  system  eliminates  ETL   boNlenecks   Scales   Economically     §  Can  be  deployed  on  industry  standard   hardware   §  Open  source  pla•orm  guards  against   vendor  lock   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage       MapReduce     Distributed  Compu6ng   Framework   Apache Hadoop  is  an  open  source   pla•orm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   CORE  HADOOP  SYSTEM  COMPONENTS