SlideShare une entreprise Scribd logo
1  sur  49
Hadoop	
  and	
  Spark	
  
Shravan	
  (Sean)	
  Pabba	
  
1	
  
About	
  Me	
  
•  Diverse	
  roles/languages	
  and	
  pla=orms.	
  
•  Middleware	
  space	
  in	
  recent	
  years.	
  
•  Worked	
  for	
  IBM/Grid	
  Dynamics/GigaSpaces.	
  
•  Working	
  as	
  Systems	
  Engineer	
  for	
  Cloudera	
  
since	
  last	
  July.	
  
•  Work	
  with	
  and	
  educate	
  clients/prospects.	
  
2	
  
Agenda	
  
•  IntroducLon	
  to	
  Spark	
  
–  Map	
  Reduce	
  Review	
  
–  Why	
  Spark	
  
–  Architecture	
  (Stand-­‐alone	
  AND	
  Cloudera)	
  
•  Concepts	
  
•  Examples/Use	
  Cases	
  
•  Spark	
  Streaming	
  
•  Shark	
  
–  Shark	
  Vs	
  Impala	
  
•  Demo	
  
3	
  
Have	
  you	
  done?	
  
•  Programming	
  languages	
  (Java/
Python/Scala)	
  
•  WriUen	
  mulL-­‐threaded	
  or	
  
distributed	
  programs	
  
•  Numerical	
  Programming/StaLsLcal	
  
CompuLng	
  (R,	
  MATLAB)	
  
•  Hadoop	
  
4	
  
INTRODUCTION	
  TO	
  SPARK	
  
5	
  
A	
  brief	
  review	
  of	
  MapReduce	
  
Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
  
Reduce	
   Reduce	
   Reduce	
   Reduce	
  
Key	
  advances	
  by	
  MapReduce:	
  
	
  
•  Data	
  Locality:	
  AutomaLc	
  split	
  computaLon	
  and	
  launch	
  of	
  mappers	
  appropriately	
  
•  Fault	
  tolerance:	
  Write	
  intermediate	
  results	
  and	
  restartable	
  mappers	
  means	
  ability	
  
to	
  run	
  on	
  commodity	
  hardware	
  
•  Linear	
  scalability:	
  CombinaLon	
  of	
  locality	
  +	
  programming	
  model	
  that	
  forces	
  
developers	
  to	
  write	
  generally	
  scalable	
  soluLons	
  to	
  problems	
  
6	
  
MapReduce	
  sufficient	
  for	
  many	
  classes	
  
of	
  problems	
  
MapReduce	
  
Hive	
   Pig	
   Mahout	
   Crunch	
   Solr	
  
A	
  bit	
  like	
  Haiku:	
  
	
  
•  Limited	
  expressivity	
  
•  But	
  can	
  be	
  used	
  to	
  approach	
  diverse	
  problem	
  domains	
  
	
  
7	
  
BUT…	
  Can	
  we	
  do	
  beUer?	
  
Areas	
  ripe	
  for	
  improvement,	
  
•  Launching	
  Mappers/Reducers	
  takes	
  Lme	
  
•  Having	
  to	
  write	
  to	
  disk	
  (replicated)	
  between	
  
each	
  step	
  
•  Reading	
  data	
  back	
  from	
  disk	
  in	
  the	
  next	
  step	
  
•  Each	
  Map/Reduce	
  step	
  has	
  to	
  go	
  back	
  into	
  the	
  
queue	
  and	
  get	
  its	
  resources	
  
•  Not	
  In	
  Memory	
  
•  Cannot	
  iterate	
  fast	
  
8	
  
What	
  is	
  Spark?	
  
Spark	
  is	
  a	
  general	
  purpose	
  computaLonal	
  framework	
  -­‐	
  more	
  flexibility	
  than	
  
MapReduce.	
  It	
  is	
  an	
  implementaLon	
  of	
  a	
  2010	
  Berkley	
  paper	
  [1].	
  
	
  
Key	
  properBes:	
  
•  Leverages	
  distributed	
  memory	
  
•  Full	
  Directed	
  Graph	
  expressions	
  for	
  data	
  parallel	
  computaLons	
  
•  Improved	
  developer	
  experience	
  
Yet	
  retains:	
  
Linear	
  scalability,	
  Fault-­‐tolerance	
  and	
  Data	
  Locality	
  based	
  
computaLons	
  
1	
  -­‐	
  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf	
  
9	
  
Spark:	
  Easy	
  and	
  Fast	
  Big	
  Data	
  
•  Easy	
  to	
  Develop	
  
– Rich	
  APIs	
  in	
  Java,	
  
Scala,	
  Python	
  
– InteracLve	
  shell	
  
•  Fast	
  to	
  Run	
  
– General	
  execuLon	
  
graphs	
  
– In-­‐memory	
  storage	
  
2-­‐5×	
  less	
  code	
   Up	
  to	
  10×	
  faster	
  on	
  disk,	
  
100×	
  in	
  memory	
  
	
  10	
  
Easy:	
  Get	
  Started	
  Immediately	
  
•  MulL-­‐language	
  support	
  
•  InteracLve	
  Shell	
  
Python	
  
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala	
  
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java	
  
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
11	
  
Spark	
  Ecosystem	
  
hUp://www.databricks.com/spark/#sparkhadoop	
  
12	
  
Spring	
  Framework	
  
hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html	
  
13	
  
Spark	
  in	
  Cloudera	
  EDH	
  
3RD	
  PARTY	
  
APPS	
  
STORAGE	
  FOR	
  ANY	
  TYPE	
  OF	
  DATA	
  
UNIFIED,	
  ELASTIC,	
  RESILIENT,	
  SECURE	
  
	
  
	
  
	
  
	
  
	
  
CLOUDERA’S	
  ENTERPRISE	
  DATA	
  HUB	
  
BATCH	
  
PROCESSING	
  
MAPREDUCE	
  
SPARK	
  
ANALYTIC	
  
SQL	
  
IMPALA	
  
SEARCH	
  
ENGINE	
  
SOLR	
  
MACHINE	
  
LEARNING	
  
SPARK	
  
STREAM	
  
PROCESSING	
  
SPARK	
  STREAMING	
  
WORKLOAD	
  MANAGEMENT	
   YARN	
  
FILESYSTEM	
  
HDFS	
  
ONLINE	
  NOSQL	
  
HBASE	
  
DATA	
  
MANAGEMENT	
  
CLOUDERA	
  NAVIGATOR	
  
SYSTEM	
  
MANAGEMENT	
  
CLOUDERA	
  MANAGER	
  
SENTRY	
  ,	
  SECURE	
  
14	
  
AdopLon	
  
•  SupporLng:	
  
– DataBricks	
  
•  ContribuLng:	
  
– UC	
  Berkley,	
  DataBricks,	
  Yahoo,	
  etc	
  
•  Well	
  known	
  use-­‐cases:	
  
– Conviva,	
  QuanLfind,	
  Bizo	
  
15	
  
CONCEPTS	
  
16	
  
Spark	
  Concepts	
  -­‐	
  Overview	
  
•  Driver	
  &	
  Workers	
  
•  RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  TransformaLons	
  
•  AcLons	
  
•  Caching	
  
17	
  
Driver	
  and	
  Workers	
  
Driver	
  
Worker	
  
Worker	
  
Worker	
  
Data	
  
RAM	
  
Data	
  
RAM	
  
Data	
  
RAM	
  
18	
  
RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  Read-­‐only	
  parLLoned	
  collecLon	
  of	
  records	
  
•  Created	
  through:	
  
– TransformaLon	
  of	
  data	
  in	
  storage	
  
– TransformaLon	
  of	
  RDDs	
  
•  Contains	
  lineage	
  to	
  compute	
  from	
  storage	
  
•  Lazy	
  materializaLon	
  
•  Users	
  control	
  persistence	
  and	
  parLLoning	
  
19	
  
OperaLons	
  
TransformaBons	
  
•  Map	
  
•  Filter	
  
•  Sample	
  
•  Join	
  
AcBons	
  
•  Reduce	
  
•  Count	
  
•  First,	
  Take	
  
•  SaveAs	
  
20	
  
OperaLons	
  
•  TransformaBons	
  create	
  new	
  RDD	
  from	
  an	
  exisLng	
  one	
  
•  AcBons	
  run	
  computaLon	
  on	
  RDD	
  and	
  return	
  a	
  value	
  
•  TransformaLons	
  are	
  lazy.	
  	
  
•  AcLons	
  materialize	
  RDDs	
  by	
  compuLng	
  transformaLons.	
  
•  RDDs	
  can	
  be	
  cached	
  to	
  avoid	
  re-­‐compuLng.	
  
21	
  
Fault	
  Tolerance	
  
•  RDDs	
  contain	
  lineage.	
  
•  Lineage	
  –	
  source	
  locaLon	
  and	
  list	
  of	
  
transformaLons	
  
•  Lost	
  parLLons	
  can	
  be	
  re-­‐computed	
  from	
  
source	
  data	
  
	
  
	
  
	
  
	
  
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS	
  File	
   Filtered	
  RDD	
   Mapped	
  RDD	
  
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
22	
  
Caching	
  
•  Persist()	
  and	
  cache()	
  mark	
  data	
  	
  
•  RDD	
  is	
  cached	
  ater	
  first	
  acLon	
  
•  Fault	
  tolerant	
  –	
  lost	
  parLLons	
  will	
  re-­‐compute	
  
•  If	
  not	
  enough	
  memory	
  –	
  	
  
some	
  parLLons	
  will	
  not	
  be	
  cached	
  
•  Future	
  acLons	
  are	
  performed	
  on	
  cached	
  
parLLoned	
  
•  So	
  they	
  are	
  much	
  faster	
  
	
  
Use	
  caching	
  for	
  iteraBve	
  algorithms	
  
	
  
23	
  
Caching	
  –	
  Storage	
  Levels	
  
•  MEMORY_ONLY	
  
•  MEMORY_AND_DISK	
  
•  MEMORY_ONLY_SER	
  
•  MEMORY_AND_DISK_SER	
  
•  DISK_ONLY	
  
•  MEMORY_ONLY_2,	
  MEMORY_AND_DISK_2…	
  
24	
  
SPARK	
  EXAMPLES	
  
25	
  
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop	
  MapReduce	
  
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
26	
  
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop	
  MapReduce	
  
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
27	
  
Spark	
  Word	
  Count	
  in	
  Java	
  
JavaSparkContext sc = new JavaSparkContext(...);!
JavaRDD<String> lines = ctx.textFile("hdfs://...");!
JavaRDD<String> words = lines.flatMap(!
new FlatMapFunction<String, String>() {!
public Iterable<String> call(String s) {!
return Arrays.asList(s.split(" "));!
}!
}!
);!
!
JavaPairRDD<String, Integer> ones = words.map(!
new PairFunction<String, String, Integer>() {!
public Tuple2<String, Integer> call(String s) {!
return new Tuple2(s, 1);!
}!
}!
);!
!
JavaPairRDD<String, Integer> counts =
ones.reduceByKey(!
new Function2<Integer, Integer, Integer>() {!
public Integer call(Integer i1, Integer i2) {!
return i1 + i2;!
}!
}!
);!
JavaRDD<String> lines =
sc.textFile("hdfs://log.txt");!
!
JavaRDD<String> words =!
lines.flatMap(line ->
Arrays.asList(line.split(" ")));!
!
JavaPairRDD<String, Integer> ones
=!
words.mapToPair(w -> new
Tuple2<String, Integer>(w, 1));!
!
JavaPairRDD<String, Integer>
counts =!
ones.reduceByKey((x, y) -> x
+ y);!
Java	
  8	
  
Lamba	
  
Expression	
  [1]	
  
1	
  -­‐	
  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html	
  
28	
  
Log	
  Mining	
  
•  Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory	
  
•  InteracLvely	
  search	
  for	
  paUerns	
  
29	
  
Log	
  Mining	
  
lines = sparkContext.textFile(“hdfs://…”)!
errors = lines.filter(_.startsWith(“ERROR”)!
messages = errors.map(_.split(‘t’)(2))!
!
cachedMsgs = messages.cache()!
!
cachedMsgs.filter(_.contains(“foo”)).count!
cachedMsgs.filter(_.contains(“bar”)).count!
…!
Base	
  RDD	
  
Transformed	
  
RDD	
  
AcLon	
  
30	
  
LogisLc	
  Regression	
  
•  Read	
  two	
  sets	
  of	
  points	
  
•  Looks	
  for	
  a	
  plane	
  W	
  that	
  separates	
  them	
  
•  Perform	
  gradient	
  descent:	
  
– Start	
  with	
  random	
  W	
  
– On	
  each	
  iteraLon,	
  sum	
  a	
  funcLon	
  of	
  W	
  over	
  the	
  
data	
  
– Move	
  W	
  in	
  a	
  direcLon	
  that	
  improves	
  it	
  
31	
  
IntuiLon	
  
32	
  
LogisLc	
  Regression	
  
val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- 1 to ITERATIONS) {!
"val gradient = points.map(p => !
" "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )!
" ".reduce(_+_)!
"w -= gradient!
}!
println(“Final separating plane: ” + w)!
33	
  
Conviva	
  Use-­‐Case	
  [1]	
  
•  Monitor	
  online	
  video	
  consumpLon	
  
•  Analyze	
  trends	
  
Need	
  to	
  run	
  tens	
  of	
  queries	
  like	
  this	
  a	
  day:	
  
	
  
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
1	
  -­‐	
  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/	
  
34	
  
Conviva	
  With	
  Spark	
  
val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSessionSummaryOnHdfs)	
  
	
  
val	
  cachedSessions	
  =	
  sessions.filter(whereCondiLonToFilterSessions).cache	
  
	
  
val	
  mapFn	
  :	
  SessionSummary	
  =>	
  (String,	
  Long)	
  =	
  {	
  s	
  =>	
  (s.videoName,	
  1)	
  }	
  
val	
  reduceFn	
  :	
  (Long,	
  Long)	
  =>	
  Long	
  =	
  {	
  (a,b)	
  =>	
  a+b	
  }	
  
	
  
val	
  results	
  =	
  
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap	
  
	
  
35	
  
SPARK	
  STREAMING	
  
36	
  
Large-­‐Scale	
  Stream	
  Processing	
  
Requires	
  
•  Fault	
  Tolerance	
  –	
  for	
  crashes	
  and	
  strugglers	
  
•  Efficiency	
  
•  Row-­‐by-­‐row	
  (conLnuous	
  operator)	
  systems	
  
do	
  not	
  handle	
  struggler	
  nodes	
  
	
  
•  Batch	
  Processing	
  provides	
  fault	
  tolerance	
  
efficiently	
  
Job	
  is	
  divided	
  into	
  determinisLc	
  tasks	
  
	
  
	
   37	
  
Key	
  QuesLon	
  	
  
•  How	
  fast	
  can	
  the	
  system	
  recover?	
  
38	
  
Spark	
  Streaming	
  
hUp://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html	
  
39	
  
Spark	
  Streaming	
  
–  Run	
  con$nuous	
  processing	
  of	
  data	
  using	
  Spark’s	
  core	
  API.	
  	
  
–  Extends	
  Spark	
  concept	
  of	
  RDD’s	
  to	
  DStreams	
  (DiscreLzed	
  Streams)	
  which	
  
are	
  fault	
  tolerant,	
  transformable	
  streams.	
  Users	
  can	
  re-­‐use	
  exisLng	
  code	
  for	
  
batch/offline	
  processing.	
  
–  Adds	
  “rolling	
  window”	
  operaLons.	
  E.g.	
  compute	
  rolling	
  averages	
  or	
  counts	
  
for	
  data	
  over	
  last	
  five	
  minutes.	
  
–  Example	
  use	
  cases:	
  
•  “On-­‐the-­‐fly”	
  ETL	
  as	
  data	
  is	
  ingested	
  into	
  Hadoop/HDFS.	
  
•  DetecLng	
  anomalous	
  behavior	
  and	
  triggering	
  alerts.	
  
•  ConLnuous	
  reporLng	
  of	
  summary	
  metrics	
  for	
  incoming	
  data.	
  
40	
  
val	
  tweets	
  =	
  ssc.twitterStream()	
  
val	
  hashTags	
  =	
  tweets.flatMap	
  (status	
  =>	
  getTags(status))	
  
hashTags.saveAsHadoopFiles("hdfs://...")	
  
	
  
flatMap flatMap flatMap
save save save
batch	
  @	
  t+1	
  batch	
  @	
  t	
   batch	
  @	
  t+2	
  
tweets	
  DStream	
  
hashTags	
  DStream	
  
Stream	
  composed	
  of	
  
small	
  (1-­‐10s)	
  batch	
  
computaLons	
  
“Micro-­‐batch”	
  Architecture	
  
41	
  
SHARK	
  
42	
  
Shark	
  Architecture	
  
•  IdenLcal	
  to	
  Hive	
  
•  Same	
  CLI,	
  JDBC,	
  	
  
	
  	
  	
  	
  SQL	
  Parser,	
  Metastore	
  
•  Replaced	
  the	
  opLmizer,	
  	
  
	
  	
  	
  	
  plan	
  generator	
  and	
  the	
  execuLon	
  engine.	
  	
  
•  Added	
  Cache	
  Manager.	
  	
  
•  Generate	
  Spark	
  code	
  instead	
  of	
  Map	
  Reduce	
  
43	
  
Hive	
  CompaLbility	
  
•  MetaStore	
  
•  HQL	
  
•  UDF	
  /	
  UDAF	
  
•  SerDes	
  
•  Scripts	
  
44	
  
Shark	
  Vs	
  Impala	
  
•  Shark	
  inherits	
  Hive	
  limitaLons	
  while	
  Impala	
  is	
  
purpose	
  built	
  for	
  SQL.	
  
•  Impala	
  is	
  significantly	
  faster	
  per	
  our	
  tests.	
  
•  Shark	
  does	
  not	
  have	
  security,	
  audit/lineage,	
  
support	
  for	
  high-­‐concurrency,	
  operaLonal	
  
tooling	
  for	
  config/monitor/reporLng/
debugging.	
  
•  InteracLve	
  SQL	
  needed	
  for	
  connecLng	
  BI	
  
Tools.	
  Shark	
  not	
  cerLfied	
  by	
  any	
  BI	
  vendor.	
  
45	
  
DEMO	
  
46	
  
SUMMARY	
  
47	
  
Why	
  Spark?	
  
•  Flexible	
  like	
  MapReduce	
  
•  High	
  performance	
  
•  Machine	
  learning,	
  iteraLve	
  algorithms	
  
•  InteracLve	
  data	
  exploraLons	
  
•  Developer	
  producLvity	
  
48	
  
How	
  Spark	
  Works?	
  
•  RDDs	
  –	
  resilient	
  distributed	
  data	
  
•  Lazy	
  transformaLons	
  
•  Caching	
  
•  Fault	
  tolerance	
  by	
  storing	
  lineage	
  
•  Streams	
  –	
  micro-­‐batches	
  of	
  RDDs	
  
•  Shark	
  –	
  Hive	
  +	
  Spark	
  
49	
  

Contenu connexe

Tendances

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 

Tendances (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Sqoop
SqoopSqoop
Sqoop
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache spark
Apache sparkApache spark
Apache spark
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 

En vedette

Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityUday Vakalapudi
 
Climate smart agriculture project
Climate smart agriculture projectClimate smart agriculture project
Climate smart agriculture projectFAO
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analyticsEdureka!
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...Edureka!
 

En vedette (20)

Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data Locality
 
Climate smart agriculture project
Climate smart agriculture projectClimate smart agriculture project
Climate smart agriculture project
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Climate smart agriculture
Climate smart agricultureClimate smart agriculture
Climate smart agriculture
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Hadoop
HadoopHadoop
Hadoop
 
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
 

Similaire à Hadoop and Spark

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 

Similaire à Hadoop and Spark (20)

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Hadoop and Spark

  • 1. Hadoop  and  Spark   Shravan  (Sean)  Pabba   1  
  • 2. About  Me   •  Diverse  roles/languages  and  pla=orms.   •  Middleware  space  in  recent  years.   •  Worked  for  IBM/Grid  Dynamics/GigaSpaces.   •  Working  as  Systems  Engineer  for  Cloudera   since  last  July.   •  Work  with  and  educate  clients/prospects.   2  
  • 3. Agenda   •  IntroducLon  to  Spark   –  Map  Reduce  Review   –  Why  Spark   –  Architecture  (Stand-­‐alone  AND  Cloudera)   •  Concepts   •  Examples/Use  Cases   •  Spark  Streaming   •  Shark   –  Shark  Vs  Impala   •  Demo   3  
  • 4. Have  you  done?   •  Programming  languages  (Java/ Python/Scala)   •  WriUen  mulL-­‐threaded  or   distributed  programs   •  Numerical  Programming/StaLsLcal   CompuLng  (R,  MATLAB)   •  Hadoop   4  
  • 6. A  brief  review  of  MapReduce   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Reduce   Reduce   Reduce   Reduce   Key  advances  by  MapReduce:     •  Data  Locality:  AutomaLc  split  computaLon  and  launch  of  mappers  appropriately   •  Fault  tolerance:  Write  intermediate  results  and  restartable  mappers  means  ability   to  run  on  commodity  hardware   •  Linear  scalability:  CombinaLon  of  locality  +  programming  model  that  forces   developers  to  write  generally  scalable  soluLons  to  problems   6  
  • 7. MapReduce  sufficient  for  many  classes   of  problems   MapReduce   Hive   Pig   Mahout   Crunch   Solr   A  bit  like  Haiku:     •  Limited  expressivity   •  But  can  be  used  to  approach  diverse  problem  domains     7  
  • 8. BUT…  Can  we  do  beUer?   Areas  ripe  for  improvement,   •  Launching  Mappers/Reducers  takes  Lme   •  Having  to  write  to  disk  (replicated)  between   each  step   •  Reading  data  back  from  disk  in  the  next  step   •  Each  Map/Reduce  step  has  to  go  back  into  the   queue  and  get  its  resources   •  Not  In  Memory   •  Cannot  iterate  fast   8  
  • 9. What  is  Spark?   Spark  is  a  general  purpose  computaLonal  framework  -­‐  more  flexibility  than   MapReduce.  It  is  an  implementaLon  of  a  2010  Berkley  paper  [1].     Key  properBes:   •  Leverages  distributed  memory   •  Full  Directed  Graph  expressions  for  data  parallel  computaLons   •  Improved  developer  experience   Yet  retains:   Linear  scalability,  Fault-­‐tolerance  and  Data  Locality  based   computaLons   1  -­‐  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf   9  
  • 10. Spark:  Easy  and  Fast  Big  Data   •  Easy  to  Develop   – Rich  APIs  in  Java,   Scala,  Python   – InteracLve  shell   •  Fast  to  Run   – General  execuLon   graphs   – In-­‐memory  storage   2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory    10  
  • 11. Easy:  Get  Started  Immediately   •  MulL-­‐language  support   •  InteracLve  Shell   Python   lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala   val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java   JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 11  
  • 14. Spark  in  Cloudera  EDH   3RD  PARTY   APPS   STORAGE  FOR  ANY  TYPE  OF  DATA   UNIFIED,  ELASTIC,  RESILIENT,  SECURE             CLOUDERA’S  ENTERPRISE  DATA  HUB   BATCH   PROCESSING   MAPREDUCE   SPARK   ANALYTIC   SQL   IMPALA   SEARCH   ENGINE   SOLR   MACHINE   LEARNING   SPARK   STREAM   PROCESSING   SPARK  STREAMING   WORKLOAD  MANAGEMENT   YARN   FILESYSTEM   HDFS   ONLINE  NOSQL   HBASE   DATA   MANAGEMENT   CLOUDERA  NAVIGATOR   SYSTEM   MANAGEMENT   CLOUDERA  MANAGER   SENTRY  ,  SECURE   14  
  • 15. AdopLon   •  SupporLng:   – DataBricks   •  ContribuLng:   – UC  Berkley,  DataBricks,  Yahoo,  etc   •  Well  known  use-­‐cases:   – Conviva,  QuanLfind,  Bizo   15  
  • 17. Spark  Concepts  -­‐  Overview   •  Driver  &  Workers   •  RDD  –  Resilient  Distributed  Dataset   •  TransformaLons   •  AcLons   •  Caching   17  
  • 18. Driver  and  Workers   Driver   Worker   Worker   Worker   Data   RAM   Data   RAM   Data   RAM   18  
  • 19. RDD  –  Resilient  Distributed  Dataset   •  Read-­‐only  parLLoned  collecLon  of  records   •  Created  through:   – TransformaLon  of  data  in  storage   – TransformaLon  of  RDDs   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializaLon   •  Users  control  persistence  and  parLLoning   19  
  • 20. OperaLons   TransformaBons   •  Map   •  Filter   •  Sample   •  Join   AcBons   •  Reduce   •  Count   •  First,  Take   •  SaveAs   20  
  • 21. OperaLons   •  TransformaBons  create  new  RDD  from  an  exisLng  one   •  AcBons  run  computaLon  on  RDD  and  return  a  value   •  TransformaLons  are  lazy.     •  AcLons  materialize  RDDs  by  compuLng  transformaLons.   •  RDDs  can  be  cached  to  avoid  re-­‐compuLng.   21  
  • 22. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  locaLon  and  list  of   transformaLons   •  Lost  parLLons  can  be  re-­‐computed  from   source  data           msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))   22  
  • 23. Caching   •  Persist()  and  cache()  mark  data     •  RDD  is  cached  ater  first  acLon   •  Fault  tolerant  –  lost  parLLons  will  re-­‐compute   •  If  not  enough  memory  –     some  parLLons  will  not  be  cached   •  Future  acLons  are  performed  on  cached   parLLoned   •  So  they  are  much  faster     Use  caching  for  iteraBve  algorithms     23  
  • 24. Caching  –  Storage  Levels   •  MEMORY_ONLY   •  MEMORY_AND_DISK   •  MEMORY_ONLY_SER   •  MEMORY_AND_DISK_SER   •  DISK_ONLY   •  MEMORY_ONLY_2,  MEMORY_AND_DISK_2…   24  
  • 26. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 26  
  • 27. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 27  
  • 28. Spark  Word  Count  in  Java   JavaSparkContext sc = new JavaSparkContext(...);! JavaRDD<String> lines = ctx.textFile("hdfs://...");! JavaRDD<String> words = lines.flatMap(! new FlatMapFunction<String, String>() {! public Iterable<String> call(String s) {! return Arrays.asList(s.split(" "));! }! }! );! ! JavaPairRDD<String, Integer> ones = words.map(! new PairFunction<String, String, Integer>() {! public Tuple2<String, Integer> call(String s) {! return new Tuple2(s, 1);! }! }! );! ! JavaPairRDD<String, Integer> counts = ones.reduceByKey(! new Function2<Integer, Integer, Integer>() {! public Integer call(Integer i1, Integer i2) {! return i1 + i2;! }! }! );! JavaRDD<String> lines = sc.textFile("hdfs://log.txt");! ! JavaRDD<String> words =! lines.flatMap(line -> Arrays.asList(line.split(" ")));! ! JavaPairRDD<String, Integer> ones =! words.mapToPair(w -> new Tuple2<String, Integer>(w, 1));! ! JavaPairRDD<String, Integer> counts =! ones.reduceByKey((x, y) -> x + y);! Java  8   Lamba   Expression  [1]   1  -­‐  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html   28  
  • 29. Log  Mining   •  Load  error  messages  from  a  log  into  memory   •  InteracLvely  search  for  paUerns   29  
  • 30. Log  Mining   lines = sparkContext.textFile(“hdfs://…”)! errors = lines.filter(_.startsWith(“ERROR”)! messages = errors.map(_.split(‘t’)(2))! ! cachedMsgs = messages.cache()! ! cachedMsgs.filter(_.contains(“foo”)).count! cachedMsgs.filter(_.contains(“bar”)).count! …! Base  RDD   Transformed   RDD   AcLon   30  
  • 31. LogisLc  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   – Start  with  random  W   – On  each  iteraLon,  sum  a  funcLon  of  W  over  the   data   – Move  W  in  a  direcLon  that  improves  it   31  
  • 33. LogisLc  Regression   val points = spark.textFile(…).map(parsePoint).cache()! ! val w = Vector.random(D)! ! for (I <- 1 to ITERATIONS) {! "val gradient = points.map(p => ! " "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )! " ".reduce(_+_)! "w -= gradient! }! println(“Final separating plane: ” + w)! 33  
  • 34. Conviva  Use-­‐Case  [1]   •  Monitor  online  video  consumpLon   •  Analyze  trends   Need  to  run  tens  of  queries  like  this  a  day:     SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 1  -­‐  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/   34  
  • 35. Conviva  With  Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,NullWritable] (pathToSessionSummaryOnHdfs)     val  cachedSessions  =  sessions.filter(whereCondiLonToFilterSessions).cache     val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }   val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =   cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap     35  
  • 37. Large-­‐Scale  Stream  Processing   Requires   •  Fault  Tolerance  –  for  crashes  and  strugglers   •  Efficiency   •  Row-­‐by-­‐row  (conLnuous  operator)  systems   do  not  handle  struggler  nodes     •  Batch  Processing  provides  fault  tolerance   efficiently   Job  is  divided  into  determinisLc  tasks       37  
  • 38. Key  QuesLon     •  How  fast  can  the  system  recover?   38  
  • 40. Spark  Streaming   –  Run  con$nuous  processing  of  data  using  Spark’s  core  API.     –  Extends  Spark  concept  of  RDD’s  to  DStreams  (DiscreLzed  Streams)  which   are  fault  tolerant,  transformable  streams.  Users  can  re-­‐use  exisLng  code  for   batch/offline  processing.   –  Adds  “rolling  window”  operaLons.  E.g.  compute  rolling  averages  or  counts   for  data  over  last  five  minutes.   –  Example  use  cases:   •  “On-­‐the-­‐fly”  ETL  as  data  is  ingested  into  Hadoop/HDFS.   •  DetecLng  anomalous  behavior  and  triggering  alerts.   •  ConLnuous  reporLng  of  summary  metrics  for  incoming  data.   40  
  • 41. val  tweets  =  ssc.twitterStream()   val  hashTags  =  tweets.flatMap  (status  =>  getTags(status))   hashTags.saveAsHadoopFiles("hdfs://...")     flatMap flatMap flatMap save save save batch  @  t+1  batch  @  t   batch  @  t+2   tweets  DStream   hashTags  DStream   Stream  composed  of   small  (1-­‐10s)  batch   computaLons   “Micro-­‐batch”  Architecture   41  
  • 43. Shark  Architecture   •  IdenLcal  to  Hive   •  Same  CLI,  JDBC,            SQL  Parser,  Metastore   •  Replaced  the  opLmizer,            plan  generator  and  the  execuLon  engine.     •  Added  Cache  Manager.     •  Generate  Spark  code  instead  of  Map  Reduce   43  
  • 44. Hive  CompaLbility   •  MetaStore   •  HQL   •  UDF  /  UDAF   •  SerDes   •  Scripts   44  
  • 45. Shark  Vs  Impala   •  Shark  inherits  Hive  limitaLons  while  Impala  is   purpose  built  for  SQL.   •  Impala  is  significantly  faster  per  our  tests.   •  Shark  does  not  have  security,  audit/lineage,   support  for  high-­‐concurrency,  operaLonal   tooling  for  config/monitor/reporLng/ debugging.   •  InteracLve  SQL  needed  for  connecLng  BI   Tools.  Shark  not  cerLfied  by  any  BI  vendor.   45  
  • 48. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,  iteraLve  algorithms   •  InteracLve  data  exploraLons   •  Developer  producLvity   48  
  • 49. How  Spark  Works?   •  RDDs  –  resilient  distributed  data   •  Lazy  transformaLons   •  Caching   •  Fault  tolerance  by  storing  lineage   •  Streams  –  micro-­‐batches  of  RDDs   •  Shark  –  Hive  +  Spark   49  

Notes de l'éditeur

  1. * MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
  2. Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.