SlideShare une entreprise Scribd logo
1  sur  45
CONFIDENTIAL - RESTRICTED
Introduction to Spark
SDBigData Meetup #6
January 14th 2015
Maxime Dumas
Systems Engineer, Cloudera
Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• avid cyclist
• from Montreal, Canada
2
What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
4
Quick and dirty, for context.
The Apache Hadoop Ecosystem
©2014 Cloudera, Inc. All rights
reserved.
• Scalability
• Simply scales just by adding nodes
• Local processing to avoid network bottlenecks
• Efficiency
• Cost efficiency (<$1k/TB) on commodity hardware
• Unified storage, metadata, security (no duplication or
synchronization)
• Flexibility
• All kinds of data (blobs, documents, records, etc)
• In all forms (structured, semi-structured, unstructured)
• Store anything then later analyze what you need
Why Hadoop?
Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
6
HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
7
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
9
Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
10
CDH: the App Store for Hadoop
11
Integration
Storage
Resource Management
Metadata
NoSQL
DBMS
…
Analytic
MPP
DBMS
Search
Engine
In-
Memory
Batch
Processing
System
Management
Data
Management
Support
Security
Machine
Learning
MapReduce
12
Introduction to Apache Spark
Credits:
• Ben White
• Todd Lipcon
• Ted Malaska
• Jairam Ranganathan
• Jayant Shekhar
• Sandy Ryza
Can we improve on MR?
• Problems with MR:
• Very low-level: requires a lot of code to do simple
things
• Very constrained: everything must be described as
“map” and “reduce”. Powerful but sometimes
difficult to think in these terms.
13
Can we improve on MR?
• Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain
well.
• Giraph / Graphlab (graph processing)
• Storm (stream processing)
• Impala (real-time SQL)
2. Generalize the capabilities of MapReduce to
provide a richer foundation to solve problems.
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on the problem!
14
What is Apache Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:
• Linear scalability
• Fault-tolerance
• Data Locality based computations
…but offers so much more:
• Leverages distributed memory for better performance
• Supports iterative algorithms that are not feasible in MR
• Improved developer experience
• Full Directed Graph expressions for data parallel computations
• Comes with libraries for machine learning, graph analysis, etc.
15
What is Apache Spark?
Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
16
Popular project
17
Getting started with Spark
• Java API
• Interactive shells:
• Scala (spark-shell)
• Python (pyspark)
18
Execution modes
19
Execution modes
• Standalone Mode
• Dedicated master and worker daemons
• YARN Client Mode
• Launches a YARN application with the
driver program running locally
• YARN Cluster Mode
• Launches a YARN application with the
driver program running in the YARN
ApplicationMaster
20
Dynamic resource
management
between Spark,
MR, Impala…
Dedicated Spark
runtime with static
resource limits
Spark Concepts
21
RDD – Resilient Distributed Dataset
• Collections of objects partitioned across a cluster
• Stored in RAM or on Disk
• You can control persistence and partitioning
• Created by:
• Distributing local collection objects
• Transformation of data in storage
• Transformation of RDDs
• Automatically rebuilt on failure (resilient)
• Contains lineage to compute from storage
• Lazy materialization
22
RDD transformations
23
Operations on RDDs
Transformations lazily transform a RDD
to a new RDD
• map
• flatMap
• filter
• sample
• join
• sort
• reduceByKey
• …
Actions run computation to return a
value
• collect
• reduce(func)
• foreach(func)
• count
• first, take(n)
• saveAs
• …
24
Fault Tolerance
• RDDs contain lineage.
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
25
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
26
Examples
Word Count in MapReduce
27
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Word Count in Spark
sc.textFile(“words”)
.flatMap(line => line.split(" "))
.map(word=>(word,1))
.reduceByKey(_+_).collect()
28
Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
29
Intuition
30
Logistic Regression
31
Logistic Regression Performance
32
33
Spark and Hadoop:
a Framework within a Framework
34
35
Integration
Storage
Resource Management
Metadata
HBase …Impala Solr Spark
Map
Reduce
System
Management
Data
Management
Support
Security
Spark Streaming
• Takes the concept of RDDs and extends it to DStreams
• Fault-tolerant like RDDs
• Transformable like RDDs
• Adds new “rolling window” operations
• Rolling averages, etc.
• But keeps everything else!
• Regular Spark code works in Spark Streaming
• Can still access HDFS data, etc.
• Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
36
Micro-batching for on the fly ETL
37
What about SQL?
38
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
Fault Recovery Recap
• RDDs store dependency graph
• Because RDDs are deterministic:
Missing RDDs are rebuilt in parallel on other nodes
• Stateful RDDs can have infinite lineage
• Periodic checkpoints to disk clears lineage
• Faster recovery times
• Better handling of stragglers vs row-by-row streaming
39
Why Spark?
• Flexible like MapReduce
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Concise, easy API for
developer productivity
40
41
Demo Time!
• Log file Analysis
• Machine Learning
• Spark Streaming
What’s Next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Try it online: Cloudera Live
• Cloudera provides pre-loaded VMs
• http://tiny.cloudera.com/quickstartvm
42
43
Preferably related to the talk… or not.
Questions?
44
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.
45

Contenu connexe

Tendances

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben whiteData Con LA
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 

Tendances (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similaire à Apache Spark - San Diego Big Data Meetup Jan 14th 2015

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 

Similaire à Apache Spark - San Diego Big Data Meetup Jan 14th 2015 (20)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Shark
SharkShark
Shark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 

Dernier

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 

Dernier (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

  • 1. CONFIDENTIAL - RESTRICTED Introduction to Spark SDBigData Meetup #6 January 14th 2015 Maxime Dumas Systems Engineer, Cloudera
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • avid cyclist • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. 4 Quick and dirty, for context. The Apache Hadoop Ecosystem
  • 5. ©2014 Cloudera, Inc. All rights reserved. • Scalability • Simply scales just by adding nodes • Local processing to avoid network bottlenecks • Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplication or synchronization) • Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-structured, unstructured) • Store anything then later analyze what you need Why Hadoop?
  • 6. Why “Ecosystem?” • In the beginning, just Hadoop • HDFS • MapReduce • Today, dozens of interrelated components • I/O • Processing • Specialty Applications • Configuration • Workflow 6
  • 7. HDFS • Distributed, highly fault-tolerant filesystem • Optimized for large streaming access to data • Based on Google File System • http://research.google.com/archive/gfs.html 7
  • 8. Lots of Commodity Machines 8 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 9. MapReduce (MR) • Programming paradigm • Batch oriented, not realtime • Works well with distributed computing • Lots of Java, but other languages supported • Based on Google’s paper • http://research.google.com/archive/mapreduce.html 9
  • 10. Apache Hive • Abstraction of Hadoop’s Java API • HiveQL “compiles” down to MR • a “SQL-like” language • Eases analysis using MapReduce 10
  • 11. CDH: the App Store for Hadoop 11 Integration Storage Resource Management Metadata NoSQL DBMS … Analytic MPP DBMS Search Engine In- Memory Batch Processing System Management Data Management Support Security Machine Learning MapReduce
  • 12. 12 Introduction to Apache Spark Credits: • Ben White • Todd Lipcon • Ted Malaska • Jairam Ranganathan • Jayant Shekhar • Sandy Ryza
  • 13. Can we improve on MR? • Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms. 13
  • 14. Can we improve on MR? • Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. • Giraph / Graphlab (graph processing) • Storm (stream processing) • Impala (real-time SQL) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. • Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on the problem! 14
  • 15. What is Apache Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Supports iterative algorithms that are not feasible in MR • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc. 15
  • 16. What is Apache Spark? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list 16
  • 18. Getting started with Spark • Java API • Interactive shells: • Scala (spark-shell) • Python (pyspark) 18
  • 20. Execution modes • Standalone Mode • Dedicated master and worker daemons • YARN Client Mode • Launches a YARN application with the driver program running locally • YARN Cluster Mode • Launches a YARN application with the driver program running in the YARN ApplicationMaster 20 Dynamic resource management between Spark, MR, Impala… Dedicated Spark runtime with static resource limits
  • 22. RDD – Resilient Distributed Dataset • Collections of objects partitioned across a cluster • Stored in RAM or on Disk • You can control persistence and partitioning • Created by: • Distributing local collection objects • Transformation of data in storage • Transformation of RDDs • Automatically rebuilt on failure (resilient) • Contains lineage to compute from storage • Lazy materialization 22
  • 24. Operations on RDDs Transformations lazily transform a RDD to a new RDD • map • flatMap • filter • sample • join • sort • reduceByKey • … Actions run computation to return a value • collect • reduce(func) • foreach(func) • count • first, take(n) • saveAs • … 24
  • 25. Fault Tolerance • RDDs contain lineage. • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data 25 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 27. Word Count in MapReduce 27 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 28. Word Count in Spark sc.textFile(“words”) .flatMap(line => line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect() 28
  • 29. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 29
  • 33. 33 Spark and Hadoop: a Framework within a Framework
  • 34. 34
  • 35. 35 Integration Storage Resource Management Metadata HBase …Impala Solr Spark Map Reduce System Management Data Management Support Security
  • 36. Spark Streaming • Takes the concept of RDDs and extends it to DStreams • Fault-tolerant like RDDs • Transformable like RDDs • Adds new “rolling window” operations • Rolling averages, etc. • But keeps everything else! • Regular Spark code works in Spark Streaming • Can still access HDFS data, etc. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 36
  • 37. Micro-batching for on the fly ETL 37
  • 39. Fault Recovery Recap • RDDs store dependency graph • Because RDDs are deterministic: Missing RDDs are rebuilt in parallel on other nodes • Stateful RDDs can have infinite lineage • Periodic checkpoints to disk clears lineage • Faster recovery times • Better handling of stragglers vs row-by-row streaming 39
  • 40. Why Spark? • Flexible like MapReduce • High performance • Machine learning, iterative algorithms • Interactive data explorations • Concise, easy API for developer productivity 40
  • 41. 41 Demo Time! • Log file Analysis • Machine Learning • Spark Streaming
  • 42. What’s Next? • Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live • Cloudera provides pre-loaded VMs • http://tiny.cloudera.com/quickstartvm 42
  • 43. 43 Preferably related to the talk… or not. Questions?
  • 45. 45

Notes de l'éditeur

  1. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  2. We’re going to breeze through these really quick, just to show how Search plugs in later…
  3. Lose a server, no problem. Lose a rack, no problem.