SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
CONFIDENTIAL - RESTRICTED1 CONFIDENTIAL - RESTRICTED
An introduction to Spark
By Todd Lipcon
With thanks to Ted Malaska, Jairam Ranganathan and Sandy Ryza
CONFIDENTIAL - RESTRICTED2
About the Presenter…
Todd Lipcon
Software Engineer
San Francisco Bay Area | Information Technology and Services
2009-
Current:
Software Engineer at Cloudera
Hadoop Commiter and PMC Member
HBase Committer and PMC Member
Member of the Apache Software Foundation
Past: Software Engineer at Amie Street, Inc.
Co-founder and technical lead at MyArtPlot, LLC
Software Engineering Intern at Google
CONFIDENTIAL - RESTRICTED3
Agenda
• Introduction to Crunch
• Introduction to Spark
• Spark use cases
• Changing mindsets
• Spark in CDH
• Short-term roadmap
CONFIDENTIAL - RESTRICTED4
MR is great… but can we do better?
Problems with MR:
• Very low-level: requires a lot of code to do simple
things
• Very constrained: everything must be described as
“map” and “reduce”. Powerful but sometimes difficult
to think in these terms.
CONFIDENTIAL - RESTRICTED5
MR is great… but can we do better?
Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain well. Ex:
Giraph / Graphlab (graph processing), Storm (stream processing)
2. Generalize the capabilities of MapReduce to provide a richer
foundation to solve problems.
Ex: MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on problem!
CONFIDENTIAL - RESTRICTED6
A brief introduction
Apache Crunch
CONFIDENTIAL - RESTRICTED7
What is Crunch?
“The Apache Crunch Java library provides a framework
for writing, testing, and running MapReduce pipelines.
Its goal is to make pipelines that are composed of many
user-defined functions simple to write, easy to test,
and efficient to run.”
CONFIDENTIAL - RESTRICTED8
Compared to other Hadoop tools
(e.g Hive/Pig)
• Developer focused
• Java API, easy to integrate with other Java libraries
• Less boiler plate, more “fluent” API than MR
• Minimal abstractions
• Thin layer on top of MapReduce – equally fast
• Access to MapReduce APIs when you really need them
• Flexible data model
• Able to use POJOs, protocol buffers, Avro records, tuples,
etc.
CONFIDENTIAL - RESTRICTED9
Crunch concepts: Sources and Targets
• Source: a repository to load input data from
• JDBC
• Files on Hadoop (text/CSV, Avro, protobuf, binary, custom
data formats, etc)
• HBase
• Yields a PCollection<T>
• Target: a place to write job output
• Stores a PCollection back onto HDFS or some other storage
CONFIDENTIAL - RESTRICTED10
Crunch concepts: PCollection<T>
• Represents a “parallel collection” of T objects
• .parallelDo(DoFn<T, U> function)
• like a Hadoop Mapper
• Emit 0 to n records for each input record
• Output: PCollection<U>
• .filter(FilterFn<T> filter)
• FilterFn must return true or false for each input record
• Output: PCollection<T>
• .union(PCollection<T>)
• Appends two collections together
CONFIDENTIAL - RESTRICTED11
Crunch concepts: PTable<K, V>
• Sub-interface of PCollection<Pair<K, V>>
• Supports all of the PCollection methods.
• .groupByKey()
• Output: PGroupedTable<K, V>
• Like a PCollection<K, Iterable<V>>
• This is just like the “shuffle” phase of MapReduce!
• JoinStrategy.join(table1, table2, joinType)
• Simple method for SQL-style joins between two tables on their
matching keys
• Supports inner join, outer join, cartesian join, etc.
• API allows user to provide hints on the best implementation
CONFIDENTIAL - RESTRICTED12
Crunch concepts: PGroupedTable<K, V>
• The same as a PCollection<Pair<K, V>>
• Still allows PCollection functions like parallelDo(…)
• Methods for aggregating or reducing:
• combineValues(Aggregator)
• Standard aggregators include summing values, averaging,
sampling, top-k, etc.
CONFIDENTIAL - RESTRICTED13
Crunch WordCount (1)
public class WordCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new WordCount(), args);
}
void run() {
String inputPath = args[0];
String outputPath = args[1];
// Create an object to coordinate pipeline creation and execution
Pipeline pipeline = new MRPipeline(WordCount.class, getConf());
// or... = MemPipeline.getInstance();
1
2
3
4
5
6
7
8
9
10
11
Standard Hadoop “main”
function which parses
configuration, etc
Specifies whether this
pipeline should execute in
MapReduce or locally in-
memory (for testing)
CONFIDENTIAL - RESTRICTED14
Crunch WordCount (2)
// Reference a given text file as a collection of Strings
PCollection<String> lines = pipeline.readTextFile(inputPath);
// Define a function that splits each line in a PCollection of
// Strings into a PCollection made up of the individual words
// in those lines.
PCollection<String> words = lines.parallelDo(new Tokenizer(),
Writables.strings();
PTable<String, Long> counts = words.count();
pipeline.writeTextFile(counts, outputPath);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Specifies how to serialize
the data in this collection.
User-defined class…
CONFIDENTIAL - RESTRICTED15
Crunch WordCount (3)
public static class Tokenizer extends DoFn<String, String> {
@Override
public void process(String line, Emitter<String> emitter) {
for (String word : line.split(“ “) {
emitter.emit(word);
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CONFIDENTIAL - RESTRICTED16
Crunch advantages
• DoFns are like mappers, but many DoFns can be
chained together in a single mapper, or in the reducer
• Crunch planner constructs a pipeline with the minimum
number of MR jobs
• myTable.parallelDo(x).parallelDo(y).groupByKey().aggregate(A
ggregators.sum().parallelDo(z).writeToFile(“/foo”) is only one
MapReduce job!
• Makes it easy to write composable, re-usable, testable
code
• Can run the same pipeline in MapReduce, or in-memory
• In the future, can also run on new frameworks like Spark with
no code change!
CONFIDENTIAL - RESTRICTED17
A brief introduction
Apache Spark
CONFIDENTIAL - RESTRICTED18
What is Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:
• Linear scalability
• Fault-tolerance
• Data Locality based computations
…but offers so much more:
• Leverages distributed memory for better performance
• Improved developer experience
• Full Directed Graph expressions for data parallel computations
• Comes with libraries for machine learning, graph analysis, etc
CONFIDENTIAL - RESTRICTED19
Spark vs Crunch
• Similar API – higher level functional transformations
• Spark is a service of its own: not a wrapper around
MapReduce
• In fact, Crunch can compile down to Spark jobs instead of
MR jobs!
• More “batteries included”
• Machine learning, graph computation, etc
CONFIDENTIAL - RESTRICTED20
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
CONFIDENTIAL - RESTRICTED21
Spark Client
(App Master)
Scheduler and
RDD Graph
Spark from a High Level
Worker
Spark Worker
RDD Objects
Task Threads
Block Manager
Rdd1.join(rdd2).
groupBy(…)
.filter(…)
Task Scheduler
Threads
Block Manager
Cluster
Manager
Trackers
MemoryStore
DiskStore
BlockInfo
ShuffleBlockManager
CONFIDENTIAL - RESTRICTED22
Resilient Distributed Datasets
• Resilient Distributed Datasets
• Collections of objects distributed across a cluster,
stored in RAM or on Disk
• Like a Crunch PCollection
• Built through parallel transformations
• Automatically rebuilt on failure (resilient)
CONFIDENTIAL - RESTRICTED23
Resilient Distributed Datasets
• Building a Story
• HadoopRDD
• JdbcRDD
• MappedRDD
• FlatMappedRDD
• FilteredRDD
• OrderedRDD
• PairRDD
• ShuffledRDD
• UnionRDD
• ….
join
filter
groupBy
B:
C:
D: E: F:
G:
Ç√
Ω
map
A:
map
take
CONFIDENTIAL - RESTRICTED24
Easy: Expressive API
• Transformations
• Lazy (like Crunch)
• Return RDDs
• Actions
• Demand action
• Return value or store data on HDFS
CONFIDENTIAL - RESTRICTED25
Easy: Expressive API
• Actions
• Collect
• Count
• First
• Take(n)
• saveAsTextFile
• foreach(func)
• reduce(func)
• …
• Transformations
• Map
• Filter
• flatMap
• Union
• Reduce
• Sort
• Join
• …
CONFIDENTIAL - RESTRICTED26
Easy: Example – Word Count (Java version)
// Create a context: specifies cluster configuration, app name, etc.
JavaSparkContext sc = new JavaSparkContext(...);
JavaRDD<String> lines = sc.textFile(“hdfs://…”);
// Create a new RDD with each word separated
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(“ “));
}
});
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CONFIDENTIAL - RESTRICTED27
Easy: Example – Word Count (Java version)
// Create an RDD with tuples like (“foo”, 1)
JavaRdd<String> ones = words.map(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
}
});
// Run the “reduce” phase to count occurrences
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
Counts.saveAsTextFile(“/out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CONFIDENTIAL - RESTRICTED28
Easy: Example – Word Count (Scala)
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
1
2
3
4
5
6
Scala’s type inference and
lambda syntax make this
very concise!
CONFIDENTIAL - RESTRICTED29
Easy: Out of the Box Functionality
• Hadoop Integration
• Works with Hadoop
Data
• Runs With YARN
• Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:
• Improved Python support
• SparkR
• Java 8
• Better ML
• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
CONFIDENTIAL - RESTRICTED30
Example: Logistic Regression
data = spark.textFile(...).map(parseDataPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Spark will load the parsed
dataset into memory!
CONFIDENTIAL - RESTRICTED31
Memory management leads to greater
performance
• Hadoop cluster with 100
nodes contains 10+TB of
RAM today and will double
next year
• 1 GB RAM ~ $10-$20
Trends:
½ price every 18 months
2x bandwidth every 3 years
64-128GB RAM
16 cores
50 GB per
sec
Memory can be enabler for high performance
big data applications
CONFIDENTIAL - RESTRICTED32
Fast: Using RAM, Operator Graphs
• In-memory Caching
• Data Partitions read from
RAM instead of disk
• Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
B:
B:
C: D: E:
F:
Ç√
Ω
map
A:
map
take
CONFIDENTIAL - RESTRICTED33
Logistic Regression Performance
(data fits in memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
CONFIDENTIAL - RESTRICTED34
Spark MLLib
• “Batteries included” library for Machine Learning
• Binary Classification (Support Vector Machines)
• Linear and logistic regression
• Collaborative filtering (recommendation engines)
• Stochastic gradient descent
• Clustering (K-means)
CONFIDENTIAL - RESTRICTED35
MLLib example (Scala)
import org.apache.spark.mllib.clustering.KMeans
// Load and parse the data
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
1
2
3
4
5
6
CONFIDENTIAL - RESTRICTED36
Spark Streaming
• Run continuous processing of data using Spark’s core API.
• Extends Spark concept of RDD’s to DStreams (Discretized Streams) which
are fault tolerant, transformable streams. Users can re-use existing code for
batch/offline processing.
• Adds “rolling window” operations. E.g. compute rolling averages or counts
for data over last five minutes.
• Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
CONFIDENTIAL - RESTRICTED37
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture
CONFIDENTIAL - RESTRICTED38
User Use Case Spark’s Value
Conviva Optimize end user’s online video
experience by analyzing traffic patterns in
real-time, enabling fine-grained traffic
shaping control.
• Rapid development and prototyping
• Shared business logic for offline and
online computation
• Open machine learning algorithms
Yahoo! Speedup model training pipeline for Ad
Targeting, increasing feature extraction
phase by over 3x. Content
recommendation using Collaborative
filtering
• Reduced latency in broader data
pipeline.
• Use of iterative ML
• Efficient P2P broadcast
Anonymous (Large
Tech Company)
“Semi real-time” log aggregation and
analysis for monitoring, alerting, etc.
• Low latency, frequently running
“mini” batch jobs processing latest
data sets
Technicolor Real-time analytics for their customers
(i.e., telcos); need ability to provide
streaming results and ability to query
them in real time
● Easy to develop; just need Spark
& SparkStreaming
● Arbitrary queries on live data
Sample use cases
CONFIDENTIAL - RESTRICTED39
Building on Spark Today
• What kind of Apps?
• ETL
• Machine Learning
• Streaming
• Dashboards
Growing Number of Successful
Use Cases & 3rd Party
Applications
CONFIDENTIAL - RESTRICTED40
Current project status
• Spark promoted to Apache TLP in mid-Feb
• 100+ contributors and 25+ companies contributing
• Includes: Intel, Yahoo, Quantifind, Bizo, etc
• Dozens of production deployments
CONFIDENTIAL - RESTRICTED41
Questions?
• Thank you for attending!
• I’ll be happy to answer any additional questions
now…
• http://spark.apache.org/ has getting started guides,
tutorials, etc.

Contenu connexe

Tendances

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
 
Innovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesInnovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from OracleEDB
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9Gleb Otochkin
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Apache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesHBaseCon
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 

Tendances (20)

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
Innovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesInnovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and Modules
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Apache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New Features
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 

En vedette

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsDataWorks Summit
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkDavid Smelker
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionMapR Technologies
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Zaloni
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Using MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkUsing MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkMongoDB
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 

En vedette (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security Standards
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware Expression
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe?
 
Spark Hadoop
Spark HadoopSpark Hadoop
Spark Hadoop
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Using MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkUsing MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & Spark
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

Similaire à Hadoop Spark - Reuniao SouJava 12/04/2014

BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Manuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4octManuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4octParadigma Digital
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 

Similaire à Hadoop Spark - Reuniao SouJava 12/04/2014 (20)

BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Manuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4octManuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4oct
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Hadoop Spark - Reuniao SouJava 12/04/2014

  • 1. CONFIDENTIAL - RESTRICTED1 CONFIDENTIAL - RESTRICTED An introduction to Spark By Todd Lipcon With thanks to Ted Malaska, Jairam Ranganathan and Sandy Ryza
  • 2. CONFIDENTIAL - RESTRICTED2 About the Presenter… Todd Lipcon Software Engineer San Francisco Bay Area | Information Technology and Services 2009- Current: Software Engineer at Cloudera Hadoop Commiter and PMC Member HBase Committer and PMC Member Member of the Apache Software Foundation Past: Software Engineer at Amie Street, Inc. Co-founder and technical lead at MyArtPlot, LLC Software Engineering Intern at Google
  • 3. CONFIDENTIAL - RESTRICTED3 Agenda • Introduction to Crunch • Introduction to Spark • Spark use cases • Changing mindsets • Spark in CDH • Short-term roadmap
  • 4. CONFIDENTIAL - RESTRICTED4 MR is great… but can we do better? Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms.
  • 5. CONFIDENTIAL - RESTRICTED5 MR is great… but can we do better? Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. Ex: Giraph / Graphlab (graph processing), Storm (stream processing) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. Ex: MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on problem!
  • 6. CONFIDENTIAL - RESTRICTED6 A brief introduction Apache Crunch
  • 7. CONFIDENTIAL - RESTRICTED7 What is Crunch? “The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.”
  • 8. CONFIDENTIAL - RESTRICTED8 Compared to other Hadoop tools (e.g Hive/Pig) • Developer focused • Java API, easy to integrate with other Java libraries • Less boiler plate, more “fluent” API than MR • Minimal abstractions • Thin layer on top of MapReduce – equally fast • Access to MapReduce APIs when you really need them • Flexible data model • Able to use POJOs, protocol buffers, Avro records, tuples, etc.
  • 9. CONFIDENTIAL - RESTRICTED9 Crunch concepts: Sources and Targets • Source: a repository to load input data from • JDBC • Files on Hadoop (text/CSV, Avro, protobuf, binary, custom data formats, etc) • HBase • Yields a PCollection<T> • Target: a place to write job output • Stores a PCollection back onto HDFS or some other storage
  • 10. CONFIDENTIAL - RESTRICTED10 Crunch concepts: PCollection<T> • Represents a “parallel collection” of T objects • .parallelDo(DoFn<T, U> function) • like a Hadoop Mapper • Emit 0 to n records for each input record • Output: PCollection<U> • .filter(FilterFn<T> filter) • FilterFn must return true or false for each input record • Output: PCollection<T> • .union(PCollection<T>) • Appends two collections together
  • 11. CONFIDENTIAL - RESTRICTED11 Crunch concepts: PTable<K, V> • Sub-interface of PCollection<Pair<K, V>> • Supports all of the PCollection methods. • .groupByKey() • Output: PGroupedTable<K, V> • Like a PCollection<K, Iterable<V>> • This is just like the “shuffle” phase of MapReduce! • JoinStrategy.join(table1, table2, joinType) • Simple method for SQL-style joins between two tables on their matching keys • Supports inner join, outer join, cartesian join, etc. • API allows user to provide hints on the best implementation
  • 12. CONFIDENTIAL - RESTRICTED12 Crunch concepts: PGroupedTable<K, V> • The same as a PCollection<Pair<K, V>> • Still allows PCollection functions like parallelDo(…) • Methods for aggregating or reducing: • combineValues(Aggregator) • Standard aggregators include summing values, averaging, sampling, top-k, etc.
  • 13. CONFIDENTIAL - RESTRICTED13 Crunch WordCount (1) public class WordCount extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new WordCount(), args); } void run() { String inputPath = args[0]; String outputPath = args[1]; // Create an object to coordinate pipeline creation and execution Pipeline pipeline = new MRPipeline(WordCount.class, getConf()); // or... = MemPipeline.getInstance(); 1 2 3 4 5 6 7 8 9 10 11 Standard Hadoop “main” function which parses configuration, etc Specifies whether this pipeline should execute in MapReduce or locally in- memory (for testing)
  • 14. CONFIDENTIAL - RESTRICTED14 Crunch WordCount (2) // Reference a given text file as a collection of Strings PCollection<String> lines = pipeline.readTextFile(inputPath); // Define a function that splits each line in a PCollection of // Strings into a PCollection made up of the individual words // in those lines. PCollection<String> words = lines.parallelDo(new Tokenizer(), Writables.strings(); PTable<String, Long> counts = words.count(); pipeline.writeTextFile(counts, outputPath); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Specifies how to serialize the data in this collection. User-defined class…
  • 15. CONFIDENTIAL - RESTRICTED15 Crunch WordCount (3) public static class Tokenizer extends DoFn<String, String> { @Override public void process(String line, Emitter<String> emitter) { for (String word : line.split(“ “) { emitter.emit(word); } } } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 16. CONFIDENTIAL - RESTRICTED16 Crunch advantages • DoFns are like mappers, but many DoFns can be chained together in a single mapper, or in the reducer • Crunch planner constructs a pipeline with the minimum number of MR jobs • myTable.parallelDo(x).parallelDo(y).groupByKey().aggregate(A ggregators.sum().parallelDo(z).writeToFile(“/foo”) is only one MapReduce job! • Makes it easy to write composable, re-usable, testable code • Can run the same pipeline in MapReduce, or in-memory • In the future, can also run on new frameworks like Spark with no code change!
  • 17. CONFIDENTIAL - RESTRICTED17 A brief introduction Apache Spark
  • 18. CONFIDENTIAL - RESTRICTED18 What is Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc
  • 19. CONFIDENTIAL - RESTRICTED19 Spark vs Crunch • Similar API – higher level functional transformations • Spark is a service of its own: not a wrapper around MapReduce • In fact, Crunch can compile down to Spark jobs instead of MR jobs! • More “batteries included” • Machine learning, graph computation, etc
  • 20. CONFIDENTIAL - RESTRICTED20 Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 21. CONFIDENTIAL - RESTRICTED21 Spark Client (App Master) Scheduler and RDD Graph Spark from a High Level Worker Spark Worker RDD Objects Task Threads Block Manager Rdd1.join(rdd2). groupBy(…) .filter(…) Task Scheduler Threads Block Manager Cluster Manager Trackers MemoryStore DiskStore BlockInfo ShuffleBlockManager
  • 22. CONFIDENTIAL - RESTRICTED22 Resilient Distributed Datasets • Resilient Distributed Datasets • Collections of objects distributed across a cluster, stored in RAM or on Disk • Like a Crunch PCollection • Built through parallel transformations • Automatically rebuilt on failure (resilient)
  • 23. CONFIDENTIAL - RESTRICTED23 Resilient Distributed Datasets • Building a Story • HadoopRDD • JdbcRDD • MappedRDD • FlatMappedRDD • FilteredRDD • OrderedRDD • PairRDD • ShuffledRDD • UnionRDD • …. join filter groupBy B: C: D: E: F: G: Ç√ Ω map A: map take
  • 24. CONFIDENTIAL - RESTRICTED24 Easy: Expressive API • Transformations • Lazy (like Crunch) • Return RDDs • Actions • Demand action • Return value or store data on HDFS
  • 25. CONFIDENTIAL - RESTRICTED25 Easy: Expressive API • Actions • Collect • Count • First • Take(n) • saveAsTextFile • foreach(func) • reduce(func) • … • Transformations • Map • Filter • flatMap • Union • Reduce • Sort • Join • …
  • 26. CONFIDENTIAL - RESTRICTED26 Easy: Example – Word Count (Java version) // Create a context: specifies cluster configuration, app name, etc. JavaSparkContext sc = new JavaSparkContext(...); JavaRDD<String> lines = sc.textFile(“hdfs://…”); // Create a new RDD with each word separated JavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(“ “)); } }); 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 27. CONFIDENTIAL - RESTRICTED27 Easy: Example – Word Count (Java version) // Create an RDD with tuples like (“foo”, 1) JavaRdd<String> ones = words.map( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); } }); // Run the “reduce” phase to count occurrences JavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } } ); Counts.saveAsTextFile(“/out 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 28. CONFIDENTIAL - RESTRICTED28 Easy: Example – Word Count (Scala) val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 1 2 3 4 5 6 Scala’s type inference and lambda syntax make this very concise!
  • 29. CONFIDENTIAL - RESTRICTED29 Easy: Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs With YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
  • 30. CONFIDENTIAL - RESTRICTED30 Example: Logistic Regression data = spark.textFile(...).map(parseDataPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Spark will load the parsed dataset into memory!
  • 31. CONFIDENTIAL - RESTRICTED31 Memory management leads to greater performance • Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year • 1 GB RAM ~ $10-$20 Trends: ½ price every 18 months 2x bandwidth every 3 years 64-128GB RAM 16 cores 50 GB per sec Memory can be enabler for high performance big data applications
  • 32. CONFIDENTIAL - RESTRICTED32 Fast: Using RAM, Operator Graphs • In-memory Caching • Data Partitions read from RAM instead of disk • Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy B: B: C: D: E: F: Ç√ Ω map A: map take
  • 33. CONFIDENTIAL - RESTRICTED33 Logistic Regression Performance (data fits in memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  • 34. CONFIDENTIAL - RESTRICTED34 Spark MLLib • “Batteries included” library for Machine Learning • Binary Classification (Support Vector Machines) • Linear and logistic regression • Collaborative filtering (recommendation engines) • Stochastic gradient descent • Clustering (K-means)
  • 35. CONFIDENTIAL - RESTRICTED35 MLLib example (Scala) import org.apache.spark.mllib.clustering.KMeans // Load and parse the data val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE) 1 2 3 4 5 6
  • 36. CONFIDENTIAL - RESTRICTED36 Spark Streaming • Run continuous processing of data using Spark’s core API. • Extends Spark concept of RDD’s to DStreams (Discretized Streams) which are fault tolerant, transformable streams. Users can re-use existing code for batch/offline processing. • Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data.
  • 37. CONFIDENTIAL - RESTRICTED37 val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
  • 38. CONFIDENTIAL - RESTRICTED38 User Use Case Spark’s Value Conviva Optimize end user’s online video experience by analyzing traffic patterns in real-time, enabling fine-grained traffic shaping control. • Rapid development and prototyping • Shared business logic for offline and online computation • Open machine learning algorithms Yahoo! Speedup model training pipeline for Ad Targeting, increasing feature extraction phase by over 3x. Content recommendation using Collaborative filtering • Reduced latency in broader data pipeline. • Use of iterative ML • Efficient P2P broadcast Anonymous (Large Tech Company) “Semi real-time” log aggregation and analysis for monitoring, alerting, etc. • Low latency, frequently running “mini” batch jobs processing latest data sets Technicolor Real-time analytics for their customers (i.e., telcos); need ability to provide streaming results and ability to query them in real time ● Easy to develop; just need Spark & SparkStreaming ● Arbitrary queries on live data Sample use cases
  • 39. CONFIDENTIAL - RESTRICTED39 Building on Spark Today • What kind of Apps? • ETL • Machine Learning • Streaming • Dashboards Growing Number of Successful Use Cases & 3rd Party Applications
  • 40. CONFIDENTIAL - RESTRICTED40 Current project status • Spark promoted to Apache TLP in mid-Feb • 100+ contributors and 25+ companies contributing • Includes: Intel, Yahoo, Quantifind, Bizo, etc • Dozens of production deployments
  • 41. CONFIDENTIAL - RESTRICTED41 Questions? • Thank you for attending! • I’ll be happy to answer any additional questions now… • http://spark.apache.org/ has getting started guides, tutorials, etc.