SlideShare a Scribd company logo
1 of 27
Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RDD
Demo
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
MLlib
Spark
Streaming
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to Get Started with Spark
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com

More Related Content

What's hot

IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...In-Memory Computing Summit
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakSean Roberts
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureRandy Bias
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...In-Memory Computing Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
OpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DOpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DAlessandro Pilotti
 
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...eNovance
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 

What's hot (20)

IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & Cloudbreak
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Camel Riders in the Cloud
Camel Riders in the CloudCamel Riders in the Cloud
Camel Riders in the Cloud
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
OpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DOpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2D
 
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 

Viewers also liked

The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...All Things Open
 
Marketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathMarketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathAll Things Open
 
Trademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectTrademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectAll Things Open
 
Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²All Things Open
 
Open Source Systems Administration
Open Source Systems AdministrationOpen Source Systems Administration
Open Source Systems AdministrationAll Things Open
 
Sustainable Open Data Markets
Sustainable Open Data MarketsSustainable Open Data Markets
Sustainable Open Data MarketsAll Things Open
 
How Raleigh Became an Open Source City
How Raleigh Became an Open Source CityHow Raleigh Became an Open Source City
How Raleigh Became an Open Source CityAll Things Open
 
All Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open
 
Open Sourcing the Public Library
Open Sourcing the Public LibraryOpen Sourcing the Public Library
Open Sourcing the Public LibraryAll Things Open
 
Software Development as a Civic Service
Software Development as a Civic ServiceSoftware Development as a Civic Service
Software Development as a Civic ServiceAll Things Open
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowAll Things Open
 
Great Artists (Designers) Steal
Great Artists (Designers) StealGreat Artists (Designers) Steal
Great Artists (Designers) StealAll Things Open
 
What Academia Can Learn from Open Source
What Academia Can Learn from Open SourceWhat Academia Can Learn from Open Source
What Academia Can Learn from Open SourceAll Things Open
 
JavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingJavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingAll Things Open
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and BeyondAll Things Open
 
Open Source in Healthcare
Open Source in HealthcareOpen Source in Healthcare
Open Source in HealthcareAll Things Open
 
Choosing a Javascript Framework
Choosing a Javascript FrameworkChoosing a Javascript Framework
Choosing a Javascript FrameworkAll Things Open
 
The Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsThe Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsAll Things Open
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudAll Things Open
 

Viewers also liked (20)

The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
 
Marketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathMarketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all math
 
Trademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectTrademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software Project
 
Women in Open Source
Women in Open SourceWomen in Open Source
Women in Open Source
 
Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²
 
Open Source Systems Administration
Open Source Systems AdministrationOpen Source Systems Administration
Open Source Systems Administration
 
Sustainable Open Data Markets
Sustainable Open Data MarketsSustainable Open Data Markets
Sustainable Open Data Markets
 
How Raleigh Became an Open Source City
How Raleigh Became an Open Source CityHow Raleigh Became an Open Source City
How Raleigh Became an Open Source City
 
All Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open Opening Keynote
All Things Open Opening Keynote
 
Open Sourcing the Public Library
Open Sourcing the Public LibraryOpen Sourcing the Public Library
Open Sourcing the Public Library
 
Software Development as a Civic Service
Software Development as a Civic ServiceSoftware Development as a Civic Service
Software Development as a Civic Service
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To Know
 
Great Artists (Designers) Steal
Great Artists (Designers) StealGreat Artists (Designers) Steal
Great Artists (Designers) Steal
 
What Academia Can Learn from Open Source
What Academia Can Learn from Open SourceWhat Academia Can Learn from Open Source
What Academia Can Learn from Open Source
 
JavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingJavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware Prototyping
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and Beyond
 
Open Source in Healthcare
Open Source in HealthcareOpen Source in Healthcare
Open Source in Healthcare
 
Choosing a Javascript Framework
Choosing a Javascript FrameworkChoosing a Javascript Framework
Choosing a Javascript Framework
 
The Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsThe Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting Tools
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
 

Similar to Apache Spark: Lightning Fast Cluster Computing

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Springmark_fisher
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 

Similar to Apache Spark: Lightning Fast Cluster Computing (20)

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Spring
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 

More from All Things Open

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityAll Things Open
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best PracticesAll Things Open
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public PolicyAll Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...All Things Open
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashAll Things Open
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptAll Things Open
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractAll Things Open
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and SuccessAll Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with BackgroundAll Things Open
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblyAll Things Open
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksAll Things Open
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptAll Things Open
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramAll Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceAll Things Open
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamAll Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in controlAll Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsAll Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
 

More from All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Recently uploaded

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Apache Spark: Lightning Fast Cluster Computing

  • 1. Apache Spark Lightening Fast Cluster Computing Eric Mizell – Director, Solution Engineering
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Spark? Apache Open Source Project Distributed Compute Engine for fast and expressive data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-Language APIs for Java, Scala, Python, and R Powerful Abstractions Enable data workers to rapidly iterate over data for: • ETL, Machine Learning, SQL, Stream Processing, and Graph Processing Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Spark? Elegant Developer APIs • Data Frames/SQL, Machine Learning, Graph algorithms and streaming • Scala, Python, Java and R • Single environment for pre-processing and Machine Learning In-memory computation model • Effective for iterative computations and machine learning Machine Learning On Hadoop • Implementation of distributed ML-algorithms • Pipeline API (Spark ML) Runs on Hadoop on YARN, Mesos, standalone
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Interactions with Spark Command Line • Scala shell – Scala/Java (./bin/spark-shell) • Python - (./bin/pyspark) Notebooks • Apache Zeppelin Notebook • Juptyer/IPython Notebook • IRuby Notebook ODBC/JDBC (Spark SQL only via Thrift) • Simba driver • DataDirect driver
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fundamental Abstraction: Resilient Distributed Datasets RDD Work with distributed collections as primitives RDD Properties • Immutable collections of objects spread across a cluster • Built through parallel transformations (map, filter, etc.) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) Multiple Languages broad developer, partner and customer engagement RDD Partition 1 RDD Partition 2 RDD Partition 3Worker Node Worker Node Worker Node RDD LogicalSpark Driver sc = new SparkContext rDD =sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map … Developer Physical Writes RDD RDDs are collections of objects distributed across a cluster, cached in RAM or on disk. They are built through parallel transformations, automatically rebuilt on failure and immutable (each transformation creates a new RDD).
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What can developers do with RDDs? RDD Operations Transformations • e.g. map, filter, groupBy, join • Lazy operations to build RDDs from other RDDs Actions • e.g. count, collect, save • Return a result or write it to storage Other primitives • Accumulator • Broadcast Variables Developer Writes RDD Operations Writes Accumulator s Actions Broadcast Variables Transformations
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved RDD Demo
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL SQL Access and Data Frames YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved YARN HDFS Spark SQL Table Structure integrated to work with tables and rows Hive Queries via Spark by Spark SQL Context can connect to Hive and query Hive Bindings to Python, Scala, Java, and R Data Frames new abstractions simplifies and speeds up SQL processing Spark Core Engine Spark SQL Data Frame DSL Spark SQL Data Frame API Data Source API
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage What are Data Frames? Data Frames represent data in RDDs as a Table RDD is a low level abstraction –Think of RDD as bytecode and DataFrame as the Java Program Data Frame Properties –Data Frames attach schema to RDDs –Allows users to perform aggressive query optimizations –Brings the power of SQL to RDDs! dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Tuple Relational View Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Frames are intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DataFrame Demo YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MLlib Machine Learning Library YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Machine Learning? Machine learning is the study of algorithms that learn concepts from data. A key aspect of learning is generalization: how well a learning algorithm is able to predict on unseen examples.
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Machine Learning Primitives Unsupervised Learning Clustering (K-means) Recommendation Collaborative Filtering - alternating least squares Dimensionality Reductions - Principal component analysis (PCA) and singular value decomposition (SVD) Supervised Learning Classification - Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Trees Regression - linear, logistic and Support Vector Machines (SVMs)
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Workflows are complex Q-Q Q-A similarit y Log Parsing, Cleanin g Ad category mapping Query category mapping Poly Exp (Q-A) Feature s Model Linear Solver train test Metrics • Feature Extraction Feature Extraction Ad Server Sponsored Search Advertising Pipeline Challenges: -> specify pipeline -> inspect and debug -> tune hyperparameters -> productionize HDFS
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Pipeline makes ML workflows easier Transformer Transforms one dataset into another Estimator Fits model to data Pipeline Sequence of stages, consisting of estimators or transformers Parameters Trait for components that take parameters
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Streaming Real Time Stream Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL MLlib Spark Streaming
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark Streaming • Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. • Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets • Data is processed using the now-familiar API: map, filter, reduce, join and window • Processed data can be stored in databases, filesystems, or live dashboards
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved GraphX Graph Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark GraphX Graph API on Spark Seamlessly work with graphs and collections Growing library of graph algorithms • SVD++, Connected Components, Triangle Count, … Iterative Graph Computations using Pregel Implements Valiant’s Bulk Synchronous Parallel (BSP) model for distributing graph algorithms. Use Case Social Media: Suggest new connections based on existing relationships Networking: Best routing through a given network
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to Get Started with Spark
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Spark Today Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ Go to the Apache Spark Website http://spark.apache.org/ Learn Spark Build a Proof of Concept Test New Functionality
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2013 Thank You! Eric Mizell - Director, Solutions Engineering emizell@hortonworks.com

Editor's Notes

  1. NEED SPEAKER NOTES
  2. NEED SPEAKER NOTES
  3. NEED SPEAKER NOTES
  4. TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  5. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  6. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  7. Key idea: add “variables” to the “functions” in functional programming
  8. NEED SPEAKER NOTES
  9. NEED SPEAKER NOTES
  10. NEED SPEAKER NOTES
  11. Spark DataFrames represent tabular Data
  12. NEED SPEAKER NOTES
  13. NEED SPEAKER NOTES
  14. NEED SPEAKER NOTES
  15. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE]
  16. TALK TRACK [NEXT SLIDE]
  17. NEED SPEAKER NOTES
  18. NEED SPEAKER NOTES
  19. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] [RESOURCES] A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices and can also own a bag of data https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  20. Takeaways Change order of interoperability slide Flush out no lock-in slide to talk about “proprietary open source”