SlideShare une entreprise Scribd logo
1  sur  38
& Java
NCDevCon
Raleigh, NC
October (5+2)th 2017
Jean Georges Perrin
Software whatever since 1983
x9
@jgperrin
http://jgp.net [blog]
http://oplo.io [oplo]
Who are thou?
๏ Experience with Spark?
๏ Experience with Hadoop?
๏ Experience with Scala?
๏ Java?
๏ PHP guru?
๏ Front-end developer?
But most importantly…
๏ … who is not a developer?
Agenda
๏ What is ?
๏ What can I do with ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ More and more examples (times permit!)
Caution
First time I am doing a hands-on
tutorial
Tons of content
Unknown crowd
Unknown setting
Title TextAnalytics Operating System
An Analytics Operating System?
Hardware
OS
Apps
An Analytics Operating System?
Hardware
OS
Apps
HardwareHardware
OS OS
Apps
Apps
Analytics
Distrib.
An Analytics Operating System?
Hardware
OS
Apps
HardwareHardware
OS OS
Apps
Analytics
Distrib.
An Analytics Operating System?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
Use Cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ (they are hiring!)
๏ General compute
๏ Distributed data transfer
๏ IBM
๏ DSX (Data Science Experiment)
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ Z
๏ Data wrangling solution
What a Typical App Looks Like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
Convinced?
On y va!
http://bit.ly/spark-clego
Java Development Tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
Get the C O D E
๏ GitHub
๏ http://bit.ly/
SparkJavaCookbookCode
https://github.com/jgperrin/net.jgp.labs.spark
git clone https://github.com/jgperrin/net.jgp.labs.spark.git
Getting Deeper
๏ Go to net.jgp.labs.spark.l000_ingestion.l000_csv
๏ Open CsvToDatasetApp.java
๏ Right click, Run As, Java Application
Working directory = /Users/jgp/git/net.jgp.labs.spark
+---+---+
|_c0|_c1|
+---+---+
| 1| 5|
| 2| 13|
| 3| 27|
| 4| 39|
| 5| 41|
| 6| 55|
+---+---+
package net.jgp.labs.spark.l000_ingestion.l000_csv;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDatasetApp {
public static void main(String[] args) {
System.out.println("Working directory = " + System.getProperty("user.dir"));
CsvToDatasetApp app = new CsvToDatasetApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
String filename = "data/tuple-data-file.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true")
.option("header", "false")
.load(filename);
df.show();
}
}
So what happened?
Let’s try to understand a little more
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS
Node 1 -
Hardware
Node 2 -
Hardware
Node 3 -
Hardware
Node 4 -
Hardware
Unified API
Spark SQL Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5 - OS
Node 5 -
Hardware
Your Application
…
…
Node 1 Node 2 Node 3 Node 4
Unified API
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5
Your Application
…
DataFrame
Title Text Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
DataFrame
A bit of Analytics
But really just a bit
Basic Analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
package net.jgp.labs.spark.l200_join.l030_count_books;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class AuthorsAndBooksCountBooksApp {
public static void main(String[] args) {
AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder()
.appName("Authors and Books")
.master("local").getOrCreate();
String filename = "data/authors.csv";
Dataset<Row> authorsDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
authorsDf.show();
filename = "data/books.csv";
Dataset<Row> booksDf = spark.read()
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename);
booksDf.show();
Dataset<Row> libraryDf = authorsDf
.join(
booksDf,
authorsDf.col("id").equalTo(booksDf.col("authorId")),
"left")
.withColumn("bookId", booksDf.col("id"))
.drop(booksDf.col("id"))
.groupBy(
authorsDf.col("id"),
authorsDf.col("name"),
authorsDf.col("link"))
.count();
libraryDf.show();
libraryDf.printSchema();
}
}
The Art of Delegating
Slave (Worker)
Driver Master
Cluster Manager
Slave (Worker)
Your app
Executor
Task
Task
Executor
Task
Task
Conclusion
A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
What You Learned
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
Going Further
๏ Run more code from the examples (I add some weekly)
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ Watch for my book on Spark + Java to come!
Thanks
@jgperrin

Contenu connexe

Tendances

Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK hypto
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com琛琳 饶
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performanceForthscale
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Steve Howe
 
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowRubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowKouhei Sutou
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189Mahmoud Samir Fayed
 
Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSematext Group, Inc.
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified LoggingGabor Kozma
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureHiroshi Toyama
 

Tendances (20)

Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performance
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016
 
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowRubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189
 
Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystem
 
Dapper
DapperDapper
Dapper
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching Logs
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
 

Similaire à Spark hands-on tutorial (rev. 002)

Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPMariano Iglesias
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with ClojureHenrik Eneroth
 
Get your organization’s feet wet with Semantic Web Technologies
Get your organization’s feet wet with Semantic Web TechnologiesGet your organization’s feet wet with Semantic Web Technologies
Get your organization’s feet wet with Semantic Web TechnologiesAndré Torkveen
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
PyWPS at COST WPS Workshop
PyWPS at COST WPS WorkshopPyWPS at COST WPS Workshop
PyWPS at COST WPS WorkshopJachym Cepicky
 

Similaire à Spark hands-on tutorial (rev. 002) (20)

Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Jaap : node, npm & grunt
Jaap : node, npm & gruntJaap : node, npm & grunt
Jaap : node, npm & grunt
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
 
Get your organization’s feet wet with Semantic Web Technologies
Get your organization’s feet wet with Semantic Web TechnologiesGet your organization’s feet wet with Semantic Web Technologies
Get your organization’s feet wet with Semantic Web Technologies
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
PyWPS at COST WPS Workshop
PyWPS at COST WPS WorkshopPyWPS at COST WPS Workshop
PyWPS at COST WPS Workshop
 

Plus de Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the worldJean-Georges Perrin
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsJean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysJean-Georges Perrin
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoJean-Georges Perrin
 

Plus de Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT Days
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San Francicsco
 

Dernier

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Dernier (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Spark hands-on tutorial (rev. 002)

  • 2. Jean Georges Perrin Software whatever since 1983 x9 @jgperrin http://jgp.net [blog] http://oplo.io [oplo]
  • 3. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  • 4. But most importantly… ๏ … who is not a developer?
  • 5. Agenda ๏ What is ? ๏ What can I do with ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ More and more examples (times permit!)
  • 6. Caution First time I am doing a hands-on tutorial Tons of content Unknown crowd Unknown setting
  • 8. An Analytics Operating System? Hardware OS Apps
  • 9. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS Apps
  • 10. Apps Analytics Distrib. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS
  • 11. Apps Analytics Distrib. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  • 12. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 13. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 14. Use Cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ (they are hiring!) ๏ General compute ๏ Distributed data transfer ๏ IBM ๏ DSX (Data Science Experiment) ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ Z ๏ Data wrangling solution
  • 15. What a Typical App Looks Like? Connect to the cluster Load Data Do something with the data Share the results
  • 18. Java Development Tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  • 19. Get the C O D E ๏ GitHub ๏ http://bit.ly/ SparkJavaCookbookCode https://github.com/jgperrin/net.jgp.labs.spark git clone https://github.com/jgperrin/net.jgp.labs.spark.git
  • 20. Getting Deeper ๏ Go to net.jgp.labs.spark.l000_ingestion.l000_csv ๏ Open CsvToDatasetApp.java ๏ Right click, Run As, Java Application
  • 21. Working directory = /Users/jgp/git/net.jgp.labs.spark +---+---+ |_c0|_c1| +---+---+ | 1| 5| | 2| 13| | 3| 27| | 4| 39| | 5| 41| | 6| 55| +---+---+
  • 22. package net.jgp.labs.spark.l000_ingestion.l000_csv; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDatasetApp { public static void main(String[] args) { System.out.println("Working directory = " + System.getProperty("user.dir")); CsvToDatasetApp app = new CsvToDatasetApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); String filename = "data/tuple-data-file.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true") .option("header", "false") .load(filename); df.show(); } }
  • 23. So what happened? Let’s try to understand a little more
  • 25. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  • 26. Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  • 27. Title Text Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  • 28. A bit of Analytics But really just a bit
  • 29. Basic Analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  • 30. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); authorsDf.show();
  • 31. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); booksDf.show(); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf.show(); libraryDf.printSchema(); } }
  • 32. The Art of Delegating
  • 33. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  • 35. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  • 36. What You Learned ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun
  • 37. Going Further ๏ Run more code from the examples (I add some weekly) ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ Watch for my book on Spark + Java to come!