SlideShare une entreprise Scribd logo
1  sur  58
Walaa Assy
Giza Systems
Software
Developer
SPARK
LIGHTNING-FAST UNIFIED ANALYTICS
ENGINE
HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?
Basics of Spark
Core API
 Cluster Managers
Spark Maintenance
Libraries
 - SQL
 - Streaming
 - Mllib
 GraphX
Troubleshooting /
Future of Spark
AGENDA
 Readability
 Expressiveness
 Fast
 Testability
 Interactive
 Fault Tolerant
 Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..
 Map reduce has very narrow scope especially in batch
processing
 Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE
A UNIFIED PLATFORM FOR BIG DATA
SPARK PROGRAMMING LANGUAGES
 The most basic abstraction of spark
 Spark operations are two main categories:
 Transformations [lazily evalutaed only storing the intent]
 Actions
 val textFile = sc.textFile("file:///spark/README.md")
 textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]
HELLO BIG DATA
 sudo yum install wget
 sudo wget https://downloads.lightbend.com/scala/2.13.0-
M4/scala-2.13.0-M4.tgz
 tar xvf scala-2.13.0-M4.tgz
 sudo mv scala-2.13.0-M4 /usr/lib
 sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala
 export PATH=$PATH:/usr/lib/scala/bin
SCALA INSTALLATION STEPS
 sudo wget
https://www.apache.org/dyn/closer.lua/spark/spark-
2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
 tar xvf spark-2.3.1-bin-hadoop2.7.tgz
 ln -s spark-2.3.1-bin-hadoop2.7 spark
 export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7
 export PATH=$PATH:$SPARK_HOME/bin
SPARK INSTALLATION – CENTOS 7
SPARK MECHANISM
 collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
 A collection similar to a list or an array from a user level
 processed in parallel to fasten computation time with no
failure tolerance
 RDD is immutable
 Transformations are lazy and stored in a DAG
 Actions trigger DAGs
 DAGS are like linear graph of tasks
 Each action will trigger a fresh execution of the graph
RDD
INPUT DATASETS TYPES
 Map
 Flatmap
 Filter
 Distinct
 Sample
 Union
 Inttersection
 Subtract
 Cartesian
Transformations return RDDs
TRANSFORMATIONS IN MAP REDUCE
 Collect()
 Count()
 Take(num)
 takeOrdered(num)(ordering)
 Reduce(function)
 Aggregate(zeroValue)(seqOp,compOp)
 Foreach(function)
 Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
 - foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK
 Sql like pairing
 Join
 fullOuterJoin
 leftJoin
 rightJoin
 Pair Saving
 saveAs(NewAPI)HadoopFile
 - path
 - keyClass
 - valueClass
 - outputFormatClass

saveAs(NewAPI)HadoopData
Set
 - conf
 saveAsSequenceFile
 Pair Saving
 - saveAsHadoopFile(path,
keyClass, valueClass,
SequenceFileOutputFormat)
PAIR METHODS- CONTD
 Works Like a distributed kernel
 Built in a basic spark manager
 Haddop cluster manager yarn
 Apache mesos standalone
PRIMARY CLUSTER MANAGER
SPARK-SUBMIT DEMO
SPARK SQL
 Spark SQL is Apache Spark's module for working with
structured or semi data.
 It is meant to be used by non big data users
 As Spark continues to grow, we want to enable wider
audiences beyond “Big Data” engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL
 Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
 Connect to any data source the same way.
 It executes SQL queries.
 We can read data from existing Hive installation using
SparkSQL.
 When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES
DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
 Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
 Connect through JDBC or ODBC.[Standard Connectivity]
 It is includes with spark
DATAFRAMES
 Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
 Structured data files
 Tables in Hive
 External databases
 Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD
EXAMPLES
SPARK SQL COMPETITION
 Hive
 Parquet
 Json
 Avro
 Amazon red shift
 Csv
 Others
It is recommended as a starting point for any spark application
As it adds
 Predicate push down
 Column pruning
 Can use SQL & RDD
SPARK SQL DATA SOURCES
SPARK STREAMING
 Big & fast data
 Gigabytes per second
 Real time fraud detection
 Marketing
 makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING
SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)
SPARK MLIB
 MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB
 MATLAB
 R
EASY TO USE BUT NOT SCALABLE
 MAHOUT
 GRAPHLAB
Scalable but at the cost ease
 Org.apache.spark.mlib
Rdd based algoritms
 Org.aoache.spark.ml
 Pipleline api built on top of dataframes
SPARK MLIB COMPETITION
 Loding the data
 Extracting features
 Training the data
 Testing
 the data
 The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW
 Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
 Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB
 Spam filtering
 Fraud detection
 Recommendation analysis
 Speech recognition
PRACTICAL USE
 Word to vector algorithm
 This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
 We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
 breeze numerical processing library used inside of spark
 It has ability to perform mathematical operations on vectors
MLIB DEMO
SPARK GRAPHX
 GraphX is Apache Spark's API for graphs and graph-parallel
computation.
 Page ranking
 Producing evaluations
 It can be used in genetic analysis
 ALGORITHMS
 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD
COMPETITONS
End-to-end PageRank performance (20 iterations,
3.7B edges)
 Joints each had unique id
 Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE
 Arrows are relations that can store metadata data known as
edges which is a long type
 A graph is built of two RDDs one containing the collection of
edges and the collection of vertices
 Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection
WHO IS USING SPARK?
 http://spark.apache.org
 Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
 Github: https://github.com/apache/spark
 https://data-flair.training/blogs/spark-sql-tutorial/
REFERENCES

Contenu connexe

Tendances

Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 

Tendances (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark
SparkSpark
Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 

Similaire à Apache spark - Architecture , Overview & libraries

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Spark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online trainingSpark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online trainingSelfpaced
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 

Similaire à Apache spark - Architecture , Overview & libraries (20)

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Spark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online trainingSpark and scala course content | Spark and scala course online training
Spark and scala course content | Spark and scala course online training
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 

Dernier

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Dernier (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Apache spark - Architecture , Overview & libraries

  • 2. HOW DO WE HANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
  • 3. Basics of Spark Core API  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA
  • 4.
  • 5.  Readability  Expressiveness  Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
  • 6.  Map reduce has very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
  • 7.
  • 8. A UNIFIED PLATFORM FOR BIG DATA
  • 10.  The most basic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
  • 12.  sudo yum install wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
  • 13.  sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz  tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
  • 15.  collection of elements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD
  • 17.
  • 18.  Map  Flatmap  Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.  Collect()  Count()  Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
  • 25.
  • 26.
  • 27.
  • 28.  Sql like pairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
  • 29.  Works Like a distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER
  • 32.  Spark SQL is Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
  • 33.  Seamlessly mix SQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
  • 34.
  • 35. DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES
  • 36.  Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
  • 39.  Hive  Parquet  Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES
  • 41.  Big & fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
  • 42. SPARK STREAMING COMPETITORS Streaming data • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
  • 44.  MLlib is a standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
  • 45.  MATLAB  R EASY TO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION
  • 46.  Loding the data  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
  • 47.  Algorithms Classifcation ex: naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
  • 48.  Spam filtering  Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE
  • 49.  Word to vector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO
  • 51.  GraphX is Apache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
  • 52. COMPETITONS End-to-end PageRank performance (20 iterations, 3.7B edges)
  • 53.  Joints each had unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE
  • 54.  Arrows are relations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices
  • 55.  Another component is edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
  • 56. WHO IS USING SPARK?
  • 57.
  • 58.  http://spark.apache.org  Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES