SlideShare une entreprise Scribd logo
1  sur  34
Apache Spark
Concepts - Spark SQL, GraphX, Streaming
Petr Zapletal Cake Solutions
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and Machine Learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Spark SQL, GraphX, Streaming
6) Spark’s distributed programming model
7) Deployment
Table of contents
● Resilient Distributed Datasets
● Spark SQL
● GraphX
● Spark Streaming
● Q & A
Spark Modules
Resilient Distributed Datasets
● Immutable, distributed collection of records
● Lazy evaluation, caching option, can be persisted
● Number of operations & transformations
● Can be created from data storage or different RDD
Spark SQL
● Spark’s interface to work with structured or semistructured data
● Structured data
o known set of fields for each record - schema
● Main capabilities
o load data from variety of structured sources
o query the data with SQL
o integration between Spark (Java, Scala and Python API) and SQL
(joining RDDs and SQL tables, using SQL functionality)
More than SQL
● Unified interface for structured data
SchemaRDD
● RDD of row objects, each representing a record
● Known schema (i.e. data fields) of its rows
● Behaves like regular RDD, stored in more efficient manner
● Adds new operations, especially running SQL queries
● Can be created from
o external data sources
o results of queries
o regular RDD
● Used in ML Pipeline API
SchemaRDD
Getting Started
● Entry points:
o HiveContext
 superset functionality, Hive related
o SQLContext
● Loads input JSON file into SchemaRDD
● Uses context to execute query
Query Example
Loading and Saving Data
● Supports number of structured data sources
o Apache Hive
 data warehouse infrastructure on top of Hadoop
 summarization, querying (SQL-like interface) and analysis
o Parquet
 column-oriented storage format in Hadoop ecosystem
 efficient storage of records with nested fields
o JSON
o RDDs
o JDBC/ODBC Server
 connecting Business Intelligence tools
 remote access to Spark cluster
GraphX
● New Spark API for graphs and graph-parallel computation
● Resilient Distributed Property Graph (RDPG, extends RDD)
o directed multigraph ( -> parallel edges)
o properties attached to each vertex and edge
● Common graph operations (subgraph computation, joining vertices, ...)
● Growing collection of graph algorithms
Motivation
● Growing scale and importance of graph data
● Application of data-parallel algorithms to graph computation is inefficient
● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient
execution of graph algorithms
o do not address graph construction & transformation
o limited fault tolerance & data mining support
Performance Comparison
Property Graph
● Directed multigraph with user defined objects to each vertex and edge
Property Graph
Triplet View
● Logical join of vertex and edge properties
Graph Operations
● Basic information (numEdges, numVertices, inDegrees, ...)
● Views (vertices, edges, triplets)
● Caching (persist, cache, ...)
● Transformation (mapVertices, mapEdges, ...)
● Structure modification (reverse, subgraph, ...)
● Neighbour aggregation (collectNeighbours, aggregations, ...)
● Pregel API
● Graph builders (various I/O operations)
● ...
Graph Algorithms
● Built-in algorithms
o PageRank, Connected Components, Triangle Count, ...
Demo
Spark Streaming
● Scalable, high-throughput, fault-tolerant stream processing
Architecture
● Streams are chopped up into batches
● Each batch is processed in Spark
● Results pushed out in batches
Streaming Word Count
Streaming Word Count
StreamingContext
● Entry point for all streaming functionality
o define input sources
o stream transformations
o output operations to DStreams
o starts & stops streaming process
● Limitations
o once started, computations cannot be added
o cannot be restarted
o one active per JVM
Discretized Streams
● Basic abstraction, represents a continuous stream of data
● DStreams
● Implemented as series of RDDs
Stateless Transformations
● Processing of each batch does not depend on previous batches
● Transformation is separately applied to every batch
o Map, flatMap, filter, reduce, groupBy, …
● Combining data from multiple DStreams
o Join, cogroup, union, ...
Stateful Transformations
● Use data or intermediate results from previous batches to compute the
result of the current batch
● Windowed operations
o act over a sliding window of time periods
● UpdateStateByKey
o maintain state while continuously updating it with new information
● Require checkpointing
Output Operations
● Specify what needs to be done with the final transformed data
● Pushing to external DB, printing, …
● If not performed, DStream is not evaluated
Input Sources
● Built-in support for a number of different data sources
● Often in additional libraries (i.e. spark-streaming-kafka)
● HDFS
● Akka Actor Stream
● Apache Kafka
● Apache Flume
● Twitter Stream
● Kinesis
● Custom Sources
● ...
Demo
Conclusion
● RDD repetition
● Spark Modules Overview
o Spark SQL
o GraphX
o Spark Streaming
Questions

Contenu connexe

Tendances

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Tendances (20)

What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Spark etl
Spark etlSpark etl
Spark etl
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 

En vedette

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

En vedette (20)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Similaire à Spark Concepts - Spark SQL, Graphx, Streaming

Similaire à Spark Concepts - Spark SQL, Graphx, Streaming (20)

Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stack
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
 

Plus de Petr Zapletal

Plus de Petr Zapletal (11)

Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019
 
Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018
 
Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018
 
Real World Serverless
Real World ServerlessReal World Serverless
Real World Serverless
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nyc
 
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Dernier (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Spark Concepts - Spark SQL, Graphx, Streaming

  • 1. Apache Spark Concepts - Spark SQL, GraphX, Streaming Petr Zapletal Cake Solutions
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and Machine Learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Spark SQL, GraphX, Streaming 6) Spark’s distributed programming model 7) Deployment
  • 3. Table of contents ● Resilient Distributed Datasets ● Spark SQL ● GraphX ● Spark Streaming ● Q & A
  • 5. Resilient Distributed Datasets ● Immutable, distributed collection of records ● Lazy evaluation, caching option, can be persisted ● Number of operations & transformations ● Can be created from data storage or different RDD
  • 6. Spark SQL ● Spark’s interface to work with structured or semistructured data ● Structured data o known set of fields for each record - schema ● Main capabilities o load data from variety of structured sources o query the data with SQL o integration between Spark (Java, Scala and Python API) and SQL (joining RDDs and SQL tables, using SQL functionality)
  • 7. More than SQL ● Unified interface for structured data
  • 8. SchemaRDD ● RDD of row objects, each representing a record ● Known schema (i.e. data fields) of its rows ● Behaves like regular RDD, stored in more efficient manner ● Adds new operations, especially running SQL queries ● Can be created from o external data sources o results of queries o regular RDD ● Used in ML Pipeline API
  • 10. Getting Started ● Entry points: o HiveContext  superset functionality, Hive related o SQLContext
  • 11. ● Loads input JSON file into SchemaRDD ● Uses context to execute query Query Example
  • 12. Loading and Saving Data ● Supports number of structured data sources o Apache Hive  data warehouse infrastructure on top of Hadoop  summarization, querying (SQL-like interface) and analysis o Parquet  column-oriented storage format in Hadoop ecosystem  efficient storage of records with nested fields o JSON o RDDs o JDBC/ODBC Server  connecting Business Intelligence tools  remote access to Spark cluster
  • 13. GraphX ● New Spark API for graphs and graph-parallel computation ● Resilient Distributed Property Graph (RDPG, extends RDD) o directed multigraph ( -> parallel edges) o properties attached to each vertex and edge ● Common graph operations (subgraph computation, joining vertices, ...) ● Growing collection of graph algorithms
  • 14. Motivation ● Growing scale and importance of graph data ● Application of data-parallel algorithms to graph computation is inefficient ● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient execution of graph algorithms o do not address graph construction & transformation o limited fault tolerance & data mining support
  • 16. Property Graph ● Directed multigraph with user defined objects to each vertex and edge
  • 18. Triplet View ● Logical join of vertex and edge properties
  • 19. Graph Operations ● Basic information (numEdges, numVertices, inDegrees, ...) ● Views (vertices, edges, triplets) ● Caching (persist, cache, ...) ● Transformation (mapVertices, mapEdges, ...) ● Structure modification (reverse, subgraph, ...) ● Neighbour aggregation (collectNeighbours, aggregations, ...) ● Pregel API ● Graph builders (various I/O operations) ● ...
  • 20. Graph Algorithms ● Built-in algorithms o PageRank, Connected Components, Triangle Count, ...
  • 21. Demo
  • 22. Spark Streaming ● Scalable, high-throughput, fault-tolerant stream processing
  • 23. Architecture ● Streams are chopped up into batches ● Each batch is processed in Spark ● Results pushed out in batches
  • 26. StreamingContext ● Entry point for all streaming functionality o define input sources o stream transformations o output operations to DStreams o starts & stops streaming process ● Limitations o once started, computations cannot be added o cannot be restarted o one active per JVM
  • 27. Discretized Streams ● Basic abstraction, represents a continuous stream of data ● DStreams ● Implemented as series of RDDs
  • 28. Stateless Transformations ● Processing of each batch does not depend on previous batches ● Transformation is separately applied to every batch o Map, flatMap, filter, reduce, groupBy, … ● Combining data from multiple DStreams o Join, cogroup, union, ...
  • 29. Stateful Transformations ● Use data or intermediate results from previous batches to compute the result of the current batch ● Windowed operations o act over a sliding window of time periods ● UpdateStateByKey o maintain state while continuously updating it with new information ● Require checkpointing
  • 30. Output Operations ● Specify what needs to be done with the final transformed data ● Pushing to external DB, printing, … ● If not performed, DStream is not evaluated
  • 31. Input Sources ● Built-in support for a number of different data sources ● Often in additional libraries (i.e. spark-streaming-kafka) ● HDFS ● Akka Actor Stream ● Apache Kafka ● Apache Flume ● Twitter Stream ● Kinesis ● Custom Sources ● ...
  • 32. Demo
  • 33. Conclusion ● RDD repetition ● Spark Modules Overview o Spark SQL o GraphX o Spark Streaming

Notes de l'éditeur

  1. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[3][4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[5]
  2. Connected Components and PageRank algorithms https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf For Spark we implemented the algorithms both using idiomatic dataflow operators (Naive Spark, as described in Section 3.2) and using an optimized implementation (Optimized Spark) that eliminates movement of edge data by pre-partitioning the edges to match the partitioning adopted by GraphX. We have excluded Giraph and Optimized Spark from Figure 7c because they were unable to scale to the larger web-graph in the allotted memory of the cluster. While the basic Spark implementation did not crash, it was forced to re-compute blocks from disk and exceeded 8000 seconds per iteration. We attribute the increased memory overhead to the use of edge-cut partitioning and the need to store bi-directed edges and messages for the connected components algorithm
  3. https://spark.apache.org/docs/latest/graphx-programming-guide.html
  4. cogroup - When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. join - When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. union - Return a new DStream that contains the union of the elements in the source DStream and otherDStream.