SlideShare une entreprise Scribd logo
1  sur  41
Spark and Spark Streaming
Eric Fu
2018-Jun-04
Agenda
• Spark
• Resilient Distributed Datasets (RDD)
• Transformations and Actions
• Implementation
• Spark SQL
• Spark Streaming
• Discretized Streams (D-Streams)
• Stateful Transformations
• Consistency: exactly-once
• Spark Structured Streaming
• System Design
MapReduce
MapReduce reuse the immediate data by writing to external storage
How to achieve fault-tolerance?
• An efficient way to put data in memory and keep it persistent
• Copy to external storage (costly)
• Replicate to several nodes (costly)
• Just recompute it (but only when data is deterministic)
Resilient Distributed Datasets (RDD)
• RDD is a read-only, partitioned collection of records
• RDD can only be created through deterministic operations
Resilient Distributed Datasets (RDD)
lines = spark.textFile("hdfs://...")
errors = lines
.filter(_.startsWith("ERROR"))
.filter(_.contains("HDFS"))
.map(_.split('t')(3))
.collect()
Fault-tolerance
• Failed partition can be recomputed
• Stragglers can be moved to other nodes
Programming Interface
• API similar to Java 8 Stream
• Driver - Master - Worker
• Driver tracks RDDs lineage
• Driver send functions to Worker
Transformations
Actions
Example: PageRank
PageRank in Spark
Better to specify partition:
Inside RDD
• Partitions
• Dependencies (parents)
• Iterator (constructor)
• Metadata
Inside RDD (cont.)
• HDFS Files
• map
• union
• sample
• join
2 kinds of dependencies
Job Execution
When user runs an action ...
1. Build lineage graph (DAG)
2. Find missing partitions
3. Schedule tasks based on locality
4. Wait until completed
Spark SQL
A Relational, Declarative API to Spark
Differences
• DataFrame API
• DataFrame = Table
• Keep track of schema
• An RDD of Row objects
• Catalyst
• SQL Optimizer
• SQL with UDF
Catalyst
row.get("x")+3
Spark Streaming
From Batch to Streaming System
Existing Streaming Systems
• Continuous operator model
• Long-running, stateful operators
• Hard to handle faults or stragglers
• Hard to perform backup & recovery – replication or upstream backup
Discretized Streams (D-Streams)
• Structure a streaming computation as a series of short, stateless,
deterministic batch computations on small time intervals
• Higher latency (100ms vs 1s)
• Higher throughput (2–5x faster than Storm)
• Easy to handle faults or stragglers (parallel recovery 1-2s)
Continuous model vs. D-Stream
Example
• Running word count
• Auto checkpoint
• Fault or straggler
Programming Interface
• Input
• Transformation
• Stateless
• Or with state across intervals
• Output operation
Stateful transformations
• Windowing
• Groups the records from a sliding window into one RDD
• Incremental aggregation
• Aggregate over a sliding window
Consistency Semantics
• Hard to provide consistency of state across nodes in streaming system
• D-Streams provide consistent "exactly-once" processing across the
cluster
Exactly-once (1/3)
Exactly-once (2/3)
Exactly-once (3/3)
State Management
• Asynchronous RDD Checkpointing
• Lineage cutoff
Spark Structured Streaming
Incremental SQL Processing
Differences
• To provide exactly-once
• Input sources must be replayable
• Output sinks must support idempotent write
• SQL and DataFrame API
• User can mark a column as denoting event time
• An additional continuous processing mode
"Incrementalize"
Window
Tumbling Window
Hopping Window
Sliding Window
Session Window
Watermarks
• It's impossible to allow arbitrarily late data
• Need to set a watermark for event time columns
• Watermarks affect when stateful operators can forget old state
System Design
Architecture
Master
• Tracks the D-Stream lineage graph
• Schedules tasks to compute new RDD partitions
Worker
• Receive and store partitions of RDD (input or computed)
• Execute tasks
Some Details
• Pipelines operators that can be grouped into a single task
• Submits next timestep before the current one finished
• Asynchronous checkpoints of RDDs and forgets lineage
• Block store manages RDD partitions in an LRU fashion
• Master recovery
Fault and Straggler
• Parallel Recovery
• Parallel across partitions of the RDDs in each timestep
• Parallel across timesteps for independent operations
• Detect stragglers (1.4× slower)
Get your hands dirty!
Thanks!
Q&A

Contenu connexe

Tendances

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Peter Haase
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Tendances (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 

Similaire à Spark and Spark Streaming

Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
Uwe Friedrichsen
 

Similaire à Spark and Spark Streaming (20)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
try
trytry
try
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Shark
SharkShark
Shark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 

Plus de 宇 傅

Plus de 宇 傅 (12)

Parallel Query Execution
Parallel Query ExecutionParallel Query Execution
Parallel Query Execution
 
The Evolution of Data Systems
The Evolution of Data SystemsThe Evolution of Data Systems
The Evolution of Data Systems
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer
 
PelotonDB - A self-driving database for hybrid workloads
PelotonDB - A self-driving database for hybrid workloadsPelotonDB - A self-driving database for hybrid workloads
PelotonDB - A self-driving database for hybrid workloads
 
Immutable Data Structures
Immutable Data StructuresImmutable Data Structures
Immutable Data Structures
 
The Case for Learned Index Structures
The Case for Learned Index StructuresThe Case for Learned Index Structures
The Case for Learned Index Structures
 
Functional Programming in Java 8
Functional Programming in Java 8Functional Programming in Java 8
Functional Programming in Java 8
 
第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩
 
Data Streaming Algorithms
Data Streaming AlgorithmsData Streaming Algorithms
Data Streaming Algorithms
 
Golang 101
Golang 101Golang 101
Golang 101
 
Docker Container: isolation and security
Docker Container: isolation and securityDocker Container: isolation and security
Docker Container: isolation and security
 
Paxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus AlgorithmPaxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus Algorithm
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Dernier (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Spark and Spark Streaming

Notes de l'éditeur

  1. We can also write a custom Partitioner class to group pages that link to each other together (e.g., partition the URLs by domain name).
  2. and metadata about its partitioning scheme and data placement
  3. The system tries to place both state and tasks to maximize data locality, but this underlying flexibility makes speculation and parallel recovery possible
  4. dropping data to disk if there is not enough memory