SlideShare une entreprise Scribd logo
1  sur  16
Booking Hotel, Flight, Train, Event & Rental Car
Apache Spark
Created By Josi Aranda @ Tiket.com
Apache Spark
• Apache Spark is an open-source powerful distributed querying and processing
engine.
• It provides flexibility and extensibility of MapReduce but at significantly higher
speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory
and up to 10 times when accessing disk.
Spark’s Features
Apache Spark achieves high performance
for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query
optimizer, and a physical execution engine.
Speed
Logistic regression in Hadoop and Spark
Spark’s Features
Write applications quickly in Java, Scala,
Python, R, and SQL. Spark offers over 80
high-level operators that make it easy to
build parallel apps. And you can use it
interactively from the Scala, Python, R, and
SQL shells.
Ease of Use
Spark's Python DataFrame API
Read JSON files with automatic schema
inference
Spark’s Features
Combine SQL, streaming, and complex
analytics. Spark powers a stack of libraries
including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark
Streaming. You can combine these libraries
seamlessly in the same application.
Generality
Spark’s Features
Spark runs on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud. It
can access diverse data sources.
Runs Everywhere
Spark Execution Process
• Any Spark application spins off a single driver process (that can contain multiple
jobs) on the master node that then directs executor processes (that contain multiple
tasks) distributed to a number of worker nodes.
• The driver process determines the number and the composition of the task
processes directed to the executor nodes based on the graph generated for the
given job. Note, that any worker node can execute tasks from a number of different
jobs.
Resilient Distributed Dataset (RDD)
• Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM
objects that allow you to perform calculations very quickly, and they are the
backbone of Apache Spark.
• RDDs have two sets of parallel operations: transformations (which return pointers to
new RDDs) and actions (which return values to the driver after running a
computation)
• RDD transformation operations are lazy in a sense that they do not compute their
results immediately. The transformations are only computed when an action is
executed and the results need to be returned to the driver.*
* RDD is like a teenager doing chores. They won’t do it until their mom starts to check.
(they will do it so fast and effectively)
RDD (cont.)
Transformations
Mostly Used RDD Operations
Actions
• .map()
• .filter()
• .flatMap()
• .distinct()
• .sample()
• .leftOuterJoin()
• .repartition()
• .take()
• .collect()
• .reduce()
• .count()
• .saveAsTextFile()
• .foreach()
RDD (cont.)
SparkContext().textFile(‘order__cart.csv’)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web
n
• CSV Line
• Partition
• RDD
RDD (cont.)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web 56312, paid, native_apps
56313, paid, web
56315, paid, web
.filter(lambda line:line[1]==‘paid’)
.map(lambda line:(line[2],1))
(native_apps,1)
(web,2)
(native_apps,1)
(web,1)
(web,1)
.reduceByKey(lambda x,y:x+y)
Spark DataFrame
• A DataFrame is an immutable distributed collection of data that is organized into
named columns analogous to a table in a relational database. Introduced as an
experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed
to DataFrames as part of the Apache Spark 1.3 release.
• By imposing a structure onto a distributed collection of data, this allows Spark users
to query structured data in Spark SQL or using expression methods (instead of
lambdas).
Ways to Create DataFrame
Spark SQL
_c0 _c1 _c2 _c3
• General Files
• Parquet
• ORC Files
• JSON
• Hive Tables
• JDBC
Ways to Create DataFrame (cont.)
a)Traditional df creation. b). df creation with SQL direct. Both will
return the same result.
Spark Dataset
• Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API
that allows users to easily express transformations on domain objects, while also
providing the performance and benefits of the robust Spark SQL execution engine.
As part of the Spark 2.0 release (and as noted in the diagram above), the
DataFrame APIs is merged into the Dataset API thus unifying data processing
capabilities across all libraries.
• Conceptually, the Spark DataFrame is an alias for a collection of generic objects
Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast,
is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or
Java
Performance Benchmark (kind of)
0 200 400 600 800
MySQL*
MapReduce
Spark RDD
Spark DataFrame
Spark DataFrame(direct sql)
Calculate Quarterly Gross rev.
Execution time in seconds (lower is better)
Year Quarter B2C Gross rev.
2016 1 419,563,291,996
2016 2 574,505,787,224
2016 3 537,110,941,199
2016 4 639,459,753,264
2017 1 456,482,358,961
2017 2 587,207,246,225
2017 3 660,881,531,765
2017 4 742,243,815,992
2018 1 1,124,567,178,623
• 6 worker nodes, 2vcpus(12 YARN cores), 13GB memory (62.4GB YARN memory)
• *single node, 32vcpus, 120GB memory

Contenu connexe

Tendances

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 

Tendances (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark etl
Spark etlSpark etl
Spark etl
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 

Similaire à Spark from the Surface

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Similaire à Spark from the Surface (20)

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Dernier

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Dernier (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 

Spark from the Surface

  • 1. Booking Hotel, Flight, Train, Event & Rental Car Apache Spark Created By Josi Aranda @ Tiket.com
  • 2. Apache Spark • Apache Spark is an open-source powerful distributed querying and processing engine. • It provides flexibility and extensibility of MapReduce but at significantly higher speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory and up to 10 times when accessing disk.
  • 3. Spark’s Features Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Speed Logistic regression in Hadoop and Spark
  • 4. Spark’s Features Write applications quickly in Java, Scala, Python, R, and SQL. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. Ease of Use Spark's Python DataFrame API Read JSON files with automatic schema inference
  • 5. Spark’s Features Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Generality
  • 6. Spark’s Features Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. Runs Everywhere
  • 7. Spark Execution Process • Any Spark application spins off a single driver process (that can contain multiple jobs) on the master node that then directs executor processes (that contain multiple tasks) distributed to a number of worker nodes. • The driver process determines the number and the composition of the task processes directed to the executor nodes based on the graph generated for the given job. Note, that any worker node can execute tasks from a number of different jobs.
  • 8. Resilient Distributed Dataset (RDD) • Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM objects that allow you to perform calculations very quickly, and they are the backbone of Apache Spark. • RDDs have two sets of parallel operations: transformations (which return pointers to new RDDs) and actions (which return values to the driver after running a computation) • RDD transformation operations are lazy in a sense that they do not compute their results immediately. The transformations are only computed when an action is executed and the results need to be returned to the driver.* * RDD is like a teenager doing chores. They won’t do it until their mom starts to check. (they will do it so fast and effectively)
  • 9. RDD (cont.) Transformations Mostly Used RDD Operations Actions • .map() • .filter() • .flatMap() • .distinct() • .sample() • .leftOuterJoin() • .repartition() • .take() • .collect() • .reduce() • .count() • .saveAsTextFile() • .foreach()
  • 10. RDD (cont.) SparkContext().textFile(‘order__cart.csv’) 56312, paid, native_apps 56313, paid, web 56314, shopping_cart, web 56315, paid, web n • CSV Line • Partition • RDD
  • 11. RDD (cont.) 56312, paid, native_apps 56313, paid, web 56314, shopping_cart, web 56315, paid, web 56312, paid, native_apps 56313, paid, web 56315, paid, web .filter(lambda line:line[1]==‘paid’) .map(lambda line:(line[2],1)) (native_apps,1) (web,2) (native_apps,1) (web,1) (web,1) .reduceByKey(lambda x,y:x+y)
  • 12. Spark DataFrame • A DataFrame is an immutable distributed collection of data that is organized into named columns analogous to a table in a relational database. Introduced as an experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed to DataFrames as part of the Apache Spark 1.3 release. • By imposing a structure onto a distributed collection of data, this allows Spark users to query structured data in Spark SQL or using expression methods (instead of lambdas).
  • 13. Ways to Create DataFrame Spark SQL _c0 _c1 _c2 _c3 • General Files • Parquet • ORC Files • JSON • Hive Tables • JDBC
  • 14. Ways to Create DataFrame (cont.) a)Traditional df creation. b). df creation with SQL direct. Both will return the same result.
  • 15. Spark Dataset • Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. As part of the Spark 2.0 release (and as noted in the diagram above), the DataFrame APIs is merged into the Dataset API thus unifying data processing capabilities across all libraries. • Conceptually, the Spark DataFrame is an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or Java
  • 16. Performance Benchmark (kind of) 0 200 400 600 800 MySQL* MapReduce Spark RDD Spark DataFrame Spark DataFrame(direct sql) Calculate Quarterly Gross rev. Execution time in seconds (lower is better) Year Quarter B2C Gross rev. 2016 1 419,563,291,996 2016 2 574,505,787,224 2016 3 537,110,941,199 2016 4 639,459,753,264 2017 1 456,482,358,961 2017 2 587,207,246,225 2017 3 660,881,531,765 2017 4 742,243,815,992 2018 1 1,124,567,178,623 • 6 worker nodes, 2vcpus(12 YARN cores), 13GB memory (62.4GB YARN memory) • *single node, 32vcpus, 120GB memory