SlideShare une entreprise Scribd logo
1  sur  33
Apache Spark 2.1 Tutorial via
Databricks Community Edition
Yao Yao and Mooyoung Lee
MSDS7330-403-Teaching-Presentation
https://spark.apache.org/images/spark-logo-trademark.png
https://spark-summit.org/2017/wp-content/uploads/sites/16/2017/03/databricks-logo.png
Timeline of Spark Development
• Developed in 2009 at UC Berkeley’s AMPLab
• Open sourced 2010 under BSD license
• Top-contributed Apache project since 2014
https://www.slideshare.net/databricks/jump-start-with-apache-spark-20-on-databricks
What is Apache Spark?
• Apache Spark is a fast and general engine for
big data analytics processing with libraries for
SQL, streaming, and advanced analytics.
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
https://www.linkedin.com/pulse/future-apache-spark-rodrigo-rivera
Spark Core API
• Programming functionality, task scheduling
• Resilient distributed datasets (RDDs) offers in-
memory caching across cluster
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
http://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html
Spark SQL + DataFrames
• Wraps more coding infrastructure around SQL
queries for more optimizations
• spark.sql("SELECT * FROM Table").show()
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
https://docs.databricks.com/spark/latest/spark-sql/index.html
Structured Streaming
• Combines streaming with batch and
interactive queries for building end-to-end
continuous applications
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Machine Learning Libraries
• Consists of ML algorithms: classification,
regression, clustering, collaborative filtering,
and dimensionality reduction
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
http://spark.apache.org/docs/2.2.0/ml-guide.html
GraphX
• Parallel computation for GraphFrames,
which handles vertex degrees such as page
rank and social networking
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
http://mathworld.wolfram.com/VertexDegree.html http://spark.apache.org/docs/2.2.0/graphx-programming-guide.html
Import Custom Library Packages
• Includes the ability for you to install custom
libraries to the Spark Core API by attaching
them to the cluster
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-
libraries.html
Cluster Manager
• Current tutorial uses online Databricks cluster
manager. Spark can also run with Hadoop’s
YARN, Mesos, or as a stand alone
Spark Core API
Spark SQL +
DataFrames
Streaming MLlib GraphX
Library
Packages
R SQL Python Scala Java
DatabricksYARN MesosStand Alone
http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/
Unified Analytics Integrations
• Can be integrated with diverse environments,
applications, and data sources
https://www.slideshare.net/databricks/jump-start-with-apache-spark-20-on-databricks
Cloud Computing on Databricks
1. Go to http://community.cloud.databricks.com/
2. Create a free community account (No computing fees)
3. Create a cluster and select the Spark version
4. Create a notebook and select programming language
5. To change languages within a cell:
– %scala for scala
– %sql for sql
– %r for r
– %python for python
– %md for markdown
• Able to purchase multiple clusters for running parallel jobs
• Mitigates load time and frees up your local machine
13
http://community.cloud.databricks.com/
14
http://community.cloud.databricks.com/
• Workspace allows you to save notebooks and libraries.
– Notebooks are a set of any number of cells that allow you to execute commands.
• Dashboards can be created from notebooks as a way of displaying the output of
cells without the code that generates them.
– Libraries are packages or modules that provide additional functionality that you need to
solve your business problems. These may be custom written Scala or Java jars, Python
eggs, or custom written packages.
• Data is where datasets are uploaded and stored on the Databricks File
Storage
• Clusters are groups of computers that you treat as a single computer. In
Databricks, this means that you can effectively treat 20 computers as you
might treat one computer. (Paid Subscription for more clusters)
• Jobs are scheduled for execution to occur either on an already existing
cluster or a cluster of its own. These can be notebooks as well as jars or
python scripts. (Paid Subscription for scheduled jobs)
• Apps are third party integrations with the Databricks platform. These
include applications such as Tableau.
http://community.cloud.databricks.com/
Creating a Cluster
• Select a unique name for the
cluster.
• Select the Spark Version.
• Enter the number of workers
to bring up - at least 1 is
required to run Spark
commands.
• Select and manage additional
options.
http://community.cloud.databricks.com/
Demo 1: Databricks Basics
• Loading In Dataset and Library
• Revision History
• Commenting
• Dashboard
• Publishing
• Github (paid subscription)
• Collaborations (paid subscription)
• Online notebook has better visualizations than
the local install
Addressing Today’s Challenges
• CPU speed
• End-to-end applications using one engine
• Decision implementation based on real-time
data
https://www.slideshare.net/abhishekcreate/2016-spark-survey
Challenge 1: Hardware Trends
• Storage capacity and network increased by 10-fold
• CPU speed remained relatively the same
• To accommodate for the CPU speed:
– We are close to the end of frequency scaling for CPU,
where the speed cannot run more cycles per second
without using more power and generating excessive heat.
– Hardware manufacturers have created multiple cores for
parallel computing processing, which requires a form of
MapReduce to compensate for distributed computation
2010 2017
Storage 100 MB/s (HDD) 1000 MB/s (SSD)
Network 1 GB/s 10 GB/s
CPU ~3 GHz ~3 GHz
https://www.slideshare.net/databricks/spark-summit-east-2017-matei-zaharia-keynote-trends-for-big-data-and-apache-
spark-in-2017
MapReduce
• “Map" is the transformation
step for local computation
for each record
• “Shuffle" is the
synchronization step
• “Reduce" is the
communication step to
combine the results from all
the nodes in a cluster.
• Executes in sequence,
where jobs are high-latency
(slow) and no subsequent
job could start until the
previous job had finished
completely
https://www.youtube.com/watch?v=y7KQcwK2w9I http://vision.cloudera.com/mapreduce-spark/
Spark’s Alternative to MapReduce
• Instead of records, uses column ranges
– Retains schema model for indexing, which could be
read inside MapReduce records
• Uses an alternative multi-step Directed Acyclic
Graphs (DAGs)
– mitigates slow nodes by executing nodes all at once
and not step by step and eliminates the
synchronization step, which lowers latency (faster)
• Supports in-memory data sharing across DAGs, so
different jobs can work with the same data at
very high speeds
http://vision.cloudera.com/mapreduce-spark/
Project Tungsten for Spark 2.0
• Mitigates runtime code generation
• Removes expensive iterator calls
• Fuse multiple operators
• Binary conversion and complies to pointer arithmetic
(user codes efficiently while performance is increased)
Dataframe code df.where(df("year") > 2015)
Logical expression GreaterThan(year#345, literal(2015))
Java Bytecode bool filter(Object baseObject) {
int offset = baseOffset + bitSetWidthInBytes +
3*8L;
int value = Platform.getInt(baseObject, offset);
return value > 2015;}
https://www.slideshare.net/databricks/spark-summit-east-2017-matei-zaharia-keynote-trends-for-big-data-and-apache-
spark-in-2017 http://vision.cloudera.com/mapreduce-spark/
Demo 2: CPU Speed
• Sc parallelize word count
• Linear regression
• GraphX
https://www.slideshare.net/databricks/2015-0317-scala-days
Speed Comparison between MapReduce and Spark
Spark is 3 to 25 times faster than MapReduce
Challenge 2: Specialized Engines
• More systems to install, configure, connect,
manage, and debug
• Performance dwindles because it is hard to
move big data across nodes and dynamically
allocate resources for different computations
• Writing temp data to file for another engine to
run analysis slows down processes between
systems
https://www.youtube.com/watch?v=vtxwXSGl9V8
End-to-End Applications
• Able to switch between languages, implement
specific libraries, and call on the same dataset
in the same notebook
• Data sets could be cached in RAM while
implementing different computations
• High level APIs allow for vertical integration
• Performance gains are cumulative due to the
aggregation of marginal gains
https://community.cloud.databricks.com/?o=7187633045765022#notebook/418623867444693
/command/418623867444694
Diversity in Application Solutions and
Methods
https://www.slideshare.net/abhishekcreate/2016-spark-survey
Demo 3: End-to-End Applications
• Changing Languages for the best models and
applications
• Producing visualizations
• Spark SQL
• JSON Schema
https://www.youtube.com/watch?v=y7KQcwK2w9I
Challenge 3: Decisions Based on Data
Streaming
• Streamed data may not be
reliable due to:
– Node crashes
– Asequential data
– Data inconsistency
https://www.youtube.com/watch?v=KspReT2JjeE https://spark.apache.org/streaming/
Structured Streaming / Node Indexing
• Fault tolerance fixes
node crashes
• Special filtering fixes
asequential data
• Indexing fixes data
consistency
• Ad-hoc queries on top of
streaming data, static
data, and batch
processes would allow
more flexibility in the
real-time decision
making process
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Real-Time Solutions from the Cloud
• Personalized web results
• Automated stock trading
• Trends in news events
• Credit card fraud
prevention
• Video quality optimization
by dynamically selecting
server sources
https://www.youtube.com/watch?v=KspReT2JjeE
Demo 4: Data Streaming
• Streaming
• Library Packages
• Custom Applications Spark Leads in number of Contributors
https://www.slideshare.net/databricks/unified-big-data-processing-with-apache-spark-qcon-2014
Conclusion
• Diversity of organizations,
job fields, and applications
that use Spark will continue
to grow as more people
find use in its various
implementations
• Spark continues to
dominate the analytical
landscape with its efficient
solutions to CPU usage,
end-to-end applications,
and data streaming
https://www.slideshare.net/abhishekcreate/2016-spark-survey
Learn/Teach Apache Spark 2.1 via
Databricks Community Edition
Lynda:
Essential Training: https://www.lynda.com/Apache-Spark-tutorials/Apache-Spark-Essential-Training/550568-2.html
Extending Spark: https://www.lynda.com/Hadoop-tutorials/Extending-Hadoop-Data-Science-Streaming-Spark-Storm-
Kafka/516574-2.html
IBM bigdatauniversity:
Spark Fundamentals I: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/info
Spark Fundamentals II: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0212EN+2016/info
Apache Spark Makers Build: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+TMP0105EN+2016/info
Exploring Spark's GraphX: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0223EN+2016/info
Analyzing Big Data in R using Apache Spark: https://courses.cognitiveclass.ai/courses/course-
v1:BigDataUniversity+RP0105EN+2016/info
completed notebooks: files found from notebook root
O'Reilly:
Definitive Guide Excerpts: http://go.databricks.com/definitive-guide-apache-spark
Raw Chapters: http://shop.oreilly.com/product/0636920034957.do
Databricks:
Educational and Training material: https://docs.databricks.com/spark/latest/training/index.html
Community edition github: https://github.com/databricks/Spark-The-Definitive-Guide
Files for this project:
https://github.com/yaowser/learn-spark
https://youtu.be/IVMbSDS4q3A

Contenu connexe

Tendances

Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 

Tendances (20)

Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 

Similaire à Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 

Similaire à Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform (20)

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Plus de Yao Yao

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearYao Yao
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYao Yao
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYao Yao
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionYao Yao
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataYao Yao
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionYao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 

Plus de Yao Yao (20)

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 

Dernier

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Dernier (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

  • 1. Apache Spark 2.1 Tutorial via Databricks Community Edition Yao Yao and Mooyoung Lee MSDS7330-403-Teaching-Presentation https://spark.apache.org/images/spark-logo-trademark.png https://spark-summit.org/2017/wp-content/uploads/sites/16/2017/03/databricks-logo.png
  • 2. Timeline of Spark Development • Developed in 2009 at UC Berkeley’s AMPLab • Open sourced 2010 under BSD license • Top-contributed Apache project since 2014 https://www.slideshare.net/databricks/jump-start-with-apache-spark-20-on-databricks
  • 3. What is Apache Spark? • Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics. Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone https://www.linkedin.com/pulse/future-apache-spark-rodrigo-rivera
  • 4. Spark Core API • Programming functionality, task scheduling • Resilient distributed datasets (RDDs) offers in- memory caching across cluster Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone http://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html
  • 5. Spark SQL + DataFrames • Wraps more coding infrastructure around SQL queries for more optimizations • spark.sql("SELECT * FROM Table").show() Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone https://docs.databricks.com/spark/latest/spark-sql/index.html
  • 6. Structured Streaming • Combines streaming with batch and interactive queries for building end-to-end continuous applications Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  • 7. Machine Learning Libraries • Consists of ML algorithms: classification, regression, clustering, collaborative filtering, and dimensionality reduction Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone http://spark.apache.org/docs/2.2.0/ml-guide.html
  • 8. GraphX • Parallel computation for GraphFrames, which handles vertex degrees such as page rank and social networking Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone http://mathworld.wolfram.com/VertexDegree.html http://spark.apache.org/docs/2.2.0/graphx-programming-guide.html
  • 9. Import Custom Library Packages • Includes the ability for you to install custom libraries to the Spark Core API by attaching them to the cluster Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven- libraries.html
  • 10. Cluster Manager • Current tutorial uses online Databricks cluster manager. Spark can also run with Hadoop’s YARN, Mesos, or as a stand alone Spark Core API Spark SQL + DataFrames Streaming MLlib GraphX Library Packages R SQL Python Scala Java DatabricksYARN MesosStand Alone http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/
  • 11. Unified Analytics Integrations • Can be integrated with diverse environments, applications, and data sources https://www.slideshare.net/databricks/jump-start-with-apache-spark-20-on-databricks
  • 12. Cloud Computing on Databricks 1. Go to http://community.cloud.databricks.com/ 2. Create a free community account (No computing fees) 3. Create a cluster and select the Spark version 4. Create a notebook and select programming language 5. To change languages within a cell: – %scala for scala – %sql for sql – %r for r – %python for python – %md for markdown • Able to purchase multiple clusters for running parallel jobs • Mitigates load time and frees up your local machine
  • 15. • Workspace allows you to save notebooks and libraries. – Notebooks are a set of any number of cells that allow you to execute commands. • Dashboards can be created from notebooks as a way of displaying the output of cells without the code that generates them. – Libraries are packages or modules that provide additional functionality that you need to solve your business problems. These may be custom written Scala or Java jars, Python eggs, or custom written packages. • Data is where datasets are uploaded and stored on the Databricks File Storage • Clusters are groups of computers that you treat as a single computer. In Databricks, this means that you can effectively treat 20 computers as you might treat one computer. (Paid Subscription for more clusters) • Jobs are scheduled for execution to occur either on an already existing cluster or a cluster of its own. These can be notebooks as well as jars or python scripts. (Paid Subscription for scheduled jobs) • Apps are third party integrations with the Databricks platform. These include applications such as Tableau. http://community.cloud.databricks.com/
  • 16. Creating a Cluster • Select a unique name for the cluster. • Select the Spark Version. • Enter the number of workers to bring up - at least 1 is required to run Spark commands. • Select and manage additional options. http://community.cloud.databricks.com/
  • 17. Demo 1: Databricks Basics • Loading In Dataset and Library • Revision History • Commenting • Dashboard • Publishing • Github (paid subscription) • Collaborations (paid subscription) • Online notebook has better visualizations than the local install
  • 18. Addressing Today’s Challenges • CPU speed • End-to-end applications using one engine • Decision implementation based on real-time data https://www.slideshare.net/abhishekcreate/2016-spark-survey
  • 19. Challenge 1: Hardware Trends • Storage capacity and network increased by 10-fold • CPU speed remained relatively the same • To accommodate for the CPU speed: – We are close to the end of frequency scaling for CPU, where the speed cannot run more cycles per second without using more power and generating excessive heat. – Hardware manufacturers have created multiple cores for parallel computing processing, which requires a form of MapReduce to compensate for distributed computation 2010 2017 Storage 100 MB/s (HDD) 1000 MB/s (SSD) Network 1 GB/s 10 GB/s CPU ~3 GHz ~3 GHz https://www.slideshare.net/databricks/spark-summit-east-2017-matei-zaharia-keynote-trends-for-big-data-and-apache- spark-in-2017
  • 20. MapReduce • “Map" is the transformation step for local computation for each record • “Shuffle" is the synchronization step • “Reduce" is the communication step to combine the results from all the nodes in a cluster. • Executes in sequence, where jobs are high-latency (slow) and no subsequent job could start until the previous job had finished completely https://www.youtube.com/watch?v=y7KQcwK2w9I http://vision.cloudera.com/mapreduce-spark/
  • 21. Spark’s Alternative to MapReduce • Instead of records, uses column ranges – Retains schema model for indexing, which could be read inside MapReduce records • Uses an alternative multi-step Directed Acyclic Graphs (DAGs) – mitigates slow nodes by executing nodes all at once and not step by step and eliminates the synchronization step, which lowers latency (faster) • Supports in-memory data sharing across DAGs, so different jobs can work with the same data at very high speeds http://vision.cloudera.com/mapreduce-spark/
  • 22. Project Tungsten for Spark 2.0 • Mitigates runtime code generation • Removes expensive iterator calls • Fuse multiple operators • Binary conversion and complies to pointer arithmetic (user codes efficiently while performance is increased) Dataframe code df.where(df("year") > 2015) Logical expression GreaterThan(year#345, literal(2015)) Java Bytecode bool filter(Object baseObject) { int offset = baseOffset + bitSetWidthInBytes + 3*8L; int value = Platform.getInt(baseObject, offset); return value > 2015;} https://www.slideshare.net/databricks/spark-summit-east-2017-matei-zaharia-keynote-trends-for-big-data-and-apache- spark-in-2017 http://vision.cloudera.com/mapreduce-spark/
  • 23. Demo 2: CPU Speed • Sc parallelize word count • Linear regression • GraphX https://www.slideshare.net/databricks/2015-0317-scala-days Speed Comparison between MapReduce and Spark Spark is 3 to 25 times faster than MapReduce
  • 24. Challenge 2: Specialized Engines • More systems to install, configure, connect, manage, and debug • Performance dwindles because it is hard to move big data across nodes and dynamically allocate resources for different computations • Writing temp data to file for another engine to run analysis slows down processes between systems https://www.youtube.com/watch?v=vtxwXSGl9V8
  • 25. End-to-End Applications • Able to switch between languages, implement specific libraries, and call on the same dataset in the same notebook • Data sets could be cached in RAM while implementing different computations • High level APIs allow for vertical integration • Performance gains are cumulative due to the aggregation of marginal gains https://community.cloud.databricks.com/?o=7187633045765022#notebook/418623867444693 /command/418623867444694
  • 26. Diversity in Application Solutions and Methods https://www.slideshare.net/abhishekcreate/2016-spark-survey
  • 27. Demo 3: End-to-End Applications • Changing Languages for the best models and applications • Producing visualizations • Spark SQL • JSON Schema https://www.youtube.com/watch?v=y7KQcwK2w9I
  • 28. Challenge 3: Decisions Based on Data Streaming • Streamed data may not be reliable due to: – Node crashes – Asequential data – Data inconsistency https://www.youtube.com/watch?v=KspReT2JjeE https://spark.apache.org/streaming/
  • 29. Structured Streaming / Node Indexing • Fault tolerance fixes node crashes • Special filtering fixes asequential data • Indexing fixes data consistency • Ad-hoc queries on top of streaming data, static data, and batch processes would allow more flexibility in the real-time decision making process https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  • 30. Real-Time Solutions from the Cloud • Personalized web results • Automated stock trading • Trends in news events • Credit card fraud prevention • Video quality optimization by dynamically selecting server sources https://www.youtube.com/watch?v=KspReT2JjeE
  • 31. Demo 4: Data Streaming • Streaming • Library Packages • Custom Applications Spark Leads in number of Contributors https://www.slideshare.net/databricks/unified-big-data-processing-with-apache-spark-qcon-2014
  • 32. Conclusion • Diversity of organizations, job fields, and applications that use Spark will continue to grow as more people find use in its various implementations • Spark continues to dominate the analytical landscape with its efficient solutions to CPU usage, end-to-end applications, and data streaming https://www.slideshare.net/abhishekcreate/2016-spark-survey
  • 33. Learn/Teach Apache Spark 2.1 via Databricks Community Edition Lynda: Essential Training: https://www.lynda.com/Apache-Spark-tutorials/Apache-Spark-Essential-Training/550568-2.html Extending Spark: https://www.lynda.com/Hadoop-tutorials/Extending-Hadoop-Data-Science-Streaming-Spark-Storm- Kafka/516574-2.html IBM bigdatauniversity: Spark Fundamentals I: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/info Spark Fundamentals II: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0212EN+2016/info Apache Spark Makers Build: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+TMP0105EN+2016/info Exploring Spark's GraphX: https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0223EN+2016/info Analyzing Big Data in R using Apache Spark: https://courses.cognitiveclass.ai/courses/course- v1:BigDataUniversity+RP0105EN+2016/info completed notebooks: files found from notebook root O'Reilly: Definitive Guide Excerpts: http://go.databricks.com/definitive-guide-apache-spark Raw Chapters: http://shop.oreilly.com/product/0636920034957.do Databricks: Educational and Training material: https://docs.databricks.com/spark/latest/training/index.html Community edition github: https://github.com/databricks/Spark-The-Definitive-Guide Files for this project: https://github.com/yaowser/learn-spark https://youtu.be/IVMbSDS4q3A

Notes de l'éditeur

  1. https://databricks.com/product/pricing/instance-types