Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

•

1 j'aime•383 vues

Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial: 1) Spark Architecture 2) Why Apache Spark? 3) Shortcoming of MapReduce 4) Downloading Apache Spark 5) Starting Spark With Scala Interactive Shell 6) Starting Spark With Python Interactive Shell 7) Getting started with spark-submit

Technologie

Introduction
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.

Introduction
Spark Architecture
Spark Core
StandaloneAmazon EC2
Hadoop
YARN
Apache Mesos
HDFS
HBase
Hive
Tachyon
Cassandra
SQL
Streaming MLLib GraphX
SparkR Java Python Scala
Libraries
Languages
Data Sources
Resource/cluster managers
Dataframes

Why Apache Spark?
Or
Why is Apache Spark faster than MapReduce?

Introduction
Why Apache Spark?
Map
Reduce
HDFS
Map()
Reduce()
Read
Input
Write
Output
● User Sends Logic
● In form of Map() & Reduces
● Tries to do execute near data
● Saves result to HDFS
Hadoop Map Reduce

Introduction
Hadoop Map Reduce - Multiple Phases
Map
Reduce - 1
HDFS
Write Map
Reduce - 2
HDFSHDFS

Introduction
Shortcoming of Map Reduce
1. Batchwise Design
a. Every map-reduce cycle reads from and writes to
HDFS
b. Heavy Latency
2. Converting logic to Map-Reduce paradigm is difficult
3. In-memory computing was not possible

Introduction
Shortcoming of Map Reduce
Map
Reduce - 1
RAMWrite Map
Reduce - 2
RAMRAM
80 times faster
than disk
See: https://gist.github.com/jboner/2841832
Latency Numbers Every Programmer Should Know

Introduction
We have already installed the Apache Spark on CloudxLab.
So, you don't have install anything.
You simply have to login into Web Console
and
Get started with commands.
Getting Started - CloudxLab

Introduction
1. Find out hadoop version:
○ [student@hadoop1 ~]$ hadoop version
○ Hadoop 2.4.0.2.1.4.0-632
2. Go to https://spark.apache.org/downloads.html
3. Select the release for your version of hadoop & Download
4. On servers you could use wget
5. Every download can be run in standalone mode
6. Unzip - tar -xzvf spark*.tgz
7. In this folder, the bin folder contains the spark commands
Getting Started - Downloading

Introduction
Getting Started - Spark Programs Overview
Program Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
$ spark-shell
It is basically the scala REPL or interactive shell with one extra variable “sc”.
Check dir(sc) or help(sc)
Starting Spark With Scala Interactive Shell

Introduction
$ pyspark
It is basically the python interactive shell
with one extra variable “sc”.
Check dir(sc) or help(sc)
Starting Spark With Python Interactive Shell

Introduction
Getting Started - spark-submit
● To run example:
○ spark-submit --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
The example computes the area of circle of a radius 1 by counting total number of
squares.
○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area
○ Code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Introduction
Getting Started - spark-submit

Introduction
Getting Started - Binaries Overview
Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
Getting Started - CloudxLab
To launch Spark on Hadoop,
Set the Environment Variables pointing to Hadoop.
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

Introduction
Getting Started - CloudxLab
We have installed other versions too:
1. /usr/spark2.0.1/bin/spark-shell
2. /usr/spark1.6/bin/spark-shell
3. /usr/spark1.2.1/bin/spark-shell

Introduction
Getting Started - CloudxLab
How to run the jupyter with pyspark

Recommandé

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab

Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab

Apache Spark Introductionsudhakara st

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkAnastasios Skarlatidis

Recommandé

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab

Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab

Apache Spark Introductionsudhakara st

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkAnastasios Skarlatidis

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Intro to apache spark stand fordThu Hiền

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkMohamed hedi Abidi

Apache spark IntroTudor Lapusan

Introduction to sparkDuyhai Doan

Introduction to Spark with ScalaHimanshu Gupta

Apache Spark TutorialFarzad Nozarian

Apache Spark RDD 101sparkInstructor

Apache Spark overviewDataArt

Introduction to Spark InternalsPietro Michiardi

Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das

Apache Spark RDDsDean Chen

Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone

Intro to Apache SparkRobert Sanders

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Spark shuffle introductioncolorant

Data analysis scala_sparkYiguang Hu

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Module01NPN Training

Spark SQL | Apache SparkEdureka!

Contenu connexe

Tendances

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Intro to apache spark stand fordThu Hiền

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkMohamed hedi Abidi

Apache spark IntroTudor Lapusan

Introduction to sparkDuyhai Doan

Introduction to Spark with ScalaHimanshu Gupta

Apache Spark TutorialFarzad Nozarian

Apache Spark RDD 101sparkInstructor

Apache Spark overviewDataArt

Introduction to Spark InternalsPietro Michiardi

Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das

Apache Spark RDDsDean Chen

Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone

Intro to Apache SparkRobert Sanders

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Spark shuffle introductioncolorant

Data analysis scala_sparkYiguang Hu

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Tendances (20)

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Intro to apache spark stand ford

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Apache Spark

Apache spark Intro

Introduction to spark

Introduction to Spark with Scala

Apache Spark Tutorial

Apache Spark RDD 101

Apache Spark overview

Introduction to Spark Internals

Spark & Spark Streaming Internals - Nov 15 (1)

Apache Spark RDDs

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Intro to Apache Spark

Apache Spark in Depth: Core Concepts, Architecture & Internals

Spark shuffle introduction

Data analysis scala_spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Similaire à Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

Module01NPN Training

Spark SQL | Apache SparkEdureka!

Big Data Processing With SparkEdureka!

Apache Spark Overview @ ferretAndrii Gakhov

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Apache spark installation [autosaved]Shweta Patnaik

Introduction to Apache Spark EcosystemBojan Babic

Spark corePrashant Gupta

5 things one must know about spark!Edureka!

Introduction to Apache Spark Developer TrainingCloudera, Inc.

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

5 reasons why spark is in demand!Edureka!

Intro to Apache Spark by CTO of TwingoMapR Technologies

Introduction to Apache SparkRahul Jain

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Apache Spark TutorialAhmet Bulut

Apache spark-melbourne-april-2015-meetupNed Shawa

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

Accelerating apache spark with rdmainside-BigData.com

Similaire à Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Module01

Spark SQL | Apache Spark

Big Data Processing With Spark

Apache Spark Overview @ ferret

Spark Summit EU 2015: Lessons from 300+ production users

Apache spark installation [autosaved]

Introduction to Apache Spark Ecosystem

Spark core

5 things one must know about spark!

Introduction to Apache Spark Developer Training

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

5 reasons why spark is in demand!

Intro to Apache Spark by CTO of Twingo

Introduction to Apache Spark

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Apache Spark Tutorial

Apache spark-melbourne-april-2015-meetup

Strata NYC 2015 - Supercharging R with Apache Spark

Using pySpark with Google Colab & Spark 3.0 preview

Accelerating apache spark with rdma

Plus de CloudxLab

Understanding computer vision with Deep LearningCloudxLab

Deep Learning OverviewCloudxLab

Recurrent Neural NetworksCloudxLab

Natural Language ProcessingCloudxLab

Naive BayesCloudxLab

AutoencodersCloudxLab

Training Deep Neural NetsCloudxLab

Reinforcement LearningCloudxLab

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab

Introduction to Deep Learning | CloudxLabCloudxLab

Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab

Ensemble Learning and Random ForestsCloudxLab

Decision TreesCloudxLab

Support Vector MachinesCloudxLab

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Plus de CloudxLab (20)

Understanding computer vision with Deep Learning

Deep Learning Overview

Recurrent Neural Networks

Natural Language Processing

Naive Bayes

Autoencoders

Training Deep Neural Nets

Reinforcement Learning

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab

Introduction to Deep Learning | CloudxLab

Dimensionality Reduction | Machine Learning | CloudxLab

Ensemble Learning and Random Forests

Decision Trees

Support Vector Machines

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

Dernier

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Understanding the Laravel MVC ArchitecturePixlogix Infotech

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

How to convert PDF to text with Nanonetsnaman860154

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Key Features Of Token Development (1).pptxLBM Solutions

Slack Application Development 101 Slidespraypatel2

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

AI as an Interface for Commercial BuildingsMemoori

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Dernier (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Understanding the Laravel MVC Architecture

The 7 Things I Know About Cyber Security After 25 Years | April 2024

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

GenCyber Cyber Security Day Presentation

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

How to convert PDF to text with Nanonets

SQL Database Design For Developers at php[tek] 2024

Key Features Of Token Development (1).pptx

Slack Application Development 101 Slides

Handwritten Text Recognition for manuscripts and early printed texts

Azure Monitor & Application Insight to monitor Infrastructure & Application

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

AI as an Interface for Commercial Buildings

A Domino Admins Adventures (Engage 2024)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

1. Introduction

2. Introduction • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop Spark Core - A fast and general engine for large-scale data processing.

3. Introduction Spark Architecture Spark Core StandaloneAmazon EC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon Cassandra SQL Streaming MLLib GraphX SparkR Java Python Scala Libraries Languages Data Sources Resource/cluster managers Dataframes

4. Why Apache Spark? Or Why is Apache Spark faster than MapReduce?

5. Introduction Why Apache Spark? Map Reduce HDFS Map() Reduce() Read Input Write Output ● User Sends Logic ● In form of Map() & Reduces ● Tries to do execute near data ● Saves result to HDFS Hadoop Map Reduce

6. Introduction Hadoop Map Reduce - Multiple Phases Map Reduce - 1 HDFS Write Map Reduce - 2 HDFSHDFS

7. Introduction Shortcoming of Map Reduce 1. Batchwise Design a. Every map-reduce cycle reads from and writes to HDFS b. Heavy Latency 2. Converting logic to Map-Reduce paradigm is difficult 3. In-memory computing was not possible

8. Introduction Shortcoming of Map Reduce Map Reduce - 1 RAMWrite Map Reduce - 2 RAMRAM 80 times faster than disk See: https://gist.github.com/jboner/2841832 Latency Numbers Every Programmer Should Know

9. Introduction We have already installed the Apache Spark on CloudxLab. So, you don't have install anything. You simply have to login into Web Console and Get started with commands. Getting Started - CloudxLab

10. Introduction 1. Find out hadoop version: ○ [student@hadoop1 ~]$ hadoop version ○ Hadoop 2.4.0.2.1.4.0-632 2. Go to https://spark.apache.org/downloads.html 3. Select the release for your version of hadoop & Download 4. On servers you could use wget 5. Every download can be run in standalone mode 6. Unzip - tar -xzvf spark*.tgz 7. In this folder, the bin folder contains the spark commands Getting Started - Downloading

11. Introduction Getting Started - Spark Programs Overview Program Description spark-shell Runs spark scala interactive commandline pyspark Runs python spark interactive commandline sparkR Runs R on spark (/usr/spark2.6/bin/sparkR) spark-submit Submit a jar or python application for execution on cluster spark-sql Runs the spark sql interactive shell

12. Introduction $ spark-shell It is basically the scala REPL or interactive shell with one extra variable “sc”. Check dir(sc) or help(sc) Starting Spark With Scala Interactive Shell

13. Introduction $ pyspark It is basically the python interactive shell with one extra variable “sc”. Check dir(sc) or help(sc) Starting Spark With Python Interactive Shell

14. Introduction Getting Started - spark-submit ● To run example: ○ spark-submit --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 The example computes the area of circle of a radius 1 by counting total number of squares. ○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area ○ Code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

15. Introduction Getting Started - spark-submit

16. Introduction Getting Started - Binaries Overview Binary Description spark-shell Runs spark scala interactive commandline pyspark Runs python spark interactive commandline sparkR Runs R on spark (/usr/spark2.6/bin/sparkR) spark-submit Submit a jar or python application for execution on cluster spark-sql Runs the spark sql interactive shell

17. Introduction Getting Started - CloudxLab To launch Spark on Hadoop, Set the Environment Variables pointing to Hadoop. export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/

18. Introduction Getting Started - CloudxLab We have installed other versions too: 1. /usr/spark2.0.1/bin/spark-shell 2. /usr/spark1.6/bin/spark-shell 3. /usr/spark1.2.1/bin/spark-shell

19. Introduction Getting Started - CloudxLab How to run the jupyter with pyspark

20. Thank you! reachus@cloudxlab.com