Data science bootcamp day 3

•Télécharger en tant que PPTX, PDF•

0 j'aime•202 vues

Chetan Khatri

Data Science - CCCS936 - Department of Computer Science, University of Kachchh.

Données & analyses

Data Science Bootcamp Day-3
Presented by: Chetan Khatri, Volunteer Teaching Assistant,
Data Science lab, University of Kachchh
Guidance by: Prof. Devji D. Chhanga, University of Kachchh.

Agenda
An Introduction to Apache Spark
Apache Spark single node configuration
MapReduce Program on Spark Cluster
An Introduction to Apache Kafka
Apache Kafka single on Configuration.
Create Topic, Push Messages to Topic

Spark Terminology
» Spark and SQL Contexts : A Spark program first creates a SparkContext object
» SparkContext tells Spark how and where to access a cluster
» The program next creates a sqlContext object
» Use sqlContext to create DataFrames

Review : DataFrames
The primary abstraction in Spark
» Immutable once constructed.
» Track lineage information to efficiently recompute lost data.
» Enable operations on collection of elements in parallel.
You construct DataFrames
» by parallelizing existing Scala collections (lists)
» by transforming an existing Spark DFs
» from files in HDFS or any other storage system

Review: DataFrames
Two types of operations: transformations and actions.
Transformations are lazy (not computed immediately).
Transformed DF is executed when action runs on it.
Persist (cache) DFs in memory or disk.

Resilient Distributed Datasets
Untyped Spark abstraction underneath DataFrames:
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
You construct RDDs
» by parallelizing existing Scala collections (lists)
» by transforming an existing RDDs or DataFrame
» from files in HDFS or any other storage system

When to use DataFrames ?
Need high-level transformations and actions, and want high-level
control over your dataset.
Have typed (structured or semi-structured) data.
You want DataFrame optimization and performance benefits
» Catalyst Optimization Engine
• 75% reduction in execution time
» Project Tungsten off-heap memory management
• 75+% reduction in memory usage (less GC)

Apache Spark MapReduce
1) Start Apache Spark Shell
./bin/spark-shell
2) Let's Read the text file
scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt")
3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s
start with a few actions:
scala> textFile.count()
scala> textFile.first()
4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file.
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
// Get transformation output.
linesWithSpark.collect()

Apache Spark MapReduce
5) We can chain together transformations and actions:
textFile.filter(line => line.contains("Spark")).count()
6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark
can implement MapReduce flows easily:
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
wordCounts.collect()

Recommandé

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit

Visualizing big data in the browser using sparkDatabricks

Use r tutorial part1, introduction to sparkrDatabricks

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

Spark what's new what's comingDatabricks

Recommandé

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit

Visualizing big data in the browser using sparkDatabricks

Use r tutorial part1, introduction to sparkrDatabricks

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

Spark what's new what's comingDatabricks

A look ahead at spark 2.0 Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Simplifying Big Data Analytics with Apache SparkDatabricks

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Real-Time Spark: From Interactive Queries to StreamingDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

New directions for Apache Spark in 2015Databricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Spark streaming state of the unionDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Reactive dashboard’s using apache sparkRahul Kumar

Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar

Intro to Apache SparkMammoth Data

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Spark streaming high level overviewAvi Levi

Data science bootcamp day2Chetan Khatri

Feature ReleaseCharlisa Bailey

Contenu connexe

Tendances

A look ahead at spark 2.0 Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Simplifying Big Data Analytics with Apache SparkDatabricks

Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks

Real-Time Spark: From Interactive Queries to StreamingDatabricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

New directions for Apache Spark in 2015Databricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks

Spark streaming state of the unionDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Reactive dashboard’s using apache sparkRahul Kumar

Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar

Intro to Apache SparkMammoth Data

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Spark streaming high level overviewAvi Levi

Tendances (20)

A look ahead at spark 2.0

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Simplifying Big Data Analytics with Apache Spark

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

Real-Time Spark: From Interactive Queries to Streaming

Spark Application Carousel: Highlights of Several Applications Built with Spark

New directions for Apache Spark in 2015

Spark streaming State of the Union - Strata San Jose 2015

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell

Spark streaming state of the union

Introduction to Spark (Intern Event Presentation)

Strata NYC 2015 - Supercharging R with Apache Spark

Overview of the Hive Stinger Initiative

Reactive dashboard’s using apache spark

Fully fault tolerant real time data pipeline with docker and mesos

Intro to Apache Spark

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Spark streaming high level overview

En vedette

Data science bootcamp day2Chetan Khatri

Feature ReleaseCharlisa Bailey

Alumni talk-university-of-kachchhChetan Khatri

Internet of things initiative-cskskvChetan Khatri

Data science bootcamp day1Chetan Khatri

Think Machine Learning with Scikit-Learn (Python)Chetan Khatri

Pycon india-2016-success-storyChetan Khatri

Data Analytics with Pandas and Numpy - PythonChetan Khatri

Mart6hajose gonzalo garcia

Job fair at seattlezeenatkassam

Publication plan slidesharecaitlinhardinASmedia

8617 Taylor RoadChristopher Sandine

Continuous Deployment with ContainersDavid Papp

Publication plan caitlinhardinASmedia

Filme terror 2013Rafael Wolf

LEÇON 127 – Il n’est d’amour que celui de Dieu.Pierrot Caron

Davidson Capital - NOAH15 LondonNOAH Advisors

En vedette (17)

Data science bootcamp day2

Feature Release

Alumni talk-university-of-kachchh

Internet of things initiative-cskskv

Data science bootcamp day1

Think Machine Learning with Scikit-Learn (Python)

Pycon india-2016-success-story

Data Analytics with Pandas and Numpy - Python

Mart6ha

Job fair at seattle

Publication plan slideshare

8617 Taylor Road

Continuous Deployment with Containers

Publication plan

Filme terror 2013

LEÇON 127 – Il n’est d’amour que celui de Dieu.

Davidson Capital - NOAH15 London

Similaire à Data science bootcamp day 3

Meetup ml spark_pptSnehal Nagmote

Apache Spark Overview @ ferretAndrii Gakhov

In Memory Analytics with Apache SparkVenkata Naga Ravi

Spark corePrashant Gupta

Apache Spark Introduction - CloudxLabAbhinav Singh

Apache Spark Introductionsudhakara st

Introduction to Apache SparkRahul Jain

20170126 big data processingVienna Data Science Group

5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher

Intro to Apache Spark by CTO of TwingoMapR Technologies

Intro to SparkKyle Burke

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Apache Spark TutorialAhmet Bulut

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Dive into spark2Gal Marder

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Paris Data Geek - Spark Streaming Djamel Zouaoui

Introduction to apache sparkMuktadiur Rahman

Similaire à Data science bootcamp day 3 (20)

Meetup ml spark_ppt

Apache Spark Overview @ ferret

In Memory Analytics with Apache Spark

Spark core

Apache Spark Introduction - CloudxLab

Apache Spark Introduction

Introduction to Apache Spark

20170126 big data processing

5 Ways to Use Spark to Enrich your Cassandra Environment

Intro to Apache Spark by CTO of Twingo

Intro to Spark

Artigo 81 - spark_tutorial.pdf

Apache Spark Tutorial

Apache spark sneha challa- google pittsburgh-aug 25th

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Dive into spark2

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Real time Analytics with Apache Kafka and Apache Spark

Paris Data Geek - Spark Streaming

Introduction to apache spark

Plus de Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri

Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

No more struggles with Apache Spark workloads in productionChetan Khatri

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionChetan Khatri

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri

HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...Chetan Khatri

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri

An Introduction to Spark with ScalaChetan Khatri

HBase with Apache Spark POC DemoChetan Khatri

HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...Chetan Khatri

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Fossasia 2018-chetan-khatriChetan Khatri

Fossasia ai-ml technologies and application for product development-chetan kh...Chetan Khatri

An Introduction Linear Algebra for Neural Networks and Deep learningChetan Khatri

Introduction to Computer ScienceChetan Khatri

An introduction to Git with Atlassian SuiteChetan Khatri

Think machine-learning-with-scikit-learn-chetanChetan Khatri

A step towards machine learning at accionlabsChetan Khatri

Plus de Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...

Demystify Information Security & Threats for Data-Driven Platforms With Cheta...

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

No more struggles with Apache Spark workloads in production

PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...

An Introduction to Spark with Scala

HBase with Apache Spark POC Demo

HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Fossasia 2018-chetan-khatri

Fossasia ai-ml technologies and application for product development-chetan kh...

An Introduction Linear Algebra for Neural Networks and Deep learning

Introduction to Computer Science

An introduction to Git with Atlassian Suite

Think machine-learning-with-scikit-learn-chetan

A step towards machine learning at accionlabs

Dernier

Invezz.com - Grow your wealth with trading signalsInvezz1

Week-01-2.ppt BBB human Computer interactionfulawalesam

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Carero dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Halmar dropshipping via API with DroFxolyaivanovalion

Mature dropshipping via API with DroFx.pptxolyaivanovalion

April 2024 - Crypto Market Report's Analysismanisha194592

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

B2 Creative Industry Response Evaluation.docxStephen266013

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Dernier (20)

Invezz.com - Grow your wealth with trading signals

Week-01-2.ppt BBB human Computer interaction

Ravak dropshipping via API with DroFx.pptx

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Carero dropshipping via API with DroFx.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Log Analysis using OSSEC sasoasasasas.pptx

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Smarteg dropshipping via API with DroFx.pptx

BabyOno dropshipping via API with DroFx.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Halmar dropshipping via API with DroFx

Mature dropshipping via API with DroFx.pptx

April 2024 - Crypto Market Report's Analysis

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

B2 Creative Industry Response Evaluation.docx

Brighton SEO | April 2024 | Data Storytelling

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Data science bootcamp day 3

1. Data Science Bootcamp Day-3 Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh Guidance by: Prof. Devji D. Chhanga, University of Kachchh.

2. Agenda An Introduction to Apache Spark Apache Spark single node configuration MapReduce Program on Spark Cluster An Introduction to Apache Kafka Apache Kafka single on Configuration. Create Topic, Push Messages to Topic

3. Spark Terminology » Spark and SQL Contexts : A Spark program first creates a SparkContext object » SparkContext tells Spark how and where to access a cluster » The program next creates a sqlContext object » Use sqlContext to create DataFrames

4. Review : DataFrames The primary abstraction in Spark » Immutable once constructed. » Track lineage information to efficiently recompute lost data. » Enable operations on collection of elements in parallel. You construct DataFrames » by parallelizing existing Scala collections (lists) » by transforming an existing Spark DFs » from files in HDFS or any other storage system

5. Review: DataFrames Two types of operations: transformations and actions. Transformations are lazy (not computed immediately). Transformed DF is executed when action runs on it. Persist (cache) DFs in memory or disk.

6. Resilient Distributed Datasets Untyped Spark abstraction underneath DataFrames: » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel You construct RDDs » by parallelizing existing Scala collections (lists) » by transforming an existing RDDs or DataFrame » from files in HDFS or any other storage system

7. When to use DataFrames ? Need high-level transformations and actions, and want high-level control over your dataset. Have typed (structured or semi-structured) data. You want DataFrame optimization and performance benefits » Catalyst Optimization Engine • 75% reduction in execution time » Project Tungsten off-heap memory management • 75+% reduction in memory usage (less GC)

8. Apache Spark MapReduce 1) Start Apache Spark Shell ./bin/spark-shell 2) Let's Read the text file scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt") 3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: scala> textFile.count() scala> textFile.first() 4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file. val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Get transformation output. linesWithSpark.collect()

9. Apache Spark MapReduce 5) We can chain together transformations and actions: textFile.filter(line => line.contains("Spark")).count() 6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily: val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts.collect()