A Tool For Big Data Analysis using Apache Spark

•

1 j'aime•1,115 vues

datamantra

Building a REST based generic big data tool for solving business problems across verticles using Apache Spark

Données & analyses

● Ganesha Yadiyala
● Big data consultant at
datamantra.io
● Consult in spark and scala
● ganeshayadiyala@gmail.com

Agenda
● Problem Statement
● Business view
● Why spark
● Thinking REST
● Load API
● Transform API
● Machine learning
● Pipeline API
● Save API

Problem Statement
Build a generic solution which can be used to do
transformation on data and then analyse it to get useful result
out of it.

Business view
● This is an era of big data.
● All companies are trying to get something useful from the
data and solve problems.
● There exists many frameworks in big data but we need a
tool which will leverage most of them and can solve
problem easily.
● So if there is a general solution or tool which can be able
to solve many of these problem that would be a big plus.

Why we used spark
There are many big data frameworks out there which can be
used for analysis of data, but we chose to use spark because,
● Capability to handle multiple data source
● Easy binding with the external data
● Good support for machine learning through spark-ml and
spark mllib

Thinking REST
To do all this transformation and analysis we provided REST
api because,
● Minimise the coupling between client and server
● Different clients can use the REST api to interact with the
tool.
● Used Akka-http for rest service

Akka-http
It is an Actor-based toolkit for interacting with web services
and clients,
● It is also written in scala and it uses same configuration
management library as spark
● It is an actor and future based

Rest server interaction
Client
Spark cluster
Rest Server
Call Rest API
Call spark API

Rest server design
● Instead of going with spark jobserver we went with our
own rest server
● Once the rest server is started spark context is created
● All the configuration is passed to the spark context
through typesafe during its creation
● Same context is used for all the operation.

Loading from different sources
We supported different types of data,
● Csv datasource
● Json datasource
● Parquet datasource
● Xml datasource

Loading from different sources
We also supported some of the sources like,
● Mongodb
● Kafka
● JDBC
● Cassandra

Transformation
In big data world data which is coming to the system cannot
be used as it is, we may have to transform the data as
needed for the operation
We gave the API’s in REST to do this transformation, which
internally call spark dataframe API’s

Example
Some of the transformation we provided is,
● Cast - Cast the datatype of a column
● Filter - filter based on the formula or condition
● Aggregation - Max,min,sum,median etc
● Joins - Joining two datasets

Machine learning - spark ml
Spark ml provides higher level API which is built on top of the
dataframe.
● We did not used mllib because that is built on top of the
rdd.
● We provided rest API which will talk to these ML apis

Example
Some of the ml apis we provided are,
● Linear regression
● Decision tree (regressor and classifier)
● Ridge regression
● KMeans etc...

Challenges in spark ml
● It was very difficult to write generic api because not all the
ml algorithms expect similar inputs
● Not all the apis are documented properly
● Validation on the type of the columns which can be given
to these API are really difficult.

Save API
Once the transformation is done or ml gives the output use
may want to save the result. We support,
● text
● json
● parquet
● mongodb
● cassandra etc...

Pipeline and scheduling
We also implemented a pipeline api which will pipe all the
loading, transformation or ml apis.
If the user want to run this operation at scheduled time it is
possible through schedule API which we have provided.

Summary
No solution will be able to solve all the big data problems, but
we tried to build a tool which is generic enough to write your
own transformation on data, analyse it and we can solve
many of the problems

Recommandé

Improving Mobile Payments With Real time Sparkdatamantra

Interactive workflow management using Azkabandatamantra

Introduction to datasetdatamantra

Building distributed processing system from scratch - Part 2datamantra

Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Productionalizing a spark applicationdatamantra

Building Distributed Systems from Scratch - Part 1datamantra

Recommandé

Improving Mobile Payments With Real time Sparkdatamantra

Interactive workflow management using Azkabandatamantra

Introduction to datasetdatamantra

Building distributed processing system from scratch - Part 2datamantra

Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Productionalizing a spark applicationdatamantra

Building Distributed Systems from Scratch - Part 1datamantra

Introduction to Structured Data Processing with Spark SQLdatamantra

Building end to end streaming application on Sparkdatamantra

Interactive Data Analysis in Spark Streamingdatamantra

Introduction to Spark 2.0 Dataset APIdatamantra

Introduction to concurrent programming with Akka actorsShashank L

Introduction to Datasource V2 APIdatamantra

Multi Source Data Analysis using Spark and Telliusdatamantra

Spark architecturedatamantra

Migrating to Spark 2.0 - Part 2datamantra

Migrating to spark 2.0datamantra

Understanding transactional writes in datasource v2datamantra

Real time ETL processing using Spark streamingdatamantra

Evolution of apache sparkdatamantra

Exploratory Data Analysis in Sparkdatamantra

Introduction to spark 2.0datamantra

Introduction to Flink Streamingdatamantra

Introduction to Structured streamingdatamantra

Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono

Structured Streaming with Kafkadatamantra

Machine learning pipeline with spark mldatamantra

Learn Cloud Computing the right way!Stefano Bellasio

Learn How to Build Mobile Apps Using Cloud ServicesMax Katz

Contenu connexe

Tendances

Introduction to Structured Data Processing with Spark SQLdatamantra

Building end to end streaming application on Sparkdatamantra

Interactive Data Analysis in Spark Streamingdatamantra

Introduction to Spark 2.0 Dataset APIdatamantra

Introduction to concurrent programming with Akka actorsShashank L

Introduction to Datasource V2 APIdatamantra

Multi Source Data Analysis using Spark and Telliusdatamantra

Spark architecturedatamantra

Migrating to Spark 2.0 - Part 2datamantra

Migrating to spark 2.0datamantra

Understanding transactional writes in datasource v2datamantra

Real time ETL processing using Spark streamingdatamantra

Evolution of apache sparkdatamantra

Exploratory Data Analysis in Sparkdatamantra

Introduction to spark 2.0datamantra

Introduction to Flink Streamingdatamantra

Introduction to Structured streamingdatamantra

Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono

Structured Streaming with Kafkadatamantra

Machine learning pipeline with spark mldatamantra

Tendances (20)

Introduction to Structured Data Processing with Spark SQL

Building end to end streaming application on Spark

Interactive Data Analysis in Spark Streaming

Introduction to Spark 2.0 Dataset API

Introduction to concurrent programming with Akka actors

Introduction to Datasource V2 API

Multi Source Data Analysis using Spark and Tellius

Spark architecture

Migrating to Spark 2.0 - Part 2

Migrating to spark 2.0

Understanding transactional writes in datasource v2

Real time ETL processing using Spark streaming

Evolution of apache spark

Exploratory Data Analysis in Spark

Introduction to spark 2.0

Introduction to Flink Streaming

Introduction to Structured streaming

Laskar: High-Velocity GraphQL & Lambda-based Software Development Model

Structured Streaming with Kafka

Machine learning pipeline with spark ml

En vedette

Learn Cloud Computing the right way!Stefano Bellasio

Learn How to Build Mobile Apps Using Cloud ServicesMax Katz

Introduction to RichFacesMax Katz

Starky and hutch 70s tv themeAlex Grasso

MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain

How to Create 80% of a Big Data Pilot ProjectGreg Makowski

The Google Scholar Revolution: a big data bibliometric toolEmilio Delgado Lopez-Cozar, Universidad de Granada

Journey Through the Cloud - Data AnalysisAmazon Web Services

Restaurant managementTrupti Shingala, WAS, CPACC, CPWA, JAWS, CSM

SC4 Workshop 1: Logistics and big data German herreroBigData_Europe

Big data 101Lars Marius Garshol

Introduction to Hive and HCatalogmarkgrover

BigDataEurope - Big Data & Climate ChangeBigData_Europe

Data analysis of weather forecastingTrupti Shingala, WAS, CPACC, CPWA, JAWS, CSM

Beyond Big Data: Harnessing the Industrial Internet for Wind PowerGE_India

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen

Pareto chart using Openoffice.orgwremes

BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTJSPM's JSCOE , Pune Maharashtra.

Big data analysis concepts and referencesInformation Security Awareness Group

En vedette (20)

Learn Cloud Computing the right way!

Learn How to Build Mobile Apps Using Cloud Services

Introduction to RichFaces

Starky and hutch 70s tv theme

MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...

How to Create 80% of a Big Data Pilot Project

The Google Scholar Revolution: a big data bibliometric tool

Journey Through the Cloud - Data Analysis

Restaurant management

SC4 Workshop 1: Logistics and big data German herrero

Big data 101

Introduction to Hive and HCatalog

BigDataEurope - Big Data & Climate Change

Data analysis of weather forecasting

Beyond Big Data: Harnessing the Industrial Internet for Wind Power

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....

Pareto chart using Openoffice.org

BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT

Big data analysis concepts and references

Similaire à A Tool For Big Data Analysis using Apache Spark

Anatomy of spark catalystdatamantra

A compute infrastructure for data scientistsStitch Fix Algorithms

Building RESTtful services in MEANMadhukara Phatak

Spark Workflow ManagementRomi Kuntsman

Modern ETL Pipelines with Change Data CaptureDatabricks

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty

Apache Spark vs Apache FlinkAKASH SIHAG

Data Pipeline for The Big Data/Data Science OKCMark Smith

Introduction to Apache Flinkdatamantra

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Building self service frameworkRovshan Musayev

Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby

Productionalizing Spark MLdatamantra

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

Introduction to Apache BeamKnoldus Inc.

Introduction to Spark ML Pipelines WorkshopHolden Karau

Dataflow.pptxSadeka Islam

Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Cassandra Lunch #88: CadenceAnant Corporation

Similaire à A Tool For Big Data Analysis using Apache Spark (20)

Anatomy of spark catalyst

A compute infrastructure for data scientists

Building RESTtful services in MEAN

Spark Workflow Management

Modern ETL Pipelines with Change Data Capture

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...

Apache Spark vs Apache Flink

Data Pipeline for The Big Data/Data Science OKC

Introduction to Apache Flink

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Building self service framework

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Productionalizing Spark ML

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Introduction to Apache Beam

Introduction to Spark ML Pipelines Workshop

Dataflow.pptx

Apache spark y cómo lo usamos en nuestros proyectos

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Cassandra Lunch #88: Cadence

Plus de datamantra

State management in Structured Streamingdatamantra

Spark on Kubernetesdatamantra

Core Services behind Spark Job Executiondatamantra

Optimizing S3 Write-heavy Spark workloadsdatamantra

Understanding time in structured streamingdatamantra

Spark stack for Model life-cycle managementdatamantra

Building real time Data Pipeline using Spark Streamingdatamantra

Testing Spark and Scaladatamantra

Understanding Implicits in Scaladatamantra

Scalable Spark deployment using Kubernetesdatamantra

Introduction to concurrent programming with akka actorsdatamantra

Functional programming in Scaladatamantra

Telco analytics at scaledatamantra

Platform for Data Scientistsdatamantra

Building scalable rest service using Akka HTTPdatamantra

Anatomy of Spark SQL Catalyst - Part 2datamantra

Plus de datamantra (16)

State management in Structured Streaming

Spark on Kubernetes

Core Services behind Spark Job Execution

Optimizing S3 Write-heavy Spark workloads

Understanding time in structured streaming

Spark stack for Model life-cycle management

Building real time Data Pipeline using Spark Streaming

Testing Spark and Scala

Understanding Implicits in Scala

Scalable Spark deployment using Kubernetes

Introduction to concurrent programming with akka actors

Functional programming in Scala

Telco analytics at scale

Platform for Data Scientists

Building scalable rest service using Akka HTTP

Anatomy of Spark SQL Catalyst - Part 2

Dernier

Sequential and reinforcement learning for demand side management by Margaux B...Paris Women in Machine Learning and Data Science

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

怎样办理伦敦大学毕业证（UoL毕业证书）成绩单学校原版复制vexqp

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Digital Transformation Playbook by Graham WareGraham Ware

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制vexqp

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

怎样办理伦敦大学城市学院毕业证（CITY毕业证书）成绩单学校原版复制vexqp

Dernier (20)

Sequential and reinforcement learning for demand side management by Margaux B...

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

怎样办理伦敦大学毕业证（UoL毕业证书）成绩单学校原版复制

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

Digital Transformation Playbook by Graham Ware

7. Epi of Chronic respiratory diseases.ppt

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

怎样办理伦敦大学城市学院毕业证（CITY毕业证书）成绩单学校原版复制

A Tool For Big Data Analysis using Apache Spark

1. A tool for big data analysis

2. ● Ganesha Yadiyala ● Big data consultant at datamantra.io ● Consult in spark and scala ● ganeshayadiyala@gmail.com

3. Agenda ● Problem Statement ● Business view ● Why spark ● Thinking REST ● Load API ● Transform API ● Machine learning ● Pipeline API ● Save API

4. Problem Statement Build a generic solution which can be used to do transformation on data and then analyse it to get useful result out of it.

5. Business view ● This is an era of big data. ● All companies are trying to get something useful from the data and solve problems. ● There exists many frameworks in big data but we need a tool which will leverage most of them and can solve problem easily. ● So if there is a general solution or tool which can be able to solve many of these problem that would be a big plus.

6. Why we used spark There are many big data frameworks out there which can be used for analysis of data, but we chose to use spark because, ● Capability to handle multiple data source ● Easy binding with the external data ● Good support for machine learning through spark-ml and spark mllib

7. Thinking REST To do all this transformation and analysis we provided REST api because, ● Minimise the coupling between client and server ● Different clients can use the REST api to interact with the tool. ● Used Akka-http for rest service

8. Akka-http It is an Actor-based toolkit for interacting with web services and clients, ● It is also written in scala and it uses same configuration management library as spark ● It is an actor and future based

9. Rest server interaction Client Spark cluster Rest Server Call Rest API Call spark API

10. Rest server design ● Instead of going with spark jobserver we went with our own rest server ● Once the rest server is started spark context is created ● All the configuration is passed to the spark context through typesafe during its creation ● Same context is used for all the operation.

11. Loading from different sources We supported different types of data, ● Csv datasource ● Json datasource ● Parquet datasource ● Xml datasource

12. Loading from different sources We also supported some of the sources like, ● Mongodb ● Kafka ● JDBC ● Cassandra

13. Transformation In big data world data which is coming to the system cannot be used as it is, we may have to transform the data as needed for the operation We gave the API’s in REST to do this transformation, which internally call spark dataframe API’s

14. Example Some of the transformation we provided is, ● Cast - Cast the datatype of a column ● Filter - filter based on the formula or condition ● Aggregation - Max,min,sum,median etc ● Joins - Joining two datasets

15. Machine learning - spark ml Spark ml provides higher level API which is built on top of the dataframe. ● We did not used mllib because that is built on top of the rdd. ● We provided rest API which will talk to these ML apis

16. Example Some of the ml apis we provided are, ● Linear regression ● Decision tree (regressor and classifier) ● Ridge regression ● KMeans etc...

17. Challenges in spark ml ● It was very difficult to write generic api because not all the ml algorithms expect similar inputs ● Not all the apis are documented properly ● Validation on the type of the columns which can be given to these API are really difficult.

18. Save API Once the transformation is done or ml gives the output use may want to save the result. We support, ● text ● json ● parquet ● mongodb ● cassandra etc...

19. Pipeline and scheduling We also implemented a pipeline api which will pipe all the loading, transformation or ml apis. If the user want to run this operation at scheduled time it is possible through schedule API which we have provided.

20. Summary No solution will be able to solve all the big data problems, but we tried to build a tool which is generic enough to write your own transformation on data, analyse it and we can solve many of the problems