SlideShare une entreprise Scribd logo
1  sur  13
DATA WAREHOUSING SOLUTION
USING APACHE SPARK
TEAM 18
AYUSH KHANDELWAL
GAURAV PARIDA
ANIL REDDY
MEHAK AGARWAL
INTRODUCTION TO DATA WAREHOUSE
A data warehouse is constructed by integrating data from multiple heterogeneous
sources. It supports analytical reporting, structured and/or ad hoc queries and decision
making.
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
It is kept separate from the organization's operational database. There is no frequent
updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze its
business.
Image taken from wikipedia.org/datawarehouse
KEY FEATURES
Subject Oriented - A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations.
Integrated - A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a particular time period.
The data in a data warehouse provides information from the historical point of view.
Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A
data warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
DATA WAREHOUSE VS OPERATIONAL DATABASE
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
APACHE SPARK
Open Source
Alternative to Map Reduce for certain applications
A low latency cluster computing system
For very large data sets
May be 100 times faster than Map Reduce for
Iterative algorithms
Interactive data mining
Used with Hadoop / HDFS
Released under BSD License
SPARK FEATURES
Uses in memory cluster computing
Memory access faster than disk access
Has API's written in
Scala
Java
Python
Can be accessed from Scala and Python shells
Currently an Apache incubator project
Scales to very large clusters
Uses in memory processing for increased speed
Low latency shell access
OUR DATA WAREHOUSE SOLUTION
Building a data warehouse is a task that requires a lot of data to start, combined with
immense computational resources.
This project deals with creating a data warehouse like system which can perform basic
queries and some analytics.
Use-cases that we are dealing with:
Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc.
Movie rating progression graph
Movie recommendation engine
MOVIELENS 20M DATASET
movielens.org is a movie ratings aggregator owned by its parent company Grouplens.
Grouplens provides different sized movielens datasets for free that can be found at
http://grouplens.org/datasets/movielens/
For this project, we are using the Movielens 20M dataset which is the largest of all the
datasets provided by movielens.
Statistics about the dataset:
20 million ratings
465,000 tag applications
27,000 movies
DESCRIBING THE DATA
The data contains 4 CSV files of which only 2 are useful for this project:
movies.csv - movieid, title, genres
ratings.csv - userid, movieid, rating, timestamp
SOME IDEAS FROM HIVE
A data warehouse infrastructure built on top of hadoop for providing data
summarization, query and analysis.
Supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem.
Provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL.
FOREGROUND
Taking ideas from Apache Hive, the following solution has been proposed by us in this
project:
Dataset files are stored in HDFS.
API interface has been developed using flask instead of a graphical interface. API
rules have been defined for each query.
On hitting the URL for the API by passing the appropriate parameters, the results
are displayed in the browser window.
BACKGROUND
The dataset files are pushed to HDFS for faster access without any modifications.
For each query, the files are read from HDFS and converted to spark RDDs (Resilient
Distributed Datasets).
RDDs are a logical collection of data partitioned across machines. They can be
manipulated in parallel.
The API call is parsed for parameters, and accordingly the corresponding query
function is called.
The result of the query is handed over to flask and displayed on the browser. GraphX
has been used for plotting graph.

Contenu connexe

Tendances

Pivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-workingPivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-working
tts2086
 

Tendances (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Pivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-workingPivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-working
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Hardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoHardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project Rhino
 
EDW and Hadoop
EDW and HadoopEDW and Hadoop
EDW and Hadoop
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Big data course
Big data  courseBig data  course
Big data course
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Aster getting started
Aster getting startedAster getting started
Aster getting started
 

En vedette

Transcript 070112.PDF
Transcript 070112.PDFTranscript 070112.PDF
Transcript 070112.PDF
Enrique Sigui
 
UNCSW 59 WILPF Report -Final
UNCSW 59 WILPF Report -FinalUNCSW 59 WILPF Report -Final
UNCSW 59 WILPF Report -Final
Jan Strout
 

En vedette (14)

Habito 1 primera parte
Habito 1 primera parteHabito 1 primera parte
Habito 1 primera parte
 
Transcript 070112.PDF
Transcript 070112.PDFTranscript 070112.PDF
Transcript 070112.PDF
 
Analisis de decisiones integrales
Analisis de decisiones integralesAnalisis de decisiones integrales
Analisis de decisiones integrales
 
UNCSW 59 WILPF Report -Final
UNCSW 59 WILPF Report -FinalUNCSW 59 WILPF Report -Final
UNCSW 59 WILPF Report -Final
 
What's on the web
What's on the webWhat's on the web
What's on the web
 
Los 7 habitos dela gente altamente efectiva
Los 7 habitos dela gente altamente efectivaLos 7 habitos dela gente altamente efectiva
Los 7 habitos dela gente altamente efectiva
 
4 Things All Mentors and Mentees Should Know
4 Things All Mentors and Mentees Should Know4 Things All Mentors and Mentees Should Know
4 Things All Mentors and Mentees Should Know
 
Incentive-Based Instruments for Water Management
Incentive-Based Instruments for Water ManagementIncentive-Based Instruments for Water Management
Incentive-Based Instruments for Water Management
 
How to Be a Workplace Ally
How to Be a Workplace AllyHow to Be a Workplace Ally
How to Be a Workplace Ally
 
Certificado de agradecimiento
Certificado de agradecimientoCertificado de agradecimiento
Certificado de agradecimiento
 
Kelsey Hinson Portfolio Presentation
Kelsey Hinson Portfolio PresentationKelsey Hinson Portfolio Presentation
Kelsey Hinson Portfolio Presentation
 
Proposal SKRIPSI Hukum Tata Negara
Proposal SKRIPSI Hukum Tata Negara Proposal SKRIPSI Hukum Tata Negara
Proposal SKRIPSI Hukum Tata Negara
 
[REPORT] Women in Leadership: Why It Matters
[REPORT] Women in Leadership: Why It Matters[REPORT] Women in Leadership: Why It Matters
[REPORT] Women in Leadership: Why It Matters
 
Equity and inclusive growth background paper - sept 2016
Equity and inclusive growth  background paper - sept 2016Equity and inclusive growth  background paper - sept 2016
Equity and inclusive growth background paper - sept 2016
 

Similaire à Cloud computing major project

Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
HGanesh
 

Similaire à Cloud computing major project (20)

Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
paper
paperpaper
paper
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Using hadoop for enterprise data management
Using hadoop for enterprise data managementUsing hadoop for enterprise data management
Using hadoop for enterprise data management
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 

Cloud computing major project

  • 1. DATA WAREHOUSING SOLUTION USING APACHE SPARK TEAM 18 AYUSH KHANDELWAL GAURAV PARIDA ANIL REDDY MEHAK AGARWAL
  • 2. INTRODUCTION TO DATA WAREHOUSE A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. A data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. It is kept separate from the organization's operational database. There is no frequent updating done in a data warehouse. It possesses consolidated historical data, which helps the organization to analyze its business.
  • 3. Image taken from wikipedia.org/datawarehouse
  • 4. KEY FEATURES Subject Oriented - A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.
  • 5. DATA WAREHOUSE VS OPERATIONAL DATABASE An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data.
  • 6. APACHE SPARK Open Source Alternative to Map Reduce for certain applications A low latency cluster computing system For very large data sets May be 100 times faster than Map Reduce for Iterative algorithms Interactive data mining Used with Hadoop / HDFS Released under BSD License
  • 7. SPARK FEATURES Uses in memory cluster computing Memory access faster than disk access Has API's written in Scala Java Python Can be accessed from Scala and Python shells Currently an Apache incubator project Scales to very large clusters Uses in memory processing for increased speed Low latency shell access
  • 8. OUR DATA WAREHOUSE SOLUTION Building a data warehouse is a task that requires a lot of data to start, combined with immense computational resources. This project deals with creating a data warehouse like system which can perform basic queries and some analytics. Use-cases that we are dealing with: Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc. Movie rating progression graph Movie recommendation engine
  • 9. MOVIELENS 20M DATASET movielens.org is a movie ratings aggregator owned by its parent company Grouplens. Grouplens provides different sized movielens datasets for free that can be found at http://grouplens.org/datasets/movielens/ For this project, we are using the Movielens 20M dataset which is the largest of all the datasets provided by movielens. Statistics about the dataset: 20 million ratings 465,000 tag applications 27,000 movies
  • 10. DESCRIBING THE DATA The data contains 4 CSV files of which only 2 are useful for this project: movies.csv - movieid, title, genres ratings.csv - userid, movieid, rating, timestamp
  • 11. SOME IDEAS FROM HIVE A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis. Supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
  • 12. FOREGROUND Taking ideas from Apache Hive, the following solution has been proposed by us in this project: Dataset files are stored in HDFS. API interface has been developed using flask instead of a graphical interface. API rules have been defined for each query. On hitting the URL for the API by passing the appropriate parameters, the results are displayed in the browser window.
  • 13. BACKGROUND The dataset files are pushed to HDFS for faster access without any modifications. For each query, the files are read from HDFS and converted to spark RDDs (Resilient Distributed Datasets). RDDs are a logical collection of data partitioned across machines. They can be manipulated in parallel. The API call is parsed for parameters, and accordingly the corresponding query function is called. The result of the query is handed over to flask and displayed on the browser. GraphX has been used for plotting graph.