This document describes a data warehousing solution using Apache Spark that was developed by Team 18 for the Movielens 20M movie rating dataset. Key aspects of the solution include storing the dataset in HDFS for faster access, developing an API interface using Flask, querying the data through Spark RDDs in response to API calls, and using GraphX to plot graphs of results like movie rating progressions. The goal was to build a scalable data warehouse system for performing queries and basic analytics on large movie rating data.
2. INTRODUCTION TO DATA WAREHOUSE
A data warehouse is constructed by integrating data from multiple heterogeneous
sources. It supports analytical reporting, structured and/or ad hoc queries and decision
making.
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
It is kept separate from the organization's operational database. There is no frequent
updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze its
business.
4. KEY FEATURES
Subject Oriented - A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations.
Integrated - A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a particular time period.
The data in a data warehouse provides information from the historical point of view.
Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A
data warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
5. DATA WAREHOUSE VS OPERATIONAL DATABASE
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
6. APACHE SPARK
Open Source
Alternative to Map Reduce for certain applications
A low latency cluster computing system
For very large data sets
May be 100 times faster than Map Reduce for
Iterative algorithms
Interactive data mining
Used with Hadoop / HDFS
Released under BSD License
7. SPARK FEATURES
Uses in memory cluster computing
Memory access faster than disk access
Has API's written in
Scala
Java
Python
Can be accessed from Scala and Python shells
Currently an Apache incubator project
Scales to very large clusters
Uses in memory processing for increased speed
Low latency shell access
8. OUR DATA WAREHOUSE SOLUTION
Building a data warehouse is a task that requires a lot of data to start, combined with
immense computational resources.
This project deals with creating a data warehouse like system which can perform basic
queries and some analytics.
Use-cases that we are dealing with:
Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc.
Movie rating progression graph
Movie recommendation engine
9. MOVIELENS 20M DATASET
movielens.org is a movie ratings aggregator owned by its parent company Grouplens.
Grouplens provides different sized movielens datasets for free that can be found at
http://grouplens.org/datasets/movielens/
For this project, we are using the Movielens 20M dataset which is the largest of all the
datasets provided by movielens.
Statistics about the dataset:
20 million ratings
465,000 tag applications
27,000 movies
10. DESCRIBING THE DATA
The data contains 4 CSV files of which only 2 are useful for this project:
movies.csv - movieid, title, genres
ratings.csv - userid, movieid, rating, timestamp
11. SOME IDEAS FROM HIVE
A data warehouse infrastructure built on top of hadoop for providing data
summarization, query and analysis.
Supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem.
Provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL.
12. FOREGROUND
Taking ideas from Apache Hive, the following solution has been proposed by us in this
project:
Dataset files are stored in HDFS.
API interface has been developed using flask instead of a graphical interface. API
rules have been defined for each query.
On hitting the URL for the API by passing the appropriate parameters, the results
are displayed in the browser window.
13. BACKGROUND
The dataset files are pushed to HDFS for faster access without any modifications.
For each query, the files are read from HDFS and converted to spark RDDs (Resilient
Distributed Datasets).
RDDs are a logical collection of data partitioned across machines. They can be
manipulated in parallel.
The API call is parsed for parameters, and accordingly the corresponding query
function is called.
The result of the query is handed over to flask and displayed on the browser. GraphX
has been used for plotting graph.