Programming for Data Analytics Project

Property Sales Big Data Analytics using MapReduce
Frameworks
Akshay Kumar Bhushan
x17155878, MSC DA Group-A
School of Computing
Dublin 1, Dublin
x17155878@student.ncirl.ie
Abstract- In the last decade, data that has generated has
been on a rise and with each passing day, it is only rising.
With the velocity and the volume of data that has been
created, it is only accelerating. Therefore, this data needs
to be stored and processed. For this paper, the dataset used
is of the property sales of Allegheny county located in
Pennsylvania, USA. The data is from the year 2012 to the
present. The data is enormous which needs to be stored,
processed and later analyzed. Therefore, the Hadoop
ecosystem is used for this dataset. The data is stored in
HDFS using MySQL via sqoop. It is then moved to Pig and
Hive for analyzing the data and generating the output. The
output generated is visualized and reported using
visualization tools.
Keywords: Big Data, Hadoop, Pig, Hive, Sqoop
I. INTRODUCTION
Property Sales happen everywhere. Usually. many factors are
taken into account for the sale of property decisions. For our
dataset of property sales, factors like Property city, Type of
document, whether the sale is valid or invalid is taken into
account. Since this dataset is vast, it is necessary for the data
to be used for parallel processing. Parallel processing is done
using the MapReduce. MapReduce is one such technique
which is used for storing and processing data across clusters.
In MapReduce, map function takes the input and calculate in
the form of key pairs while reduce function job is to take the
input from the output of map function and combine the result.
The combined result gives the output and later stored in the
Hadoop Distributed file system.
The motive is to find factors from the dataset using various
distributed frameworks techniques by analyzing the data and
generating the trends from the past years of property sales so
that in the future using the trends, users can take the decision
whether to sell the property or not and which time period is
best for property sales.
This dataset giving information on sales throughout the year
with each month and day. This dataset has three V’s of Big
data which are Volume, Variety, and Velocity. Therefore, the
Hadoop ecosystem is used for storage, processing, and
analyses. The information and Knowledge gained from this
data will help users to take the better decision in future while
selling the property and which city should be taken into
consideration.
MapReduce is one such framework to analyze the data and
generate a more complex query. In the project, Pig and Hive
are another such frameworks which have been used for
analyzing the data and getting the desired output according to
the queries.
The data is stored in MySQL database and from there it is
stored in HDFS using sqoop. From there it is then loaded into
Pig and Hive for analyzing the data. The output is stored and
visualized using R and Tableau.
Further Sections discusses about Related Work, Methodology
and various results of the study done on the data.
II. RELATED WORK
MapReduce is the most common technique for large data set
analysis as it allows running on clusters of commodity
hardware. [1] Since there is the volume of data it is important
to process the data in a limited time and therefore parallel
processing is important [1]. Using the distributed file system
and MapReduce programming, the data size of terabytes and
petabytes can be processed with improved scalability,
reliability, performance, and optimization. In paper [2]
MapReduce technique was used and it showed that
MapReduce tool reduced the time of data access and loading
by more than 50%.
MapReduce programming model needs developers to write
programs which are hard to maintain and reuse. [3] Hive,
which is built on the top of Hadoop is used which support SQL
like queries. These queries are feed into map reduce jobs and
they execute the desired result according to the query.
Facebook uses hive technology for their various product like
Facebook ads. Before Hive, end users had to write map reduce
program for each task. Today Facebook run jobs on Hive
cluster for various application. [3]
The pig was developed to simplify the complex data and
perform the map reduce jobs without the knowledge of Java
code. Pig is a scripting language where users write pig scripts
using pig latin language. [4] These scripts are SQL like
language. These scripts are fed in pig engine and they convert
them into map reduce jobs. Then MapReduce jobs are run on
the Hadoop cluster.

III. METHODOLOGY
A. Dataset Description
The dataset for this project was taken from
https://catalog.data.gov/dataset/allegheny-county-property-
sale-transactions. The dataset is published by Allegheny
county and city of Pittsburgh. It contains information of
property sales from 2012 till present of county Allegheny. This
dataset contains data of the non-profit organization, public
sector, and academic institutions.
This dataset contains 230144 observations of 34 variables
have various attributes such as Parid, Property state, address,
city, street name, municipality code, Sale date, the Record
date, Sale code, sale type and property price etc. This dataset
is processed and then analyzed according to the needs. From
this dataset, we were able to find which property city is
favorable to the users and at what time is suitable to sell the
property. By analyzing this data, it will help the user in future
for better decision making.
B. Data Preprocessing
After observing the dataset, it was decided that the data needs
to be processed. The data was pre-processed using R. Columns
which were not relevant for the study were removed from the
dataset. Code for the null value and the missing value was run
in R. It was found that there were 15968 missing values in the
dataset. These missing values were cleaned. After cleaning the
data, the dataset now had 214176 observations of 22 variables.
This dataset was now loaded in a different csv named PDA.
C. Data Processing
1. Loading Data to HDFS: To implement MapReduce,
the data needs to be loaded into Hadoop. To load the
data in HDFS, we follow the steps. A table named
‘PDA’ is created in MySQL database. The data is
loaded from CSV into the table. The data stored in
MySQL is now loaded in HDFS using sqoop.
2. Loading Data from MySQL to Hive: To perform
MapReduce jobs on Hive framework, the dataset is
loaded into the MySQL database and then it is
migrated to Hive. The migration of data was done
using sqoop utility. In Hive, we do not need to create
the table. It is automatically loaded into Hive. Hive is
preferred because of MySQL queries which are easy
to perform. The output generated is stored in CSV.
3. Loading Data into Pig: To perform MapReduce job
on Pig framework, the dataset is loaded from a CSV
file and the output are stored in CSV files.
MapReduce Processing: Over the past few years, Hadoop
MapReduce framework is used to handle the large data for
parallel processing and analysis. [5] It is scalable and
faults tolerance. The MapReduce have two functions
namely map and reduce. Map job is to take input and
calculate them in form of key pairs and reduce job is to
take the output from the map input and perform the result
and give the output and combine them.
IV. IMPLEMENTATION
The dataset was analyzed and MapReduce jobs were
performed on Hive tables
• Mapreduce1
1. Property Sale Trends Year wise in different
Quarter
The objective was to find property sale trends of the Year
2015, 2016 and 2017. The dataset contained information on
sales of property for each day of each year. The property sale
transaction for a single day for each month of the year varied
from 1 to 356 for the year 2015. For the year 2016, the sale
transaction was from 1 to 277. For the year 2017, the property
sale transactions were 1 to 331. Using these each year was
divided into the quarter. The query was applied to the hive for
getting these values and hence the result was achieved. This
result is stored in a CSV file. The output generated helps the
user to analyze which quarter is favorable for the selling of the
property.
• MapReduce Task 2
2. Instrument Type for a City
The objective of this task 2 was to find the different type of
instrument for a particular city and which Instrument type
contribute the most for that particular city. This dataset has
the different cities name for sales. For this query, a query was
applied to the city giving information about which city have
the have most property sales. The output achieved from this
query was applied to another query describing the description
of the type of document used to record the
real property transfer. The result achieved was stored in a CSV
file. Analysing the output will the help user to know in future
about the most common type of document used for that city.
The user will know in advance that what are the chances of the
type of document property for that city.
3. Property Sale Total Price
The objective of this task 3 was to find the sum of the price
for each city. The price attribute describes about the amount
paid for the sale. The query was applied to the pig. The result
was stored in HBase. The result achieved was later analyzed.
From this query, government and users can decide in future
that in which city the amount paid for the sale has been the
most for properties.
4. Instrument type with price

The objective of this task was to find the price sum according
to the instrument type. The price attribute describes about the
amount paid for the sale. The query for this MapReduce task
was applied to the pig.
V. RESULTS
The output generated in hive and pig are visualized through a
tableau
Output 1
Fig. 1 below shows the trends in property sales in different
years 2015, 2016 and 2017. These trends are in the quarter for
a different year. The trends show that Quarter 2 has the
maximum sales of the property for each year. While Actual
Forecast is depicted in the table but another column gets added
when using forecast feature in Tableau. It shows that for the
Year 2018 Quarter 2, what will be the Estimated sales based
on the previous data feed in the tableau. The visualization
result will help the user to take the decision to sell the property
in which quarter.
Fig. 1 Property sale in different Year
Output 2
Fig. 2 Type of Instrument for city
Figure 2 above shows which type of document used to record
the real property transfer and which type of document tops
for a particular city. The visualization shows the Deed is the
most common type of document used to record the real
property transfer for the city Pittsburgh. In the future, the
user will know for a particular city which is the most
common and most least for the type of document used for
property transfer.
Output 3
Fig. 3
The figure above describes about the different cities where the
amount has been paid for sales. From the visualization, it is
seen that for Pittsburgh city the amount for the sale for
different properties has been the maximum. The amount here
is the sum of the amount paid for all the years from 2012 till
present. The second city where the amount has been paid for
the sale the most is Coraopolis. The data generated gives an
insight view of the cities, helping users to decide to buy the
property in which city.

Fig. 4 Instrument type with price
The figure above shows the description of the type of
document used to record the real property transfer with price
attribute. From above visualization, it can be deduced that
Deed is the type of document which is the most common for
most cities and amount paid for the sale is the highest when
the type of document is deed.
VI. CONCLUSIONS
The large dataset that has been chosen for the project is
processed and analyzed using the Hadoop framework. The
project contains information about the sale of property, sale
code which describes whether the sale is valid or invalid. Pig
and Hive were two MapReduce environments which have
been used for processing this large data set.
The output of all the results will help users in better decision
making related to sales. It will help the government to take
decision accordingly.
For the future, more analysis can be performed on this dataset
as the data is updated in real time.
VII. REFERENCES
[1] Maitrey, S. and Jha, C.K., 2015, February.
Handling big data efficiently by using map
reduce technique. In Computational Intelligence
& Communication Technology (CICT), 2015
IEEE International Conference on (pp. 703-
708). IEEE.
[2] Panda, B., Herbach, J.S., Basu, S. and Bayardo,
R.J., 2009. Planet: massively parallel learning of
tree ensembles with mapreduce. Proceedings of
the VLDB Endowment, 2(2), pp.1426-1437.
[3] Thusoo, A., Sarma, J.S., Jain, N., Shao, Z.,
Chakka, P., Zhang, N., Antony, S., Liu, H. and
Murthy, R., 2010, March. Hive-a petabyte scale
data warehouse using hadoop. In Data
Engineering (ICDE), 2010 IEEE 26th
International Conference on (pp. 996-1005).
IEEE.
[4] Olston, C., Reed, B., Srivastava, U., Kumar, R.
and Tomkins, A., 2008, June. Pig latin: a not-so-
foreign language for data processing.
In Proceedings of the 2008 ACM SIGMOD
international conference on Management of
data (pp. 1099-1110). ACM.
[5] DeWitt, D. and Stonebraker, M., 2008.
MapReduce: A major step backwards. The
Database Column, 1, p.23.

Programming for Data Analytics Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Programming for Data Analytics Project

Similar to Programming for Data Analytics Project (20)

Recently uploaded

Recently uploaded (20)

Programming for Data Analytics Project