SlideShare a Scribd company logo
1 of 4
Download to read offline
Property Sales Big Data Analytics using MapReduce
Frameworks
Akshay Kumar Bhushan
x17155878, MSC DA Group-A
School of Computing
Dublin 1, Dublin
x17155878@student.ncirl.ie
Abstract- In the last decade, data that has generated has
been on a rise and with each passing day, it is only rising.
With the velocity and the volume of data that has been
created, it is only accelerating. Therefore, this data needs
to be stored and processed. For this paper, the dataset used
is of the property sales of Allegheny county located in
Pennsylvania, USA. The data is from the year 2012 to the
present. The data is enormous which needs to be stored,
processed and later analyzed. Therefore, the Hadoop
ecosystem is used for this dataset. The data is stored in
HDFS using MySQL via sqoop. It is then moved to Pig and
Hive for analyzing the data and generating the output. The
output generated is visualized and reported using
visualization tools.
Keywords: Big Data, Hadoop, Pig, Hive, Sqoop
I. INTRODUCTION
Property Sales happen everywhere. Usually. many factors are
taken into account for the sale of property decisions. For our
dataset of property sales, factors like Property city, Type of
document, whether the sale is valid or invalid is taken into
account. Since this dataset is vast, it is necessary for the data
to be used for parallel processing. Parallel processing is done
using the MapReduce. MapReduce is one such technique
which is used for storing and processing data across clusters.
In MapReduce, map function takes the input and calculate in
the form of key pairs while reduce function job is to take the
input from the output of map function and combine the result.
The combined result gives the output and later stored in the
Hadoop Distributed file system.
The motive is to find factors from the dataset using various
distributed frameworks techniques by analyzing the data and
generating the trends from the past years of property sales so
that in the future using the trends, users can take the decision
whether to sell the property or not and which time period is
best for property sales.
This dataset giving information on sales throughout the year
with each month and day. This dataset has three V’s of Big
data which are Volume, Variety, and Velocity. Therefore, the
Hadoop ecosystem is used for storage, processing, and
analyses. The information and Knowledge gained from this
data will help users to take the better decision in future while
selling the property and which city should be taken into
consideration.
MapReduce is one such framework to analyze the data and
generate a more complex query. In the project, Pig and Hive
are another such frameworks which have been used for
analyzing the data and getting the desired output according to
the queries.
The data is stored in MySQL database and from there it is
stored in HDFS using sqoop. From there it is then loaded into
Pig and Hive for analyzing the data. The output is stored and
visualized using R and Tableau.
Further Sections discusses about Related Work, Methodology
and various results of the study done on the data.
II. RELATED WORK
MapReduce is the most common technique for large data set
analysis as it allows running on clusters of commodity
hardware. [1] Since there is the volume of data it is important
to process the data in a limited time and therefore parallel
processing is important [1]. Using the distributed file system
and MapReduce programming, the data size of terabytes and
petabytes can be processed with improved scalability,
reliability, performance, and optimization. In paper [2]
MapReduce technique was used and it showed that
MapReduce tool reduced the time of data access and loading
by more than 50%.
MapReduce programming model needs developers to write
programs which are hard to maintain and reuse. [3] Hive,
which is built on the top of Hadoop is used which support SQL
like queries. These queries are feed into map reduce jobs and
they execute the desired result according to the query.
Facebook uses hive technology for their various product like
Facebook ads. Before Hive, end users had to write map reduce
program for each task. Today Facebook run jobs on Hive
cluster for various application. [3]
The pig was developed to simplify the complex data and
perform the map reduce jobs without the knowledge of Java
code. Pig is a scripting language where users write pig scripts
using pig latin language. [4] These scripts are SQL like
language. These scripts are fed in pig engine and they convert
them into map reduce jobs. Then MapReduce jobs are run on
the Hadoop cluster.
III. METHODOLOGY
A. Dataset Description
The dataset for this project was taken from
https://catalog.data.gov/dataset/allegheny-county-property-
sale-transactions. The dataset is published by Allegheny
county and city of Pittsburgh. It contains information of
property sales from 2012 till present of county Allegheny. This
dataset contains data of the non-profit organization, public
sector, and academic institutions.
This dataset contains 230144 observations of 34 variables
have various attributes such as Parid, Property state, address,
city, street name, municipality code, Sale date, the Record
date, Sale code, sale type and property price etc. This dataset
is processed and then analyzed according to the needs. From
this dataset, we were able to find which property city is
favorable to the users and at what time is suitable to sell the
property. By analyzing this data, it will help the user in future
for better decision making.
B. Data Preprocessing
After observing the dataset, it was decided that the data needs
to be processed. The data was pre-processed using R. Columns
which were not relevant for the study were removed from the
dataset. Code for the null value and the missing value was run
in R. It was found that there were 15968 missing values in the
dataset. These missing values were cleaned. After cleaning the
data, the dataset now had 214176 observations of 22 variables.
This dataset was now loaded in a different csv named PDA.
C. Data Processing
1. Loading Data to HDFS: To implement MapReduce,
the data needs to be loaded into Hadoop. To load the
data in HDFS, we follow the steps. A table named
‘PDA’ is created in MySQL database. The data is
loaded from CSV into the table. The data stored in
MySQL is now loaded in HDFS using sqoop.
2. Loading Data from MySQL to Hive: To perform
MapReduce jobs on Hive framework, the dataset is
loaded into the MySQL database and then it is
migrated to Hive. The migration of data was done
using sqoop utility. In Hive, we do not need to create
the table. It is automatically loaded into Hive. Hive is
preferred because of MySQL queries which are easy
to perform. The output generated is stored in CSV.
3. Loading Data into Pig: To perform MapReduce job
on Pig framework, the dataset is loaded from a CSV
file and the output are stored in CSV files.
MapReduce Processing: Over the past few years, Hadoop
MapReduce framework is used to handle the large data for
parallel processing and analysis. [5] It is scalable and
faults tolerance. The MapReduce have two functions
namely map and reduce. Map job is to take input and
calculate them in form of key pairs and reduce job is to
take the output from the map input and perform the result
and give the output and combine them.
IV. IMPLEMENTATION
The dataset was analyzed and MapReduce jobs were
performed on Hive tables
• Mapreduce1
1. Property Sale Trends Year wise in different
Quarter
The objective was to find property sale trends of the Year
2015, 2016 and 2017. The dataset contained information on
sales of property for each day of each year. The property sale
transaction for a single day for each month of the year varied
from 1 to 356 for the year 2015. For the year 2016, the sale
transaction was from 1 to 277. For the year 2017, the property
sale transactions were 1 to 331. Using these each year was
divided into the quarter. The query was applied to the hive for
getting these values and hence the result was achieved. This
result is stored in a CSV file. The output generated helps the
user to analyze which quarter is favorable for the selling of the
property.
• MapReduce Task 2
2. Instrument Type for a City
The objective of this task 2 was to find the different type of
instrument for a particular city and which Instrument type
contribute the most for that particular city. This dataset has
the different cities name for sales. For this query, a query was
applied to the city giving information about which city have
the have most property sales. The output achieved from this
query was applied to another query describing the description
of the type of document used to record the
real property transfer. The result achieved was stored in a CSV
file. Analysing the output will the help user to know in future
about the most common type of document used for that city.
The user will know in advance that what are the chances of the
type of document property for that city.
• MapReduce Task 3
3. Property Sale Total Price
The objective of this task 3 was to find the sum of the price
for each city. The price attribute describes about the amount
paid for the sale. The query was applied to the pig. The result
was stored in HBase. The result achieved was later analyzed.
From this query, government and users can decide in future
that in which city the amount paid for the sale has been the
most for properties.
• MapReduce Task 4
4. Instrument type with price
The objective of this task was to find the price sum according
to the instrument type. The price attribute describes about the
amount paid for the sale. The query for this MapReduce task
was applied to the pig.
V. RESULTS
The output generated in hive and pig are visualized through a
tableau
Output 1
Fig. 1 below shows the trends in property sales in different
years 2015, 2016 and 2017. These trends are in the quarter for
a different year. The trends show that Quarter 2 has the
maximum sales of the property for each year. While Actual
Forecast is depicted in the table but another column gets added
when using forecast feature in Tableau. It shows that for the
Year 2018 Quarter 2, what will be the Estimated sales based
on the previous data feed in the tableau. The visualization
result will help the user to take the decision to sell the property
in which quarter.
Fig. 1 Property sale in different Year
Output 2
Fig. 2 Type of Instrument for city
Figure 2 above shows which type of document used to record
the real property transfer and which type of document tops
for a particular city. The visualization shows the Deed is the
most common type of document used to record the real
property transfer for the city Pittsburgh. In the future, the
user will know for a particular city which is the most
common and most least for the type of document used for
property transfer.
Output 3
Fig. 3
The figure above describes about the different cities where the
amount has been paid for sales. From the visualization, it is
seen that for Pittsburgh city the amount for the sale for
different properties has been the maximum. The amount here
is the sum of the amount paid for all the years from 2012 till
present. The second city where the amount has been paid for
the sale the most is Coraopolis. The data generated gives an
insight view of the cities, helping users to decide to buy the
property in which city.
Fig. 4 Instrument type with price
The figure above shows the description of the type of
document used to record the real property transfer with price
attribute. From above visualization, it can be deduced that
Deed is the type of document which is the most common for
most cities and amount paid for the sale is the highest when
the type of document is deed.
VI. CONCLUSIONS
The large dataset that has been chosen for the project is
processed and analyzed using the Hadoop framework. The
project contains information about the sale of property, sale
code which describes whether the sale is valid or invalid. Pig
and Hive were two MapReduce environments which have
been used for processing this large data set.
The output of all the results will help users in better decision
making related to sales. It will help the government to take
decision accordingly.
For the future, more analysis can be performed on this dataset
as the data is updated in real time.
VII. REFERENCES
[1] Maitrey, S. and Jha, C.K., 2015, February.
Handling big data efficiently by using map
reduce technique. In Computational Intelligence
& Communication Technology (CICT), 2015
IEEE International Conference on (pp. 703-
708). IEEE.
[2] Panda, B., Herbach, J.S., Basu, S. and Bayardo,
R.J., 2009. Planet: massively parallel learning of
tree ensembles with mapreduce. Proceedings of
the VLDB Endowment, 2(2), pp.1426-1437.
[3] Thusoo, A., Sarma, J.S., Jain, N., Shao, Z.,
Chakka, P., Zhang, N., Antony, S., Liu, H. and
Murthy, R., 2010, March. Hive-a petabyte scale
data warehouse using hadoop. In Data
Engineering (ICDE), 2010 IEEE 26th
International Conference on (pp. 996-1005).
IEEE.
[4] Olston, C., Reed, B., Srivastava, U., Kumar, R.
and Tomkins, A., 2008, June. Pig latin: a not-so-
foreign language for data processing.
In Proceedings of the 2008 ACM SIGMOD
international conference on Management of
data (pp. 1099-1110). ACM.
[5] DeWitt, D. and Stonebraker, M., 2008.
MapReduce: A major step backwards. The
Database Column, 1, p.23.

More Related Content

What's hot

Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 

What's hot (20)

Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
How the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeedHow the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeed
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Innovaccer service capabilities with case studies
Innovaccer service capabilities with case studiesInnovaccer service capabilities with case studies
Innovaccer service capabilities with case studies
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 

Similar to Programming for Data Analytics Project

Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study Report
Shreya Chakrabarti
 

Similar to Programming for Data Analytics Project (20)

Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Analysis of parking citations mapreduce techniques
Analysis of parking citations   mapreduce techniquesAnalysis of parking citations   mapreduce techniques
Analysis of parking citations mapreduce techniques
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study Report
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
 
B1803040412
B1803040412B1803040412
B1803040412
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock ExchangeIRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Big Data
Big DataBig Data
Big Data
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Programming for Data Analytics Project

  • 1. Property Sales Big Data Analytics using MapReduce Frameworks Akshay Kumar Bhushan x17155878, MSC DA Group-A School of Computing Dublin 1, Dublin x17155878@student.ncirl.ie Abstract- In the last decade, data that has generated has been on a rise and with each passing day, it is only rising. With the velocity and the volume of data that has been created, it is only accelerating. Therefore, this data needs to be stored and processed. For this paper, the dataset used is of the property sales of Allegheny county located in Pennsylvania, USA. The data is from the year 2012 to the present. The data is enormous which needs to be stored, processed and later analyzed. Therefore, the Hadoop ecosystem is used for this dataset. The data is stored in HDFS using MySQL via sqoop. It is then moved to Pig and Hive for analyzing the data and generating the output. The output generated is visualized and reported using visualization tools. Keywords: Big Data, Hadoop, Pig, Hive, Sqoop I. INTRODUCTION Property Sales happen everywhere. Usually. many factors are taken into account for the sale of property decisions. For our dataset of property sales, factors like Property city, Type of document, whether the sale is valid or invalid is taken into account. Since this dataset is vast, it is necessary for the data to be used for parallel processing. Parallel processing is done using the MapReduce. MapReduce is one such technique which is used for storing and processing data across clusters. In MapReduce, map function takes the input and calculate in the form of key pairs while reduce function job is to take the input from the output of map function and combine the result. The combined result gives the output and later stored in the Hadoop Distributed file system. The motive is to find factors from the dataset using various distributed frameworks techniques by analyzing the data and generating the trends from the past years of property sales so that in the future using the trends, users can take the decision whether to sell the property or not and which time period is best for property sales. This dataset giving information on sales throughout the year with each month and day. This dataset has three V’s of Big data which are Volume, Variety, and Velocity. Therefore, the Hadoop ecosystem is used for storage, processing, and analyses. The information and Knowledge gained from this data will help users to take the better decision in future while selling the property and which city should be taken into consideration. MapReduce is one such framework to analyze the data and generate a more complex query. In the project, Pig and Hive are another such frameworks which have been used for analyzing the data and getting the desired output according to the queries. The data is stored in MySQL database and from there it is stored in HDFS using sqoop. From there it is then loaded into Pig and Hive for analyzing the data. The output is stored and visualized using R and Tableau. Further Sections discusses about Related Work, Methodology and various results of the study done on the data. II. RELATED WORK MapReduce is the most common technique for large data set analysis as it allows running on clusters of commodity hardware. [1] Since there is the volume of data it is important to process the data in a limited time and therefore parallel processing is important [1]. Using the distributed file system and MapReduce programming, the data size of terabytes and petabytes can be processed with improved scalability, reliability, performance, and optimization. In paper [2] MapReduce technique was used and it showed that MapReduce tool reduced the time of data access and loading by more than 50%. MapReduce programming model needs developers to write programs which are hard to maintain and reuse. [3] Hive, which is built on the top of Hadoop is used which support SQL like queries. These queries are feed into map reduce jobs and they execute the desired result according to the query. Facebook uses hive technology for their various product like Facebook ads. Before Hive, end users had to write map reduce program for each task. Today Facebook run jobs on Hive cluster for various application. [3] The pig was developed to simplify the complex data and perform the map reduce jobs without the knowledge of Java code. Pig is a scripting language where users write pig scripts using pig latin language. [4] These scripts are SQL like language. These scripts are fed in pig engine and they convert them into map reduce jobs. Then MapReduce jobs are run on the Hadoop cluster.
  • 2. III. METHODOLOGY A. Dataset Description The dataset for this project was taken from https://catalog.data.gov/dataset/allegheny-county-property- sale-transactions. The dataset is published by Allegheny county and city of Pittsburgh. It contains information of property sales from 2012 till present of county Allegheny. This dataset contains data of the non-profit organization, public sector, and academic institutions. This dataset contains 230144 observations of 34 variables have various attributes such as Parid, Property state, address, city, street name, municipality code, Sale date, the Record date, Sale code, sale type and property price etc. This dataset is processed and then analyzed according to the needs. From this dataset, we were able to find which property city is favorable to the users and at what time is suitable to sell the property. By analyzing this data, it will help the user in future for better decision making. B. Data Preprocessing After observing the dataset, it was decided that the data needs to be processed. The data was pre-processed using R. Columns which were not relevant for the study were removed from the dataset. Code for the null value and the missing value was run in R. It was found that there were 15968 missing values in the dataset. These missing values were cleaned. After cleaning the data, the dataset now had 214176 observations of 22 variables. This dataset was now loaded in a different csv named PDA. C. Data Processing 1. Loading Data to HDFS: To implement MapReduce, the data needs to be loaded into Hadoop. To load the data in HDFS, we follow the steps. A table named ‘PDA’ is created in MySQL database. The data is loaded from CSV into the table. The data stored in MySQL is now loaded in HDFS using sqoop. 2. Loading Data from MySQL to Hive: To perform MapReduce jobs on Hive framework, the dataset is loaded into the MySQL database and then it is migrated to Hive. The migration of data was done using sqoop utility. In Hive, we do not need to create the table. It is automatically loaded into Hive. Hive is preferred because of MySQL queries which are easy to perform. The output generated is stored in CSV. 3. Loading Data into Pig: To perform MapReduce job on Pig framework, the dataset is loaded from a CSV file and the output are stored in CSV files. MapReduce Processing: Over the past few years, Hadoop MapReduce framework is used to handle the large data for parallel processing and analysis. [5] It is scalable and faults tolerance. The MapReduce have two functions namely map and reduce. Map job is to take input and calculate them in form of key pairs and reduce job is to take the output from the map input and perform the result and give the output and combine them. IV. IMPLEMENTATION The dataset was analyzed and MapReduce jobs were performed on Hive tables • Mapreduce1 1. Property Sale Trends Year wise in different Quarter The objective was to find property sale trends of the Year 2015, 2016 and 2017. The dataset contained information on sales of property for each day of each year. The property sale transaction for a single day for each month of the year varied from 1 to 356 for the year 2015. For the year 2016, the sale transaction was from 1 to 277. For the year 2017, the property sale transactions were 1 to 331. Using these each year was divided into the quarter. The query was applied to the hive for getting these values and hence the result was achieved. This result is stored in a CSV file. The output generated helps the user to analyze which quarter is favorable for the selling of the property. • MapReduce Task 2 2. Instrument Type for a City The objective of this task 2 was to find the different type of instrument for a particular city and which Instrument type contribute the most for that particular city. This dataset has the different cities name for sales. For this query, a query was applied to the city giving information about which city have the have most property sales. The output achieved from this query was applied to another query describing the description of the type of document used to record the real property transfer. The result achieved was stored in a CSV file. Analysing the output will the help user to know in future about the most common type of document used for that city. The user will know in advance that what are the chances of the type of document property for that city. • MapReduce Task 3 3. Property Sale Total Price The objective of this task 3 was to find the sum of the price for each city. The price attribute describes about the amount paid for the sale. The query was applied to the pig. The result was stored in HBase. The result achieved was later analyzed. From this query, government and users can decide in future that in which city the amount paid for the sale has been the most for properties. • MapReduce Task 4 4. Instrument type with price
  • 3. The objective of this task was to find the price sum according to the instrument type. The price attribute describes about the amount paid for the sale. The query for this MapReduce task was applied to the pig. V. RESULTS The output generated in hive and pig are visualized through a tableau Output 1 Fig. 1 below shows the trends in property sales in different years 2015, 2016 and 2017. These trends are in the quarter for a different year. The trends show that Quarter 2 has the maximum sales of the property for each year. While Actual Forecast is depicted in the table but another column gets added when using forecast feature in Tableau. It shows that for the Year 2018 Quarter 2, what will be the Estimated sales based on the previous data feed in the tableau. The visualization result will help the user to take the decision to sell the property in which quarter. Fig. 1 Property sale in different Year Output 2 Fig. 2 Type of Instrument for city Figure 2 above shows which type of document used to record the real property transfer and which type of document tops for a particular city. The visualization shows the Deed is the most common type of document used to record the real property transfer for the city Pittsburgh. In the future, the user will know for a particular city which is the most common and most least for the type of document used for property transfer. Output 3 Fig. 3 The figure above describes about the different cities where the amount has been paid for sales. From the visualization, it is seen that for Pittsburgh city the amount for the sale for different properties has been the maximum. The amount here is the sum of the amount paid for all the years from 2012 till present. The second city where the amount has been paid for the sale the most is Coraopolis. The data generated gives an insight view of the cities, helping users to decide to buy the property in which city.
  • 4. Fig. 4 Instrument type with price The figure above shows the description of the type of document used to record the real property transfer with price attribute. From above visualization, it can be deduced that Deed is the type of document which is the most common for most cities and amount paid for the sale is the highest when the type of document is deed. VI. CONCLUSIONS The large dataset that has been chosen for the project is processed and analyzed using the Hadoop framework. The project contains information about the sale of property, sale code which describes whether the sale is valid or invalid. Pig and Hive were two MapReduce environments which have been used for processing this large data set. The output of all the results will help users in better decision making related to sales. It will help the government to take decision accordingly. For the future, more analysis can be performed on this dataset as the data is updated in real time. VII. REFERENCES [1] Maitrey, S. and Jha, C.K., 2015, February. Handling big data efficiently by using map reduce technique. In Computational Intelligence & Communication Technology (CICT), 2015 IEEE International Conference on (pp. 703- 708). IEEE. [2] Panda, B., Herbach, J.S., Basu, S. and Bayardo, R.J., 2009. Planet: massively parallel learning of tree ensembles with mapreduce. Proceedings of the VLDB Endowment, 2(2), pp.1426-1437. [3] Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H. and Murthy, R., 2010, March. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on (pp. 996-1005). IEEE. [4] Olston, C., Reed, B., Srivastava, U., Kumar, R. and Tomkins, A., 2008, June. Pig latin: a not-so- foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1099-1110). ACM. [5] DeWitt, D. and Stonebraker, M., 2008. MapReduce: A major step backwards. The Database Column, 1, p.23.