M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering

Abhijit Kumar Behera
M.Tech (CSE)
Roll No. 1350001
School of Computer Engineering
Guided By : Dr. Laxman Sahoo

Contents
 Introduction
 Apache Hadoop related projects
 Application of Mahout
 Literature Survey
 Plan of Action
 Conclusion
 References

Introduction
•The K-means algorithm is one of the most well-known clustering
algorithms that has been frequently used to variety of problems.
•MapReduce as the most popular cloud computing parallel
framework is effective to handle massive data, the researches of K-means
clustering algorithm which is based on MapReduce
become a focus for scholars.

Components of Hadoop
HDFS
•Name Node
•Data Node
•Secondary
Name Node
 Map Reduce
•Map()
•Combine()
•Reduce()
YARN
•Job Tracker
•TaskTracker
HBase

HBase
Hadoop
( HDFS and
MapReduce)
Mahout
Spark
HIVE
Zookeeper Sqoop
PIG
Apache Hadoop Projects

Application of Mahout
 Collaborative Filtering
 Matrix factorization based recommenders
 A user based Recommender
 Clustering
 Canopy Clustering
 K-Means Clustering
 Fuzzy K-Means
 Affinity Propagation Clustering
 Classification
 Naive Bayes
 Random forest classifier

Literature Survey
An Improved parallel K-means Clustering Algorithm with
MapReduce
Authors Name: Qing Liao, Fan Yang, Jingming Zhao
Journal : Communication Technology (ICCT), IEEE
Year of Publication:2014
Parallel K-means Algorithm
1) Initial
2) Mapper
3) Reducer

Literature Survey
Clouds for Scalable Big Data Analytics
Authors Name: Domenico Talia
Journal: IEEE Computer Society
Year of Publication:2013
In this paper, author describe how cloud comp uting enhance the development and
functionality of Big Data Analytics when it deployed into it.
Cloud Service Model Features Users
Data analytics software as a service A single and complete data mining
application or task (including data sources)
offered as a service
End users, analytics managers, data
analysts
Data analytics platform as a service A data analysis suite or framework for
programming or developing high-level
applications, hiding the cloud
infrastructure and data storage
Data mining application developers,
data scientists
Data analytics infrastructure as a
service
A set of virtualized resources provided to a
programmer or data mining researcher for
developing, configuring, and running data
analysis frameworks or applications
Data mining programmers, data
management developers, data
mining researchers

Plan of Action
August - October 2014 Literature survey is done.
November 2014
Problem definition formulation is
done and problem solving outline are
yet to be done
December 2014- January 2015
Find out the appropriate solution of
the problem yet to be formulated
February-May 2015
Final implementation of the solution
with result yet to be done

Conclusion
Large-scale data mining has been a new challenge in recent years.
Using the Map-Reduce frame work the big data analytics can be
accomplished. The K-means algorithm is one of the most well-known
clustering algorithms. However, its processing performance
has usually encountered a bottleneck if being utilized to deal with
massive data. A parallel K-means algorithm with MapReduce which
shows obvious advantage is implemented to handle massive data.

References
[1] Walisa Romsaiyud, Wichian Premchaiswadi, " An Adaptive Machine Learning on Map-
Reduce Framework for Improving performance of Large-Scale Data Analysis on EC ",
Eleventh IEEE Int'l Conf. on ICT and knowledge Engineering, 2014
[2] Domenico Talia," Clouds for Scalable Big Data Analytics ", IEEE Computer Society, 2013
[3] Feng Ye, Zhijan Wang , "Cloud-based Big Data Mining & Analyzing Services
Platform integrating R", IEEE International Conference on Advance Cloud and Big Data
, 2013
[4].ǲApache-Hadoopǳ-http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F

M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering

M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering

Similaire à M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering (20)

Dernier

Dernier (20)

M.Tech Student Research on Apache Hadoop Projects and Application of Mahout for Data Clustering