SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Large Scale Hierarchical
Text Classification
using Hadoop:MapReduce
Hammad Haleem (10-css-25)
Pankaj Kumar sharma (10-css-46)
Department of Computer Engineering
Jamia Millia Islamia University
New Delhi India
S.no. Topic
1 Objective
2 Hierarchical Text Classification (HTC) and dataset
3 About Our Technique
5 About Hadoop and Mapreduce
6 Progress Reporting
7 References
Table of Contents
OBJECTIVE
To develop and deploy a cost-effective, near real time and
Distributed Hierarchical Text Documents’ classification
System which could be used to classify documents in a huge
Hierarchy of Categories in real time.
What is Hierarchical Text Classification ?
▪ Documents are said to follow a hierarchical
classification if:
– If a single document is present in more than one category.
– And the categories are themselves in a hierarchy. So a single
category can contain multiple documents and even multiple
categories.
Reuters Corpus (RCV1)
• Raw data from news articles.
• Approximately 806,791 documents. It has 1000+ categories.
Wikipedia Dataset
• The data is in the SVM format and requires a very less
amount of preprocessing.
• The dataset has 2,400,000 documents in 325,000
categories
Used dataset
OUR APPROACH -DETAILED DESCRIPTION OF ALGORITHM
We divided the development time into two phases.
1. Where we did the initial environment setup and wrote functions
which will help in further development.
2. We used these functions to perform the training and testing on
actual data.
The section ahead talks in detail about the various steps performed at
various phases of project.
WE DEVELOPED FOLLOWING FUNCTIONS : INITIAL PHASE
These methods are quite frequently used in the project so we would like to discuss them
● Train (TrainingDocumentSet D, CategoryTree C)
○ TrainingDocumentSet D is the set of training documents (i.e., already marked with their corresponding
categories).
○ CategoryTree C will contain a list of Categories, the parent-child relation between them.
● Classify (Document d, CategoryTree C, TainedClassifier TC)
○ Document d would be the document to be classified.
○ CategoryTree C will contain a list of categories, the hierarchical parent-child relation between them.
○ TrainedClassifier TC is the output of Train API call
● Tf-Idf-Calculator (DocumentSet D)
○ DocumentSet D is the set of Text Documents for which we’ve to calculate Tf-Idf weights.
● CosineDistance(Document d1, Document d2)
○ will return the cosine distance between documents d1 and d2.
HIGH LEVEL VIEW OF THE WHOLE CLASSIFICATION ALGORITHM.
DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE
▪ G Following diagram gives an overview of the training technique.
▪ The training follows lazy learner technique and is quite efficient for the
problem.
DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE
The DCLTH is a Lazy learning algorithm, therefore the training phase is quite straight
forward. First of all, all training documents undergo preprocessing. In this layer
mostly following activities are performed:
1. Removal of HTML tags or other noisy data from the documents.
2. Converting special documents like docs, pdfs to simple word based format.
3. Stemming words, i.e., converting words to their corresponding root words.
(We’ll use Potter Stemming).
4. Stop word removal. i.e., removing common words like “a”, “the”, etc.
5. Documents are represented in libSVM format.
DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE
After preprocessing is done, the document set undergo Tf-Idf calculator and the Tf-Idf value for
each word related to each document is found.We would use following definition of Tf, Idf and
Tf-Tdf:
Tf(w,d) = f(w, d)/sum(d)
Where f(w, d) is frequency of w, sum(d) is sum of frequencies of each unique word.
Idf(w,d,D) = log( N / (1+n))
Where N is number of documents in set D, and n is number of documents in which W appears. 1
is added to avoid “Division by 0 error”. Finally,
Tf-Idf(w,d,D) = Tf(w, d)*Idf(w,d,D).
Finally this information is stored for later usage. This completes the training phase of DCLTH.
DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE
❏ The novel Deep Classification in Hierarchical Text Classification
performs better from other classification algorithms because
instead of applying classification directly over all the categories,
it first divides the categories into two categories, namely related
and unrelated categories and then using the hierarchical
information related to the category it employes top-down
approach to classify the document.
❏ Generally we have two phases for the testing process.
DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE
LEVEL 1
The aim of this stage is to divide the large category set into two subset, namely “related” and
“unrelated”. For any document d, the number of related categories is much smaller than the number of
unrelated categories.The point worth noting is that now from a huge set of categories we have only a
small number of categories left on which the real classification could take place.This phase works as
follow:
1. The given document is processed (in the similar fashion as explained in “Training” stage).
2. Processed document is then used to find k most similar documents. For this we would employ
the Cosine distance and k NN algorithm.
3. From these k documents we get the set of related categories (called candidate categories).
DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE
LEVEL2
The aim of second stage is to work upon the candidate categories and processed document (both obtained
from first stage) and to obtain most probable categories from them.
This level would work as follow:
1. From the given candidate categories, a pruned tree is formed.
An example of such a tree is given in the following diagram.
2. From this pruned tree, we start from the root and apply Standard Naive Bayes’ Classifier to further prune
down the tree.
3. Algorithm stops when we reach the bottom most level.
4. Categories related to the finally pruned tree (called final category is pruned).
DID WE SAY “DISTRIBUTED” ?
❏ Most of the techniques, which lie under the above two approaches
of HTC are Linear and runs are tested on a Single System, not on
any cluster.
❏ This project aims to :
❏ Deploy any one of these popular approaches to run parallelly on
Hadoop cluster,
❏ Develop our own approach optimised to perform best parallely.
Distributed Processing and Hadoop
▪ The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models.
▪ It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
▪ Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
More from our Hadoop Cluster.
1. We showcase some screenshots from the hadoop
cluster in next few slides.
a. From various panels of hadoop platform
i. Node Manager GUI
ii. DFS gui
iii. Job Tracker GUI
What is Mapreduce and How Hadoop helped us ?
▪ MapReduce is a programming model for processing large data sets with
a parallel, distributed algorithm on a cluster
▪ The model was initially
– "Map" step: The master node takes the input, divides it into smaller sub-problems, and
distributes them to worker nodes. A worker node may do this again in turn, leading to a
multi-level tree structure. The worker node processes the smaller problem, and passes
the answer back to its master node.
– "Reduce" step: The master node then collects the answers to all the subproblems and
combines them in some way to form the output – the answer to the problem it was
originally trying to solve.
PROGRESS REPORT
▪ Setup of Cluster
▪ Analysis of various algorithms for Training and testing
▪ Training of the dataset.
▪ Testing of dataset.
▪ Implementing the Whole classifier without Mapreduce.
▪ Implementing the Classifier with mapreduce.
References
1. www.kaggle.com/c/lshtc [Dataset and Problem description]
2. Distributed Hierarchical text classification framework, US Patent US 7809723 B2
3. www.about.reuters.com/researchandstandards/corpus
4. www.hadoop.apache.org [Hadoop docs]
5. Hadoop: The Definitive Guide.
6. DKPro TC http://code.google.com/p/dkpro-tc/
7. RTextTools https://github.com/timjurka/RTextTools
8. DigitalPebble TC https://github.com/DigitalPebble/TextClassification
9. Some other techniques in addition to DCLTH are:
10. Dumais, Susan, and Hao Chen. "Hierarchical classification of Web content." Proceedings of the 23rd
annual international ACM SIGIR conference on Research and development in information retrieval. ACM,
2000.
11. Granitzer, Michael. "Hierarchical text classification using methods from machine learning." Master's
Thesis, Graz University of Technology (2003).

Contenu connexe

Tendances

ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATANexgen Technology
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsAfaq Mansoor Khan
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...Hanh Le Hieu
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363SHIVA REDDY
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 

Tendances (18)

ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Introduction to Data Structures & Algorithms
Introduction to Data Structures & AlgorithmsIntroduction to Data Structures & Algorithms
Introduction to Data Structures & Algorithms
 
How to share a secret
How to share a secretHow to share a secret
How to share a secret
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Chapter8
Chapter8Chapter8
Chapter8
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
K means clustering
K means clusteringK means clustering
K means clustering
 
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
Kmeans
KmeansKmeans
Kmeans
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 

Similaire à Large Scale Hierarchical Text Classification

Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...Prof. Maulik Trivedi
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
 
Oop(object oriented programming)
Oop(object oriented programming)Oop(object oriented programming)
Oop(object oriented programming)geetika goyal
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
 
Final training course
Final training courseFinal training course
Final training courseNoor Dhiya
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 

Similaire à Large Scale Hierarchical Text Classification (20)

Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
 
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGCOMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Oop(object oriented programming)
Oop(object oriented programming)Oop(object oriented programming)
Oop(object oriented programming)
 
Hadoop
HadoopHadoop
Hadoop
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Final training course
Final training courseFinal training course
Final training course
 
Oops ppt
Oops pptOops ppt
Oops ppt
 
Lec1
Lec1Lec1
Lec1
 
Lec1
Lec1Lec1
Lec1
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Lec1
Lec1Lec1
Lec1
 
Chapter 2 ds
Chapter 2 dsChapter 2 ds
Chapter 2 ds
 
Data clustring
Data clustring Data clustring
Data clustring
 
BDS_QA.pdf
BDS_QA.pdfBDS_QA.pdf
BDS_QA.pdf
 
prj exam
prj examprj exam
prj exam
 

Dernier

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Dernier (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Large Scale Hierarchical Text Classification

  • 1. Large Scale Hierarchical Text Classification using Hadoop:MapReduce Hammad Haleem (10-css-25) Pankaj Kumar sharma (10-css-46) Department of Computer Engineering Jamia Millia Islamia University New Delhi India
  • 2. S.no. Topic 1 Objective 2 Hierarchical Text Classification (HTC) and dataset 3 About Our Technique 5 About Hadoop and Mapreduce 6 Progress Reporting 7 References Table of Contents
  • 3. OBJECTIVE To develop and deploy a cost-effective, near real time and Distributed Hierarchical Text Documents’ classification System which could be used to classify documents in a huge Hierarchy of Categories in real time.
  • 4. What is Hierarchical Text Classification ? ▪ Documents are said to follow a hierarchical classification if: – If a single document is present in more than one category. – And the categories are themselves in a hierarchy. So a single category can contain multiple documents and even multiple categories.
  • 5. Reuters Corpus (RCV1) • Raw data from news articles. • Approximately 806,791 documents. It has 1000+ categories. Wikipedia Dataset • The data is in the SVM format and requires a very less amount of preprocessing. • The dataset has 2,400,000 documents in 325,000 categories Used dataset
  • 6. OUR APPROACH -DETAILED DESCRIPTION OF ALGORITHM We divided the development time into two phases. 1. Where we did the initial environment setup and wrote functions which will help in further development. 2. We used these functions to perform the training and testing on actual data. The section ahead talks in detail about the various steps performed at various phases of project.
  • 7. WE DEVELOPED FOLLOWING FUNCTIONS : INITIAL PHASE These methods are quite frequently used in the project so we would like to discuss them ● Train (TrainingDocumentSet D, CategoryTree C) ○ TrainingDocumentSet D is the set of training documents (i.e., already marked with their corresponding categories). ○ CategoryTree C will contain a list of Categories, the parent-child relation between them. ● Classify (Document d, CategoryTree C, TainedClassifier TC) ○ Document d would be the document to be classified. ○ CategoryTree C will contain a list of categories, the hierarchical parent-child relation between them. ○ TrainedClassifier TC is the output of Train API call ● Tf-Idf-Calculator (DocumentSet D) ○ DocumentSet D is the set of Text Documents for which we’ve to calculate Tf-Idf weights. ● CosineDistance(Document d1, Document d2) ○ will return the cosine distance between documents d1 and d2.
  • 8. HIGH LEVEL VIEW OF THE WHOLE CLASSIFICATION ALGORITHM.
  • 9. DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE ▪ G Following diagram gives an overview of the training technique. ▪ The training follows lazy learner technique and is quite efficient for the problem.
  • 10. DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE The DCLTH is a Lazy learning algorithm, therefore the training phase is quite straight forward. First of all, all training documents undergo preprocessing. In this layer mostly following activities are performed: 1. Removal of HTML tags or other noisy data from the documents. 2. Converting special documents like docs, pdfs to simple word based format. 3. Stemming words, i.e., converting words to their corresponding root words. (We’ll use Potter Stemming). 4. Stop word removal. i.e., removing common words like “a”, “the”, etc. 5. Documents are represented in libSVM format.
  • 11. DEEP HIERARCHICAL CLASSIFICATION : TRAINING PHASE After preprocessing is done, the document set undergo Tf-Idf calculator and the Tf-Idf value for each word related to each document is found.We would use following definition of Tf, Idf and Tf-Tdf: Tf(w,d) = f(w, d)/sum(d) Where f(w, d) is frequency of w, sum(d) is sum of frequencies of each unique word. Idf(w,d,D) = log( N / (1+n)) Where N is number of documents in set D, and n is number of documents in which W appears. 1 is added to avoid “Division by 0 error”. Finally, Tf-Idf(w,d,D) = Tf(w, d)*Idf(w,d,D). Finally this information is stored for later usage. This completes the training phase of DCLTH.
  • 12. DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE ❏ The novel Deep Classification in Hierarchical Text Classification performs better from other classification algorithms because instead of applying classification directly over all the categories, it first divides the categories into two categories, namely related and unrelated categories and then using the hierarchical information related to the category it employes top-down approach to classify the document. ❏ Generally we have two phases for the testing process.
  • 13. DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE LEVEL 1 The aim of this stage is to divide the large category set into two subset, namely “related” and “unrelated”. For any document d, the number of related categories is much smaller than the number of unrelated categories.The point worth noting is that now from a huge set of categories we have only a small number of categories left on which the real classification could take place.This phase works as follow: 1. The given document is processed (in the similar fashion as explained in “Training” stage). 2. Processed document is then used to find k most similar documents. For this we would employ the Cosine distance and k NN algorithm. 3. From these k documents we get the set of related categories (called candidate categories).
  • 14. DEEP HIERARCHICAL CLASSIFICATION : TESTING PHASE LEVEL2 The aim of second stage is to work upon the candidate categories and processed document (both obtained from first stage) and to obtain most probable categories from them. This level would work as follow: 1. From the given candidate categories, a pruned tree is formed. An example of such a tree is given in the following diagram. 2. From this pruned tree, we start from the root and apply Standard Naive Bayes’ Classifier to further prune down the tree. 3. Algorithm stops when we reach the bottom most level. 4. Categories related to the finally pruned tree (called final category is pruned).
  • 15. DID WE SAY “DISTRIBUTED” ? ❏ Most of the techniques, which lie under the above two approaches of HTC are Linear and runs are tested on a Single System, not on any cluster. ❏ This project aims to : ❏ Deploy any one of these popular approaches to run parallelly on Hadoop cluster, ❏ Develop our own approach optimised to perform best parallely.
  • 16. Distributed Processing and Hadoop ▪ The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. ▪ It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ▪ Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 17. More from our Hadoop Cluster. 1. We showcase some screenshots from the hadoop cluster in next few slides. a. From various panels of hadoop platform i. Node Manager GUI ii. DFS gui iii. Job Tracker GUI
  • 18.
  • 19.
  • 20.
  • 21. What is Mapreduce and How Hadoop helped us ? ▪ MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster ▪ The model was initially – "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. – "Reduce" step: The master node then collects the answers to all the subproblems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 22. PROGRESS REPORT ▪ Setup of Cluster ▪ Analysis of various algorithms for Training and testing ▪ Training of the dataset. ▪ Testing of dataset. ▪ Implementing the Whole classifier without Mapreduce. ▪ Implementing the Classifier with mapreduce.
  • 23. References 1. www.kaggle.com/c/lshtc [Dataset and Problem description] 2. Distributed Hierarchical text classification framework, US Patent US 7809723 B2 3. www.about.reuters.com/researchandstandards/corpus 4. www.hadoop.apache.org [Hadoop docs] 5. Hadoop: The Definitive Guide. 6. DKPro TC http://code.google.com/p/dkpro-tc/ 7. RTextTools https://github.com/timjurka/RTextTools 8. DigitalPebble TC https://github.com/DigitalPebble/TextClassification 9. Some other techniques in addition to DCLTH are: 10. Dumais, Susan, and Hao Chen. "Hierarchical classification of Web content." Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2000. 11. Granitzer, Michael. "Hierarchical text classification using methods from machine learning." Master's Thesis, Graz University of Technology (2003).