SlideShare une entreprise Scribd logo
1  sur  93
A Management and Visualisation Tool for

                    Text Mining Applications

                                          Student Peishan Mao

                         MSc Computing Science Project Report

          School of Computing Science and Information System

                    Birkbeck College, University of London 2005



                                           Status         Draft

                                      Last saved    26 Apr. 10




               1 of 93
1 TABLE OF CONTENTS

1     TABLE OF CONTENTS                             2

2     ACKNOWLEDGEMENT                               5

3     ABSTRACT                                      6

4     INTRODUCTION                                  7

5     BACKGROUND                                    8

5.1    Written Text                                  8

5.2    Natural Language Text Classification          8
  5.2.1    Text Classification                       8
  5.2.2    The Classifier                            9

5.3    Text Classifier Experimentations             12

6     HIGH-LEVEL APPLICATION DESCRIPTION            14

6.1    Description and Rationale                    14
  6.1.1    Build a Classifier                       14
  6.1.2    Evaluate and Refine the Classifier       15

6.2    Development and Technologies                 15

7     DESIGN                                        17

7.1    Functional Requirements                      17

7.2    Non-Functional Requirements                  22
  7.2.1    Usability                                22
  7.2.2    Hardware and Software Constraint         22
  7.2.3    Documentation                            23

7.3    System Framework                             23

7.4    Components in Detail                         25
  7.4.1   The Client - User Interface               25
  7.4.2   Display Manager                           26
  7.4.3   The Classifier                            26
  7.4.4   Data Manipulation and Cleansing           28
  7.4.5   Experimentation                           29
  7.4.6   Results Manager                           30
  7.4.7   Error Handling                            31


                                          2 of 93
7.5     Class Diagram                                32

8      DATABASE                                      33

8.1    Entities                                      33
  8.1.1    Score Table                               33
  8.1.2    Source Table                              33
  8.1.3    Configuration Table                       33
  8.1.4    Score Functions Table                     33
  8.1.5    Match Normalisation Functions Table       34
  8.1.6    Tree Normalisation Functions Table        34
  8.1.7    Classification Condition Table            34
  8.1.8    Class Weights Table                       34
  8.1.9    Temporary Max and Min Score Table         34

8.2    Views                                         35
  8.2.1    Weighted Scores                           35
  8.2.2    Maximum and Minimum Scores                35
  8.2.3    Misclassified Documents                   35

8.3     Relation Design for the Main Tables          35

9      IMPLEMENTATION                                37

9.1     Main User Interface                          37

9.2     Display Manager                              39

9.3     Classifier Classes                           40

9.4     Results Output Classes                       41

9.5     Other Controller Classes                     43

9.6     TreeView Controller Class                    44

9.7     Error Interface                              45

10      IMPLEMENTATION SPECIFICS                     46

10.1    Generic Selection Form Class                 46

10.2    Visualisation of the Suffix Tree             48

10.3    Dynamic Sub-String Matching                  49

10.4    User Interaction Warnings                    50

11      USER GUIDE                                   53



                                           3 of 93
11.1 Getting Started                           53
  11.1.1 Input Data                            53

11.2   Loading a Resource Corpus               54

11.3   Selecting a Sampling Set                57

11.4   Performing Pre-processing               61

11.5 Running N-Fold Cross-Validation           64
  11.5.1 Set Up Cross-Validation Set           64
  11.5.2 Perform experiments on the data       67
    11.5.2.1  Create the Suffix Tree           67
    11.5.2.2  Display Suffix Tree              69
    11.5.2.3  Delete Suffix Tree               71
    11.5.2.4  N-Gram Matching                  71
    11.5.2.5  Score Documents                  73
    11.5.2.6  Classify documents               74
    11.5.2.7  Add New Document to Classify     76

11.6   Creating a Classifier                   79

12     TESTING                                 81

13     CONCLUSION                              83

13.1   Evaluation                              83

13.2   Future Work                             84

14     BIBLIOGRAPHY                            86

15     APPENDIX A DATABASE                     88

16     APPENDIX B CLASS DEFINITIONS            90

17     APPENDIX C SOURCE CODE                  93




                                     4 of 93
2 ACKNOWLEDGEMENT

I would like to thank the following people for their help over the course of this project:


Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and
advice on the whole area of text classification, and pointing me in the right direction for
information on the topic to being interviewed as a potential user to the proposed system
as part of the requirement collection.


Timothy Yip: for laboriously proof reading the draft for the report despite not having
much interest in information technology.




                                          5 of 93
3 ABSTRACT

This report describes the design and implementation of a management and visualisation
tool for text classification applications. The system is built as a wrapper for machine
learning classification tool. It aims to provide a flexible framework to accommodate for
future changes to the system. The system is implemented in C# .Net with a Windows
Forms front end and an Access Database as an example, but should be flexible enough
to add different underlying components.




                                        6 of 93
4 INTRODUCTION


This report describes the project carried out to implement a management and
visualisation tool for text classification. It covers background information about the
project, the design, implementation and conclusion. The report is organised as follows:


Section 4 this section. It describes the organisation of the report.
Section 5 takes a look at the background of the project. This section covers discussion
on natural language classification, and suffix tree data structure used in Pampapathi et
al‟s study.
Section 6 a high-level description and rationale of the system.
Section 7 describes the design of the system. Lays out the system requirements,
system framework, and describes system components and classes.
Section 8 explains the database design and description of the database entities and
table relations.
Section 9 discusses how the system was implemented and goes into class definitions.
Section 10 focuses on specific system implementations and looks at the implementation
of the generic selection form class, visualisation of the suffix tree, dynamic sub-string
matching on documents, and user warnings.
Section 11 is the user guide to the system.
Section 13 concludes the project. This section discusses whether the system built has
met the requirements laid out at the beginning of the project. It also looks at future work.


Appendix A Database
Appendix B Class Definitions
Error! Reference source not found.




                                          7 of 93
5 BACKGROUND


5.1 Written Text


Writing has long been an important means of exchanging information, ideas and
concepts from one individual to another, or to a group. Indeed, it is even thought to be
the single most advantageous evolutionary adaptation for species preservation [2]. The
written text available contains a vast amount of information. The advent of the internet
and on-line documents has contributed to the proliferation of digital textual data readily
available for our perusal. Consequently, it is increasingly important to have a systematic
method of organising this corpus of information.
Tools for textual data mining are proving to be increasingly important to our growing
mass of text based data. The discipline of computing science has provided significant
contributions to this area by means of automating the data mining process. To encode
unstructured text data into a more structured form is not a straightforward task. Natural
language is rich and ambiguous. Working with free text is one of the most challenging
areas in computer science.
This project aims to investigate how computer science can help to evaluate some of the
vast amounts of textual information available to us, and how to provide a convenient way
to access this type of unstructured data. In particular, the focus will be on the data
classification aspect of data mining. The next section will explore this topic in more
depth.


5.2 Natural Language Text Classification


5.2.1    Text Classification
F Sebastiani [3] described automated text categorisation as
“The task of automatically sorting a set of documents into categories (or classes, or
topics) from a predefined set. The task, that falls at the crossroads of information
retrieval, machine learning, and (statistical) natural language processing, has witnessed
a booming interest in the last ten years from researchers and developers alike.”
Classification maps data into predefined groups or classes. Examples of classification
applications include image and pattern recognition, medical diagnosis, loan approval,
detecting faults in industry applications, and classifying financial trends. Until the late
80‟s, knowledge engineering was the dominant paradigm in automated text
categorisation. Knowledge engineering consists of the manual definition of a set of rules
which form part of a classifier by domain experts. Although this approach has produced
results with accuracies as high as 90% [3], it is labour intensive and domain specific.
The emergence of a new paradigm based on machine learning which answers many of
the limitations with knowledge engineering has superseded its predecessor.
Machine learning encompasses a variety of methods that represent the convergence of
statistics, biological modelling, adaptive control theory, psychology, and artificial


                                         8 of 93
intelligence (AI) [11]. Data classification by machine learning is a two-phase process
(Figure 1). The first phase involves a general inductive process to automatically build a
model by using classification algorithm that describes a predetermined set of data
classes which are non-overlapping. This step is referred to as supervised learning
because the classes are determined before examining the data and the set of data is
known as the training data set. Data in text classification comes in the form of files and
each file is often described as documents. Classification algorithms require that the
classes are defined based on purely the content of the documents. They describe these
classes by looking at the characteristics of the documents in the training set already
known to belong to the class. The learned model constitutes the classifier and can be
used to categorise future corpus samples. In the second phase, the classifier
constructed in the phase one is used for classification.
Machine leaning approach to text classification is less labour intensive, and is domain
independent. Since the attribution of documents to categories is based purely on the
content of the documents effort is thus concentrated on constructing an automatic
builder of classifiers (also known as the learner), and not the classifier itself [3]. The
automatic builder is a tool that extracts the characteristics from the training set which is
represented by a classification model. This means that once a learner is built, new
classifiers can be automatically constructed from sets of manually classified documents.



                   Training                    Classification                  Classification
                   Set                           Algorithm                        Model


              a)
                                               Classification
                                                  Model




                    Test Set                                                 New
                                                                             Documents

             b)


              Figure 1.       a) Step One in Text Classification b) Step two in text classification



5.2.2    The Classifier
In general a text classifier comprises a number of basic components. As noted in the
previous section, the text classifier begins with an inductive stage. A classifier requires
some sort of text representation of documents. In order to build an internal model the
inductive step involves a set of examples used for training the classifier. This set of
examples is known as the training set and each document in the training set is assigned
to a class C = {c1, c2, … cn}. All the documents used in the training phase are
transformed into internal representations.
Currently, a dominant learning method in text classification is based on a vector space
model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text


                                                 9 of 93
classification experiments. Bayesian classifiers are statistical classifiers. Classification
is based on the probability that a given document belongs to a particular class. The
approach is „naïve‟ because it assumes that the contribution by all attributes on a given
class is independent and each contributed equally to the classification problem. By
analysing the contribution of each „independent‟ attribute, a conditional probability is
determined. Attributes in this approach are the words that appear in the documents of
the training set.
Documents are represented by a vector with dimensions equal to the number of different
words within the documents of the training set. The value of each individual entry within
the vector is set at the frequency of the corresponding word. According to this approach,
training data are used to estimate parameters of a probability distribution, and Bayes
theorem is used to estimate the probability of a class. A new document is assigned to
the class that yields the highest probability. It is important to perform pre-processing to
remove frequent words such as stop words before a training set is used in the inductive
phase.
The Naïve Bayesian approach has several advantages. Firstly, it is easy to use;
secondly only one scan of the training data is required. It can also easily handle missing
values by simply omitting that probability when calculating the likelihoods of membership
in each class. Although the Naïve Bayesian-based classifier is popular, documents are
represented as a „bag-of-words‟ where words in the document have no relationships with
each other. However words that appear in a document are usually not independent.
Furthermore, the smallest unit of representation is a word.
Research is continuously investigating how designs of text classifiers can be further
improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new
innovative approach to the internal modelling of text classifiers. They used a well known
data structure called a suffix tree [11] which allows for indexing the characteristics of
documents at a more granular level, with documents represented by substrings. The
suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a
tree structure, where each node represents one character, and the root represents the
null string. Each path from the root represents a string, described by the characters
labelling the nodes traversed. All strings sharing a common prefix will branch off from a
common node. When strings are words over a to z, a node has at most 26 children, one
for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been
used for complex string matching problems in matching string sequences (data
compression, DNA sequencing). Pampapathi et al‟s research is the first to apply suffix
trees to natural language text classification.
Pampapathi et al‟s method of constructing the suffix tree varies slightly from the
standard way. Firstly, the tree nodes are labelled instead of the edges in order to
associate directly the frequency with the characters and substrings. Secondly, a special
terminal character is not used as the focus is on the substrings and not the suffixes.
Each suffix tree has a depth. The depth is described by the maximum number of levels
in the tree. A level is defined by the number of nodes away from the root node. For
example the suffix tree illustrated in Figure 2 has a depth of 4. Pampapathi et al‟s sets a
limit to the tree depth and each node of the suffix tree stores the frequency and the
character.
For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in Figure
2 is created. The substrings are COOL; OOL; OL; and L.



                                         10 of 93
C (1)        O (1)        O (1)       L (1)




       Root
                                   O (1)        L (1)

                      O (1)


                                   L (1)




                      L (1)




                                  Figure 2.       Suffix Tree for String „COOL‟



If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram
illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the
last three substrings in S2 are duplicates of some of the substrings already seen in S1,
and new nodes are not created for these repeated substrings.



                      F (1)          O (1)         O (1)        L (1)




          Root        C (1)          O (1)         O (1)        L (1)



                                     O (2)         L (2)
                      O (2)

                                     L (2)


                      L (2)




                              Figure 3.       Suffix Tree with String „FOOL‟ Added



Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal
model undergoes supervised learning from a training set which contains documents that
have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix
tree, by capturing the characteristics of documents at the character level, does not
require pre-processing of the training set. A suffix tree is built for each class and a new
document is classified by scoring it against each of the trees. The class of the highest
scoring tree is assigned to the document. Pampapathi et al‟s study was based on email

                                                 11 of 93
classification and the result of the experiment showed that a classifier employing a suffix
tree outperformed the Naïve Bayesian method.
In order to solve a classification problem, not only is the classifier one of the central
components, but as seen with the Naïve Bayesian method it is also important to perform
pre-processing on data used for training. The next section looks at other processes
involved in text classification other than the classifier component itself.


5.3 Text Classifier Experimentations
As described in previous sections that there is a two-step process to classification:
   1. Create a specific model by evaluating the training data. This step has as input
      the training data (including the category/class labels) and as output a definition of
      the model developed. The model created which is the classifier classifies the
      training data as accurately as possible.
   2. Apply the model developed by classifying new sets of documents.
In the research community or for those interested in evaluating the performance of a
classifier the second step can be more involved. First, the predictive accuracy of the
classifier is estimated. A simple yet popular technique is called the holdout method
which uses a test set of class-labelled samples. These samples are usually randomly
selected and it is important that they are independent of the training samples, otherwise
the estimate could be optimistic since the learned model is based on that data, and
therefore tend to overfit. The accuracy of a classifier on a given test set is the
percentage of test set samples that are correctly classified by the classifier. For each
test sample the known class label is compared with the classifier‟s class prediction for
that sample.
If the accuracy of the classifier model is considered as acceptable, the model can be
used to classify new documents.


                                 Training                Derive             Estimate
                                 Set                    Classifier          Accuracy


            Corpus
            data


                                 Test Set



                     Figure 4.   Estimating Classifier Accuracy with the Holdout Method



The estimate using the holdout method is pessimistic since only a portion of the initial
data is used to derive the classifier. Another technique call N-fold cross-validation is
often used in research. Cross-validation is a statistical technique which can mitigate
bias caused by a particular partition of training and test set. It is also useful when the
amount of data is limited. The method can be used to evaluate and estimate the
performance of a classifier, and the aim is to obtain as honest an estimation as possible
about the classification accuracy of the system. N-fold cross-validation involves


                                             12 of 93
partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping
blocks/folds. Then the training-testing process is run N times, with a different test set.
For example, when N=3, we will have the following training and test sets.


                 Block 1                    Train Test
                                 Run 1 1, 2            3
                 Block 2
                                 Run 2 1, 3            2
                 Block 3         Run 3 2, 3            1


                                Figure 5.      3-Fold Cross-Validation



For each cross-validation run the user will be able to use a training set to build the
classifier.
Stratified N-fold cross-validation is a recommended method for estimating classifier
accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are
stratified so that the class distribution of the samples in each fold is approximately the
same as that of the initial training set.
Preparing the training set data for classification using pre-processing can help improve
the accuracy, efficiency, and scalability of the evaluation of the classification. Methods
include stop word removal, punctuation removal, and stemming.
The use of the above techniques to prepare the data and estimate classifier accuracy
increases the overall computational time yet is useful for evaluating a classifier, and
selecting among several classifiers.
The current project aims to build a system which is a wrapper to a text classifier and
incorporates the suffix tree that was used in the research done by Pampapathi et al as
an example. The next section and beyond describes the project in detail.




                                            13 of 93
6 HIGH-LEVEL APPLICATION DESCRIPTION

6.1 Description and Rationale
The aim of this project is to build a management and visualisation tool that will allow
researchers to perform data manipulation support for underlying text classification
algorithms. The tool will provide a software infrastructure for a data mining system
based on machine learning. The goal is to build a flexible framework that would allow
changes to the underlying components with relative ease. Functions maybe added to
the system in the future. Adding new functionalities should have minimal effect on the
current system.
The system will be built as a wrapper for the two-step process involved in classification.
First, a component will be built that will automatically build a classifier given some
training data. Secondly, to provide capabilities to perform classification and evaluate the
performance of a classifier. Additionally, the tool will provide functionalities to run data
sampling and various pre-processing on data.
For the researcher it is incumbent to clearly define the training set (this will be known as
the „resource corpus‟ in this report) used for the training the classifier. When the
resource corpus is small the user can choose to use the entire corpus in the study. If the
resource corpus is large, the tool gives the option to select sampling sets to represent it.
A number of sampling methodologies is implemented that allows the user to select a
sample, which will reflect the characteristics of the resource corpus from which it is
drawn.
Note that a resource corpus is grouped into classes and this structure needs to be taken
into consideration when the sampling mechanism was developed. Three popular
sampling methods will be developed. Although other sampling methods can be added,
such as convenience sampling, judgement sampling, quota sampling, and snowball
sampling.
Note that the user can choose to evaluate data used to construct the classier before
actually building the classifier. The tool will be designed to be generic enough to
analyse a corpus of any categorisation type e.g. automated indexing of scientific articles,
emails routing, spam filtering, criminal profiling, and expertise profiling.


6.1.1    Build a Classifier
The tool allows the user to build a classifier. The current framework only implements the
suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will
be flexible enough to incorporate other classification models in the future. The research
on suffix trees applied to classification is new, and there is currently no such application.
The learning process of the classifier follows the machine learning approach to
automated text classification, whereby the system automatically builds a classifier for the
categories of interest. From the graphical user interface (GUI), the user can select a
corpus to use as training data. The application provides links to .dll files developed by
Birkbeck College which allow the user to build a suffix tree from the selected corpus. The
internal data representation is constructed by generalising from a training set of pre-
classified documents. Once the classifier is built the user can load new documents into
the system to be classified.


                                         14 of 93
6.1.2    Evaluate and Refine the Classifier
In research once a classifier has been built it is desirable to evaluate its effectiveness.
Even before the construction of the classifier the tool provides a platform for users to
perform a number of experiments and refinements on the source (training) data. Hence,
the second focus of the project is to provide a user-friendly front-end and a base
application for testing classification algorithms.
The user can load in a text based corpus and perform standard pre-processing functions
to remove noise and prepare the data for experimentation. There is also a choice of
sampling methods to use in order to reduce the size of the initial corpus making it more
manageable.
Sebastiani [2] notes that any classifier is prone to classification error, whether the
classifier is human or machine. This is due to a central notion to text classification that
the membership of a document in a class based on the characteristics of the document
and the class is inherently subjective, since the characteristics of both the documents
and class cannot be formally specified. As a result automatic text classifiers are
evaluated using a set of pre-classified documents. The accuracy of classifiers is
compared to the classification decision and the original category the documents were
assigned to. For experimentation and evaluation purpose, this set of pre-classified
documents is split into two sets: a training set and test set, not necessarily of equal
sizes.
The tool implements an extra level of experimentation using n-fold cross-validation.
When employing cross-validation in classification it must take into account that the data
is grouped by classes therefore this project will implement stratified cross-validation.
Once a classifier has been constructed, it is possible to perform data classification
experiments as well as other tasks such as single document analysis. For example, for
the implementation of a suffix tree-based classifier the user will be able to view the
structure of the suffix tree, as well, the documents in the test sets or load a new
document and obtain a full matrix of output data about it. The output data is persisted in
an information system which is subsequently used to perform analysis and visualisation
tasks.


6.2 Development and Technologies
Development was done in C#, using the .NET framework. The architect of the system
was designed to be an extensible platform to enable users and developers to leverage
the existing framework for future system upgrades. The tool was built from several
components and aims to be modular. There are a number of controller components to
provide functionalities for the tool. A set of libraries is used to provide the functionalities
for the suffix tree. Working closely with researchers from Birkbeck College on the
interface, these libraries for the suffix tree were provided by Birkbeck College.
The suffix tree data structure is built in memory and can become very large. One
solution to better utilise resources is to have the data structure physically stored as one
tree, although it is logically represented as individual trees for each class. Further
discussion can be found in subsequent sections.




                                           15 of 93
A Windows application was built as the client. This forms the interface that the user
interacts with to gain access to the functionalities of the tool. The output data is cached
in a database.
The main targeted users for the tool are researchers in the research community for
natural language text classification, and other users who want to mine textual data.




                                         16 of 93
7 DESIGN

7.1 Functional Requirements
Requirements for the application were collected from research on natural language text
classification and discussions with targeted users in the research community.
Requirements are the capabilities and conditions to which the application must conform.
The functional requirements of the system are captured using „use cases‟. Use cases
are a useful tool in describing how a user interacts with a system. They are written
stories that describe the interaction between the system and the user that is easy to
understand. Requirements can often change over the course of development and for
this reason there was no attempt to define and freeze all requirements from the onset of
the project. The following use cases were produced. Note some use cases were added
throughout the development of the system


Use Case Name:                Load Directory as Source Corpus
Primary Actor:                User
Pre-conditions:               The application is running
Post-conditions:              A source corpus is loaded into the application
Main Success Scenarios:
Actor Action (or Intention)                      System Responsibility
   1. The user selects a valid directory             2. The system checks for directory
      and has at least read access to the               path validity and access
      directory, and loads it as a corpus            3. Builds a tree structure of classes
      into the system                                   based on the sub-folders in the
                                                        directory and displays the classes
                                                        in the GUI



Use Case Name:                View a Document in Corpus
Primary Actor:                User
Pre-conditions:               A corpus is successfully loaded
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention)                      System Responsibility
   1. Select the document to view                    2. Display content of document in the
                                                        GUI



Use Case Name:                Create Sampling Set


                                          17 of 93
Primary Actor:                User
Preconditions:                A source corpus is successfully loaded
Postconditions:               A sampling set based on the source corpus is created. New
                              file directory created for the corpus.
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects how they want to                   3. Creates a sampling set based on
      select the sampling set                            parameters given by the user
   2. User specifies location to store the            4. Creates the directory structure and
      documents/files created for the                    document/files in the location
      sampling set                                       specified by the user
                                                      5. Displays new corpus created in the
                                                         GUI



Use Case Name:                Run Pre-Processing
Primary Actor:                User
Pre-conditions:               A training set exist in the system
Post-conditions:              A new pre-processed sampling set created. New file directory
                              created for the corpus.
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. Select type of pre-processing to                4. Performs pre-processing
      perform                                         5. Creates a new pre-processed set
   2. User specifies location to store the            6. Stores the directory structure and
      documents/files created for the pre-               documents/files at the location
      pre-processing set                                 specified by the user.
   3. Run pre-processing                              7. Displays the corpus as a directory
                                                         structure in the GUI



Use Case Name:                Run N-Fold Cross-Validation
Primary Actor:                User
Preconditions:                A sampling set is successfully created
Postconditions:               N-fold cross-validation set is created virtually
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects sampling set to                    2. Builds n-fold cross-validation set
      process and the number of fold                     based on parameters given by the
                                                         user, which includes the n-runs,


                                           18 of 93
each run containing training set and
                                                         test set.
                                                      3. Displays new cross-validation set
                                                         created in the GUI



Use Case Name                 Create Classifier (Suffix Tree)
Primary Actor:                User
Preconditions:                A cross-validation set or classification set exist
Postconditions:               Classifier created in memory
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User actives an event to build                  3. Builds classifier in memory, based
      classifier for a cross-validation set              on the corpus set selected
      or classification set                           4. indicate in the GUI that the
   2. User choose any additional                         classifier of the corpus has been
      conditions to apply                                created



Use Case Name:                Score Documents
Primary Actor:                User
Preconditions:                An n-fold cross-validation set is created. Classifier for the
                              corpus set is created
Postconditions:               Documents in the cross-validation set is scored and data
                              stored in the database
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects the cross-validation               2. Scores all documents under the
      run to score                                       selected corpus set
                                                      3. Inserts score data into database



Use Case Name:                Classify Documents
Primary Actor:                User
Preconditions:                An n-fold cross-validation set is created. Classifier for the set
                              is created and the documents have been scored
Postconditions:               Misclassified documents in the cross-validation set is flagged
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility


                                           19 of 93
1. User selects the cross-validation               2. Classify all documents under the
      run to classify                                    selected cross-validation set
                                                      3. Flag all misclassified documents in
                                                         the GUI



Use Case Name:                Create Classification Set
Primary Actor:                User
Preconditions:                A source corpus is successfully loaded
Postconditions:               A classification set is created virtually
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects the corpus set they                2. Display new corpus created in the
      want to use to create a classifier                 GUI as a classification corpus set



Use Case Name:                Load New Document to Classify
Primary Actor:                User
Preconditions:                Cross-validation set or classification set exist
Postconditions:               Substring matches and relates output data is store in
                              database
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User decides which suffix tree to               2. Document name and relevant
      use for classification and loads in a              information is displayed in the GUI
      valid textual document as an item                  ready to be analysed
      to be classified and analysed                   3. Score and classify document
                                                      4. Stores output data in database



Use Case Name:                View a Document
Primary Actor:                User
Pre-conditions:               Document loaded into the system
Post-conditions:
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. Select the document to view                     2. Display content of document on
                                                         GUI



                                           20 of 93
Use Case Name                 View n-Gram Matches in document
Primary Actor:                User
Preconditions:                The document in concern is successfully loaded and suffix
                              classifier created
Postconditions:
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects a string/substring in a            2. Queries the classifier to retrieve the
      document to match                                  n length substring matches
                                                      3. Displays to user the frequency for
                                                         the string/substring selected



Use Case Name                 View Statistics on Matches
Primary Actor:                User
Preconditions:                Document successfully loaded, scored and output exists in
                              database
Postconditions:               Displays information in GUI
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects to view output                     2. System queries and retrieves
                                                         relevant data in the database
                                                      3. Displays the output in table form in
                                                         the GUI



Use Case Name                 Visualise Representation of Classifier (View Suffix Tree)
Primary Actor:                User
Preconditions:                Classifier was successfully built
Postconditions:               Classifier visual representation displayed on GUI
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   1. User selects option to display suffix           2. Builds visual representation of the
      tree                                               classifier and displays in GUI




                                           21 of 93
Use Case Name                 Delete Classifier
Primary Actor:                User
Preconditions:                Classifier was successfully built
Postconditions:               Classifier is deleted
Main Success Scenarios:
Actor Action (or Intention)                       System Responsibility
   3. User selects classifier to delete               4. Remove classifier, and clear
                                                         displayed tree in GUI



7.2 Non-Functional Requirements


The non-functional requirements for the use cases are as follows.


7.2.1    Usability
The user should have one main single user interface to interact with the system. The
user interface should be user friendly and the complexity of computation e.g. building an
n-fold cross-validation set, scoring documents against a classification model, should be
hidden from the user.
An experimental run of the suffix tree classifier could involve as many as 126 scoring
configurations, all of which could together take some considerable time to calculate. It
therefore makes sense to keep a store of all calculated scores, rather than calculate
them on-the-fly whenever they are requested. The results will be cached in a data store,
which is implemented as database in this project. Hence, optimizing system
responsiveness.
Some system requests can only be activated once a pre-condition has been satisfied
e.g. the user can only score documents when the suffix tree has been created. The
system should give informative warning messages if the user attempts to perform a task
without pre-conditions being satisfied. Where appropriate, upon a task being performed,
the system may automatically carry out pre-conditions before performing the requested
task.


7.2.2    Hardware and Software Constraint
The application should be easily extensible and scalable. Developers should be able to
add both extra functionality and expand the workload the application can handle with
relative ease.
The design should consider the future enhancement of the system and should be
reasonably easy to maintain and upgrade. Codes should also be well documented.
The system should use an RDBMS to manage its data layer, but be independent of the
RDBMS it uses to manage its data.



                                           22 of 93
7.2.3    Documentation
Help menus and tool tips will be available to help users interact with the system. The
application will also come with a user manual, including screen shots. The application
will be available along with written documentation for its installation and configuration.


7.3 System Framework
It was decided to build the system with a number of components. Each component has
a specialised function in the system. Figure 6 illustrates the main components and the
system boundary. The next section will describe the functions of each component in
more detail and section 7.5 contains the class diagram. By isolating system
responsibilities the following main components were identified.
        User interface
        Display Manager
        Classifier (Central Manager, STClassifier Manager, STClassifier)
        Sampling Set Generator
        Pre-processor
        Cross-validation
        Results Manager (Database Manager, OLEDB, Database)
Figure 7 shows how the system is divided into a client/server architecture. The
advantage of this set up is its ease of maintenance as the server implementation can be
an abstraction to the client. All the functionalities of the system are accessed through
the graphical user interface (GUI). The implementation is in the server, isolating users
from the system complexities not relevant to the user.
One of the main aims of the design of the system was to create a flexible framework.
                   Others...

The green boxes        seen in Figure 8 represent new or alternative components that
can be added to the system in the future with relative ease.




                                         23 of 93
Input Data

                                                                                                   System Boundary

                                                                                  Random

               Graphical User
                                       DisplayManager
                 Interface

                                                                                Sampling Set
                                                                                 Generator

                                                                                                           Utility
              Results Manager          Central Manager



                                                                                Pre-processor



OLEBD            Database                     STClassifier
                 Manager                       Manager
                                                                             Stemmer
  Database




                                              STClassifier                     Cross-Validation




                           Figure 6.          System Components and Boundary




              Input Data
                                         Graphical User             Client
                                           Interface

                                                                                                               Server



                                        DisplayManager
                                                                                    Random

                                               v


                                                                                 Sampling Set
              Results Manager           Central Manager                           Generator

                                                                                                           Utility




                                                                                 Pre-processor
                 Database                      STClassifier
OLEBD
                 Manager                        Manager




                                                                              Stemmer
 Database




                                               STClassifier


                                                                                Cross-Validation




                                  Figure 7.                  Client Server Division




                                                   24 of 93
Graphical User
                                                                                Others...
                                                      Interface




                                                                                                 Random          Others...

                                Input Data                 DisplayManager


                                                                                               Sampling Set
                                                                                                Generator

                                                                                                                             Utility
          Others...    Results Manager                     Central Manager



                                                                                               Pre-processor



                                                                     STClassifier
                          Database           Others...
          OLEBD                                                       Manager
                          Manager
                                                                                            Stemmer     Others..




                                                                     STClassifier             Cross-Validation
            Database




                                     Figure 8.           Additional or Alternative Components



7.4 Components in Detail


7.4.1     The Client - User Interface

  Graphical User
     Interface



The user interacts with the system via a single graphical user interface which is also the
client. In this project the client is implemented as a set of Windows forms and controls in
.NET. There is one main form where users can access all the functionalities of the
system. There are a number of other dialog boxes and forms to help with the navigation
and interaction with the system. For example there is a Select Scoring Method form,
used to request from the user the scoring methodology to use when scoring a new
document. Other more generic forms such as the Select Dialog form are employed for a
number of uses and do not display specific types of information (see section 10
Implementation Specifics for further discussion).
The client is simply an event handler for each of the GUI controls that calls the Central
Manager via the Display Manager for actual data processing. The GUI contains no
implementation, but delegates to the Display Manager, thus decoupling the interface
from the implementation. There is a two-way communication between the client and the
Display Manager, whereby a user invokes an event and related messages are passed to
the Central Manager. The Central Manager passes the messages to the Central
Manager which subsequently either delegates to other more specialised controllers to
handle the task, or resolves the request itself.
The design of the screens was done in speaking with potential users. The user should
be able to perform all the tasks described by the use cases seen earlier in the Functional
Requirements section (the functions will not be reiterated here).


                                                              25 of 93
For this project Windows forms were chosen for the implementation because most users
are familiar with the Windows form interface. It creates a familiar interface on initial
interaction with the system and facilitates use of the system. In particular, the .NET
framework provides a wealth of controls and functionalities, which help to build a user
friendly interface and hides the complexity of the underlying workings from the user. The
different components are built as separate classes and the user interface or the client
can be implemented using a different methodology from Windows forms, such as
command line as illustrated.

                                   Select              Select Scoring
                                   Dialog                 Method




                                            Graphical User
                                                                    Command Line
                                              Interface




                 Input Data


                                                Display Manager




                                  Figure 9.             Client interface and Its Collaborating Components



7.4.2       Display Manager


  DisplayManager




The Display Manager is a layer between the User Interface and the Central Manager
and the rest of the system. It essentially passes messages between these two
components. The Display Manager is responsible for information displayed back to the
user and it manages also the input data.


                              Graphical User
                                                        Others...
                                Interface




    Input Data                      DisplayManager




                                    Central Manager




7.4.3       The Classifier
It was mentioned in the previous section that the Central Manager is part of the
classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its
connecting components. The classifier comprises of the Central Manager, a controller


                                                                        26 of 93
that manages the underlying model of the classifier, and the underlying model itself. The
Central Manager is a controller that handles the communication between all the main
components in the system which communicates with the classifier. The Central
Manager should provide the following functionalities:
       Select Sampling Set for a corpus
       Pre-process all documents in a corpus
       Run cross-validation on a corpus
       Create a classifier for a given corpus
       Score all documents in a corpus
       Classify all documents in a corpus
       Obtain classification results for a corpus


There are further controller classes called by the Central Manager to provide more
specialised functionalities, these are the Output Manager, Suffix Tree Manager,
Sampling Set Generator, Pre-processor, and Cross-validation.
When a user loads a corpus into the system it is managed by the Central Manager. If
there is a request to create a sampling set for example, the Central Manager should
know where the corpus is located and delegates the Sampling Set Generator the task of
creating a sampling set based on parameters set by the user. Similarly, a request from
the user to perform pre-processing on the corpus is delegated to the Pre-processor to
carry out the task by the central manager.
The various components is designed to have specialised tasks, they do not need to
know where the data is located as this information is passed to the components when
the Central Manger invokes a request. The Sampling Set generator does not need to
know how the Pre-processor carries out its task, nor does it need to know about the
Cross-validation component. The three components receive data and requests from the
Central Manager, perform its task and return any information back to the Central
Manager.
The classifier has to be connected to an internal model. In this project the suffix tree
data structure is employed to model the representation of document characteristics. As
seen in Figure 10, the classifier can be implemented with different types of models such
as a Naïve Bayesian or Neural Networks. There is a dual way communication between
the Central Manager and the STClassifier via the STClassifier Manager. The
STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to:
       Building the representation of documents using the suffix tree data structure
       Training the classifier
       Score a document
       Returns classification results


The STClassifier Manager controls the flow of messages between the Central Manager
and the STClassifier. Responsibilities involve converting data to the format that is
accepted by the STClassifier, and converting output from the STClassifier which is

                                         27 of 93
passed back to the STClassifier Manager. It is essentially a wrapper class for the
STClassifier.
The suffix tree is built using the contents of documents in a training set. Once a suffix
tree is built it will be cached in an ArrayList that is managed by the STClassifier
Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree
remains stored in memory until the user activates an event to delete the suffix tree. As a
result the system does not need to create a suffix tree every subsequent action that
references it. Hence, only methods in the STClassifier Manager are called and it is not
necessary to call methods in the STClassifier.
The classifier generates output data when a request is invoked to classify and score
documents. These two actions can be a time consuming activities. The Central
Manager decides what type of output data needs to be saved and passes the data from
the classifier to the Results Manager to handle. Section Figure 13 describes the design
of the Results manager.


                                                    Graphical User
                                                      Interface            Command Line




  Results Manager
                                                         Display Manager


                                                                                           Sampling Set
                                                                                            Generator

                                                         Central Manager




                                                                                           Pre-processor



                    NBClassifier          NNClassifier          STClassifier
                     Manager               Manager               Manager


                                                                                          Cross-Validation


                    NBClassifier          NNClassifier          STClassifier




              Classifier

                                   Figure 10.      The Classifier and Its Collaborating Components



7.4.4       Data Manipulation and Cleansing



   Sampling Set
                                   Pre-processor
    Generator




                                                               28 of 93
When a corpus is loaded into the system as input data. The user can create sampling
sets from the initial corpus and also prepare the data for experimentation by performing
various types of pre-processing on the data. The input data is given to the classifier,
which sends it to the Sampling Set Generator to handle the generation of sampling sets.
Various sampling methodologies can be plugged into the Sampling Set Generator. For
this project the system will implement random sampling and systematic sampling
methodologies. The pre-processor provides the functionality for pre-processing data
passed to it. Similarly, various methods of pre-processing can be plugged into the
system with relative ease. Currently, the system provides stemming, stop word removal,
and punctuation removal.
In order for a method to plug into the system, a method class must implement an
IMethod interface so that it guarantees the following:
                    A method class must have a name property to return the name of the
                    method. This is necessary, so if new methods are added to the system it
                    will be identified by its name.
                    A method class must have a Run method. This method is where all the
                    work is done
A set of utility classes will provide helper functionalities such as random number
generator, common divisor, and file system.


                           Systematic      Random       Snowball




                                         Sampling Set
                                          Generator

                                                                         Utility
 Central Manager



                                        Pre-processor




                       Stop
                       Word       Punctuation
                                                 Stemmer      Others..
                      Removal      Removal


         Figure 11.      Data Manipulation and Cleansing Components and Its Collaborating Components



7.4.5       Experimentation


 Cross-Validation




Setting up data for experimentation is the main responsibly of the Cross-validation class.
The Central Manager passes a corpus to the Cross-validation component, which uses
the data to build N-fold cross-validation sets. It divides the given set of corpus into N
blocks and builds a training set and test set for each N run. The data is stored as an
array that is passed back to the Central Manager.

                                                        29 of 93
The methods the Cross-Validation class is expected to perform are:
          Set the number of N-folds
          Run N-fold cross-validation on a given source data
          Return the cross-validation sets in an array data structure



 Central Manager




                               Cross-Validation




                      Figure 12.       Cross-validation and Its Collaborating Components



7.4.6        Results Manager


 Results Manager




The Results Manager handles the output of the classifier and the repository of the
output. The underlying RDBMS of this project is an Access database, which is used to
cache the data generated by the classifier. The OLEDB component is responsible for
the direct communication with the database. This class needs to provide the basic
database functionalities such as read/write/ delete in a generic fashion. It is through the
Database Manager object that all communication with the OLEDB library occurs, and the
data flow between the Results Manager. The Database Manager manages the OLEDB.
The green boxes illustrate that the information system for the system does not
necessarily has to be an Access database. The system is designed to be able to store
the data using a different means with relative ease, e.g. XML files, SQL server etc.




                                                  30 of 93
Results Manager                 Central Manager




   XML File             Database
   Manager              Manager




     XML                 OLEDB




  XML File(s)
                           Database




                            Figure 13.   Results Manager and Its Collaborating Components



7.4.7       Error Handling
Adequate error handling for an end user application is essential. Displays of warnings
and errors should be handled in the higher level of the system, namely by the Display
manager and then displayed to the user in a reasonable fashion. Errors that occur in the
other classes should be propagated to the Display Manager. All classes apart from the
User Interface and the Display Manager are expected to implement an IErrorRecord
interface. A class that implements this interface will guarantee that it has a property
called error which returns the error message.




                                                   31 of 93
7.5 Class Diagram
Figure 14 shows a class diagram of the main components of the system discussed above


                                                                                                                                                    Controllers::DisplayManager
                               MainForm
                                                                                         -nodeMgr : TreeViewNodeManager
  -tvExplorer                                                                            -classifier : CentralManager
  -sTreeView                                                                             -dbProvider : string
  -rtxtView                                                                              -dbUserId : string
  -rtxtInfo                                                                              -dbPassword : string
  -mItemAddRCorpus_Click(in sender : object, in e)                                       -dbName : string
  -mitemSelectSampling_Click(in sender : object, in e)                      -Controlled By
                                                                                         -dbAccessMode : string                                                                                                                                                      1
  -mitemPreprocess_Click(in sender : object, in e)                                       +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages)
  -mitemCrossValidation_Click(in sender : object, in e)                                  +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode
  -CreateSTree_Click(in sender : object, in e)                              1..*
                                                                                         +DisplayBlank()
  -DeleteSTree_Click(in sender : object, in e)                                           +DisplayFile(in filePathname : string)
  -DisplaySuffixTree_Click(in sender : object, in e)                                     +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
  -AddNewDoc_Click(in sender : object, in e)                                             +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string)
  -AddClassificationSet_Click(in sender : object, in e)                            1     +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
  -ScoreAllDoc_Click(in sender : object, in e)                                           -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode)
  -ClassifyAllDocs_Click(in sender : object, in e)                                       +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode)
                                                                                         +DisplayScoresByDoc(in displayView : ListView, in sourceNode : TreeNode, in filepath : string)
                                                                                         +ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
                                                                                         +ClassifyAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string)
                                                                                         +FlagMisClassifiedDocuments(in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int)
                                                                                         +DeleteScores(in parentPath : string)
                                                                                         +DeleteSTree(in STreeNode : TreeNode)
                                                                                         +DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode)                                                           Controllers::SampleSetGenerator
                                                                                         +GetMatchInfo(in text : string, in STreeNode : TreeNode) : string                                                                                                               -error : string
                                                                                         +CleanupDatabase()                                                                                                                                              -Controls
                                                                                                                                                                                                                                                                         -methodNames : string[] = new string[] {"Census", "Random", "Systematic"}
                                                                                                                                                                                                                                                                         +ErrorMessage() : string
                                                                                                                                                                1                                                                                               1        -CodeToName(in code : int) : string
                                                                                                                                                                                                                                                                         +Run(in resourcePath : string, in destPath : string, in selectMethod : string)
                                                                                                                                                                                            1       -Controls                                                            +MethodNames() : string[]


                                                                                                                                                                                  Classifier::CentralManager
                                                                                                                                            -sampler : SampleSetGenerator
                                                                                                                                            -preprocessor : Preprocessor                                                                                                                                       Controllers::CrossValidation
                                                                                                                                      1     -crossValidator : CrossValidation                                                                                                                              -folds : Array[]
                                                                                                                                            -dataModelMgr : SuffixTreeManager                                                                                                                              -noOfFolds : int
                                                                                                                                            -outputMgr : DatabaseManager                                                                                                                                   -minFold : int = 2
                                                                                                                                                                                                                                                          1
                                                                                                                                            -error : string                                                                                                                                                -maxFold : int = 10
                                                              1       -Controls
                                                                                                                                            +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool                                                        -error : string
                                                                                                                                            +Contains(in key : string) : bool                                                                             -Performs                                  1     +ErrorMessage() : string
                                                    Output::DatabaseManager                                                                 +Remove(in key : string)                                                                                                                                       +CrossValidation(in folds : int)
                                                                                                                                            +GetClassNames(in key : string) : string[]                                                                                                                     +Run(in path : string) : Array[]
  -dbAccess : OLEDB                                                                                                                         +GetClassScores(in key : string, in className : string, in doc : string) : double[,,]                         1
  -dbProvider : string                                                                                                                                                                                                                                                                                     +FoldCount() : int
                                                                                                                                            +ErrorMessage() : string
  -dbUserId : string                                                                                                                        +CentralManager()                                                                                             -Controls
  -dbPassword : string                                                                                                                      +GetModel(in key : string) : EMSTreeClassifier
  -dbName : string                                                                                                                          +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int                                                                       Controllers::Preprocessor
  -ScoresTable : string = "Scores"                                                                                                          +Sampler() : SampleSetGenerator                                                                               1
  -ConfigTable : string = "Config"                                                                                                                                                                                                                                                 -stopWordFile : string
                                                                                                                                            +Preprocessor() : Preprocessor
  -ClassWeightsTable : string = "ClassWeights"                                                                                                                                                                                                                                     -punctuationFile : string
                                                                                                                                            +CrossValidator() : CrossValidation
  -ClassifiedTable : string = "qry3a_MaxWScoreClass"                                                                                                                                                                                                                               -methodNames : string[] = new string[methodCount]
                                                                                                                                            +OutputManager() : DatabaseManager
  -MisClassifyFiles : string = "qry2b_MisClassifiedByFile"                                                                                                                                                                                                                         -error : string
  -MatchByClass : string = "zqry2b_matchByClass_Crosstab"                                                                                                                                                                                                                          +ErrorMessage() : string
  -error : string                                                                                                                                                                           1                                                                               1      +Preprocessor()
  -bOpen : bool                                                                                                                                                                                                                                                                    -SetupMethodNames()
  +ErrorMessage() : string                                                                                                                                                                                                                                                         -CodeToName(in code : int) : string
  +DatabaseManager()                                                                                                                                                                                                                                                               +Run(in content : string, in type : string) : string
  +SelectScoresByFile(in parentPathNode : string, in filePath : string) : OleDbDataReader                                                                                                                                                                                          +MethodNames() : string[]
  +SelectMisClassifiedDocuments(in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader
  +SelectClassifiedClass(in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader                                                                                                                                                                        1
  +DeleteScores(in ParentNodePath : string)
  +Provider() : string
  +UserId() : string
  +Password() : string                                                                                                                                                                     1       -Controls                                                                                               1        -Has
  +DatabaseName() : string
                                                                                                                                                                              Classifier::SuffixTreeManager                                                                                        DataMining::StopWord
                                                              1
                                                                                                                                           -createdSTreeList : SortedList                                                                                                                  -name : string
                                                                                                                                           -error : string                                                                                                                                 -stringList : ArrayList = new ArrayList()
                                                              1       -Access Database                                                                                                                                                                                                     -error : string
                                                                                                                                           +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool
                                                                                                                                           +Contains(in key : string) : bool                                                                                                               +Name() : string
                                                          Output::OLEDB                                                                    +Remove(in key : string)                                                                                                                        +Run(in text : string) : string
        -oleDbDataAdapter : OleDbDataAdapter                                                                                               +GetClassNames(in key : string) : string[]                                                                                                      +ErrorMessage() : string
        -oleDbConnection : OleDbConnection                                                                                                 +GetClassScores(in key : string, in className : string, in doc : string) : double[,,]                                                           +StopWord(in filePathName : string)
        -oleDbInsertCommand : OleDbCommand                                                                                                 +ErrorMessage() : string                                                                                                                        +Add(in filePathName : string)
        -oleDbDeleteCommand : OleDbCommand                                                                                                 +SuffixTreeManager()                                                                                                                            -AddWord(in targetWord : string)
        -oleDbUpdateCommand : OleDbCommand                                                                                                 -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool                                                                          +Clear()
        -oleDbSelectCommand : OleDbCommand                                                                                                 +GetModel(in key : string) : EMSTreeClassifier                                                                                                  +Reset()                                                       1   -Controls
        +oleDbDataReader : OleDbDataReader                                                                                                 +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int                                                                  +Contains(in word : string) : bool
        -command : COMMAND                                                                                                                                                                                                                                                                 +StringList() : ArrayList
        -error : string                                                                                                                                                                    1
        -bOpen : bool
        +ErrorMessage() : string                                                                                                                                                         1..*      -Access
                                                                                                                                                                                                                                                                                                                    Controllers::TreeViewNodeManager
        +IsOpen() : bool
        +InsertCommand() : string                                                                                                                                                                                                                                            -error : string
                                                                                                                                                                                     EMSTreeClassifier
        +DeleteCommand() : string                                                                                                                                                                                                                                            +ErrorMessage() : string
        +UpdateCommand() : string                                                                                                         -className : string[]
                                                                                                                                                                                                                                                                             +ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool
        +SelectCommand() : string                                                                                                         -dictionary : string[]
                                                                                                                                                                                                                                                                             +GetClassFiles(in classFileParent : TreeNode) : FileInfo[][]
        +GetReader() : OleDbDataReader                                                                                                    -dictionaryByClass : string[][]
                                                                                                                                                                                                                                                                             +GetChildrenNodeNames(in targetNode : TreeNode) : string[]
        +ExecuteCommand() : bool                                                                                                          -mergedTree : EMSTreeClassifier.EMSTree
                                                                                                                                                                                                                                                                             +GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode
        -SelectReader() : OleDbDataReader                                                                                                 +addToClass(in txt : string, in class : string)                                                                                    +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[])
        -UpdateReader() : OleDbDataReader                                                                                                 +classIntToName(in classInt : int) : string                                                                                        +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode
        -InsertReader() : OleDbDataReader                                                                                                 +classNameToInt(in className : string) : int                                                                                       +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[])
        -DeleteReader() : OleDbDataReader                                                                                                 +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,]                       -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode)
        +OLEDB()                                                                                                                          +maxScore(in a : double[]) : static int                                                                                            -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][]
        +Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string)                 +setDepth(in d : int)                                                                                                              +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode
        +Close()                                                                                                                          +train(in classTrainingFiles : <unspecified>[][]) : bool                                                                           -CreateNewNode(in nodeName : string, in imageIdx : TreeImages) : TreeNode




                                                                                                                                                               Figure 14.                        Class Diagram



                                                                                                            32 of 93
8 DATABASE

8.1 Entities
All the data in the system is stored in an Access database. The following describes the
organisation of the data that the system will store.


8.1.1   Score Table




When a user calls to score a new document or a set of documents, each document is
scored against 126 configurations for each class. The data is cached in the score table.


8.1.2   Source Table




The source table stores the location properties of documents. This includes the physical
pathname of the document and where it is logically located in the display tree.


8.1.3   Configuration Table




This configuration table stores the 126 combination of scoring methods used in
Pampapathi et al‟s study. Each configuration consists of a type of scoring function,
match normalisation, and tree normalisation function.


8.1.4   Score Functions Table




                                        33 of 93
This table contains the name description of score functions.


8.1.5   Match Normalisation Functions Table




This table contains the name description of match normalisation functions.


8.1.6   Tree Normalisation Functions Table




This table contains the name description of tree normalisation functions.


8.1.7   Classification Condition Table




This table stores any classification conditions to be considered when classifying a
document from a particular corpus.


8.1.8   Class Weights Table




This table stores the class weights when classifying documents.


8.1.9   Temporary Max and Min Score Table




                                         34 of 93
This is a temporary table used to cache the maximum and minimum scores for a class
grouped by document, configuration.


8.2 Views
The following are some of the main views to assist in querying the main tables for data
displayed in the user interface.


8.2.1   Weighted Scores




This view obtains the weighted scores by documents and scoring configuration.


8.2.2   Maximum and Minimum Scores




This view obtains the maximum and minimum score by document and scoring
configuration.


8.2.3   Misclassified Documents




This view obtains the misclassified documents and related data.


8.3 Relation Design for the Main Tables
The main table of the database is the Scores table. This table contains the scores for
each document, scored by different configuration combinations (see the Implementation

                                        35 of 93
section for scoring configuration description). Figure 15 shows the relationships
between the main tables.


                         tTreeNormalisation                tMatchNormalisation                        tScoreFunction
                         PK    Index                       PK       Index                             PK   Index

                               Name                                 Name                                   Name



                                                                               1..1

                                                                      Config

                                                              PK,I1    ConfigId
                                                1..1                                         1..1
                                                              FK2      SF
                                                              FK3      MN
                                                              FK1      TN
                                                                       SF Name
                                                                       MN Name
                                                                       TN Name

                                                                                      *..1


                                                          tempMaxMinWScores

                              Source                     FK2,I2       SourceId
                                                *..1     FK1,I1       ConfigId
                    PK   SourceId                                     True Class

                         Node Parent Path                             MaxOfWScore
                         Node Path                                    MinOfWScore
                         File Path




                                                                      Scores
                                              *..1       PK             ScoreId
                                                                                               *..1
                                                         FK2,I4,I3      SourceId
                                                         FK1,I2,I1      ConfigId
                                                                        Score Class
                                                                        True Class
                                                                        Score



                                              Figure 15.         Table Relations




                                                       36 of 93
9 IMPLEMENTATION

Due to the large size of the program, this report will not cover all the different
implementation details, but instead the discussion will focus on the main classes and
highlight some specific implementation. See Appendix B Class Definitions.


9.1 Main User Interface
The main form of the user interface is divided into four resizable panes which each
display different types of information to the user (see Figure 16):
       tvExplorer
       rtxtView/sTreeView.
       lblTreeDetail/listView
       rTxtInfo
The tvExplorer is a Windows Form TreeView control, which displays the different
corpuses available in the system. The information is presented as a hierarchy of nodes,
like the way files and folders are displayed in the left pane of Windows Explorer.
The rtxtView is implemented as a Windows Forms RichTextBox control. When the user
selects a child node in tvExplorer that represents a document, rtxtView will display the
content of document. The rtxtView will also allow users to perform dynamic n-gram
(sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching).
The sTreeView is implemented as a TreeView control. It shares the same pane as the
rtxtView control and is only made visible on the main form (and the rtxtView becomes
invisible) when the user requests to display a suffix tree that has been created. At the
same time the lblSTreeDetail control, which is implemented as a Windows Form Label
control will display description about the suffix tree currently displayed in the sTreeView
control. ListView is a Windows Form ListView control which provides information related
to the current content of the rtxtView control.
RtxtInfo is a RichText control and displays classification summary regarding a document.




                                        37 of 93
lblSTreeDetail/listView




 tvExplorer




 rtxtInfo                                                              rtxtView/sTreeView


                                 Figure 16.   Main User Interface



The main form is implemented as a .NET class called MainForm. Figure 17 shows the
class members and class interface.
Note that there are other Windows Form control classes which were implemented to
control the flow of user-system interaction. Section 10 Implementation Specifics will
describe one of them in detail, and see Appendix x for all the user interface classes.




                                         38 of 93
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining
Management and Visualisation Tool for Text Mining

Contenu connexe

En vedette

FULL REPORT OF ASSURANCE OF LEARNING RESULTS
FULL REPORT OF ASSURANCE OF LEARNING RESULTSFULL REPORT OF ASSURANCE OF LEARNING RESULTS
FULL REPORT OF ASSURANCE OF LEARNING RESULTSbutest
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Learning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology EngineeringLearning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology Engineeringbutest
 
Noshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-LineNoshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-Linebutest
 
Improving the Performance of Action Prediction through ...
Improving the Performance of Action Prediction through ...Improving the Performance of Action Prediction through ...
Improving the Performance of Action Prediction through ...butest
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.docbutest
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...butest
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..butest
 

En vedette (9)

FULL REPORT OF ASSURANCE OF LEARNING RESULTS
FULL REPORT OF ASSURANCE OF LEARNING RESULTSFULL REPORT OF ASSURANCE OF LEARNING RESULTS
FULL REPORT OF ASSURANCE OF LEARNING RESULTS
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Learning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology EngineeringLearning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology Engineering
 
Noshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-LineNoshir Contractor - WebSci'09 - Society On-Line
Noshir Contractor - WebSci'09 - Society On-Line
 
Improving the Performance of Action Prediction through ...
Improving the Performance of Action Prediction through ...Improving the Performance of Action Prediction through ...
Improving the Performance of Action Prediction through ...
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.doc
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
S10
S10S10
S10
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..
 

Similaire à Management and Visualisation Tool for Text Mining

Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in javaMuhammad Aleem Siddiqui
 
Crap shit head
Crap shit headCrap shit head
Crap shit headShash
 
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....thienvo61
 
070105618001 070105618006-070105618015-070105618021
070105618001 070105618006-070105618015-070105618021070105618001 070105618006-070105618015-070105618021
070105618001 070105618006-070105618015-070105618021sanaskumar008
 
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0A
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0AFYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0A
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0ATianwei_liu
 
05 2007 excel 2007
05 2007 excel  200705 2007 excel  2007
05 2007 excel 2007Anshnish1331
 
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...
Mcts self paced training kit exam 432   sql server 2008 - implementation and ...Mcts self paced training kit exam 432   sql server 2008 - implementation and ...
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...Portal_do_Estudante_SQL
 
A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfYasmine Anino
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
Contents in findamrnatad jkldasd
Contents in findamrnatad jkldasdContents in findamrnatad jkldasd
Contents in findamrnatad jkldasdDayakar Siddula
 
Diplomatiki word
Diplomatiki wordDiplomatiki word
Diplomatiki wordXaris1985
 
Grade management-using-snmp-design-doc
Grade management-using-snmp-design-docGrade management-using-snmp-design-doc
Grade management-using-snmp-design-docHarshul Jain
 
Open source programming
Open source programmingOpen source programming
Open source programmingRizwan Ahmed
 
Syllibus web application
Syllibus web applicationSyllibus web application
Syllibus web applicationAzizol Duralim
 

Similaire à Management and Visualisation Tool for Text Mining (20)

Data structures and algorithm analysis in java
Data structures and algorithm analysis in javaData structures and algorithm analysis in java
Data structures and algorithm analysis in java
 
Crap shit head
Crap shit headCrap shit head
Crap shit head
 
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....
Data Structures and Algorithm Analysis in C++, 3rd Edition by Dr. Clifford A....
 
812395816
812395816812395816
812395816
 
070105618001 070105618006-070105618015-070105618021
070105618001 070105618006-070105618015-070105618021070105618001 070105618006-070105618015-070105618021
070105618001 070105618006-070105618015-070105618021
 
Chani index
Chani indexChani index
Chani index
 
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0A
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0AFYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0A
FYP%3A+P2P+Bluetooth+Communication+Framework+on+Android%0A
 
seanogdenphd1
seanogdenphd1seanogdenphd1
seanogdenphd1
 
05 2007 excel 2007
05 2007 excel  200705 2007 excel  2007
05 2007 excel 2007
 
Table of contents
Table of contentsTable of contents
Table of contents
 
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...
Mcts self paced training kit exam 432   sql server 2008 - implementation and ...Mcts self paced training kit exam 432   sql server 2008 - implementation and ...
Mcts self paced training kit exam 432 sql server 2008 - implementation and ...
 
Manual t(se)
Manual t(se)Manual t(se)
Manual t(se)
 
A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdfA Comprehensive Introduction to Object-Oriented Programming with Java.pdf
A Comprehensive Introduction to Object-Oriented Programming with Java.pdf
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
Contents in findamrnatad jkldasd
Contents in findamrnatad jkldasdContents in findamrnatad jkldasd
Contents in findamrnatad jkldasd
 
Diplomatiki word
Diplomatiki wordDiplomatiki word
Diplomatiki word
 
Grade management-using-snmp-design-doc
Grade management-using-snmp-design-docGrade management-using-snmp-design-doc
Grade management-using-snmp-design-doc
 
C How To Program.pdf
C How To Program.pdfC How To Program.pdf
C How To Program.pdf
 
Open source programming
Open source programmingOpen source programming
Open source programming
 
Syllibus web application
Syllibus web applicationSyllibus web application
Syllibus web application
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Management and Visualisation Tool for Text Mining

  • 1. A Management and Visualisation Tool for Text Mining Applications Student Peishan Mao MSc Computing Science Project Report School of Computing Science and Information System Birkbeck College, University of London 2005 Status Draft Last saved 26 Apr. 10 1 of 93
  • 2. 1 TABLE OF CONTENTS 1 TABLE OF CONTENTS 2 2 ACKNOWLEDGEMENT 5 3 ABSTRACT 6 4 INTRODUCTION 7 5 BACKGROUND 8 5.1 Written Text 8 5.2 Natural Language Text Classification 8 5.2.1 Text Classification 8 5.2.2 The Classifier 9 5.3 Text Classifier Experimentations 12 6 HIGH-LEVEL APPLICATION DESCRIPTION 14 6.1 Description and Rationale 14 6.1.1 Build a Classifier 14 6.1.2 Evaluate and Refine the Classifier 15 6.2 Development and Technologies 15 7 DESIGN 17 7.1 Functional Requirements 17 7.2 Non-Functional Requirements 22 7.2.1 Usability 22 7.2.2 Hardware and Software Constraint 22 7.2.3 Documentation 23 7.3 System Framework 23 7.4 Components in Detail 25 7.4.1 The Client - User Interface 25 7.4.2 Display Manager 26 7.4.3 The Classifier 26 7.4.4 Data Manipulation and Cleansing 28 7.4.5 Experimentation 29 7.4.6 Results Manager 30 7.4.7 Error Handling 31 2 of 93
  • 3. 7.5 Class Diagram 32 8 DATABASE 33 8.1 Entities 33 8.1.1 Score Table 33 8.1.2 Source Table 33 8.1.3 Configuration Table 33 8.1.4 Score Functions Table 33 8.1.5 Match Normalisation Functions Table 34 8.1.6 Tree Normalisation Functions Table 34 8.1.7 Classification Condition Table 34 8.1.8 Class Weights Table 34 8.1.9 Temporary Max and Min Score Table 34 8.2 Views 35 8.2.1 Weighted Scores 35 8.2.2 Maximum and Minimum Scores 35 8.2.3 Misclassified Documents 35 8.3 Relation Design for the Main Tables 35 9 IMPLEMENTATION 37 9.1 Main User Interface 37 9.2 Display Manager 39 9.3 Classifier Classes 40 9.4 Results Output Classes 41 9.5 Other Controller Classes 43 9.6 TreeView Controller Class 44 9.7 Error Interface 45 10 IMPLEMENTATION SPECIFICS 46 10.1 Generic Selection Form Class 46 10.2 Visualisation of the Suffix Tree 48 10.3 Dynamic Sub-String Matching 49 10.4 User Interaction Warnings 50 11 USER GUIDE 53 3 of 93
  • 4. 11.1 Getting Started 53 11.1.1 Input Data 53 11.2 Loading a Resource Corpus 54 11.3 Selecting a Sampling Set 57 11.4 Performing Pre-processing 61 11.5 Running N-Fold Cross-Validation 64 11.5.1 Set Up Cross-Validation Set 64 11.5.2 Perform experiments on the data 67 11.5.2.1 Create the Suffix Tree 67 11.5.2.2 Display Suffix Tree 69 11.5.2.3 Delete Suffix Tree 71 11.5.2.4 N-Gram Matching 71 11.5.2.5 Score Documents 73 11.5.2.6 Classify documents 74 11.5.2.7 Add New Document to Classify 76 11.6 Creating a Classifier 79 12 TESTING 81 13 CONCLUSION 83 13.1 Evaluation 83 13.2 Future Work 84 14 BIBLIOGRAPHY 86 15 APPENDIX A DATABASE 88 16 APPENDIX B CLASS DEFINITIONS 90 17 APPENDIX C SOURCE CODE 93 4 of 93
  • 5. 2 ACKNOWLEDGEMENT I would like to thank the following people for their help over the course of this project: Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and advice on the whole area of text classification, and pointing me in the right direction for information on the topic to being interviewed as a potential user to the proposed system as part of the requirement collection. Timothy Yip: for laboriously proof reading the draft for the report despite not having much interest in information technology. 5 of 93
  • 6. 3 ABSTRACT This report describes the design and implementation of a management and visualisation tool for text classification applications. The system is built as a wrapper for machine learning classification tool. It aims to provide a flexible framework to accommodate for future changes to the system. The system is implemented in C# .Net with a Windows Forms front end and an Access Database as an example, but should be flexible enough to add different underlying components. 6 of 93
  • 7. 4 INTRODUCTION This report describes the project carried out to implement a management and visualisation tool for text classification. It covers background information about the project, the design, implementation and conclusion. The report is organised as follows: Section 4 this section. It describes the organisation of the report. Section 5 takes a look at the background of the project. This section covers discussion on natural language classification, and suffix tree data structure used in Pampapathi et al‟s study. Section 6 a high-level description and rationale of the system. Section 7 describes the design of the system. Lays out the system requirements, system framework, and describes system components and classes. Section 8 explains the database design and description of the database entities and table relations. Section 9 discusses how the system was implemented and goes into class definitions. Section 10 focuses on specific system implementations and looks at the implementation of the generic selection form class, visualisation of the suffix tree, dynamic sub-string matching on documents, and user warnings. Section 11 is the user guide to the system. Section 13 concludes the project. This section discusses whether the system built has met the requirements laid out at the beginning of the project. It also looks at future work. Appendix A Database Appendix B Class Definitions Error! Reference source not found. 7 of 93
  • 8. 5 BACKGROUND 5.1 Written Text Writing has long been an important means of exchanging information, ideas and concepts from one individual to another, or to a group. Indeed, it is even thought to be the single most advantageous evolutionary adaptation for species preservation [2]. The written text available contains a vast amount of information. The advent of the internet and on-line documents has contributed to the proliferation of digital textual data readily available for our perusal. Consequently, it is increasingly important to have a systematic method of organising this corpus of information. Tools for textual data mining are proving to be increasingly important to our growing mass of text based data. The discipline of computing science has provided significant contributions to this area by means of automating the data mining process. To encode unstructured text data into a more structured form is not a straightforward task. Natural language is rich and ambiguous. Working with free text is one of the most challenging areas in computer science. This project aims to investigate how computer science can help to evaluate some of the vast amounts of textual information available to us, and how to provide a convenient way to access this type of unstructured data. In particular, the focus will be on the data classification aspect of data mining. The next section will explore this topic in more depth. 5.2 Natural Language Text Classification 5.2.1 Text Classification F Sebastiani [3] described automated text categorisation as “The task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. The task, that falls at the crossroads of information retrieval, machine learning, and (statistical) natural language processing, has witnessed a booming interest in the last ten years from researchers and developers alike.” Classification maps data into predefined groups or classes. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial trends. Until the late 80‟s, knowledge engineering was the dominant paradigm in automated text categorisation. Knowledge engineering consists of the manual definition of a set of rules which form part of a classifier by domain experts. Although this approach has produced results with accuracies as high as 90% [3], it is labour intensive and domain specific. The emergence of a new paradigm based on machine learning which answers many of the limitations with knowledge engineering has superseded its predecessor. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modelling, adaptive control theory, psychology, and artificial 8 of 93
  • 9. intelligence (AI) [11]. Data classification by machine learning is a two-phase process (Figure 1). The first phase involves a general inductive process to automatically build a model by using classification algorithm that describes a predetermined set of data classes which are non-overlapping. This step is referred to as supervised learning because the classes are determined before examining the data and the set of data is known as the training data set. Data in text classification comes in the form of files and each file is often described as documents. Classification algorithms require that the classes are defined based on purely the content of the documents. They describe these classes by looking at the characteristics of the documents in the training set already known to belong to the class. The learned model constitutes the classifier and can be used to categorise future corpus samples. In the second phase, the classifier constructed in the phase one is used for classification. Machine leaning approach to text classification is less labour intensive, and is domain independent. Since the attribution of documents to categories is based purely on the content of the documents effort is thus concentrated on constructing an automatic builder of classifiers (also known as the learner), and not the classifier itself [3]. The automatic builder is a tool that extracts the characteristics from the training set which is represented by a classification model. This means that once a learner is built, new classifiers can be automatically constructed from sets of manually classified documents. Training Classification Classification Set Algorithm Model a) Classification Model Test Set New Documents b) Figure 1. a) Step One in Text Classification b) Step two in text classification 5.2.2 The Classifier In general a text classifier comprises a number of basic components. As noted in the previous section, the text classifier begins with an inductive stage. A classifier requires some sort of text representation of documents. In order to build an internal model the inductive step involves a set of examples used for training the classifier. This set of examples is known as the training set and each document in the training set is assigned to a class C = {c1, c2, … cn}. All the documents used in the training phase are transformed into internal representations. Currently, a dominant learning method in text classification is based on a vector space model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text 9 of 93
  • 10. classification experiments. Bayesian classifiers are statistical classifiers. Classification is based on the probability that a given document belongs to a particular class. The approach is „naïve‟ because it assumes that the contribution by all attributes on a given class is independent and each contributed equally to the classification problem. By analysing the contribution of each „independent‟ attribute, a conditional probability is determined. Attributes in this approach are the words that appear in the documents of the training set. Documents are represented by a vector with dimensions equal to the number of different words within the documents of the training set. The value of each individual entry within the vector is set at the frequency of the corresponding word. According to this approach, training data are used to estimate parameters of a probability distribution, and Bayes theorem is used to estimate the probability of a class. A new document is assigned to the class that yields the highest probability. It is important to perform pre-processing to remove frequent words such as stop words before a training set is used in the inductive phase. The Naïve Bayesian approach has several advantages. Firstly, it is easy to use; secondly only one scan of the training data is required. It can also easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the Naïve Bayesian-based classifier is popular, documents are represented as a „bag-of-words‟ where words in the document have no relationships with each other. However words that appear in a document are usually not independent. Furthermore, the smallest unit of representation is a word. Research is continuously investigating how designs of text classifiers can be further improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new innovative approach to the internal modelling of text classifiers. They used a well known data structure called a suffix tree [11] which allows for indexing the characteristics of documents at a more granular level, with documents represented by substrings. The suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a tree structure, where each node represents one character, and the root represents the null string. Each path from the root represents a string, described by the characters labelling the nodes traversed. All strings sharing a common prefix will branch off from a common node. When strings are words over a to z, a node has at most 26 children, one for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been used for complex string matching problems in matching string sequences (data compression, DNA sequencing). Pampapathi et al‟s research is the first to apply suffix trees to natural language text classification. Pampapathi et al‟s method of constructing the suffix tree varies slightly from the standard way. Firstly, the tree nodes are labelled instead of the edges in order to associate directly the frequency with the characters and substrings. Secondly, a special terminal character is not used as the focus is on the substrings and not the suffixes. Each suffix tree has a depth. The depth is described by the maximum number of levels in the tree. A level is defined by the number of nodes away from the root node. For example the suffix tree illustrated in Figure 2 has a depth of 4. Pampapathi et al‟s sets a limit to the tree depth and each node of the suffix tree stores the frequency and the character. For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in Figure 2 is created. The substrings are COOL; OOL; OL; and L. 10 of 93
  • 11. C (1) O (1) O (1) L (1) Root O (1) L (1) O (1) L (1) L (1) Figure 2. Suffix Tree for String „COOL‟ If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the last three substrings in S2 are duplicates of some of the substrings already seen in S1, and new nodes are not created for these repeated substrings. F (1) O (1) O (1) L (1) Root C (1) O (1) O (1) L (1) O (2) L (2) O (2) L (2) L (2) Figure 3. Suffix Tree with String „FOOL‟ Added Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal model undergoes supervised learning from a training set which contains documents that have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix tree, by capturing the characteristics of documents at the character level, does not require pre-processing of the training set. A suffix tree is built for each class and a new document is classified by scoring it against each of the trees. The class of the highest scoring tree is assigned to the document. Pampapathi et al‟s study was based on email 11 of 93
  • 12. classification and the result of the experiment showed that a classifier employing a suffix tree outperformed the Naïve Bayesian method. In order to solve a classification problem, not only is the classifier one of the central components, but as seen with the Naïve Bayesian method it is also important to perform pre-processing on data used for training. The next section looks at other processes involved in text classification other than the classifier component itself. 5.3 Text Classifier Experimentations As described in previous sections that there is a two-step process to classification: 1. Create a specific model by evaluating the training data. This step has as input the training data (including the category/class labels) and as output a definition of the model developed. The model created which is the classifier classifies the training data as accurately as possible. 2. Apply the model developed by classifying new sets of documents. In the research community or for those interested in evaluating the performance of a classifier the second step can be more involved. First, the predictive accuracy of the classifier is estimated. A simple yet popular technique is called the holdout method which uses a test set of class-labelled samples. These samples are usually randomly selected and it is important that they are independent of the training samples, otherwise the estimate could be optimistic since the learned model is based on that data, and therefore tend to overfit. The accuracy of a classifier on a given test set is the percentage of test set samples that are correctly classified by the classifier. For each test sample the known class label is compared with the classifier‟s class prediction for that sample. If the accuracy of the classifier model is considered as acceptable, the model can be used to classify new documents. Training Derive Estimate Set Classifier Accuracy Corpus data Test Set Figure 4. Estimating Classifier Accuracy with the Holdout Method The estimate using the holdout method is pessimistic since only a portion of the initial data is used to derive the classifier. Another technique call N-fold cross-validation is often used in research. Cross-validation is a statistical technique which can mitigate bias caused by a particular partition of training and test set. It is also useful when the amount of data is limited. The method can be used to evaluate and estimate the performance of a classifier, and the aim is to obtain as honest an estimation as possible about the classification accuracy of the system. N-fold cross-validation involves 12 of 93
  • 13. partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping blocks/folds. Then the training-testing process is run N times, with a different test set. For example, when N=3, we will have the following training and test sets. Block 1 Train Test Run 1 1, 2 3 Block 2 Run 2 1, 3 2 Block 3 Run 3 2, 3 1 Figure 5. 3-Fold Cross-Validation For each cross-validation run the user will be able to use a training set to build the classifier. Stratified N-fold cross-validation is a recommended method for estimating classifier accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that of the initial training set. Preparing the training set data for classification using pre-processing can help improve the accuracy, efficiency, and scalability of the evaluation of the classification. Methods include stop word removal, punctuation removal, and stemming. The use of the above techniques to prepare the data and estimate classifier accuracy increases the overall computational time yet is useful for evaluating a classifier, and selecting among several classifiers. The current project aims to build a system which is a wrapper to a text classifier and incorporates the suffix tree that was used in the research done by Pampapathi et al as an example. The next section and beyond describes the project in detail. 13 of 93
  • 14. 6 HIGH-LEVEL APPLICATION DESCRIPTION 6.1 Description and Rationale The aim of this project is to build a management and visualisation tool that will allow researchers to perform data manipulation support for underlying text classification algorithms. The tool will provide a software infrastructure for a data mining system based on machine learning. The goal is to build a flexible framework that would allow changes to the underlying components with relative ease. Functions maybe added to the system in the future. Adding new functionalities should have minimal effect on the current system. The system will be built as a wrapper for the two-step process involved in classification. First, a component will be built that will automatically build a classifier given some training data. Secondly, to provide capabilities to perform classification and evaluate the performance of a classifier. Additionally, the tool will provide functionalities to run data sampling and various pre-processing on data. For the researcher it is incumbent to clearly define the training set (this will be known as the „resource corpus‟ in this report) used for the training the classifier. When the resource corpus is small the user can choose to use the entire corpus in the study. If the resource corpus is large, the tool gives the option to select sampling sets to represent it. A number of sampling methodologies is implemented that allows the user to select a sample, which will reflect the characteristics of the resource corpus from which it is drawn. Note that a resource corpus is grouped into classes and this structure needs to be taken into consideration when the sampling mechanism was developed. Three popular sampling methods will be developed. Although other sampling methods can be added, such as convenience sampling, judgement sampling, quota sampling, and snowball sampling. Note that the user can choose to evaluate data used to construct the classier before actually building the classifier. The tool will be designed to be generic enough to analyse a corpus of any categorisation type e.g. automated indexing of scientific articles, emails routing, spam filtering, criminal profiling, and expertise profiling. 6.1.1 Build a Classifier The tool allows the user to build a classifier. The current framework only implements the suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will be flexible enough to incorporate other classification models in the future. The research on suffix trees applied to classification is new, and there is currently no such application. The learning process of the classifier follows the machine learning approach to automated text classification, whereby the system automatically builds a classifier for the categories of interest. From the graphical user interface (GUI), the user can select a corpus to use as training data. The application provides links to .dll files developed by Birkbeck College which allow the user to build a suffix tree from the selected corpus. The internal data representation is constructed by generalising from a training set of pre- classified documents. Once the classifier is built the user can load new documents into the system to be classified. 14 of 93
  • 15. 6.1.2 Evaluate and Refine the Classifier In research once a classifier has been built it is desirable to evaluate its effectiveness. Even before the construction of the classifier the tool provides a platform for users to perform a number of experiments and refinements on the source (training) data. Hence, the second focus of the project is to provide a user-friendly front-end and a base application for testing classification algorithms. The user can load in a text based corpus and perform standard pre-processing functions to remove noise and prepare the data for experimentation. There is also a choice of sampling methods to use in order to reduce the size of the initial corpus making it more manageable. Sebastiani [2] notes that any classifier is prone to classification error, whether the classifier is human or machine. This is due to a central notion to text classification that the membership of a document in a class based on the characteristics of the document and the class is inherently subjective, since the characteristics of both the documents and class cannot be formally specified. As a result automatic text classifiers are evaluated using a set of pre-classified documents. The accuracy of classifiers is compared to the classification decision and the original category the documents were assigned to. For experimentation and evaluation purpose, this set of pre-classified documents is split into two sets: a training set and test set, not necessarily of equal sizes. The tool implements an extra level of experimentation using n-fold cross-validation. When employing cross-validation in classification it must take into account that the data is grouped by classes therefore this project will implement stratified cross-validation. Once a classifier has been constructed, it is possible to perform data classification experiments as well as other tasks such as single document analysis. For example, for the implementation of a suffix tree-based classifier the user will be able to view the structure of the suffix tree, as well, the documents in the test sets or load a new document and obtain a full matrix of output data about it. The output data is persisted in an information system which is subsequently used to perform analysis and visualisation tasks. 6.2 Development and Technologies Development was done in C#, using the .NET framework. The architect of the system was designed to be an extensible platform to enable users and developers to leverage the existing framework for future system upgrades. The tool was built from several components and aims to be modular. There are a number of controller components to provide functionalities for the tool. A set of libraries is used to provide the functionalities for the suffix tree. Working closely with researchers from Birkbeck College on the interface, these libraries for the suffix tree were provided by Birkbeck College. The suffix tree data structure is built in memory and can become very large. One solution to better utilise resources is to have the data structure physically stored as one tree, although it is logically represented as individual trees for each class. Further discussion can be found in subsequent sections. 15 of 93
  • 16. A Windows application was built as the client. This forms the interface that the user interacts with to gain access to the functionalities of the tool. The output data is cached in a database. The main targeted users for the tool are researchers in the research community for natural language text classification, and other users who want to mine textual data. 16 of 93
  • 17. 7 DESIGN 7.1 Functional Requirements Requirements for the application were collected from research on natural language text classification and discussions with targeted users in the research community. Requirements are the capabilities and conditions to which the application must conform. The functional requirements of the system are captured using „use cases‟. Use cases are a useful tool in describing how a user interacts with a system. They are written stories that describe the interaction between the system and the user that is easy to understand. Requirements can often change over the course of development and for this reason there was no attempt to define and freeze all requirements from the onset of the project. The following use cases were produced. Note some use cases were added throughout the development of the system Use Case Name: Load Directory as Source Corpus Primary Actor: User Pre-conditions: The application is running Post-conditions: A source corpus is loaded into the application Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. The user selects a valid directory 2. The system checks for directory and has at least read access to the path validity and access directory, and loads it as a corpus 3. Builds a tree structure of classes into the system based on the sub-folders in the directory and displays the classes in the GUI Use Case Name: View a Document in Corpus Primary Actor: User Pre-conditions: A corpus is successfully loaded Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document in the GUI Use Case Name: Create Sampling Set 17 of 93
  • 18. Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A sampling set based on the source corpus is created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects how they want to 3. Creates a sampling set based on select the sampling set parameters given by the user 2. User specifies location to store the 4. Creates the directory structure and documents/files created for the document/files in the location sampling set specified by the user 5. Displays new corpus created in the GUI Use Case Name: Run Pre-Processing Primary Actor: User Pre-conditions: A training set exist in the system Post-conditions: A new pre-processed sampling set created. New file directory created for the corpus. Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select type of pre-processing to 4. Performs pre-processing perform 5. Creates a new pre-processed set 2. User specifies location to store the 6. Stores the directory structure and documents/files created for the pre- documents/files at the location pre-processing set specified by the user. 3. Run pre-processing 7. Displays the corpus as a directory structure in the GUI Use Case Name: Run N-Fold Cross-Validation Primary Actor: User Preconditions: A sampling set is successfully created Postconditions: N-fold cross-validation set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects sampling set to 2. Builds n-fold cross-validation set process and the number of fold based on parameters given by the user, which includes the n-runs, 18 of 93
  • 19. each run containing training set and test set. 3. Displays new cross-validation set created in the GUI Use Case Name Create Classifier (Suffix Tree) Primary Actor: User Preconditions: A cross-validation set or classification set exist Postconditions: Classifier created in memory Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User actives an event to build 3. Builds classifier in memory, based classifier for a cross-validation set on the corpus set selected or classification set 4. indicate in the GUI that the 2. User choose any additional classifier of the corpus has been conditions to apply created Use Case Name: Score Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the corpus set is created Postconditions: Documents in the cross-validation set is scored and data stored in the database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the cross-validation 2. Scores all documents under the run to score selected corpus set 3. Inserts score data into database Use Case Name: Classify Documents Primary Actor: User Preconditions: An n-fold cross-validation set is created. Classifier for the set is created and the documents have been scored Postconditions: Misclassified documents in the cross-validation set is flagged Main Success Scenarios: Actor Action (or Intention) System Responsibility 19 of 93
  • 20. 1. User selects the cross-validation 2. Classify all documents under the run to classify selected cross-validation set 3. Flag all misclassified documents in the GUI Use Case Name: Create Classification Set Primary Actor: User Preconditions: A source corpus is successfully loaded Postconditions: A classification set is created virtually Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects the corpus set they 2. Display new corpus created in the want to use to create a classifier GUI as a classification corpus set Use Case Name: Load New Document to Classify Primary Actor: User Preconditions: Cross-validation set or classification set exist Postconditions: Substring matches and relates output data is store in database Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User decides which suffix tree to 2. Document name and relevant use for classification and loads in a information is displayed in the GUI valid textual document as an item ready to be analysed to be classified and analysed 3. Score and classify document 4. Stores output data in database Use Case Name: View a Document Primary Actor: User Pre-conditions: Document loaded into the system Post-conditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. Select the document to view 2. Display content of document on GUI 20 of 93
  • 21. Use Case Name View n-Gram Matches in document Primary Actor: User Preconditions: The document in concern is successfully loaded and suffix classifier created Postconditions: Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects a string/substring in a 2. Queries the classifier to retrieve the document to match n length substring matches 3. Displays to user the frequency for the string/substring selected Use Case Name View Statistics on Matches Primary Actor: User Preconditions: Document successfully loaded, scored and output exists in database Postconditions: Displays information in GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects to view output 2. System queries and retrieves relevant data in the database 3. Displays the output in table form in the GUI Use Case Name Visualise Representation of Classifier (View Suffix Tree) Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier visual representation displayed on GUI Main Success Scenarios: Actor Action (or Intention) System Responsibility 1. User selects option to display suffix 2. Builds visual representation of the tree classifier and displays in GUI 21 of 93
  • 22. Use Case Name Delete Classifier Primary Actor: User Preconditions: Classifier was successfully built Postconditions: Classifier is deleted Main Success Scenarios: Actor Action (or Intention) System Responsibility 3. User selects classifier to delete 4. Remove classifier, and clear displayed tree in GUI 7.2 Non-Functional Requirements The non-functional requirements for the use cases are as follows. 7.2.1 Usability The user should have one main single user interface to interact with the system. The user interface should be user friendly and the complexity of computation e.g. building an n-fold cross-validation set, scoring documents against a classification model, should be hidden from the user. An experimental run of the suffix tree classifier could involve as many as 126 scoring configurations, all of which could together take some considerable time to calculate. It therefore makes sense to keep a store of all calculated scores, rather than calculate them on-the-fly whenever they are requested. The results will be cached in a data store, which is implemented as database in this project. Hence, optimizing system responsiveness. Some system requests can only be activated once a pre-condition has been satisfied e.g. the user can only score documents when the suffix tree has been created. The system should give informative warning messages if the user attempts to perform a task without pre-conditions being satisfied. Where appropriate, upon a task being performed, the system may automatically carry out pre-conditions before performing the requested task. 7.2.2 Hardware and Software Constraint The application should be easily extensible and scalable. Developers should be able to add both extra functionality and expand the workload the application can handle with relative ease. The design should consider the future enhancement of the system and should be reasonably easy to maintain and upgrade. Codes should also be well documented. The system should use an RDBMS to manage its data layer, but be independent of the RDBMS it uses to manage its data. 22 of 93
  • 23. 7.2.3 Documentation Help menus and tool tips will be available to help users interact with the system. The application will also come with a user manual, including screen shots. The application will be available along with written documentation for its installation and configuration. 7.3 System Framework It was decided to build the system with a number of components. Each component has a specialised function in the system. Figure 6 illustrates the main components and the system boundary. The next section will describe the functions of each component in more detail and section 7.5 contains the class diagram. By isolating system responsibilities the following main components were identified. User interface Display Manager Classifier (Central Manager, STClassifier Manager, STClassifier) Sampling Set Generator Pre-processor Cross-validation Results Manager (Database Manager, OLEDB, Database) Figure 7 shows how the system is divided into a client/server architecture. The advantage of this set up is its ease of maintenance as the server implementation can be an abstraction to the client. All the functionalities of the system are accessed through the graphical user interface (GUI). The implementation is in the server, isolating users from the system complexities not relevant to the user. One of the main aims of the design of the system was to create a flexible framework. Others... The green boxes seen in Figure 8 represent new or alternative components that can be added to the system in the future with relative ease. 23 of 93
  • 24. Input Data System Boundary Random Graphical User DisplayManager Interface Sampling Set Generator Utility Results Manager Central Manager Pre-processor OLEBD Database STClassifier Manager Manager Stemmer Database STClassifier Cross-Validation Figure 6. System Components and Boundary Input Data Graphical User Client Interface Server DisplayManager Random v Sampling Set Results Manager Central Manager Generator Utility Pre-processor Database STClassifier OLEBD Manager Manager Stemmer Database STClassifier Cross-Validation Figure 7. Client Server Division 24 of 93
  • 25. Graphical User Others... Interface Random Others... Input Data DisplayManager Sampling Set Generator Utility Others... Results Manager Central Manager Pre-processor STClassifier Database Others... OLEBD Manager Manager Stemmer Others.. STClassifier Cross-Validation Database Figure 8. Additional or Alternative Components 7.4 Components in Detail 7.4.1 The Client - User Interface Graphical User Interface The user interacts with the system via a single graphical user interface which is also the client. In this project the client is implemented as a set of Windows forms and controls in .NET. There is one main form where users can access all the functionalities of the system. There are a number of other dialog boxes and forms to help with the navigation and interaction with the system. For example there is a Select Scoring Method form, used to request from the user the scoring methodology to use when scoring a new document. Other more generic forms such as the Select Dialog form are employed for a number of uses and do not display specific types of information (see section 10 Implementation Specifics for further discussion). The client is simply an event handler for each of the GUI controls that calls the Central Manager via the Display Manager for actual data processing. The GUI contains no implementation, but delegates to the Display Manager, thus decoupling the interface from the implementation. There is a two-way communication between the client and the Display Manager, whereby a user invokes an event and related messages are passed to the Central Manager. The Central Manager passes the messages to the Central Manager which subsequently either delegates to other more specialised controllers to handle the task, or resolves the request itself. The design of the screens was done in speaking with potential users. The user should be able to perform all the tasks described by the use cases seen earlier in the Functional Requirements section (the functions will not be reiterated here). 25 of 93
  • 26. For this project Windows forms were chosen for the implementation because most users are familiar with the Windows form interface. It creates a familiar interface on initial interaction with the system and facilitates use of the system. In particular, the .NET framework provides a wealth of controls and functionalities, which help to build a user friendly interface and hides the complexity of the underlying workings from the user. The different components are built as separate classes and the user interface or the client can be implemented using a different methodology from Windows forms, such as command line as illustrated. Select Select Scoring Dialog Method Graphical User Command Line Interface Input Data Display Manager Figure 9. Client interface and Its Collaborating Components 7.4.2 Display Manager DisplayManager The Display Manager is a layer between the User Interface and the Central Manager and the rest of the system. It essentially passes messages between these two components. The Display Manager is responsible for information displayed back to the user and it manages also the input data. Graphical User Others... Interface Input Data DisplayManager Central Manager 7.4.3 The Classifier It was mentioned in the previous section that the Central Manager is part of the classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its connecting components. The classifier comprises of the Central Manager, a controller 26 of 93
  • 27. that manages the underlying model of the classifier, and the underlying model itself. The Central Manager is a controller that handles the communication between all the main components in the system which communicates with the classifier. The Central Manager should provide the following functionalities: Select Sampling Set for a corpus Pre-process all documents in a corpus Run cross-validation on a corpus Create a classifier for a given corpus Score all documents in a corpus Classify all documents in a corpus Obtain classification results for a corpus There are further controller classes called by the Central Manager to provide more specialised functionalities, these are the Output Manager, Suffix Tree Manager, Sampling Set Generator, Pre-processor, and Cross-validation. When a user loads a corpus into the system it is managed by the Central Manager. If there is a request to create a sampling set for example, the Central Manager should know where the corpus is located and delegates the Sampling Set Generator the task of creating a sampling set based on parameters set by the user. Similarly, a request from the user to perform pre-processing on the corpus is delegated to the Pre-processor to carry out the task by the central manager. The various components is designed to have specialised tasks, they do not need to know where the data is located as this information is passed to the components when the Central Manger invokes a request. The Sampling Set generator does not need to know how the Pre-processor carries out its task, nor does it need to know about the Cross-validation component. The three components receive data and requests from the Central Manager, perform its task and return any information back to the Central Manager. The classifier has to be connected to an internal model. In this project the suffix tree data structure is employed to model the representation of document characteristics. As seen in Figure 10, the classifier can be implemented with different types of models such as a Naïve Bayesian or Neural Networks. There is a dual way communication between the Central Manager and the STClassifier via the STClassifier Manager. The STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to: Building the representation of documents using the suffix tree data structure Training the classifier Score a document Returns classification results The STClassifier Manager controls the flow of messages between the Central Manager and the STClassifier. Responsibilities involve converting data to the format that is accepted by the STClassifier, and converting output from the STClassifier which is 27 of 93
  • 28. passed back to the STClassifier Manager. It is essentially a wrapper class for the STClassifier. The suffix tree is built using the contents of documents in a training set. Once a suffix tree is built it will be cached in an ArrayList that is managed by the STClassifier Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree remains stored in memory until the user activates an event to delete the suffix tree. As a result the system does not need to create a suffix tree every subsequent action that references it. Hence, only methods in the STClassifier Manager are called and it is not necessary to call methods in the STClassifier. The classifier generates output data when a request is invoked to classify and score documents. These two actions can be a time consuming activities. The Central Manager decides what type of output data needs to be saved and passes the data from the classifier to the Results Manager to handle. Section Figure 13 describes the design of the Results manager. Graphical User Interface Command Line Results Manager Display Manager Sampling Set Generator Central Manager Pre-processor NBClassifier NNClassifier STClassifier Manager Manager Manager Cross-Validation NBClassifier NNClassifier STClassifier Classifier Figure 10. The Classifier and Its Collaborating Components 7.4.4 Data Manipulation and Cleansing Sampling Set Pre-processor Generator 28 of 93
  • 29. When a corpus is loaded into the system as input data. The user can create sampling sets from the initial corpus and also prepare the data for experimentation by performing various types of pre-processing on the data. The input data is given to the classifier, which sends it to the Sampling Set Generator to handle the generation of sampling sets. Various sampling methodologies can be plugged into the Sampling Set Generator. For this project the system will implement random sampling and systematic sampling methodologies. The pre-processor provides the functionality for pre-processing data passed to it. Similarly, various methods of pre-processing can be plugged into the system with relative ease. Currently, the system provides stemming, stop word removal, and punctuation removal. In order for a method to plug into the system, a method class must implement an IMethod interface so that it guarantees the following: A method class must have a name property to return the name of the method. This is necessary, so if new methods are added to the system it will be identified by its name. A method class must have a Run method. This method is where all the work is done A set of utility classes will provide helper functionalities such as random number generator, common divisor, and file system. Systematic Random Snowball Sampling Set Generator Utility Central Manager Pre-processor Stop Word Punctuation Stemmer Others.. Removal Removal Figure 11. Data Manipulation and Cleansing Components and Its Collaborating Components 7.4.5 Experimentation Cross-Validation Setting up data for experimentation is the main responsibly of the Cross-validation class. The Central Manager passes a corpus to the Cross-validation component, which uses the data to build N-fold cross-validation sets. It divides the given set of corpus into N blocks and builds a training set and test set for each N run. The data is stored as an array that is passed back to the Central Manager. 29 of 93
  • 30. The methods the Cross-Validation class is expected to perform are: Set the number of N-folds Run N-fold cross-validation on a given source data Return the cross-validation sets in an array data structure Central Manager Cross-Validation Figure 12. Cross-validation and Its Collaborating Components 7.4.6 Results Manager Results Manager The Results Manager handles the output of the classifier and the repository of the output. The underlying RDBMS of this project is an Access database, which is used to cache the data generated by the classifier. The OLEDB component is responsible for the direct communication with the database. This class needs to provide the basic database functionalities such as read/write/ delete in a generic fashion. It is through the Database Manager object that all communication with the OLEDB library occurs, and the data flow between the Results Manager. The Database Manager manages the OLEDB. The green boxes illustrate that the information system for the system does not necessarily has to be an Access database. The system is designed to be able to store the data using a different means with relative ease, e.g. XML files, SQL server etc. 30 of 93
  • 31. Results Manager Central Manager XML File Database Manager Manager XML OLEDB XML File(s) Database Figure 13. Results Manager and Its Collaborating Components 7.4.7 Error Handling Adequate error handling for an end user application is essential. Displays of warnings and errors should be handled in the higher level of the system, namely by the Display manager and then displayed to the user in a reasonable fashion. Errors that occur in the other classes should be propagated to the Display Manager. All classes apart from the User Interface and the Display Manager are expected to implement an IErrorRecord interface. A class that implements this interface will guarantee that it has a property called error which returns the error message. 31 of 93
  • 32. 7.5 Class Diagram Figure 14 shows a class diagram of the main components of the system discussed above Controllers::DisplayManager MainForm -nodeMgr : TreeViewNodeManager -tvExplorer -classifier : CentralManager -sTreeView -dbProvider : string -rtxtView -dbUserId : string -rtxtInfo -dbPassword : string -mItemAddRCorpus_Click(in sender : object, in e) -dbName : string -mitemSelectSampling_Click(in sender : object, in e) -Controlled By -dbAccessMode : string 1 -mitemPreprocess_Click(in sender : object, in e) +AddNode(in destNode : TreeNode, in nodeNames : string[], in imageIdx : TreeImages, in selectedImageIdx : TreeImages) -mitemCrossValidation_Click(in sender : object, in e) +FindNode(in selectedNode : TreeNode, in nodeName : string) : TreeNode -CreateSTree_Click(in sender : object, in e) 1..* +DisplayBlank() -DeleteSTree_Click(in sender : object, in e) +DisplayFile(in filePathname : string) -DisplaySuffixTree_Click(in sender : object, in e) +SelectSampleCorpus(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -AddNewDoc_Click(in sender : object, in e) +AddNewClassificationSet(in treeStructure : TreeView, in sourceNode : TreeNode, in destRoot : string) -AddClassificationSet_Click(in sender : object, in e) 1 +PerformPreprocessing(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -ScoreAllDoc_Click(in sender : object, in e) -PerformCrossValidation(in defaultCorpus : string, in sourceNode : TreeNode, in destNode : TreeNode) -ClassifyAllDocs_Click(in sender : object, in e) +SetupSTree(in defaultCorpus : string, in sourceFilesNode : TreeNode, in STreeNode : TreeNode) +DisplayScoresByDoc(in displayView : ListView, in sourceNode : TreeNode, in filepath : string) +ScoreAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string) +ClassifyAllDocuments(in sourceDataNode : TreeNode, in sTreeNodeName : string) +FlagMisClassifiedDocuments(in sourceNodePath : string, in sourceDataNode : TreeNode, in sf : int, in mn : int, in tn : int) +DeleteScores(in parentPath : string) +DeleteSTree(in STreeNode : TreeNode) +DisplaySTree(in displayTxt : Label, in diplayView : TreeView, in defaultCorpus : string, in dataSource : TreeNode, in STreeNode : TreeNode) Controllers::SampleSetGenerator +GetMatchInfo(in text : string, in STreeNode : TreeNode) : string -error : string +CleanupDatabase() -Controls -methodNames : string[] = new string[] {"Census", "Random", "Systematic"} +ErrorMessage() : string 1 1 -CodeToName(in code : int) : string +Run(in resourcePath : string, in destPath : string, in selectMethod : string) 1 -Controls +MethodNames() : string[] Classifier::CentralManager -sampler : SampleSetGenerator -preprocessor : Preprocessor Controllers::CrossValidation 1 -crossValidator : CrossValidation -folds : Array[] -dataModelMgr : SuffixTreeManager -noOfFolds : int -outputMgr : DatabaseManager -minFold : int = 2 1 -error : string -maxFold : int = 10 1 -Controls +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool -error : string +Contains(in key : string) : bool -Performs 1 +ErrorMessage() : string Output::DatabaseManager +Remove(in key : string) +CrossValidation(in folds : int) +GetClassNames(in key : string) : string[] +Run(in path : string) : Array[] -dbAccess : OLEDB +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] 1 -dbProvider : string +FoldCount() : int +ErrorMessage() : string -dbUserId : string +CentralManager() -Controls -dbPassword : string +GetModel(in key : string) : EMSTreeClassifier -dbName : string +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int Controllers::Preprocessor -ScoresTable : string = "Scores" +Sampler() : SampleSetGenerator 1 -ConfigTable : string = "Config" -stopWordFile : string +Preprocessor() : Preprocessor -ClassWeightsTable : string = "ClassWeights" -punctuationFile : string +CrossValidator() : CrossValidation -ClassifiedTable : string = "qry3a_MaxWScoreClass" -methodNames : string[] = new string[methodCount] +OutputManager() : DatabaseManager -MisClassifyFiles : string = "qry2b_MisClassifiedByFile" -error : string -MatchByClass : string = "zqry2b_matchByClass_Crosstab" +ErrorMessage() : string -error : string 1 1 +Preprocessor() -bOpen : bool -SetupMethodNames() +ErrorMessage() : string -CodeToName(in code : int) : string +DatabaseManager() +Run(in content : string, in type : string) : string +SelectScoresByFile(in parentPathNode : string, in filePath : string) : OleDbDataReader +MethodNames() : string[] +SelectMisClassifiedDocuments(in parentPathNode : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader +SelectClassifiedClass(in sourceNodePath : string, in filepath : string, in sf : int, in mn : int, in tn : int) : OleDbDataReader 1 +DeleteScores(in ParentNodePath : string) +Provider() : string +UserId() : string +Password() : string 1 -Controls 1 -Has +DatabaseName() : string Classifier::SuffixTreeManager DataMining::StopWord 1 -createdSTreeList : SortedList -name : string -error : string -stringList : ArrayList = new ArrayList() 1 -Access Database -error : string +Create(in key : string, in classNames : string[], in depth : int, in classFiles : FileInfo[][]) : bool +Contains(in key : string) : bool +Name() : string Output::OLEDB +Remove(in key : string) +Run(in text : string) : string -oleDbDataAdapter : OleDbDataAdapter +GetClassNames(in key : string) : string[] +ErrorMessage() : string -oleDbConnection : OleDbConnection +GetClassScores(in key : string, in className : string, in doc : string) : double[,,] +StopWord(in filePathName : string) -oleDbInsertCommand : OleDbCommand +ErrorMessage() : string +Add(in filePathName : string) -oleDbDeleteCommand : OleDbCommand +SuffixTreeManager() -AddWord(in targetWord : string) -oleDbUpdateCommand : OleDbCommand -AddSTreeToCache(in key : string, in sTree : EMSTreeClassifier) : bool +Clear() -oleDbSelectCommand : OleDbCommand +GetModel(in key : string) : EMSTreeClassifier +Reset() 1 -Controls +oleDbDataReader : OleDbDataReader +GetFrequency(in key : string, in matchText : string, in classIdx : int) : int +Contains(in word : string) : bool -command : COMMAND +StringList() : ArrayList -error : string 1 -bOpen : bool +ErrorMessage() : string 1..* -Access Controllers::TreeViewNodeManager +IsOpen() : bool +InsertCommand() : string -error : string EMSTreeClassifier +DeleteCommand() : string +ErrorMessage() : string +UpdateCommand() : string -className : string[] +ChildNameExist(in TargetNode : TreeNode, in matchName : string) : bool +SelectCommand() : string -dictionary : string[] +GetClassFiles(in classFileParent : TreeNode) : FileInfo[][] +GetReader() : OleDbDataReader -dictionaryByClass : string[][] +GetChildrenNodeNames(in targetNode : TreeNode) : string[] +ExecuteCommand() : bool -mergedTree : EMSTreeClassifier.EMSTree +GetTreeNode(in targetNodeName : string, in Parentnode : TreeNode) : TreeNode -SelectReader() : OleDbDataReader +addToClass(in txt : string, in class : string) +DisplaySTree(in displayView : TreeView, in sTree : EMSTreeClassifier, in classFreqToDisplay : string[]) -UpdateReader() : OleDbDataReader +classIntToName(in classInt : int) : string +AddItemToTreeView(in root : TreeNode, in childNames : params string[]) : TreeNode -InsertReader() : OleDbDataReader +classNameToInt(in className : string) : int +AddCrossValidationSetsToTreeView(in sourceNode : TreeNode, in content : Array[]) -DeleteReader() : OleDbDataReader +classScore(in example : string, in class : string, in nsf : int, in nmnf : int, in ntnf : int) : double[,,] -PopulateRunNode(in content : Array[], in testSetNum : int, in parentNode : TreeNode) +OLEDB() +maxScore(in a : double[]) : static int -Combine(in array1 : FileInfo[][], in array2 : FileInfo[][]) : FileInfo[][] +Open(in Provider : string, in UserID : string, in Password : string, in DatabaseName : string, in Mode : string) +setDepth(in d : int) +AddItem(in destNode : TreeNode, in newNodeName : string, in imageIdx : TreeImages) : TreeNode +Close() +train(in classTrainingFiles : <unspecified>[][]) : bool -CreateNewNode(in nodeName : string, in imageIdx : TreeImages) : TreeNode Figure 14. Class Diagram 32 of 93
  • 33. 8 DATABASE 8.1 Entities All the data in the system is stored in an Access database. The following describes the organisation of the data that the system will store. 8.1.1 Score Table When a user calls to score a new document or a set of documents, each document is scored against 126 configurations for each class. The data is cached in the score table. 8.1.2 Source Table The source table stores the location properties of documents. This includes the physical pathname of the document and where it is logically located in the display tree. 8.1.3 Configuration Table This configuration table stores the 126 combination of scoring methods used in Pampapathi et al‟s study. Each configuration consists of a type of scoring function, match normalisation, and tree normalisation function. 8.1.4 Score Functions Table 33 of 93
  • 34. This table contains the name description of score functions. 8.1.5 Match Normalisation Functions Table This table contains the name description of match normalisation functions. 8.1.6 Tree Normalisation Functions Table This table contains the name description of tree normalisation functions. 8.1.7 Classification Condition Table This table stores any classification conditions to be considered when classifying a document from a particular corpus. 8.1.8 Class Weights Table This table stores the class weights when classifying documents. 8.1.9 Temporary Max and Min Score Table 34 of 93
  • 35. This is a temporary table used to cache the maximum and minimum scores for a class grouped by document, configuration. 8.2 Views The following are some of the main views to assist in querying the main tables for data displayed in the user interface. 8.2.1 Weighted Scores This view obtains the weighted scores by documents and scoring configuration. 8.2.2 Maximum and Minimum Scores This view obtains the maximum and minimum score by document and scoring configuration. 8.2.3 Misclassified Documents This view obtains the misclassified documents and related data. 8.3 Relation Design for the Main Tables The main table of the database is the Scores table. This table contains the scores for each document, scored by different configuration combinations (see the Implementation 35 of 93
  • 36. section for scoring configuration description). Figure 15 shows the relationships between the main tables. tTreeNormalisation tMatchNormalisation tScoreFunction PK Index PK Index PK Index Name Name Name 1..1 Config PK,I1 ConfigId 1..1 1..1 FK2 SF FK3 MN FK1 TN SF Name MN Name TN Name *..1 tempMaxMinWScores Source FK2,I2 SourceId *..1 FK1,I1 ConfigId PK SourceId True Class Node Parent Path MaxOfWScore Node Path MinOfWScore File Path Scores *..1 PK ScoreId *..1 FK2,I4,I3 SourceId FK1,I2,I1 ConfigId Score Class True Class Score Figure 15. Table Relations 36 of 93
  • 37. 9 IMPLEMENTATION Due to the large size of the program, this report will not cover all the different implementation details, but instead the discussion will focus on the main classes and highlight some specific implementation. See Appendix B Class Definitions. 9.1 Main User Interface The main form of the user interface is divided into four resizable panes which each display different types of information to the user (see Figure 16): tvExplorer rtxtView/sTreeView. lblTreeDetail/listView rTxtInfo The tvExplorer is a Windows Form TreeView control, which displays the different corpuses available in the system. The information is presented as a hierarchy of nodes, like the way files and folders are displayed in the left pane of Windows Explorer. The rtxtView is implemented as a Windows Forms RichTextBox control. When the user selects a child node in tvExplorer that represents a document, rtxtView will display the content of document. The rtxtView will also allow users to perform dynamic n-gram (sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching). The sTreeView is implemented as a TreeView control. It shares the same pane as the rtxtView control and is only made visible on the main form (and the rtxtView becomes invisible) when the user requests to display a suffix tree that has been created. At the same time the lblSTreeDetail control, which is implemented as a Windows Form Label control will display description about the suffix tree currently displayed in the sTreeView control. ListView is a Windows Form ListView control which provides information related to the current content of the rtxtView control. RtxtInfo is a RichText control and displays classification summary regarding a document. 37 of 93
  • 38. lblSTreeDetail/listView tvExplorer rtxtInfo rtxtView/sTreeView Figure 16. Main User Interface The main form is implemented as a .NET class called MainForm. Figure 17 shows the class members and class interface. Note that there are other Windows Form control classes which were implemented to control the flow of user-system interaction. Section 10 Implementation Specifics will describe one of them in detail, and see Appendix x for all the user interface classes. 38 of 93