This document provides an introduction and overview of the Stat project, which aims to create an open source machine learning framework in Java for text analysis. The Stat framework is designed to be simple, extensible, and performant. It aims to simplify common text analysis tasks for researchers and engineers by providing reusable tools and wrappers for existing NLP and machine learning packages. The document outlines the goals, scope, stakeholders and provides an initial requirements analysis for the Stat framework.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Requirment
1. Requirement Analysis Version 0.1
by the Stat Team
Mehrbod Sharifi
Jing Yang
The Stat Project, guided by
Professor Eric Nyberg and Anthony Tomasic
Feb. 25, 2009
2. Chapter 1
Introduction to STAT
In this chapter, we introduce the Stat project, its motivation and scope and also define the target
audience and stakesholders. We will start the discussion of why we believe such a framework will
be useful for the software engineers and computer science researchers but we will provide more
details and evidence in the later chapters.
1.1 Overview
Stat is an open source machine learning framework in Java for text analysis. Original the work
Stat was abbreviating Semi-Supervised Text Analysis Toolkit which refer to the implementation
of some semi-supervised algorithms in this package, however later on we evolved to defining a
framework as opposed to our particular implementation and therefore the first S can now be in-
terpreted as ”Simple” or ”Statistical”.
Applying machine learning approaches to extract information and uncover patterns from tex-
tual data has become extremely popular in recent years. Accordingly, many software have been
developed to enable people to utilize machine learning for text analytics and automate such pro-
cess. Users, however, find many of these existing software difficult to use, even if they just want
to carry out a simple experiment; they have to spend much time learning those software and may
finally find out they still need to write their own programs to preprocess data to get their target
software running.
We notice this situation and observe that many of these can be simplified. A new software
framework should be developed to ease the process of doing text analytics; we believe researchers
or engineering using our framework for textual data analysis would feel the process convenient,
conformable, and probably, enjoyable.
Existing software with regard to using machine learning for linguistic analysis have tremen-
dously helped researchers and engineers make new discoveries based on textual data, which is
unarguably one of the most form of data in the real world.
As a result, many more researchers, engineers, and possibly students are increasingly inter-
ested in using machine learning approaches in their text analytics. Those people, some of which
even being experienced users, find existing software packages are not generally easy to learn and
convenient to use.
In the next section, we will outline our design goal and provide a summary of how this differ-
entiates Stat from the exiting software packages. We will also defined the scope or our work and
our audience in the sections that follows.
1
3. 1.2 Goals
Here is the outline of our design goal for the new framework. These points will be clarified mostly
in the upcoming chapters but we will state them with brief introduction in this section:
• Simplicity: This is the most important consideration. Essentially, we will reduce the com-
plexity of the API by limiting the hierarchy and number of domain objects and their inter-
action. We achieve this by defining a clear distinction of responsibilities and the evaluate
our success by how quickly someone completely unfamiliar with text analysis and machine
learning can understand the toolkit and start using it. This is explained further in the next
sections and chapters.
• Extensibility: We put the focus on how to facilitate the extension of the package or in
other words: implementing within our framework. Combined with the simplicity, we hope
that this will encourage more people to contribute and enable the kinds proven success as
can be seen in Matlab or R for example.
• Performance: As it is widely know, dealing with text is computationally intensive and
we will take this into consideration from ground up (e.g., using Java primitives instead of
objects)
• Features: In the presence of extensibility we will give lowers priority to implementing many
features for this package. Instead, we will demonstrate how the package generalizes the
approaches by many other packages by ”wrapping” those tools so they can be used in the
simplified manner and also implicitly providing some training for them if the users would
rather continue by moving to any of those packages. As stated previously, we will provide
implementation of unsupervised and semi-supervised methods which is what lacks in this
domain.
These objectives shows how Stat will be different than existing software package in this do-
main. For example, although Weka has a comprehensive suite of machine learning algorithms, it
is not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts
representation and processing. MinorThird, on the other hand, though designed specifically as a
package for text analysis, turns out to be rather complicated and difficult to learn. It also does not
support semi-supervised and unsupervised learning, which are becoming increasingly important
machine learning approaches.
Another problem for many existing packages is that they often adopt their own specific input
and output format. Real-world textual data, however, are generally in other formats that are not
readily understood by those packages. Researchers and engineers who want to make use of those
packages often find themselves spending much time seeking or writing ad hoc format conversion
code. These ad hoc code, which could have been reusable, are often written over and over again
by different users.
Researchers and engineers, when presented common text analysis tasks, usually want a text-
specific, lightweight, reusable, understandable, and easy-to-learn package that help them get their
works done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti-
vated by the needs of users who want to simplify their work and experiment related to textual data
learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate
their analytics task on textual data.
2
4. In a nutshell, Stat is an open source framework aimed at providing researchers and engi-
neers with a integrated set of simplified, reusable, and convenient toolkits for textual data
analysis. Based on this framework, researchers can carry out their machine learning exper-
iments on textual data conveniently, and engineers can build their own small applications
for text analytics or use the classes designed by others.
1.3 Scope
The previous section may give an impression for an impossible task. In this section, we clearly
state what is and is not included in this project.
The main deliverable for this project is a set of specifications, which defines a simplified frame-
work for text analysis based on NLP and machine learning. We explain how succinctly the frame-
work should be used and how easily it can be extended.
We also provide introductory implementations of the framework, including tools and packages
serving foundation classes of the framework. They are
• Dataset and framework object adaptors: A set of classes that will allow reading and
writing files in various formats, supporting importing and exporting dataset as well as loading
and saving framework objects.
• Linguistic and machine learning packages wrappers: A set of classes that integrate
existing tools for NLP and Machine Learning and can be used within the framework. These
wrappers hides the implementation and variation details of these packages to provide a set
of simplified and unified interfaces to framework users.
• Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo-
rithms that are not available from the existing packages.
The goal is NOT to design the most comprehensive machine learning package or compete or
correct the previous packages. We will to focus on the goals stated above to create our framework
from a different perspective.
3
5. 1.4 Stakeholders
Below is the list of stakeholder and how this project will affect them:
• Researchers, particularly in language technology but also in other fields, would be able
to save time by focusing on their experiments instead of dealing with various input/output
format which is routinely necessary in text processing. They can also easily switch between
various tools available and even contribute to STAT so that others can save time by using
their adaptors and algorithms.
• Software engineers, who are not familiar with the machine learning can start using the
package in their program with a very short learning phase. STAT can help them develop clear
concepts of machine learning quickly. They can build their applications using functionality
provided STAT easily and achieve high level performance.
• Developers of learning package, can provide plug-ins for STAT to allow ease of integration
of their package. They can also delegate some of the interoperability needs through this
program (some of which may be more time consuming to be addressed within their own
package).
• Beginners to text processing and mining, who want fundamental and easy to learn
capabilities involving discovering patterns from text. They will be benefited from this project
by saving their time, facilitating their learning process, and sparking their interests to the
area of language technology.
4
7. Chapter 3
Existing Related Software Package
In this chapter, we analyze a few main competitors of our projects. We focus on two academic
toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and
discuss why and how we can do better than these competitors.
3.1 Weka
Weka is a comprehensive collection of machine learning algorithms for solving data mining problems
in Java and open sourced under the GPL.
3.1.1 Strengths of Weka
Weka is a very popular software for machine learning, due to the its main strengths:
• Provide comprehensive machine learning algorithms. Weka supports most current
machine learning approaches for classification, clustering, regression, and association rules.
• Cover most aspects for performing a full data mining process. In addition to learn-
ing, Weka supports common data preprocessing methods, feature selection, and visualization.
• Freely available. Weka is open source released under GNU General Public License.
• Cross-platform. Weka is cross-platform fully implemented in Java.
Because of its supports of comprehensive machine learning algorithm, Weka is often used for
analytics in many form of data, including textual data.
3.1.2 Limitations of using Weka for text analysis
However, Weka is not designed specifically for textual data analysis. The most critical drawback
of using Weka for processing text is that Weka does not provide “built-in” constructs for natural
representation of linguistics concepts1 . Users interested in using Weka for text analysis often find
themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka
representation.
• Not good at understanding various text format. Weka is good at understanding its
standard .arff format, which is however not a convenient way of representation text. Users
have to worry about how can they convert textual data in various original format such as
1
Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary
utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight-
forwardly
6
8. raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to
be understandable by Weka. As a result, they need to spend time seeking or writing external
tools to complete this task before performing their actual analysis.
• Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor-
ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes
are by default imported as nominal attributes, which usually is not a desirable type for text
(imagine treating different chunks of text as different values of a categorical attribute). One
have to explicitly use filters to do a conversion, which could have been done automatically if
it knows you are importing text.
• Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing
is a very important aspect of textual data analysis but not a concern of Weka. Weka does
not (at least, not dedicated to) take care this issue very seriously for users. Weka has a
StringToWordVector class that performs all-in-one basic linguistics preprocessing, including
tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less
flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing)
for users who want fined grain and advanced linguistics controls.
• Unnatural representation of textual data learning concepts. Weka is designed for
general purpose machine learning tasks so have to protect too many variations. As a results,
domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the
number of classes explodes. For example, we have to use Instance rather than Document and
Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning
for text processing. First adding many Attribute to a cryptic FastVector which then passed
to a Instances in order to construct a dataset appears very awkward to users processing
text. Categorize filters first according to attribute/instance then supervised /unsupervised
make non-expert users feel confusing and hard to find their right filters. Many users may feel
unconformable programmatically using Weka to carry out their experiments related to text.
In summary, for users who want enjoyable experience at performing text analysis, they need
built-in capabilities to naturally support representing and processing text. They need specialized
and convenient tools that can help them finish most common text analysis tasks straightforwardly
and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com-
prehensive tools.
7
9. Partial UML Domain Model of Weka (Preliminary)
evaluate
1
1
Evaluation Classifier
1 1
1
built-from classify
evaluate-on
1
1 1
1 tranform-attribute contain
StringToWordVector Instances Instance
1
1 attributeValues
*
1
contain
*
Attribute
NominalToString transform-type 1
1
possibleValues
Note: when you see ClassA quot;containsquot; a number of ClassB,
it is probably that Weka implements it as ClassA maintains a
quot;FastVectorquot; whose elements are instances of ClassB.
Figure 3.1: Partial domain model for Weka for basic text analysis
8
10. Chapter 4
Requirements specifications
Here we first explain in detail the major features of our framework.
• Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java
programming knowledge and basic machine learning concepts can learn our package without
much efforts, understand its logical flow quickly, be able to get started within a small amount
of time, and finish the most common tasks with a few lines of code. Since our framework is
not designed for general purposes and for including comprehensive features, there are space
for us to simplify the APIs to optimize for those most typical and frequent operations.
• Reusable. Built-in modular supports are provided for the core routines across various
phases in text analysis, including text format transformation, linguistic processing, machine
learning, and experimental evaluation. Additional functionalities can be extended on top of
the core framework easily and user-defined specifications are pluggable. Existing code can
be used cross environment and interoperate with external related packages, such as Weka,
MinorThird, and OpenNLP. (I use reusable instead of extendable because it cover a higher
level of concept we might also need and able to follow, what’s your idea? )
• To be added
4.1 Functional Requirements
In this section, we define most common use cases of our framework and address them in the degree
of detail of casual use case. The “functional requirements” of this project are that the users can
use libraries provided by our framework to complete these use cases more easily and comfortably
than not use.
Actors
Since our framework assumes that all users of interests are programming using our APIs, there is
only one role of human actor, namely the programmer. This human actor is always the primary
actor. There are some possible secondary and system actors, namely the external packages our
framework integrates, depending on what specific use cases the primary actor is performing.
Casual Use Cases
Here we present some typical use cases of our framework in a casual format. For better under-
standing and separation of responsibilities, use cases are divided to many categories, where each
category defines a typical step of doing text analysis.
9
11. • Dataset importing and exporting. In this category of use cases, a user want to read
file(s) from different kinds of sources in different kinds of formats, to some specific data
structures representing dataset in memory for further processing, or write dataset to files in
other format. Here list sample important use cases:
1. Use case 1. Read a list of raw text files that placed in a specified directory of the local
file system, to a RawCorpus in which a RawDocument represents a text file.
2. Use case 2. Read a list of HTML files that placed in a specified directory of the local file
system, strip the tags, and store to a RawCorpus in which a RawDocument represents
a HTML file.
3. Use case 3. Read a XML file with non-unicode encoding from the Web specified by a
URL to a RawDocument, with fields appropriately populated.
• Object persistence. In this category of use cases, a user want to persist objects in our
framework to disk in our internal format, which can be loaded lately.
1. Use case 1.
2. Use case 2.
3. Use case 3.
• Structured information extraction.
1. Use case 1.
2. Use case 2.
3. Use case 3.
• Linguistic preprocessing.
1. Use case 1.
2. Use case 2.
3. Use case 3.
• Machine learning.
1. Use case 1.
2. Use case 2.
3. Use case 3.
• Experiment and evaluation.
1. Use case 1.
2. Use case 2.
3. Use case 3.
10
12. 4.2 Non-functional Requirements
• Open source. It should be made available for public collaboration, allowing users to use,
change, improve, and redistribute the software.
• Portability. It should be consistently installed, configured, and run independent to different
platforms, given its design and implementation on Java runtime environment.
• Documentation. Its code should be readable, self-explained, and documented clearly and
unambiguously for critical or tricky part. It should include an introduction guide for users
to get started, and preferably, provides sample dataset, tutorial, and demos for user to run
examples out of the box.
• Performance. It should be able to response to user within reasonable amount of time given
a limited amount of data (unclear, need specify). Preferably, it can estimate the running
time needed to perform a task and notify user before user actually execute the task (is this
the responsibility for framework designers? )
• Dependency. It is actually a issue. The package integrates other external packages and has
many dependency. How to resolve this issue? How do we distribute our package?
11