SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Requirement Analysis                  Version 0.1


                                                      by the Stat Team

                                                        Mehrbod Sharifi
                                                             Jing Yang




               The Stat Project, guided by

          Professor Eric Nyberg and Anthony Tomasic




                      Feb. 25, 2009
Chapter 1

Introduction to STAT

In this chapter, we introduce the Stat project, its motivation and scope and also define the target
audience and stakesholders. We will start the discussion of why we believe such a framework will
be useful for the software engineers and computer science researchers but we will provide more
details and evidence in the later chapters.


1.1    Overview
Stat is an open source machine learning framework in Java for text analysis. Original the work
Stat was abbreviating Semi-Supervised Text Analysis Toolkit which refer to the implementation
of some semi-supervised algorithms in this package, however later on we evolved to defining a
framework as opposed to our particular implementation and therefore the first S can now be in-
terpreted as ”Simple” or ”Statistical”.

    Applying machine learning approaches to extract information and uncover patterns from tex-
tual data has become extremely popular in recent years. Accordingly, many software have been
developed to enable people to utilize machine learning for text analytics and automate such pro-
cess. Users, however, find many of these existing software difficult to use, even if they just want
to carry out a simple experiment; they have to spend much time learning those software and may
finally find out they still need to write their own programs to preprocess data to get their target
software running.

    We notice this situation and observe that many of these can be simplified. A new software
framework should be developed to ease the process of doing text analytics; we believe researchers
or engineering using our framework for textual data analysis would feel the process convenient,
conformable, and probably, enjoyable.
    Existing software with regard to using machine learning for linguistic analysis have tremen-
dously helped researchers and engineers make new discoveries based on textual data, which is
unarguably one of the most form of data in the real world.

    As a result, many more researchers, engineers, and possibly students are increasingly inter-
ested in using machine learning approaches in their text analytics. Those people, some of which
even being experienced users, find existing software packages are not generally easy to learn and
convenient to use.
    In the next section, we will outline our design goal and provide a summary of how this differ-
entiates Stat from the exiting software packages. We will also defined the scope or our work and
our audience in the sections that follows.

                                                1
1.2     Goals
Here is the outline of our design goal for the new framework. These points will be clarified mostly
in the upcoming chapters but we will state them with brief introduction in this section:

   • Simplicity: This is the most important consideration. Essentially, we will reduce the com-
     plexity of the API by limiting the hierarchy and number of domain objects and their inter-
     action. We achieve this by defining a clear distinction of responsibilities and the evaluate
     our success by how quickly someone completely unfamiliar with text analysis and machine
     learning can understand the toolkit and start using it. This is explained further in the next
     sections and chapters.

   • Extensibility: We put the focus on how to facilitate the extension of the package or in
     other words: implementing within our framework. Combined with the simplicity, we hope
     that this will encourage more people to contribute and enable the kinds proven success as
     can be seen in Matlab or R for example.

   • Performance: As it is widely know, dealing with text is computationally intensive and
     we will take this into consideration from ground up (e.g., using Java primitives instead of
     objects)

   • Features: In the presence of extensibility we will give lowers priority to implementing many
     features for this package. Instead, we will demonstrate how the package generalizes the
     approaches by many other packages by ”wrapping” those tools so they can be used in the
     simplified manner and also implicitly providing some training for them if the users would
     rather continue by moving to any of those packages. As stated previously, we will provide
     implementation of unsupervised and semi-supervised methods which is what lacks in this
     domain.

    These objectives shows how Stat will be different than existing software package in this do-
main. For example, although Weka has a comprehensive suite of machine learning algorithms, it
is not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts
representation and processing. MinorThird, on the other hand, though designed specifically as a
package for text analysis, turns out to be rather complicated and difficult to learn. It also does not
support semi-supervised and unsupervised learning, which are becoming increasingly important
machine learning approaches.

   Another problem for many existing packages is that they often adopt their own specific input
and output format. Real-world textual data, however, are generally in other formats that are not
readily understood by those packages. Researchers and engineers who want to make use of those
packages often find themselves spending much time seeking or writing ad hoc format conversion
code. These ad hoc code, which could have been reusable, are often written over and over again
by different users.

    Researchers and engineers, when presented common text analysis tasks, usually want a text-
specific, lightweight, reusable, understandable, and easy-to-learn package that help them get their
works done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti-
vated by the needs of users who want to simplify their work and experiment related to textual data
learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate
their analytics task on textual data.



                                                  2
In a nutshell, Stat is an open source framework aimed at providing researchers and engi-
      neers with a integrated set of simplified, reusable, and convenient toolkits for textual data
      analysis. Based on this framework, researchers can carry out their machine learning exper-
      iments on textual data conveniently, and engineers can build their own small applications
      for text analytics or use the classes designed by others.


1.3     Scope
The previous section may give an impression for an impossible task. In this section, we clearly
state what is and is not included in this project.

   The main deliverable for this project is a set of specifications, which defines a simplified frame-
work for text analysis based on NLP and machine learning. We explain how succinctly the frame-
work should be used and how easily it can be extended.

    We also provide introductory implementations of the framework, including tools and packages
serving foundation classes of the framework. They are

   • Dataset and framework object adaptors: A set of classes that will allow reading and
     writing files in various formats, supporting importing and exporting dataset as well as loading
     and saving framework objects.

   • Linguistic and machine learning packages wrappers: A set of classes that integrate
     existing tools for NLP and Machine Learning and can be used within the framework. These
     wrappers hides the implementation and variation details of these packages to provide a set
     of simplified and unified interfaces to framework users.

   • Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo-
     rithms that are not available from the existing packages.

    The goal is NOT to design the most comprehensive machine learning package or compete or
correct the previous packages. We will to focus on the goals stated above to create our framework
from a different perspective.




                                                   3
1.4    Stakeholders
Below is the list of stakeholder and how this project will affect them:

   • Researchers, particularly in language technology but also in other fields, would be able
     to save time by focusing on their experiments instead of dealing with various input/output
     format which is routinely necessary in text processing. They can also easily switch between
     various tools available and even contribute to STAT so that others can save time by using
     their adaptors and algorithms.

   • Software engineers, who are not familiar with the machine learning can start using the
     package in their program with a very short learning phase. STAT can help them develop clear
     concepts of machine learning quickly. They can build their applications using functionality
     provided STAT easily and achieve high level performance.

   • Developers of learning package, can provide plug-ins for STAT to allow ease of integration
     of their package. They can also delegate some of the interoperability needs through this
     program (some of which may be more time consuming to be addressed within their own
     package).

   • Beginners to text processing and mining, who want fundamental and easy to learn
     capabilities involving discovering patterns from text. They will be benefited from this project
     by saving their time, facilitating their learning process, and sparking their interests to the
     area of language technology.




                                                4
Chapter 2

Survey Analysis

2.1   Goal

2.2   Design

2.3   Result




                  5
Chapter 3

Existing Related Software Package

In this chapter, we analyze a few main competitors of our projects. We focus on two academic
toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and
discuss why and how we can do better than these competitors.


3.1        Weka
Weka is a comprehensive collection of machine learning algorithms for solving data mining problems
in Java and open sourced under the GPL.

3.1.1       Strengths of Weka
Weka is a very popular software for machine learning, due to the its main strengths:

       • Provide comprehensive machine learning algorithms. Weka supports most current
         machine learning approaches for classification, clustering, regression, and association rules.
       • Cover most aspects for performing a full data mining process. In addition to learn-
         ing, Weka supports common data preprocessing methods, feature selection, and visualization.
       • Freely available. Weka is open source released under GNU General Public License.
       • Cross-platform. Weka is cross-platform fully implemented in Java.

Because of its supports of comprehensive machine learning algorithm, Weka is often used for
analytics in many form of data, including textual data.

3.1.2       Limitations of using Weka for text analysis
However, Weka is not designed specifically for textual data analysis. The most critical drawback
of using Weka for processing text is that Weka does not provide “built-in” constructs for natural
representation of linguistics concepts1 . Users interested in using Weka for text analysis often find
themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka
representation.

       • Not good at understanding various text format. Weka is good at understanding its
         standard .arff format, which is however not a convenient way of representation text. Users
         have to worry about how can they convert textual data in various original format such as
   1
     Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary
utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight-
forwardly


                                                         6
raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to
     be understandable by Weka. As a result, they need to spend time seeking or writing external
     tools to complete this task before performing their actual analysis.
   • Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor-
     ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes
     are by default imported as nominal attributes, which usually is not a desirable type for text
     (imagine treating different chunks of text as different values of a categorical attribute). One
     have to explicitly use filters to do a conversion, which could have been done automatically if
     it knows you are importing text.
   • Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing
     is a very important aspect of textual data analysis but not a concern of Weka. Weka does
     not (at least, not dedicated to) take care this issue very seriously for users. Weka has a
     StringToWordVector class that performs all-in-one basic linguistics preprocessing, including
     tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less
     flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing)
     for users who want fined grain and advanced linguistics controls.
   • Unnatural representation of textual data learning concepts. Weka is designed for
     general purpose machine learning tasks so have to protect too many variations. As a results,
     domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the
     number of classes explodes. For example, we have to use Instance rather than Document and
     Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning
     for text processing. First adding many Attribute to a cryptic FastVector which then passed
     to a Instances in order to construct a dataset appears very awkward to users processing
     text. Categorize filters first according to attribute/instance then supervised /unsupervised
     make non-expert users feel confusing and hard to find their right filters. Many users may feel
     unconformable programmatically using Weka to carry out their experiments related to text.

    In summary, for users who want enjoyable experience at performing text analysis, they need
built-in capabilities to naturally support representing and processing text. They need specialized
and convenient tools that can help them finish most common text analysis tasks straightforwardly
and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com-
prehensive tools.




                                                7
Partial UML Domain Model of Weka (Preliminary)

                                                       evaluate

                                1
                                                                                    1

                         Evaluation                                                Classifier

                                                                        1                           1


                                1
                                                                  built-from                    classify
                                 evaluate-on

                                                                                                            1
                                                   1              1


                     1 tranform-attribute                                           contain
StringToWordVector                                       Instances                                          Instance
                                               1

                                                                               1                        attributeValues
                                                                                                *



                                                                  1


                                                                  contain

                                                                  *

                                                          Attribute
 NominalToString         transform-type     1
                     1
                                                   possibleValues




                                                Note: when you see ClassA quot;containsquot; a number of ClassB,
                                                it is probably that Weka implements it as ClassA maintains a
                                                quot;FastVectorquot; whose elements are instances of ClassB.


    Figure 3.1: Partial domain model for Weka for basic text analysis




                                                          8
Chapter 4

Requirements specifications

Here we first explain in detail the major features of our framework.
   • Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java
     programming knowledge and basic machine learning concepts can learn our package without
     much efforts, understand its logical flow quickly, be able to get started within a small amount
     of time, and finish the most common tasks with a few lines of code. Since our framework is
     not designed for general purposes and for including comprehensive features, there are space
     for us to simplify the APIs to optimize for those most typical and frequent operations.

   • Reusable. Built-in modular supports are provided for the core routines across various
     phases in text analysis, including text format transformation, linguistic processing, machine
     learning, and experimental evaluation. Additional functionalities can be extended on top of
     the core framework easily and user-defined specifications are pluggable. Existing code can
     be used cross environment and interoperate with external related packages, such as Weka,
     MinorThird, and OpenNLP. (I use reusable instead of extendable because it cover a higher
     level of concept we might also need and able to follow, what’s your idea? )

   • To be added


4.1    Functional Requirements
In this section, we define most common use cases of our framework and address them in the degree
of detail of casual use case. The “functional requirements” of this project are that the users can
use libraries provided by our framework to complete these use cases more easily and comfortably
than not use.

Actors
Since our framework assumes that all users of interests are programming using our APIs, there is
only one role of human actor, namely the programmer. This human actor is always the primary
actor. There are some possible secondary and system actors, namely the external packages our
framework integrates, depending on what specific use cases the primary actor is performing.

Casual Use Cases
Here we present some typical use cases of our framework in a casual format. For better under-
standing and separation of responsibilities, use cases are divided to many categories, where each
category defines a typical step of doing text analysis.

                                                9
• Dataset importing and exporting. In this category of use cases, a user want to read
  file(s) from different kinds of sources in different kinds of formats, to some specific data
  structures representing dataset in memory for further processing, or write dataset to files in
  other format. Here list sample important use cases:

    1. Use case 1. Read a list of raw text files that placed in a specified directory of the local
       file system, to a RawCorpus in which a RawDocument represents a text file.
    2. Use case 2. Read a list of HTML files that placed in a specified directory of the local file
       system, strip the tags, and store to a RawCorpus in which a RawDocument represents
       a HTML file.
    3. Use case 3. Read a XML file with non-unicode encoding from the Web specified by a
       URL to a RawDocument, with fields appropriately populated.

• Object persistence. In this category of use cases, a user want to persist objects in our
  framework to disk in our internal format, which can be loaded lately.

    1. Use case 1.
    2. Use case 2.
    3. Use case 3.

• Structured information extraction.

    1. Use case 1.
    2. Use case 2.
    3. Use case 3.

• Linguistic preprocessing.

    1. Use case 1.
    2. Use case 2.
    3. Use case 3.

• Machine learning.

    1. Use case 1.
    2. Use case 2.
    3. Use case 3.

• Experiment and evaluation.

    1. Use case 1.
    2. Use case 2.
    3. Use case 3.




                                            10
4.2   Non-functional Requirements
  • Open source. It should be made available for public collaboration, allowing users to use,
    change, improve, and redistribute the software.

  • Portability. It should be consistently installed, configured, and run independent to different
    platforms, given its design and implementation on Java runtime environment.

  • Documentation. Its code should be readable, self-explained, and documented clearly and
    unambiguously for critical or tricky part. It should include an introduction guide for users
    to get started, and preferably, provides sample dataset, tutorial, and demos for user to run
    examples out of the box.

  • Performance. It should be able to response to user within reasonable amount of time given
    a limited amount of data (unclear, need specify). Preferably, it can estimate the running
    time needed to perform a task and notify user before user actually execute the task (is this
    the responsibility for framework designers? )

  • Dependency. It is actually a issue. The package integrates other external packages and has
    many dependency. How to resolve this issue? How do we distribute our package?




                                              11
Bibliography

[1] Reference 1

[2] Reference 2




                  12

Contenu connexe

Tendances

Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
ijtsrd
 

Tendances (7)

IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
 
Ijetcas14 533
Ijetcas14 533Ijetcas14 533
Ijetcas14 533
 
#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...
#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...
#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
 
Thesis_Rehan_Aziz
Thesis_Rehan_AzizThesis_Rehan_Aziz
Thesis_Rehan_Aziz
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 

En vedette

Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]
butest
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09
stat
 
Op Weg Naar
Op Weg NaarOp Weg Naar
Op Weg Naar
ikknip
 
Stat2 25 09
Stat2 25 09Stat2 25 09
Stat2 25 09
stat
 
Summary Of Dissertation Presentation
Summary Of Dissertation PresentationSummary Of Dissertation Presentation
Summary Of Dissertation Presentation
cmhusted
 
Op weg naar de grote wereld
Op weg naar de grote wereldOp weg naar de grote wereld
Op weg naar de grote wereld
ikknip
 
Is A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile PossibleIs A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile Possible
cmhusted
 
Bonsai
BonsaiBonsai
Bonsai
ikknip
 
10 terrible powerpoint clichés
10 terrible powerpoint clichés10 terrible powerpoint clichés
10 terrible powerpoint clichés
Brian Wakabayashi
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
stat
 
Requirements - Part 1
Requirements - Part 1Requirements - Part 1
Requirements - Part 1
stat
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
butest
 

En vedette (20)

Text classification with Weka
Text classification with WekaText classification with Weka
Text classification with Weka
 
Waikato environment for knowledge analysis (weka)
Waikato environment for knowledge analysis (weka)Waikato environment for knowledge analysis (weka)
Waikato environment for knowledge analysis (weka)
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]
 
我愛上攝影
我愛上攝影我愛上攝影
我愛上攝影
 
Sample
SampleSample
Sample
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09
 
Op Weg Naar
Op Weg NaarOp Weg Naar
Op Weg Naar
 
Stat2 25 09
Stat2 25 09Stat2 25 09
Stat2 25 09
 
Summary Of Dissertation Presentation
Summary Of Dissertation PresentationSummary Of Dissertation Presentation
Summary Of Dissertation Presentation
 
Dim
DimDim
Dim
 
Op weg naar de grote wereld
Op weg naar de grote wereldOp weg naar de grote wereld
Op weg naar de grote wereld
 
Organi-Deviance Part I
Organi-Deviance Part IOrgani-Deviance Part I
Organi-Deviance Part I
 
Is A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile PossibleIs A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile Possible
 
Bonsai
BonsaiBonsai
Bonsai
 
10 terrible powerpoint clichés
10 terrible powerpoint clichés10 terrible powerpoint clichés
10 terrible powerpoint clichés
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Requirements - Part 1
Requirements - Part 1Requirements - Part 1
Requirements - Part 1
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 
Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka Classification and Clustering Analysis using Weka
Classification and Clustering Analysis using Weka
 

Similaire à Requirment

employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
zillesubhan
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updated
Mohammed Ali Khan
 

Similaire à Requirment (20)

D017232729
D017232729D017232729
D017232729
 
Guia 2-examen-de-ingles
Guia 2-examen-de-inglesGuia 2-examen-de-ingles
Guia 2-examen-de-ingles
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
OOP ppt.pdf
OOP ppt.pdfOOP ppt.pdf
OOP ppt.pdf
 
IRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech Recognition
 
Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD Metrics
 
A Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On MicrocomputersA Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On Microcomputers
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
An Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmAn Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting Algorithm
 
Concurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented ModelingConcurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented Modeling
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
 
Automatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingAutomatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language Processing
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
IRJET - Text Summarizer.
IRJET -  	  Text Summarizer.IRJET -  	  Text Summarizer.
IRJET - Text Summarizer.
 
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updated
 
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
2012 ieee projects software engineering @ Seabirds ( Trichy, Chennai, Pondich...
 
Development of Computer Aided Learning Software for Use in Electric Circuit A...
Development of Computer Aided Learning Software for Use in Electric Circuit A...Development of Computer Aided Learning Software for Use in Electric Circuit A...
Development of Computer Aided Learning Software for Use in Electric Circuit A...
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Requirment

  • 1. Requirement Analysis Version 0.1 by the Stat Team Mehrbod Sharifi Jing Yang The Stat Project, guided by Professor Eric Nyberg and Anthony Tomasic Feb. 25, 2009
  • 2. Chapter 1 Introduction to STAT In this chapter, we introduce the Stat project, its motivation and scope and also define the target audience and stakesholders. We will start the discussion of why we believe such a framework will be useful for the software engineers and computer science researchers but we will provide more details and evidence in the later chapters. 1.1 Overview Stat is an open source machine learning framework in Java for text analysis. Original the work Stat was abbreviating Semi-Supervised Text Analysis Toolkit which refer to the implementation of some semi-supervised algorithms in this package, however later on we evolved to defining a framework as opposed to our particular implementation and therefore the first S can now be in- terpreted as ”Simple” or ”Statistical”. Applying machine learning approaches to extract information and uncover patterns from tex- tual data has become extremely popular in recent years. Accordingly, many software have been developed to enable people to utilize machine learning for text analytics and automate such pro- cess. Users, however, find many of these existing software difficult to use, even if they just want to carry out a simple experiment; they have to spend much time learning those software and may finally find out they still need to write their own programs to preprocess data to get their target software running. We notice this situation and observe that many of these can be simplified. A new software framework should be developed to ease the process of doing text analytics; we believe researchers or engineering using our framework for textual data analysis would feel the process convenient, conformable, and probably, enjoyable. Existing software with regard to using machine learning for linguistic analysis have tremen- dously helped researchers and engineers make new discoveries based on textual data, which is unarguably one of the most form of data in the real world. As a result, many more researchers, engineers, and possibly students are increasingly inter- ested in using machine learning approaches in their text analytics. Those people, some of which even being experienced users, find existing software packages are not generally easy to learn and convenient to use. In the next section, we will outline our design goal and provide a summary of how this differ- entiates Stat from the exiting software packages. We will also defined the scope or our work and our audience in the sections that follows. 1
  • 3. 1.2 Goals Here is the outline of our design goal for the new framework. These points will be clarified mostly in the upcoming chapters but we will state them with brief introduction in this section: • Simplicity: This is the most important consideration. Essentially, we will reduce the com- plexity of the API by limiting the hierarchy and number of domain objects and their inter- action. We achieve this by defining a clear distinction of responsibilities and the evaluate our success by how quickly someone completely unfamiliar with text analysis and machine learning can understand the toolkit and start using it. This is explained further in the next sections and chapters. • Extensibility: We put the focus on how to facilitate the extension of the package or in other words: implementing within our framework. Combined with the simplicity, we hope that this will encourage more people to contribute and enable the kinds proven success as can be seen in Matlab or R for example. • Performance: As it is widely know, dealing with text is computationally intensive and we will take this into consideration from ground up (e.g., using Java primitives instead of objects) • Features: In the presence of extensibility we will give lowers priority to implementing many features for this package. Instead, we will demonstrate how the package generalizes the approaches by many other packages by ”wrapping” those tools so they can be used in the simplified manner and also implicitly providing some training for them if the users would rather continue by moving to any of those packages. As stated previously, we will provide implementation of unsupervised and semi-supervised methods which is what lacks in this domain. These objectives shows how Stat will be different than existing software package in this do- main. For example, although Weka has a comprehensive suite of machine learning algorithms, it is not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts representation and processing. MinorThird, on the other hand, though designed specifically as a package for text analysis, turns out to be rather complicated and difficult to learn. It also does not support semi-supervised and unsupervised learning, which are becoming increasingly important machine learning approaches. Another problem for many existing packages is that they often adopt their own specific input and output format. Real-world textual data, however, are generally in other formats that are not readily understood by those packages. Researchers and engineers who want to make use of those packages often find themselves spending much time seeking or writing ad hoc format conversion code. These ad hoc code, which could have been reusable, are often written over and over again by different users. Researchers and engineers, when presented common text analysis tasks, usually want a text- specific, lightweight, reusable, understandable, and easy-to-learn package that help them get their works done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti- vated by the needs of users who want to simplify their work and experiment related to textual data learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate their analytics task on textual data. 2
  • 4. In a nutshell, Stat is an open source framework aimed at providing researchers and engi- neers with a integrated set of simplified, reusable, and convenient toolkits for textual data analysis. Based on this framework, researchers can carry out their machine learning exper- iments on textual data conveniently, and engineers can build their own small applications for text analytics or use the classes designed by others. 1.3 Scope The previous section may give an impression for an impossible task. In this section, we clearly state what is and is not included in this project. The main deliverable for this project is a set of specifications, which defines a simplified frame- work for text analysis based on NLP and machine learning. We explain how succinctly the frame- work should be used and how easily it can be extended. We also provide introductory implementations of the framework, including tools and packages serving foundation classes of the framework. They are • Dataset and framework object adaptors: A set of classes that will allow reading and writing files in various formats, supporting importing and exporting dataset as well as loading and saving framework objects. • Linguistic and machine learning packages wrappers: A set of classes that integrate existing tools for NLP and Machine Learning and can be used within the framework. These wrappers hides the implementation and variation details of these packages to provide a set of simplified and unified interfaces to framework users. • Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo- rithms that are not available from the existing packages. The goal is NOT to design the most comprehensive machine learning package or compete or correct the previous packages. We will to focus on the goals stated above to create our framework from a different perspective. 3
  • 5. 1.4 Stakeholders Below is the list of stakeholder and how this project will affect them: • Researchers, particularly in language technology but also in other fields, would be able to save time by focusing on their experiments instead of dealing with various input/output format which is routinely necessary in text processing. They can also easily switch between various tools available and even contribute to STAT so that others can save time by using their adaptors and algorithms. • Software engineers, who are not familiar with the machine learning can start using the package in their program with a very short learning phase. STAT can help them develop clear concepts of machine learning quickly. They can build their applications using functionality provided STAT easily and achieve high level performance. • Developers of learning package, can provide plug-ins for STAT to allow ease of integration of their package. They can also delegate some of the interoperability needs through this program (some of which may be more time consuming to be addressed within their own package). • Beginners to text processing and mining, who want fundamental and easy to learn capabilities involving discovering patterns from text. They will be benefited from this project by saving their time, facilitating their learning process, and sparking their interests to the area of language technology. 4
  • 6. Chapter 2 Survey Analysis 2.1 Goal 2.2 Design 2.3 Result 5
  • 7. Chapter 3 Existing Related Software Package In this chapter, we analyze a few main competitors of our projects. We focus on two academic toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and discuss why and how we can do better than these competitors. 3.1 Weka Weka is a comprehensive collection of machine learning algorithms for solving data mining problems in Java and open sourced under the GPL. 3.1.1 Strengths of Weka Weka is a very popular software for machine learning, due to the its main strengths: • Provide comprehensive machine learning algorithms. Weka supports most current machine learning approaches for classification, clustering, regression, and association rules. • Cover most aspects for performing a full data mining process. In addition to learn- ing, Weka supports common data preprocessing methods, feature selection, and visualization. • Freely available. Weka is open source released under GNU General Public License. • Cross-platform. Weka is cross-platform fully implemented in Java. Because of its supports of comprehensive machine learning algorithm, Weka is often used for analytics in many form of data, including textual data. 3.1.2 Limitations of using Weka for text analysis However, Weka is not designed specifically for textual data analysis. The most critical drawback of using Weka for processing text is that Weka does not provide “built-in” constructs for natural representation of linguistics concepts1 . Users interested in using Weka for text analysis often find themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka representation. • Not good at understanding various text format. Weka is good at understanding its standard .arff format, which is however not a convenient way of representation text. Users have to worry about how can they convert textual data in various original format such as 1 Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight- forwardly 6
  • 8. raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to be understandable by Weka. As a result, they need to spend time seeking or writing external tools to complete this task before performing their actual analysis. • Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor- ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes are by default imported as nominal attributes, which usually is not a desirable type for text (imagine treating different chunks of text as different values of a categorical attribute). One have to explicitly use filters to do a conversion, which could have been done automatically if it knows you are importing text. • Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing is a very important aspect of textual data analysis but not a concern of Weka. Weka does not (at least, not dedicated to) take care this issue very seriously for users. Weka has a StringToWordVector class that performs all-in-one basic linguistics preprocessing, including tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing) for users who want fined grain and advanced linguistics controls. • Unnatural representation of textual data learning concepts. Weka is designed for general purpose machine learning tasks so have to protect too many variations. As a results, domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the number of classes explodes. For example, we have to use Instance rather than Document and Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning for text processing. First adding many Attribute to a cryptic FastVector which then passed to a Instances in order to construct a dataset appears very awkward to users processing text. Categorize filters first according to attribute/instance then supervised /unsupervised make non-expert users feel confusing and hard to find their right filters. Many users may feel unconformable programmatically using Weka to carry out their experiments related to text. In summary, for users who want enjoyable experience at performing text analysis, they need built-in capabilities to naturally support representing and processing text. They need specialized and convenient tools that can help them finish most common text analysis tasks straightforwardly and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com- prehensive tools. 7
  • 9. Partial UML Domain Model of Weka (Preliminary) evaluate 1 1 Evaluation Classifier 1 1 1 built-from classify evaluate-on 1 1 1 1 tranform-attribute contain StringToWordVector Instances Instance 1 1 attributeValues * 1 contain * Attribute NominalToString transform-type 1 1 possibleValues Note: when you see ClassA quot;containsquot; a number of ClassB, it is probably that Weka implements it as ClassA maintains a quot;FastVectorquot; whose elements are instances of ClassB. Figure 3.1: Partial domain model for Weka for basic text analysis 8
  • 10. Chapter 4 Requirements specifications Here we first explain in detail the major features of our framework. • Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java programming knowledge and basic machine learning concepts can learn our package without much efforts, understand its logical flow quickly, be able to get started within a small amount of time, and finish the most common tasks with a few lines of code. Since our framework is not designed for general purposes and for including comprehensive features, there are space for us to simplify the APIs to optimize for those most typical and frequent operations. • Reusable. Built-in modular supports are provided for the core routines across various phases in text analysis, including text format transformation, linguistic processing, machine learning, and experimental evaluation. Additional functionalities can be extended on top of the core framework easily and user-defined specifications are pluggable. Existing code can be used cross environment and interoperate with external related packages, such as Weka, MinorThird, and OpenNLP. (I use reusable instead of extendable because it cover a higher level of concept we might also need and able to follow, what’s your idea? ) • To be added 4.1 Functional Requirements In this section, we define most common use cases of our framework and address them in the degree of detail of casual use case. The “functional requirements” of this project are that the users can use libraries provided by our framework to complete these use cases more easily and comfortably than not use. Actors Since our framework assumes that all users of interests are programming using our APIs, there is only one role of human actor, namely the programmer. This human actor is always the primary actor. There are some possible secondary and system actors, namely the external packages our framework integrates, depending on what specific use cases the primary actor is performing. Casual Use Cases Here we present some typical use cases of our framework in a casual format. For better under- standing and separation of responsibilities, use cases are divided to many categories, where each category defines a typical step of doing text analysis. 9
  • 11. • Dataset importing and exporting. In this category of use cases, a user want to read file(s) from different kinds of sources in different kinds of formats, to some specific data structures representing dataset in memory for further processing, or write dataset to files in other format. Here list sample important use cases: 1. Use case 1. Read a list of raw text files that placed in a specified directory of the local file system, to a RawCorpus in which a RawDocument represents a text file. 2. Use case 2. Read a list of HTML files that placed in a specified directory of the local file system, strip the tags, and store to a RawCorpus in which a RawDocument represents a HTML file. 3. Use case 3. Read a XML file with non-unicode encoding from the Web specified by a URL to a RawDocument, with fields appropriately populated. • Object persistence. In this category of use cases, a user want to persist objects in our framework to disk in our internal format, which can be loaded lately. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Structured information extraction. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Linguistic preprocessing. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Machine learning. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Experiment and evaluation. 1. Use case 1. 2. Use case 2. 3. Use case 3. 10
  • 12. 4.2 Non-functional Requirements • Open source. It should be made available for public collaboration, allowing users to use, change, improve, and redistribute the software. • Portability. It should be consistently installed, configured, and run independent to different platforms, given its design and implementation on Java runtime environment. • Documentation. Its code should be readable, self-explained, and documented clearly and unambiguously for critical or tricky part. It should include an introduction guide for users to get started, and preferably, provides sample dataset, tutorial, and demos for user to run examples out of the box. • Performance. It should be able to response to user within reasonable amount of time given a limited amount of data (unclear, need specify). Preferably, it can estimate the running time needed to perform a task and notify user before user actually execute the task (is this the responsibility for framework designers? ) • Dependency. It is actually a issue. The package integrates other external packages and has many dependency. How to resolve this issue? How do we distribute our package? 11