3. MALLET
1 Introduction to MALLET
MALLET is a Java-based package for statistical natural language processing,
document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text..
2 Where do we use MALLET?
:
1. Historical Topics and Trends
Our aim here is to automatically discover general topics that appear
in a large newpaper corpus. MALLET is run over a period of interest
to find the top general topic groups. For example: if we wish to know
the top ten topic groups between the years 1965-1901, the MALLET
is run to find this dataset. In addition, we can also find topics more
strongly associated with say ”iron”. We can extract 5 lines on each
side of the line containing ”iron” and again run mallet to find the top
general topic groups.
2. Detect spam mails
We can use the document classification capabilities of MALLET to
detect spam mails. A simple example of this would be a spam classifier
like you’d find in your email inbox. Since we know what good mail
looks like, and since we know what spam typically looks like, we can
craft a Naive Bayes classifier to make a statistical approximation as to
whether or not a new message is spam.
3. Extract important information
We can use the sequence tagging functionality that MALLET provides
to extract important information from data. By employing named-
entity recognition techniques, we can figure out exactly what a docu-
ment is talking about without having to read through the entire text
ourselves. Imagine someone hands you a book and asks you for all the
characters and locations featured throughout the text. Using named-
entity recognition, a computer can accomplish that task in mere seconds
as compared to the hours it would take a human.
2
4. MALLET
3 Getting Started
3.1 Installing MALLET
1. Download the latest version of mallet from
http://mallet.cs.umass.edu/download.php
2. To Build MALLET 2.0, you must have Apache Ant. You can download
it from http://ant.apache.org/
3. Set all the environment variables pointing to Java Home, Ant Home
and Mallet Home (Mallet Directory).
4. Change to the MALLET directory and type:
ant
Example : C:UsersVAIOWebIRmallet-2.0.7>ant
If ant finishes with ”BUILD SUCCESSFUL”, MALLET is now ready
to use.
3.2 Using the Script
Now, if you installed MALLET in the directory WebIRmallet-2.0.7,
this script will be present in the WebIRmallet-2.0.7bin. If the cur-
rent working directory is the MALLET directory, you can use this script in
this pattern:
binmallet [command] --option value --option value ...
Type binmallet to get a list of commands and the help can be found
by using the option --help with any command to get a description of valid
commands.
4 Importing Data files
To import a data file use the command:
binmallet import-file --input [filename]
--output [output filename] [options]
3
5. MALLET
Similarly, to import an entire directory use:
binmallet import dir [dir path]
--output [output filename] [options]
For example:
binmallet import-file --input sample-datawebenhill.txt
--output output.mallet
in the above example, the input data is hill.txt and the output is present
in the output.mallet file after removing the stopwords.
binmallet import-dir --input sample-dataweb*
--output output.mallet
in the above example, the input data is folders present in web folder and
the output is given in the output.mallet file after removing the stopwords
For more options use the help by typing in:
binmallet import-file --help or
binmallet import-dir --help
5 Natural Language Processing
MALLET includes routines for transforming text documents into numerical
representations that can then be processed efficiently. This process is imple-
mented through a flexible system called ”pipes”, which handle distinct tasks
such as tokenizing strings, removing stopwords, and converting sequences
into count vectors. MALLET uses Unicode files, and thus, we can use vari-
ous language files and provide MALLET with certain rules for for processing
the data. We can use regular expressions to tokanize any word segment in
any language. For example if we type in
binmallet import-file --input sample-datawebenhill.txt
--output output.mallet --print-output --remove-stopwords
in the above example, MALLET removes the stopwords and prints the out-
put and also writes the output in the output.mallet file. A sample output
4
6. MALLET
with and without removing stopwords is shown below :
(a) without removing stopwords (b) Removing stopwords
Figure 1: Natural Language Processing using MALLET
The above figure shows the support for English language by MALLET.
In the above snapshot, a simple txt file ”hill.txt” written in English language
is imported. The words are numbered and the number of occurrences are
also shown. The stopwords are recognized by MALLET and can or cannot
be included in the output file as per the user’s requirements. Currently,
MALLET doesn’t support only Chineese and Japaneese text..
6 Document classification
A classifier is an algorithm that distinguishes between a fixed set of classes,
such as ”spam” vs. ”non-spam”, based on some previous training (Note that
MALLET is also a machine learning tool). MALLET includes implemen-
tations of several classification algorithms. Some of them are Naive Bayes
algorithm, Maximum Entropy, and Decision Trees.
To get strted with the document classifier, first loasd the data into MAL-
LET format. Then follow the following steps:
5
7. MALLET
1. Train the classifier:
Suppose u have a MALLET data file called train.mallet, use the
command :
binmallet train-classifier --input train.mallet
--output-classifier my.classifier
2. Choose the algorithm:
The default classification algorithm is Naive Bayes Theorem. To select
a different algorithm, use the --trainer option. For example, to use
the MaxEnt algorithm, use the following command:
binmallet train-classifier --input training.mallet
--output-classifier my.classifier --trainer MaxEnt
You can also try - NaiveBayes, C45, Decision Tree.
To compare multiple training algorithms, use the following command,
binmallet train-classifier --input labeled.mallet
--training-portion --trainer MaxEnt
--trainer NaiveBayes
This command will comapre the MaxEnt and the NaiveBayes algo-
rithms.
3. Evaluation:
If we wish to know if the classifier is producing good results on data
now used in the training, we can split a single set of instance into train-
ing and testing lists. For this purpose, you can use a command like:
binmallet train-classifier --input labeled.mallet
--training-portion 0.9
This command will randomly split the data into 90% training instances,
which will be used to train the classifier and the remaining 10% testing
instances. MALLET will use the classifier to predict the class labels
of the testing instances, compare those to the true labels, and report
results. You can even try various training options that u can find in
the help of mallet.
6
8. MALLET
For example, u can try the following command :
binmallet train-classifier --input web.mallet
--trainer MaxEnt --trainer NaiveBayes
--training-portion 0.9 --num-trials 10
This command will run 10 trials, in which the input data is randomly
split into 90% training instances and 10% testing instances. For each
trial, MALLET trains a MaxEnt classifier and a Naive Bayes classifier
on the training instances, then prints accuracy results and a matrix of
correct and predicted labels for each classifier. An illustration is shown
in the next page.
7
10. MALLET
7 Sequence Tagging
Sometimes, we may have a very large database with distinct values in it, take
for example, a large gene database. MALLET includes implementations of
widely used sequence algorithms including hidden Markov models (HMMs)
and linear chain conditional random fields (CRFs). These algorithms support
applications such as gene finding and named-entity recognition.
Simple Tagger
Simple tagger is a command line interface to the MALLET CRF class. To
use this, each line in the input file should represent a token. The needed
format is :
feature1 feature2 ... featuren label
For example, write the following in a file named ”sample” and put it in
the mallet directory.
Kirti CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
To train the CRF, use the following command while in the mallet direc-
tory:
java -cp class;libmallet-deps.jar
cc.mallet.fst.SimpleTagger --train true
--model-file nouncrf sample
This command will train the CRF. The --train true command will spec-
ify that this is the training. Here the CRF file is created in the mallet direc-
tory itself. We can however specify the locations as per convinience.
9
12. MALLET
Now that we have trained MALLET, we can put it to test by creating a
new file called ”test”. Inside this file, we write :
CAPITAL Al
slept
here .
Now we need the file to be labelled, so, we use CRF in the nouncrf by
typing:
java -cp class;libmallet-deps.jar
cc.mallet.fst.SimpleTagger
--model-file nouncrf test
which produces the following output:
Number of predicates: 5
noun CAPITAL Al
non-noun slept
non-noun here
8 Topic Models
Topic models provide a simple way to analyze large volumes of unlabeled text.
A ”topic” consists of a cluster of words that frequently occur together. Using
some contextual clues, the topic models can connect the words with similar
meanings and distinguish between uses of words with multiple meanings.
Now the first step in acheiving a Topic model is to import a set of doc-
uments. Suppose we want to import the files in the folder ”en”, type the
command:
binmallet import-dir
--input sample-dataweben --output output.mallet
--keep-sequence --remove-stopwords
This command will remove all the stopwords, keep all the sequences and
write the output to a ”output.mallet” file in the mallet directory.
11
13. MALLET
Now, type in the command:
binmallet train-topics
--input sample-datawebenoutput.mallet
--num-topics 100 --output-state topic-state.gz
Here --num-topics [NUMBER] represents the number of topics to use.
More the number, more the fine-grained results we get and --output-state
outputs a compressed text file containing the words in the corpus with their
topic assignments. This file format can easily be parsed and used by non-
Java-based software. Note that the state file will be GZipped, so it is helpful
to provide a filename that ends in .gz.
References
[1] http://mallet.cs.umass.edu
[2] http://www.fieldstone-software.com/mallet/
12