2. What is Text classification
Text Classification using machine
learning and NLP is to classify the
documents into separate categories
based on the linguistic features present
in the documents.
4. Classifier Training
Classifier Training
Algorithm
Training set Feature vectors of all instances
Model File built
Test a Single Instance
Read a string and classify it
Prediction
Reads a string and preprocess
and classify it as a particular class.
Some ML classification
algorithm
5. Testing the Model: Cross Validation
Load TestSet Data
Randomize Data
Cross Validation
Preproces
sing Tokenize
Y
e
s
NO
Confusion Matrix
results
6. Demo using Scala Breeze/Nak
For our demo we are using this SBT dependency
libraryDependencies += "org.scalanlp" % "nak" % "1.1.3"
https://github.com/scalanlp/nak
7. Simple Example for
Training and Evaluation
object TwentyNewsExample extends App
{
val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news-
bydate";
val newsgroupsDir = new File(directoryLocation)
implicit val isoCodec = scala.io.Codec("ISO-8859-1")
val stopwords = Set("the","a","an","of","in","for","by","on")
val trainDir = new File(newsgroupsDir, "small_train")
val trainingExamples = fromLabeledDirs(trainDir).toList
val featurizer = new BowFeaturizer(stopwords)
8. Training and Evaluation code
//Training Process
val config = LiblinearConfig(cost=5.0)
val classifier = trainClassifier(config, featurizer, trainingExamples)
println("training done ")
//Evaluation Process
val evalDir = new File(newsgroupsDir, "small_test")
val maxLabelNews = maxLabel(classifier.labels) _
val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
(ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)
val (goldLabels, predictions, inputs) = comparisons.unzip3
println(ConfusionMatrix(goldLabels, predictions, inputs))
}
9. Code and dataset will be available
https://github.
com/rsudharshan/DataScienceWithScala
also in the Nak github page