Introduction to text classification using scala

•

3 j'aime•2,037 vues

Sudharshan Rajendhiran

Technologie

Introduction to Text Classification using
Scala
Sudharshan

What is Text classification
Text Classification using machine
learning and NLP is to classify the
documents into separate categories
based on the linguistic features present
in the documents.

Preprocessing
TextDirectoryLoader
Tokenizer
Vectorizer
Folder containing subfolders of each
class text files (eg: Pos, Neg)
Instances with class and other attributes
Instances with Feature Vectors
Processed Train Set in Memory

Classifier Training
Classifier Training
Algorithm
Training set Feature vectors of all instances
Model File built
Test a Single Instance
Read a string and classify it
Prediction
Reads a string and preprocess
and classify it as a particular class.
Some ML classification
algorithm

Testing the Model: Cross Validation
Load TestSet Data
Randomize Data
Cross Validation
Preproces
sing Tokenize
Y
e
s
NO
Confusion Matrix
results

Demo using Scala Breeze/Nak
For our demo we are using this SBT dependency
libraryDependencies += "org.scalanlp" % "nak" % "1.1.3"
https://github.com/scalanlp/nak

$Simple Example for Training and Evaluation object TwentyNewsExample extends App { val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news- bydate"; val newsgroupsDir = new File(directoryLocation) implicit val isoCodec = scala.io.Codec("ISO-8859-1") val stopwords = Set("the","a","an","of","in","for","by","on") val trainDir = new File(newsgroupsDir, "small_train") val trainingExamples = fromLabeledDirs(trainDir).toList val featurizer = new BowFeaturizer(stopwords)$

Training and Evaluation code
//Training Process
val config = LiblinearConfig(cost=5.0)
val classifier = trainClassifier(config, featurizer, trainingExamples)
println("training done ")
//Evaluation Process
val evalDir = new File(newsgroupsDir, "small_test")
val maxLabelNews = maxLabel(classifier.labels) _
val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
(ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)
val (goldLabels, predictions, inputs) = comparisons.unzip3
println(ConfusionMatrix(goldLabels, predictions, inputs))
}

Code and dataset will be available
https://github.
com/rsudharshan/DataScienceWithScala
also in the Nak github page

Contenu connexe

Tendances

Introducing object oriented programming (oop)Hemlathadhevi Annadhurai

Basic Concepts Of OOPS/OOPS in Java,C++Guneesh Basundhra

Object oriented programmingSaiful Islam Sany

OOP in Javawiradikusuma

Fundamentals of JAVASwapnika Godbole

Oops in Javamalathip12

Java Object Oriented Programming University of Potsdam

Parallel Computing in .NETmeghantaylor

Packages and InterfacesAkashDas112

9. Input Output in javaNilesh Dalvi

JAVA PROGRAMMING – Packages - Stream based I/O Jyothishmathi Institute of Technology and Science Karimnagar

Object Oriented Languagedheva B

GETTING STARTED WITH JAVA(beginner)HarshithaAllu

As3.0guest197585

concept of oopsprince sharma

Information Extraction from HTML: General Machine Learning ...butest

6. Exception HandlingNilesh Dalvi

Java packagesJeffrey Quevedo

Javasession6Rajeev Kumar

Tendances (19)

Introducing object oriented programming (oop)

Basic Concepts Of OOPS/OOPS in Java,C++

Object oriented programming

OOP in Java

Fundamentals of JAVA

Oops in Java

Java Object Oriented Programming

Parallel Computing in .NET

Packages and Interfaces

9. Input Output in java

JAVA PROGRAMMING – Packages - Stream based I/O

Object Oriented Language

GETTING STARTED WITH JAVA(beginner)

As3.0

concept of oops

Information Extraction from HTML: General Machine Learning ...

6. Exception Handling

Java packages

Javasession6

Similaire à Introduction to text classification using scala

Java_Interview QnsManikandanRamanujam

Python-Classes.pptxKarudaiyar Ganapathy

Python - object orientedLearnbay Datascience

Online Tweet Sentiment Analysis with Apache SparkDavide Nardone

javaopps conceptsNikhil Agrawal

Java se 8 fundamentalsmegharajk

Java classes in karve nagar puneletsleadsand

Nitish Chaulagai Java1.pptxNitishChaulagai

Basic Java ProgrammingMath-Circle

FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...AnkurSingh340457

Text classification with WekaMilad Alshomary

Object Oriented Programming Language is an oopsanaiftikhar23

Object oriented javascriptUsman Mehmood

Android Training (Java Review)Khaled Anaqwa

Xml representation oftextspecificationsusert098

Introduction to Java Object Oiented Concepts and Basic terminologiesTabassumMaktum

STAT Requirement Analysisstat

Introduction to ScalaSynesso

Java basicssagsharma

DOSUG XML Beans overview by Om SivanesianMatthew McCullough

Similaire à Introduction to text classification using scala (20)

Java_Interview Qns

Python-Classes.pptx

Python - object oriented

Online Tweet Sentiment Analysis with Apache Spark

javaopps concepts

Java se 8 fundamentals

Java classes in karve nagar pune

Nitish Chaulagai Java1.pptx

Basic Java Programming

FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...

Text classification with Weka

Object Oriented Programming Language is an oop

Object oriented javascript

Android Training (Java Review)

Xml representation oftextspecifications

Introduction to Java Object Oiented Concepts and Basic terminologies

STAT Requirement Analysis

Introduction to Scala

Java basics

DOSUG XML Beans overview by Om Sivanesian

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Year of the Servo Reboot: Where Are We Now?Igalia

Corporate and higher education May webinar.pptxRustici Software

DBX First Quarter 2024 Investor PresentationDropbox

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Real Time Object Detection Using Open CVKhem

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

MINDCTI Revenue Release Quarter One 2024MIND CTI

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Data Cloud, More than a CDP by Matt Robison

MS Copilot expands with MS Graph connectors

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Beginners Guide to Building a RAG App Using Open Source Milvus

Axa Assurance Maroc - Insurer Innovation Award 2024

A Year of the Servo Reboot: Where Are We Now?

Corporate and higher education May webinar.pptx

DBX First Quarter 2024 Investor Presentation

Boost Fertility New Invention Ups Success Rates.pdf

Real Time Object Detection Using Open CV

How to Troubleshoot Apps for the Modern Connected Worker

Strategies for Landing an Oracle DBA Job as a Fresher

Apidays New York 2024 - The value of a flexible API Management solution for O...

Artificial Intelligence Chap.5 : Uncertainty

AWS Community Day CPH - Three problems of Terraform

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

MINDCTI Revenue Release Quarter One 2024

Introduction to text classification using scala

1. Introduction to Text Classification using Scala Sudharshan

2. What is Text classification Text Classification using machine learning and NLP is to classify the documents into separate categories based on the linguistic features present in the documents.

3. Preprocessing TextDirectoryLoader Tokenizer Vectorizer Folder containing subfolders of each class text files (eg: Pos, Neg) Instances with class and other attributes Instances with Feature Vectors Processed Train Set in Memory

4. Classifier Training Classifier Training Algorithm Training set Feature vectors of all instances Model File built Test a Single Instance Read a string and classify it Prediction Reads a string and preprocess and classify it as a particular class. Some ML classification algorithm

5. Testing the Model: Cross Validation Load TestSet Data Randomize Data Cross Validation Preproces sing Tokenize Y e s NO Confusion Matrix results

6. Demo using Scala Breeze/Nak For our demo we are using this SBT dependency libraryDependencies += "org.scalanlp" % "nak" % "1.1.3" https://github.com/scalanlp/nak

7. Simple Example for Training and Evaluation object TwentyNewsExample extends App { val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news- bydate"; val newsgroupsDir = new File(directoryLocation) implicit val isoCodec = scala.io.Codec("ISO-8859-1") val stopwords = Set("the","a","an","of","in","for","by","on") val trainDir = new File(newsgroupsDir, "small_train") val trainingExamples = fromLabeledDirs(trainDir).toList val featurizer = new BowFeaturizer(stopwords)

8. Training and Evaluation code //Training Process val config = LiblinearConfig(cost=5.0) val classifier = trainClassifier(config, featurizer, trainingExamples) println("training done ") //Evaluation Process val evalDir = new File(newsgroupsDir, "small_test") val maxLabelNews = maxLabel(classifier.labels) _ val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features) val (goldLabels, predictions, inputs) = comparisons.unzip3 println(ConfusionMatrix(goldLabels, predictions, inputs)) }

9. Code and dataset will be available https://github. com/rsudharshan/DataScienceWithScala also in the Nak github page

10. Questions ?? Thank You

Introduction to text classification using scala

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Introduction to text classification using scala

Similaire à Introduction to text classification using scala (20)

Dernier

Dernier (20)

Introduction to text classification using scala