SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
Introduction to Text Classification using
Scala
Sudharshan
What is Text classification
Text Classification using machine
learning and NLP is to classify the
documents into separate categories
based on the linguistic features present
in the documents.
Preprocessing
TextDirectoryLoader
Tokenizer
Vectorizer
Folder containing subfolders of each
class text files (eg: Pos, Neg)
Instances with class and other attributes
Instances with Feature Vectors
Processed Train Set in Memory
Classifier Training
Classifier Training
Algorithm
Training set Feature vectors of all instances
Model File built
Test a Single Instance
Read a string and classify it
Prediction
Reads a string and preprocess
and classify it as a particular class.
Some ML classification
algorithm
Testing the Model: Cross Validation
Load TestSet Data
Randomize Data
Cross Validation
Preproces
sing Tokenize
Y
e
s
NO
Confusion Matrix
results
Demo using Scala Breeze/Nak
For our demo we are using this SBT dependency
libraryDependencies += "org.scalanlp" % "nak" % "1.1.3"
https://github.com/scalanlp/nak
Simple Example for
Training and Evaluation
object TwentyNewsExample extends App
{
val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news-
bydate";
val newsgroupsDir = new File(directoryLocation)
implicit val isoCodec = scala.io.Codec("ISO-8859-1")
val stopwords = Set("the","a","an","of","in","for","by","on")
val trainDir = new File(newsgroupsDir, "small_train")
val trainingExamples = fromLabeledDirs(trainDir).toList
val featurizer = new BowFeaturizer(stopwords)
Training and Evaluation code
//Training Process
val config = LiblinearConfig(cost=5.0)
val classifier = trainClassifier(config, featurizer, trainingExamples)
println("training done ")
//Evaluation Process
val evalDir = new File(newsgroupsDir, "small_test")
val maxLabelNews = maxLabel(classifier.labels) _
val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
(ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)
val (goldLabels, predictions, inputs) = comparisons.unzip3
println(ConfusionMatrix(goldLabels, predictions, inputs))
}
Code and dataset will be available
https://github.
com/rsudharshan/DataScienceWithScala
also in the Nak github page
Questions ??
Thank You

Contenu connexe

Tendances

Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Hemlathadhevi Annadhurai
 
Basic Concepts Of OOPS/OOPS in Java,C++
Basic Concepts Of OOPS/OOPS in Java,C++Basic Concepts Of OOPS/OOPS in Java,C++
Basic Concepts Of OOPS/OOPS in Java,C++Guneesh Basundhra
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NETmeghantaylor
 
Packages and Interfaces
Packages and InterfacesPackages and Interfaces
Packages and InterfacesAkashDas112
 
9. Input Output in java
9. Input Output in java9. Input Output in java
9. Input Output in javaNilesh Dalvi
 
Object Oriented Language
Object Oriented LanguageObject Oriented Language
Object Oriented Languagedheva B
 
GETTING STARTED WITH JAVA(beginner)
GETTING STARTED WITH JAVA(beginner)GETTING STARTED WITH JAVA(beginner)
GETTING STARTED WITH JAVA(beginner)HarshithaAllu
 
Information Extraction from HTML: General Machine Learning ...
Information Extraction from HTML: General Machine Learning ...Information Extraction from HTML: General Machine Learning ...
Information Extraction from HTML: General Machine Learning ...butest
 
6. Exception Handling
6. Exception Handling6. Exception Handling
6. Exception HandlingNilesh Dalvi
 

Tendances (19)

Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Introducing object oriented programming (oop)
Introducing object oriented programming (oop)
 
Basic Concepts Of OOPS/OOPS in Java,C++
Basic Concepts Of OOPS/OOPS in Java,C++Basic Concepts Of OOPS/OOPS in Java,C++
Basic Concepts Of OOPS/OOPS in Java,C++
 
Object oriented programming
Object oriented programmingObject oriented programming
Object oriented programming
 
OOP in Java
OOP in JavaOOP in Java
OOP in Java
 
Fundamentals of JAVA
Fundamentals of JAVAFundamentals of JAVA
Fundamentals of JAVA
 
Oops in Java
Oops in JavaOops in Java
Oops in Java
 
Java Object Oriented Programming
Java Object Oriented Programming Java Object Oriented Programming
Java Object Oriented Programming
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
 
Packages and Interfaces
Packages and InterfacesPackages and Interfaces
Packages and Interfaces
 
9. Input Output in java
9. Input Output in java9. Input Output in java
9. Input Output in java
 
JAVA PROGRAMMING – Packages - Stream based I/O
JAVA PROGRAMMING – Packages - Stream based I/O JAVA PROGRAMMING – Packages - Stream based I/O
JAVA PROGRAMMING – Packages - Stream based I/O
 
Object Oriented Language
Object Oriented LanguageObject Oriented Language
Object Oriented Language
 
GETTING STARTED WITH JAVA(beginner)
GETTING STARTED WITH JAVA(beginner)GETTING STARTED WITH JAVA(beginner)
GETTING STARTED WITH JAVA(beginner)
 
As3.0
As3.0As3.0
As3.0
 
concept of oops
concept of oopsconcept of oops
concept of oops
 
Information Extraction from HTML: General Machine Learning ...
Information Extraction from HTML: General Machine Learning ...Information Extraction from HTML: General Machine Learning ...
Information Extraction from HTML: General Machine Learning ...
 
6. Exception Handling
6. Exception Handling6. Exception Handling
6. Exception Handling
 
Java packages
Java packagesJava packages
Java packages
 
Javasession6
Javasession6Javasession6
Javasession6
 

Similaire à Introduction to text classification using scala

Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
Java se 8 fundamentals
Java se 8 fundamentalsJava se 8 fundamentals
Java se 8 fundamentalsmegharajk
 
Java classes in karve nagar pune
Java classes in karve nagar puneJava classes in karve nagar pune
Java classes in karve nagar puneletsleadsand
 
Nitish Chaulagai Java1.pptx
Nitish Chaulagai Java1.pptxNitish Chaulagai Java1.pptx
Nitish Chaulagai Java1.pptxNitishChaulagai
 
Basic Java Programming
Basic Java ProgrammingBasic Java Programming
Basic Java ProgrammingMath-Circle
 
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...AnkurSingh340457
 
Text classification with Weka
Text classification with WekaText classification with Weka
Text classification with WekaMilad Alshomary
 
Object Oriented Programming Language is an oop
Object Oriented Programming Language is an oopObject Oriented Programming Language is an oop
Object Oriented Programming Language is an oopsanaiftikhar23
 
Object oriented javascript
Object oriented javascriptObject oriented javascript
Object oriented javascriptUsman Mehmood
 
Android Training (Java Review)
Android Training (Java Review)Android Training (Java Review)
Android Training (Java Review)Khaled Anaqwa
 
Xml representation oftextspecifications
Xml representation oftextspecificationsXml representation oftextspecifications
Xml representation oftextspecificationsusert098
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesTabassumMaktum
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaSynesso
 
DOSUG XML Beans overview by Om Sivanesian
DOSUG XML Beans overview by Om SivanesianDOSUG XML Beans overview by Om Sivanesian
DOSUG XML Beans overview by Om SivanesianMatthew McCullough
 

Similaire à Introduction to text classification using scala (20)

Java_Interview Qns
Java_Interview QnsJava_Interview Qns
Java_Interview Qns
 
Python-Classes.pptx
Python-Classes.pptxPython-Classes.pptx
Python-Classes.pptx
 
Python - object oriented
Python - object orientedPython - object oriented
Python - object oriented
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
javaopps concepts
javaopps conceptsjavaopps concepts
javaopps concepts
 
Java se 8 fundamentals
Java se 8 fundamentalsJava se 8 fundamentals
Java se 8 fundamentals
 
Java classes in karve nagar pune
Java classes in karve nagar puneJava classes in karve nagar pune
Java classes in karve nagar pune
 
Nitish Chaulagai Java1.pptx
Nitish Chaulagai Java1.pptxNitish Chaulagai Java1.pptx
Nitish Chaulagai Java1.pptx
 
Basic Java Programming
Basic Java ProgrammingBasic Java Programming
Basic Java Programming
 
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...
FAL(2022-23)_CSE0206_ETH_AP2022232000455_Reference_Material_I_16-Aug-2022_Mod...
 
Text classification with Weka
Text classification with WekaText classification with Weka
Text classification with Weka
 
Object Oriented Programming Language is an oop
Object Oriented Programming Language is an oopObject Oriented Programming Language is an oop
Object Oriented Programming Language is an oop
 
Object oriented javascript
Object oriented javascriptObject oriented javascript
Object oriented javascript
 
Android Training (Java Review)
Android Training (Java Review)Android Training (Java Review)
Android Training (Java Review)
 
Xml representation oftextspecifications
Xml representation oftextspecificationsXml representation oftextspecifications
Xml representation oftextspecifications
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologies
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Java basics
Java basicsJava basics
Java basics
 
DOSUG XML Beans overview by Om Sivanesian
DOSUG XML Beans overview by Om SivanesianDOSUG XML Beans overview by Om Sivanesian
DOSUG XML Beans overview by Om Sivanesian
 

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Introduction to text classification using scala

  • 1. Introduction to Text Classification using Scala Sudharshan
  • 2. What is Text classification Text Classification using machine learning and NLP is to classify the documents into separate categories based on the linguistic features present in the documents.
  • 3. Preprocessing TextDirectoryLoader Tokenizer Vectorizer Folder containing subfolders of each class text files (eg: Pos, Neg) Instances with class and other attributes Instances with Feature Vectors Processed Train Set in Memory
  • 4. Classifier Training Classifier Training Algorithm Training set Feature vectors of all instances Model File built Test a Single Instance Read a string and classify it Prediction Reads a string and preprocess and classify it as a particular class. Some ML classification algorithm
  • 5. Testing the Model: Cross Validation Load TestSet Data Randomize Data Cross Validation Preproces sing Tokenize Y e s NO Confusion Matrix results
  • 6. Demo using Scala Breeze/Nak For our demo we are using this SBT dependency libraryDependencies += "org.scalanlp" % "nak" % "1.1.3" https://github.com/scalanlp/nak
  • 7. Simple Example for Training and Evaluation object TwentyNewsExample extends App { val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news- bydate"; val newsgroupsDir = new File(directoryLocation) implicit val isoCodec = scala.io.Codec("ISO-8859-1") val stopwords = Set("the","a","an","of","in","for","by","on") val trainDir = new File(newsgroupsDir, "small_train") val trainingExamples = fromLabeledDirs(trainDir).toList val featurizer = new BowFeaturizer(stopwords)
  • 8. Training and Evaluation code //Training Process val config = LiblinearConfig(cost=5.0) val classifier = trainClassifier(config, featurizer, trainingExamples) println("training done ") //Evaluation Process val evalDir = new File(newsgroupsDir, "small_test") val maxLabelNews = maxLabel(classifier.labels) _ val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features) val (goldLabels, predictions, inputs) = comparisons.unzip3 println(ConfusionMatrix(goldLabels, predictions, inputs)) }
  • 9. Code and dataset will be available https://github. com/rsudharshan/DataScienceWithScala also in the Nak github page