SlideShare une entreprise Scribd logo
1  sur  18
RapidMiner5 2.9 - Word vector tool and RapidMiner
Word Vector tool The Word & Web Vector Tool is a flexible Java library for statistical language modeling and integration of Web and Webservice based data sources.  It supports the creation of word vector representations of text documents in the vector space model that is the point of departure for many text processing applications .
Installation 1.	Download the archive form wvtoolsourceforge website.
Installation 2. Putting it into lib/plugins directory of your RapidMiner installation, example: D:rogram Filesapid-IapidMiner5iblugins
Word Vector tool The aim of the WVTool is to provide a simple to use, simple to extend pure Java library for text and webmining. It can easily be invoked from any Java application.
Word Vector tool WVTool bridges a gap between highly sophisticated linguistic packages as the GATE system on the one side and many partial solutions that are part of diverse text and information retrieval applications on the other side.
Functions
Word List A word list contains all terms used for vectorization together with some statistics  (e.g. in how many documents a term appears). The word list is needed for vectorization to define which terms are considered as dimensions of the vector space and for weighting purposes.
WVtool functions Input list that tells the system which text documents to process WVTool Function  Inputs A configuration object, that tells the system which methods to use in the individual steps.
Defining the input The input list tells the WVTool which texts should be processed. Every item in the list contains the following information: A URI  The language the document is written in (optional) ˆ The type of the document (optional) ˆ The character encoding of the document, e.g. UTF-8 (optional) ˆ A class label
Using Predefined Word Lists In some cases it is necessary to exactly define the dimensions of the vector space, yet leaving the counting of terms and documents to the WVTool . This can be achieved by calling the word list creation function with a list of String values.
Text Input The TextInput operator creates an ExampleSet from a collection of texts. The output ExampleSet contains one row for each text document and one column of each term.
Text Classification, Clustering and Visualization For text classification, the class labels (e.g. positive, negative) are defined in the TextInput operator, as described above. Using clustering or dimensionality reduction, there is a possibility to directly visualize text documents from the RapidMiner Visualization panel.
Creating and Maintaining Word Lists Creating an Initial Word List: An initial word list can be created by using the following chain of operators:
Creating and Maintaining Word Lists Applying a Word List:  You can apply a word list in two ways:  To use the actual weights, first create word vectors using the TextInput Operator and then use the AttributeWeightsLoader and AttributesWeightsApplier on the resulting ExampleSet.
Creating and Maintaining Word Lists Applying a Word List:  You can apply a word list in two ways:  2.  To use the word list only as a selection of relevant terms and leave it to the TextInput to actually weight them, use the AttributeWeightsLoader before. The TextInput will create vectors that contain as dimensions only terms in the word list, that have a weight larger than zero.
Creating and Maintaining Word Lists Updating a Word List :  If you add new documents to your corpus, usually additional terms will be relevant and should be added to the word list. After the InteractiveAttributeWeighting operator pops up, use the load function to load your original word list.
More Questions? Reach us at support@dataminingtools.net Visit: www.dataminingtools.net

Contenu connexe

En vedette (20)

Clustering
ClusteringClustering
Clustering
 
Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
RapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid Miner
 
Data Applied:Forecast
Data Applied:ForecastData Applied:Forecast
Data Applied:Forecast
 
Welcome
WelcomeWelcome
Welcome
 
2008 IEDM presentation
2008 IEDM presentation2008 IEDM presentation
2008 IEDM presentation
 
LISP:Predicates in lisp
LISP:Predicates in lispLISP:Predicates in lisp
LISP:Predicates in lisp
 
Test
TestTest
Test
 
Quick Look At Classification
Quick Look At ClassificationQuick Look At Classification
Quick Look At Classification
 
Data Applied: Similarity
Data Applied: SimilarityData Applied: Similarity
Data Applied: Similarity
 
Anime
AnimeAnime
Anime
 
SPSS: Data Editor
SPSS: Data EditorSPSS: Data Editor
SPSS: Data Editor
 
Control Statements in Matlab
Control Statements in  MatlabControl Statements in  Matlab
Control Statements in Matlab
 
Ireland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs SlideshareIreland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs Slideshare
 
Data-Applied: Technology Insights
Data-Applied: Technology InsightsData-Applied: Technology Insights
Data-Applied: Technology Insights
 
Traffic Skills, Parent & Kids Intro
Traffic Skills, Parent & Kids IntroTraffic Skills, Parent & Kids Intro
Traffic Skills, Parent & Kids Intro
 
XL-Miner: Timeseries
XL-Miner: TimeseriesXL-Miner: Timeseries
XL-Miner: Timeseries
 
Apresentação Red Advisers
Apresentação Red AdvisersApresentação Red Advisers
Apresentação Red Advisers
 
Txomin Hartz Txikia
Txomin Hartz TxikiaTxomin Hartz Txikia
Txomin Hartz Txikia
 

Similaire à RapidMiner: Word Vector Tool And Rapid Miner

List of values Best Practices
List of values Best PracticesList of values Best Practices
List of values Best PracticesAd Ghauri
 
Team G
Team GTeam G
Team Gbutest
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningDataminingTools Inc
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningsqlserver content
 
Ant conc ~design & development of a freeware
Ant conc ~design & development of a freewareAnt conc ~design & development of a freeware
Ant conc ~design & development of a freewaresarahannelazarus
 
20131112 Introduction to LaTeX for EndNote Users.docx
20131112 Introduction to LaTeX for EndNote Users.docx20131112 Introduction to LaTeX for EndNote Users.docx
20131112 Introduction to LaTeX for EndNote Users.docxNTUSubjectRooms
 
Programming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentProgramming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentMahmoud Samir Fayed
 
Improving writing aids, the community way
Improving writing aids, the community wayImproving writing aids, the community way
Improving writing aids, the community wayAlexandro Colorado
 
Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2GDSCUniversitasMatan
 
Intro To Flex Typography 360|Flex
Intro To Flex Typography 360|FlexIntro To Flex Typography 360|Flex
Intro To Flex Typography 360|FlexMatt Guest
 
Presentation kaushal
Presentation kaushalPresentation kaushal
Presentation kaushalAjay Yadav
 
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfInstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfarsmobiles
 
What is html xml and xhtml
What is html xml and xhtmlWhat is html xml and xhtml
What is html xml and xhtmlFkdiMl
 

Similaire à RapidMiner: Word Vector Tool And Rapid Miner (20)

List of values Best Practices
List of values Best PracticesList of values Best Practices
List of values Best Practices
 
Team G
Team GTeam G
Team G
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data mining
 
MS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data miningMS SQL SERVER: SSIS and data mining
MS SQL SERVER: SSIS and data mining
 
Ant conc notes
Ant conc notesAnt conc notes
Ant conc notes
 
I x scripting
I x scriptingI x scripting
I x scripting
 
 
Project seminar
Project seminarProject seminar
Project seminar
 
Ant conc ~design & development of a freeware
Ant conc ~design & development of a freewareAnt conc ~design & development of a freeware
Ant conc ~design & development of a freeware
 
20131112 Introduction to LaTeX for EndNote Users.docx
20131112 Introduction to LaTeX for EndNote Users.docx20131112 Introduction to LaTeX for EndNote Users.docx
20131112 Introduction to LaTeX for EndNote Users.docx
 
Olap
OlapOlap
Olap
 
Programming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) EnvironmentProgramming Without Coding Technology (PWCT) Environment
Programming Without Coding Technology (PWCT) Environment
 
Improving writing aids, the community way
Improving writing aids, the community wayImproving writing aids, the community way
Improving writing aids, the community way
 
Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2
 
Intro To Flex Typography 360|Flex
Intro To Flex Typography 360|FlexIntro To Flex Typography 360|Flex
Intro To Flex Typography 360|Flex
 
Presentation kaushal
Presentation kaushalPresentation kaushal
Presentation kaushal
 
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfInstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
 
What is html xml and xhtml
What is html xml and xhtmlWhat is html xml and xhtml
What is html xml and xhtml
 
PDFArticle
PDFArticlePDFArticle
PDFArticle
 
Robot framework
Robot frameworkRobot framework
Robot framework
 

Plus de DataminingTools Inc

AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceDataminingTools Inc
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technologyDataminingTools Inc
 

Plus de DataminingTools Inc (20)

Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Areas of machine leanring
Areas of machine leanringAreas of machine leanring
Areas of machine leanring
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI 2AI: Learning in AI 2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

RapidMiner: Word Vector Tool And Rapid Miner

  • 1. RapidMiner5 2.9 - Word vector tool and RapidMiner
  • 2. Word Vector tool The Word & Web Vector Tool is a flexible Java library for statistical language modeling and integration of Web and Webservice based data sources. It supports the creation of word vector representations of text documents in the vector space model that is the point of departure for many text processing applications .
  • 3. Installation 1. Download the archive form wvtoolsourceforge website.
  • 4. Installation 2. Putting it into lib/plugins directory of your RapidMiner installation, example: D:rogram Filesapid-IapidMiner5iblugins
  • 5. Word Vector tool The aim of the WVTool is to provide a simple to use, simple to extend pure Java library for text and webmining. It can easily be invoked from any Java application.
  • 6. Word Vector tool WVTool bridges a gap between highly sophisticated linguistic packages as the GATE system on the one side and many partial solutions that are part of diverse text and information retrieval applications on the other side.
  • 8. Word List A word list contains all terms used for vectorization together with some statistics (e.g. in how many documents a term appears). The word list is needed for vectorization to define which terms are considered as dimensions of the vector space and for weighting purposes.
  • 9. WVtool functions Input list that tells the system which text documents to process WVTool Function Inputs A configuration object, that tells the system which methods to use in the individual steps.
  • 10. Defining the input The input list tells the WVTool which texts should be processed. Every item in the list contains the following information: A URI The language the document is written in (optional) ˆ The type of the document (optional) ˆ The character encoding of the document, e.g. UTF-8 (optional) ˆ A class label
  • 11. Using Predefined Word Lists In some cases it is necessary to exactly define the dimensions of the vector space, yet leaving the counting of terms and documents to the WVTool . This can be achieved by calling the word list creation function with a list of String values.
  • 12. Text Input The TextInput operator creates an ExampleSet from a collection of texts. The output ExampleSet contains one row for each text document and one column of each term.
  • 13. Text Classification, Clustering and Visualization For text classification, the class labels (e.g. positive, negative) are defined in the TextInput operator, as described above. Using clustering or dimensionality reduction, there is a possibility to directly visualize text documents from the RapidMiner Visualization panel.
  • 14. Creating and Maintaining Word Lists Creating an Initial Word List: An initial word list can be created by using the following chain of operators:
  • 15. Creating and Maintaining Word Lists Applying a Word List: You can apply a word list in two ways: To use the actual weights, first create word vectors using the TextInput Operator and then use the AttributeWeightsLoader and AttributesWeightsApplier on the resulting ExampleSet.
  • 16. Creating and Maintaining Word Lists Applying a Word List: You can apply a word list in two ways: 2. To use the word list only as a selection of relevant terms and leave it to the TextInput to actually weight them, use the AttributeWeightsLoader before. The TextInput will create vectors that contain as dimensions only terms in the word list, that have a weight larger than zero.
  • 17. Creating and Maintaining Word Lists Updating a Word List : If you add new documents to your corpus, usually additional terms will be relevant and should be added to the word list. After the InteractiveAttributeWeighting operator pops up, use the load function to load your original word list.
  • 18. More Questions? Reach us at support@dataminingtools.net Visit: www.dataminingtools.net