SlideShare une entreprise Scribd logo
1  sur  2
Télécharger pour lire hors ligne
Helen Bailey and Sands Fish, MIT Libraries 1
Text Analysis Methods for Digital Humanities
Workshop Exercise
MALLET
Pre-workshop, students should download and install the MALLET GUI on their laptops.
https://code.google.com/p/topic-modeling-tool/
They should also run the topic modeler on a sample text file with the default settings to make
sure it’s working correctly.
Helpful MALLET Resources
• MALLET GUI information
• Blog post on using the GUI and displaying output in Gephi
• Using MALLET on the command line
• Intro to Topic Modeling in general
• MALLET website
• Review of MALLET in Journal of Digital Humanities
• Using HO-LDA / Finding Number of Topics in Emergency Text classification
In-Class Exercise
1. Run MALLET on a known corpus (full-text examples used in the demo, all from the
Gutenberg Project: Adventures of Huckleberry Finn, Alice’s Adventures in Wonderland,
Andersen’s Fairy Tales, Grimm’s Fairy Tales, Life on the Mississippi, On the Origin of
Species, The Wizard of Oz).
2. Change the parameters to see how they impact the results. For example:
• Does preserving case matter?
• How does changing the number of iterations impact the results?
• What about changing the topic proportion threshold?
• How many topic words should you print? (What are you trying to discover? How
much info is useful?)
• What do the results tell you about this corpus? How could you use this to learn
about a corpus you weren’t familiar with?
3. MALLET implements the LDA algorithm, discuss its details a little. Hierarchicial Topic
Modeling as a juxtaposition (not available via MALLET)
Helen Bailey and Sands Fish, MIT Libraries 2
Stanford Named Entity Recognizer
Pre-workshop, students should download and install the SNER GUI on their laptops.
http://nlp.stanford.edu/software/CRF-NER.shtml#Download
They should also run SNER on a sample file using the default classifier (or, if that’s not
available, the first classifier in the classifier folder), to make sure it’s working correctly.
Helpful SNER Resources
• Basic GUI tutorial
In-Class Exercise
• Run SNER on a known corpus. Change the classifier to see if results differ.
• Save tagged file output and open. What do you then need to do with that to make it
useful?
• Difference between entity extraction and entity disambiguation.
• Do we want to have them run the output through a concordance program?
• What might you do with this data? How could it interact with other tools to tell the
narrative?
CLAVIN
• CLAVIN Tool by Berico Technologies
o Cartographic Location And Vicinity INdexer
• MIT Center for Civic Media open source CLAVIN Server for doing geo-parsing via HTTP
o Includes special "civic sauce" for determining the "aboutness" of a document,
narrowing down to the most likely place a document is talking about.
o According to Civic, this is the best quality geo-parsing service outside of Yahoo's
pay service.
• Uses ApacheNLP for location entity extraction under the hood.
Setup
1. Download source from https://github.com/sandsfish/CLAVIN-Server
2. Follow the instructions in the readme to build and setup the tool.
Evaluating Assumptions
● We’re providing sample text to work with. What do you already know about it? What do
you know from the data itself, and what information are you lacking?
● What characteristics of the sample data are likely contributing to the results you get from
these tools? (Lack of pre-processing, for example)
● Note how long it takes for these tools o run. Consider the size of the data set we’re
working with versus the size of possible data sets you may be interested in.

Contenu connexe

Similaire à Workshop Exercise: Text Analysis Methods for Digital Humanities

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Maintaining Large Scale Julia Ecosystems
Maintaining Large Scale Julia EcosystemsMaintaining Large Scale Julia Ecosystems
Maintaining Large Scale Julia EcosystemsChris Rackauckas
 
BESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesBESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesRoberto García
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.pptsagarjsicg
 
Fake news detection
Fake news detection Fake news detection
Fake news detection shalushamil
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
You and your code.pdf
You and your code.pdfYou and your code.pdf
You and your code.pdfTony Khánh
 

Similaire à Workshop Exercise: Text Analysis Methods for Digital Humanities (20)

Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Maintaining Large Scale Julia Ecosystems
Maintaining Large Scale Julia EcosystemsMaintaining Large Scale Julia Ecosystems
Maintaining Large Scale Julia Ecosystems
 
BESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesBESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User Interfaces
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
 
Fake news detection
Fake news detection Fake news detection
Fake news detection
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
You and your code.pdf
You and your code.pdfYou and your code.pdf
You and your code.pdf
 

Dernier

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 

Dernier (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

Workshop Exercise: Text Analysis Methods for Digital Humanities

  • 1. Helen Bailey and Sands Fish, MIT Libraries 1 Text Analysis Methods for Digital Humanities Workshop Exercise MALLET Pre-workshop, students should download and install the MALLET GUI on their laptops. https://code.google.com/p/topic-modeling-tool/ They should also run the topic modeler on a sample text file with the default settings to make sure it’s working correctly. Helpful MALLET Resources • MALLET GUI information • Blog post on using the GUI and displaying output in Gephi • Using MALLET on the command line • Intro to Topic Modeling in general • MALLET website • Review of MALLET in Journal of Digital Humanities • Using HO-LDA / Finding Number of Topics in Emergency Text classification In-Class Exercise 1. Run MALLET on a known corpus (full-text examples used in the demo, all from the Gutenberg Project: Adventures of Huckleberry Finn, Alice’s Adventures in Wonderland, Andersen’s Fairy Tales, Grimm’s Fairy Tales, Life on the Mississippi, On the Origin of Species, The Wizard of Oz). 2. Change the parameters to see how they impact the results. For example: • Does preserving case matter? • How does changing the number of iterations impact the results? • What about changing the topic proportion threshold? • How many topic words should you print? (What are you trying to discover? How much info is useful?) • What do the results tell you about this corpus? How could you use this to learn about a corpus you weren’t familiar with? 3. MALLET implements the LDA algorithm, discuss its details a little. Hierarchicial Topic Modeling as a juxtaposition (not available via MALLET)
  • 2. Helen Bailey and Sands Fish, MIT Libraries 2 Stanford Named Entity Recognizer Pre-workshop, students should download and install the SNER GUI on their laptops. http://nlp.stanford.edu/software/CRF-NER.shtml#Download They should also run SNER on a sample file using the default classifier (or, if that’s not available, the first classifier in the classifier folder), to make sure it’s working correctly. Helpful SNER Resources • Basic GUI tutorial In-Class Exercise • Run SNER on a known corpus. Change the classifier to see if results differ. • Save tagged file output and open. What do you then need to do with that to make it useful? • Difference between entity extraction and entity disambiguation. • Do we want to have them run the output through a concordance program? • What might you do with this data? How could it interact with other tools to tell the narrative? CLAVIN • CLAVIN Tool by Berico Technologies o Cartographic Location And Vicinity INdexer • MIT Center for Civic Media open source CLAVIN Server for doing geo-parsing via HTTP o Includes special "civic sauce" for determining the "aboutness" of a document, narrowing down to the most likely place a document is talking about. o According to Civic, this is the best quality geo-parsing service outside of Yahoo's pay service. • Uses ApacheNLP for location entity extraction under the hood. Setup 1. Download source from https://github.com/sandsfish/CLAVIN-Server 2. Follow the instructions in the readme to build and setup the tool. Evaluating Assumptions ● We’re providing sample text to work with. What do you already know about it? What do you know from the data itself, and what information are you lacking? ● What characteristics of the sample data are likely contributing to the results you get from these tools? (Lack of pre-processing, for example) ● Note how long it takes for these tools o run. Consider the size of the data set we’re working with versus the size of possible data sets you may be interested in.