SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Text Analysis Methods
for Digital Humanities
Helen Bailey and Sands Fish
MIT Libraries
Examples of Data Narratives
•  Visualizing Emancipation
•  Narrative Visualization of Whaling Ship Logs
•  Out of Sight, Out of Mind
Approaches to Storytelling w/ Data
•  EDA - Exploratory Data Analysis
•  Exploring data from a number of perspectives:
o  Temporal
o  Geographical
o  Statistical
o  Categorical
o  Relational
•  80% - Data Hacking, 20% - Narrative Construction, Visualization,
etc.
"To use any sort of historical data, we must above all understand the
constraints under which it was collected. In this case, that means
retelling the history of why and how the ship's logs were first collected, and
how the constraints of digitization in the punch card era radically shape the
sort of evidence we can draw from them. The important thing about this sort
of work is that it helps us understand the overall biases of a particular
data set, which is crucial for limiting our interpretive leaps."
- Ben Schmidt, “Reading digital sources: a case study in ship's logs”
Inherent Biases & Limitations
•  Data capture methods and format
•  Purpose of data collection
•  Transformation over time
•  Authenticity and trust
Understand provenance
“Rather than replace humans, computers amplify human abilities. The
most productive line of inquiry, therefore, is not in identifying how automated
methods can obviate the need for researchers to read their text. Rather, the
most productive line of inquiry is to identify the best way to use both
humans and automated methods for analyzing texts.”
- Justin Grimmer and Brandon M. Stewart, “
Text as Data: The Promise and Pitfalls of Automatic Content Analysis
Methods for Political Texts”
Acquiring Text
•  Full-text resources:
o  DSpace@MIT http://dspace.mit.edu/
o  Dome http://dome.mit.edu/
o  Digital Public Library of America http://dp.la
o  Europeana http://www.europeana.eu/portal/
o  HathiTrust http://www.hathitrust.org/
•  http://libguides.mit.edu/apis - metadata only
•  http://libguides.mit.edu/digitalhumanities
Data Management and Sharing
•  Assumption of sharing and data management plan as a
funding requirement
•  Data storage options - anticipate interaction
o  Storage formats - non-proprietary and repurposable
whenever possible
o  File system storage vs. database
•  Documentation of process
http://libraries.mit.edu/guides/subjects/data-management/
Formatting / Pre-Processing
•  Tool input requirements
•  Assumptions:
o  Text as a “bag of words”
o  Unigrams, bigrams
o  Word order (or not)
o  Stop words, capitalization, punctuation
Featurizing Text
•  Each word becomes a feature
•  This is called "high dimensional" data
•  Each word is a "dimension", or "feature"
•  Features are represented as vectors in Euclidean space
•  Euclidean mathematics scales beyond 3 dimensions
The Shape of Data
•  Data structures and formats
•  Informed (in part) by:
o  Tools
o  Co-occurrence
o  Data output formats
o  Entity type
o  Temporal, geographical perspective, etc.
Validation
From Ben Schmidt’s “Machine Learning at Sea”
Network Models
•  Representing data as a network
o  Types: technological, communication, transportation, energy, airplane routes,
web linking patterns
o  social
§  non-human animal interaction
§  membership in larger groups
§  sexually transmitted diseases
§  co-authorship of scientific publications
§  trade agreements between nations
•  Mapping the News - Berkman's Controversy Work
o  Spidering
o  Influential actors over time
Topic Modeling Tools
•  MALLET
o  Can run on unstructured plain text files
o  http://mallet.cs.umass.edu/topics.php
•  Stanford Topic Modeling Toolbox
o  Requires data in a CSV or TSV file
o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
Entity Extraction
•  Identifies known entities in specific categories
o  Locations
o  People
o  Organizations
o  Dates/times
•  Creates annotated text from unstructured text
•  Domain-specific
Entity Extraction Tools
•  Stanford Named Entity Recognizer
http://nlp.stanford.edu/software/CRF-NER.shtml
•  Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/download_view/NETagger
•  DBPedia Spotlight
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
Geo-Parsing
•  Common Pitfalls
o  Set of places (GeoNames dictionary)
o  Dictionary determines how broad or narrow your
search is
•  Enhancements to CLAVIN by Civic Media
o  Aboutness (uses mention counting)
o  HTTP access used for more advanced workflows

Contenu connexe

Tendances

Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Harriett Green
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical DocumentsGeorg Vogeler
 
Challenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsChallenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsIIIF_io
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMESharonYang
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible LibraryKsenija Mincic Obradovic
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Ksenija Mincic Obradovic
 
Discussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBDiscussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBKimmo Soramaki
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISCrishuana Williams
 
Historical methods 2012
Historical methods 2012Historical methods 2012
Historical methods 2012p-logsdon
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH ResearchHarriett Green
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu
 
Digital Libraries on International Campuses
Digital Libraries on International CampusesDigital Libraries on International Campuses
Digital Libraries on International CampusesHarriett Green
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Harriett Green
 

Tendances (15)

Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Working digitally with Historical Documents
Working digitally with Historical DocumentsWorking digitally with Historical Documents
Working digitally with Historical Documents
 
Librarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data MiningLibrarian Legal Literacies for Text Data Mining
Librarian Legal Literacies for Text Data Mining
 
Challenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old InstitutionsChallenges Displaying Complex Image Data: New Tech & Old Institutions
Challenges Displaying Complex Image Data: New Tech & Old Institutions
 
New Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAMENew Directions in Information Organization: A Linked Data Model with BIBFRAME
New Directions in Information Organization: A Linked Data Model with BIBFRAME
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
Discussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNBDiscussion of "Google matrix of world trade" @ DNB
Discussion of "Google matrix of world trade" @ DNB
 
Google Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLISGoogle Books' Potential for Digital Transformation - Syracuse University MLIS
Google Books' Potential for Digital Transformation - Syracuse University MLIS
 
Historical methods 2012
Historical methods 2012Historical methods 2012
Historical methods 2012
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH Research
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)
 
Digital Libraries on International Campuses
Digital Libraries on International CampusesDigital Libraries on International Campuses
Digital Libraries on International Campuses
 
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
Beyond the Scanned Image: A Needs Assessment of Faculty Users of Digital Coll...
 

Similaire à Text Analysis Methods for Digital Humanities

Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCodePolitan
 
Rscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsRscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsSusanMRob
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf09372002dedi
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 
Data for the Humanities
Data for the HumanitiesData for the Humanities
Data for the Humanitieslibrarianrafia
 
Beyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationBeyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationMia
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 

Similaire à Text Analysis Methods for Digital Humanities (20)

00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Combining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User ProfilingCombining Data Mining and Machine Learning for Effective User Profiling
Combining Data Mining and Machine Learning for Effective User Profiling
 
Rscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libsRscd 2017 bo f data lifecycle data skills for libs
Rscd 2017 bo f data lifecycle data skills for libs
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data for the Humanities
Data for the HumanitiesData for the Humanities
Data for the Humanities
 
Beyond the Black Box: Data Visualisation
Beyond the Black Box: Data VisualisationBeyond the Black Box: Data Visualisation
Beyond the Black Box: Data Visualisation
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Ir1
Ir1Ir1
Ir1
 
Demography pro sem
Demography pro semDemography pro sem
Demography pro sem
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 

Dernier

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Dernier (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

Text Analysis Methods for Digital Humanities

  • 1. Text Analysis Methods for Digital Humanities Helen Bailey and Sands Fish MIT Libraries
  • 2. Examples of Data Narratives •  Visualizing Emancipation •  Narrative Visualization of Whaling Ship Logs •  Out of Sight, Out of Mind
  • 3. Approaches to Storytelling w/ Data •  EDA - Exploratory Data Analysis •  Exploring data from a number of perspectives: o  Temporal o  Geographical o  Statistical o  Categorical o  Relational •  80% - Data Hacking, 20% - Narrative Construction, Visualization, etc.
  • 4. "To use any sort of historical data, we must above all understand the constraints under which it was collected. In this case, that means retelling the history of why and how the ship's logs were first collected, and how the constraints of digitization in the punch card era radically shape the sort of evidence we can draw from them. The important thing about this sort of work is that it helps us understand the overall biases of a particular data set, which is crucial for limiting our interpretive leaps." - Ben Schmidt, “Reading digital sources: a case study in ship's logs”
  • 5. Inherent Biases & Limitations •  Data capture methods and format •  Purpose of data collection •  Transformation over time •  Authenticity and trust Understand provenance
  • 6. “Rather than replace humans, computers amplify human abilities. The most productive line of inquiry, therefore, is not in identifying how automated methods can obviate the need for researchers to read their text. Rather, the most productive line of inquiry is to identify the best way to use both humans and automated methods for analyzing texts.” - Justin Grimmer and Brandon M. Stewart, “ Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”
  • 7. Acquiring Text •  Full-text resources: o  DSpace@MIT http://dspace.mit.edu/ o  Dome http://dome.mit.edu/ o  Digital Public Library of America http://dp.la o  Europeana http://www.europeana.eu/portal/ o  HathiTrust http://www.hathitrust.org/ •  http://libguides.mit.edu/apis - metadata only •  http://libguides.mit.edu/digitalhumanities
  • 8. Data Management and Sharing •  Assumption of sharing and data management plan as a funding requirement •  Data storage options - anticipate interaction o  Storage formats - non-proprietary and repurposable whenever possible o  File system storage vs. database •  Documentation of process http://libraries.mit.edu/guides/subjects/data-management/
  • 9. Formatting / Pre-Processing •  Tool input requirements •  Assumptions: o  Text as a “bag of words” o  Unigrams, bigrams o  Word order (or not) o  Stop words, capitalization, punctuation
  • 10. Featurizing Text •  Each word becomes a feature •  This is called "high dimensional" data •  Each word is a "dimension", or "feature" •  Features are represented as vectors in Euclidean space •  Euclidean mathematics scales beyond 3 dimensions
  • 11. The Shape of Data •  Data structures and formats •  Informed (in part) by: o  Tools o  Co-occurrence o  Data output formats o  Entity type o  Temporal, geographical perspective, etc.
  • 12. Validation From Ben Schmidt’s “Machine Learning at Sea”
  • 13. Network Models •  Representing data as a network o  Types: technological, communication, transportation, energy, airplane routes, web linking patterns o  social §  non-human animal interaction §  membership in larger groups §  sexually transmitted diseases §  co-authorship of scientific publications §  trade agreements between nations •  Mapping the News - Berkman's Controversy Work o  Spidering o  Influential actors over time
  • 14. Topic Modeling Tools •  MALLET o  Can run on unstructured plain text files o  http://mallet.cs.umass.edu/topics.php •  Stanford Topic Modeling Toolbox o  Requires data in a CSV or TSV file o  http://nlp.stanford.edu/software/tmt/tmt-0.4/
  • 15. Entity Extraction •  Identifies known entities in specific categories o  Locations o  People o  Organizations o  Dates/times •  Creates annotated text from unstructured text •  Domain-specific
  • 16. Entity Extraction Tools •  Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml •  Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/download_view/NETagger •  DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
  • 17. Geo-Parsing •  Common Pitfalls o  Set of places (GeoNames dictionary) o  Dictionary determines how broad or narrow your search is •  Enhancements to CLAVIN by Civic Media o  Aboutness (uses mention counting) o  HTTP access used for more advanced workflows