SlideShare une entreprise Scribd logo
1  sur  28
Documents as Data
Harvesting Knowledge from
Textual Resources with DADAlytics
Mary Mann, Sarah Ann Adams, Rose Gold,
Ilaria Della Monica, M. Cristina Pattuelli
Qualitative and Quantitative Methods in Libraries
Florence, May 28 - June 1, 2019
Semantic Lab
at Pratt Institute
@semlabteam bit.ly/QQMLSemLab
What is Linked Open Data
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
LOD: Recommended best practices
for exposing, sharing, and connecting
pieces of data, information, and
knowledge on the Semantic Web
conceived by Tim Berners-Lee in 2006
Diagrams by
Sarah Ann Adams
bit.ly/QQMLSemLab
Availability of
easy-to-use tools
Technological
understanding
How DADAlytics
Helps
Intuitive
data service
Lowers barrier to
LOD creation
Linked Data
Obstacles
bit.ly/QQMLSemLab
What is DADAlytics
Partners:
Carnegie Hall, Tulane
University, University of
Minnesota, Harvard
University, Villa I Tatti,
Whitney Museum of
American Art
Named-Entity
Recognition
(NER) Module
Sélavy
Document
Analysis Tool
-------------
-------------
-------------
-------------
-------------
organization
location
date
person
misc
--------------
--------------
--------------
--------------
------
--------------
Title
--------------
Subtitle
--------------
--------------
------
Body
Diagrams by
Sarah Ann Adams
bit.ly/QQMLSemLab
bit.ly/QQMLSemLab
DBpedia
Spotlight
Stanford NLP NLTK SpaCy
OpeNER
Project
TensorFlow
Syntaxnet
Tool
Type
NLP tool with
NER component
NLP tool with
NER component
NLP tool with
NER
component
NLP tool with
NER component
NLP tool with
NER component
neural network
part-of-speech
tagger
Trainin
g
Data
DBpedia
resources
(Wikipedia-
extracted
structured
content)
mix of CoNLL, MUC-
6, MUC-7, and ACE
named entity
corpora using the
english.muc.
7class.distsim.crf.se
r.gz classifier
Groningen
Meaning Bank
corpus
OntoNotes and
Common Cawl
Apache OpeNLP
models
Parsey
McParseface
Further
Info
dbpedia-
spotlight.org
nlp.stanford.edu nltk.org spacy.io opener-project.eu research.googleblo
g.com/2016/05/ann
ouncing-syntaxnet-
worlds-most.html
The Six DADAlytics NER Tools
bit.ly/QQMLSemLab
Mary Berenson [1885] Public Domain, held at National Portrait Gallery
Mary Berenson and her Diaries
Mary (Whitall) Berenson
- art historian, art critic
- wife of art historian Bernard Berenson
- influenced Bernard’s work
- Archive held at Villa I Tatti
Mary and Bernard Berenson near Fernhurst,
England, 1898, courtesy of the Villa I Tatti Berenson
Library
DADA•Berenson
bit.ly/QQMLSemLab
DADA•Berenson
NAMES?
PLACES?
ARTISTS?
WORKS OF ART?
Photograph courtesy of the Villa I Tatti Berenson Library
Villa I Tatti Diary
Project
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS ENTITY
TYPES
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS ENTITY
TYPES
4] MANUALLY EXTRACT
ENTITIES
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS ENTITY
TYPES
4] MANUALLY EXTRACT
ENTITIES
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS ENTITY
TYPES
4] MANUALLY EXTRACT
ENTITIES
5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS ENTITY
TYPES
4] MANUALLY EXTRACT
ENTITIES
5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE
6] COMPARE MANUAL EXTRACTION TO DADALYTICS
OUTPUT
bit.ly/QQMLSemLab
Dadalytics NER Demo
bit.ly/QQMLSemLab
Person
Location
Date
Organization
Event
Miscellaneous
DADAlytics Entity
Categories
Diary-Specific
Entity Types
Literature
Music
Poetry
Theater
Non-Fiction
Visual Art
Art Described by Era
Art Described by Region
Drawing
Painting
Photography
Pottery
Print
Sculpture
Stained Glass
Textile
Mural
Art Collection
Biographic
Cultural
Historic
Article
Journal
Lecture
Magazine
Newspaper
Thesis
Entity Classification
bit.ly/QQMLSemLab
Extraction Comparison Results
semlab.io/DADAlytics-ner-evaluation/
bit.ly/QQMLSemLab
Extraction Comparison Results
semlab.io/DADAlytics-ner-evaluation/
NLTK Example
bit.ly/QQMLSemLab
100.00%
Analysis of
Results
54.50%
11.17%
57.33%
38.00%
54.50%
29.33%
21.67%
20.50%
12.67%
11.50%
35.00%
24.17%
38.50%
39.67%
75.00%
23.83%
68.83%
73.00%
78.67%
67.83%
61.33%
bit.ly/QQMLSemLab
100.00%
Analysis of
Results
54.50%
11.17%
57.33%
38.00%
54.50%
29.33%
21.67%
20.50%
12.67%
11.50%
35.00%
24.17%
38.50%
39.67%
75.00%
23.83%
68.83%
73.00%
78.67%
67.83%
61.33%
Batista, D. Named-Entity evaluation metrics based on entity-level.
(2018 May 9). Retrieved from www.davidsbatista.net/blog/
2018/05/09/Named_Entity_Evaluation/
66.00%
50.00%
[For partial matches]
[For exact matches]
bit.ly/QQMLSemLab
RroseSelavy(MarcelDuchamp),1920©ManRayTrust/ADAGP,ParisandDACS,London2015
SÉLAVY - DOCUMENT ANALYSIS TOOL
Marcel Duchamp as Rrose Sélavy
(pronounced “c’est la vie”)
---------------
---------------
---------------
---------------
-------
-------------
Block 1
-------------
Block 2
---------------
---------------
----
Block 3
Diagram by
Sarah Ann Adams
bit.ly/QQMLSemLab
Turning the document into blocks
bit.ly/QQMLSemLab
Document clean up
bit.ly/QQMLSemLab
Processing of document through Sélavy
The text that was formatted in Selavy
is now being pushed through the NER
tool for entity recognition, and will
then be pulled back into the Selavy
tool for further transformation
bit.ly/QQMLSemLab
Reviewing the entities
bit.ly/QQMLSemLab
Next Steps
- Complete the development of the Sélavy module
- Test the Sélavy using Mary Berenson’s diary, and then on
other types of documents (interviews, finding aids, etc.)
- Evaluate the tool with the intended community of users
- Review and refine the tool and workflow
- Apply methodology to other Semantic Lab projects
bit.ly/QQMLSemLab
Thank You
Semantic Lab
at Pratt Institute
--- S E M L A B C O - D I R E C T O R S ---
prof. m. cristina pattuelli
prof. matt miller
------ S E M L A B T E A M ------
mary mann
rose gold
sarah adams
taylor baker
megan lyon
------ C O N T A C T ------
w :: semlab.io
t :: @semlabteam
e :: foaf.Person@semlab.io
A special thank you to
Ilaria della Monica, Archivist, Villa I Tatti
tools by Nithinan Tatah from the Noun Project (slide 4)
personal solution by ProSymbols from the Noun Project (slide 4)
solution by Gregor Cresnar from the Noun Project (slide 4)
PErson passive confused by Margaret Hagan from the Noun Project (slide 5)
NOUN PROJECT IMAGE CREDITS
Questions?
@semlabteam bit.ly/QQMLSemLab

Contenu connexe

Similaire à Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

Web-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionWeb-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionIRJET Journal
 
DWDM syllabus.doc
DWDM syllabus.docDWDM syllabus.doc
DWDM syllabus.docRitCse
 
LIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Europe
 
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...eMadrid network
 
Mtp ppt soumya_sarkar
Mtp ppt soumya_sarkarMtp ppt soumya_sarkar
Mtp ppt soumya_sarkarsamarai_apoc
 
Comparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms onComparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms oneSAT Publishing House
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
 
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET Journal
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...IRJET Journal
 
Digital Lost and Found Item Portal
Digital Lost and Found Item PortalDigital Lost and Found Item Portal
Digital Lost and Found Item PortalIRJET Journal
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesIRJET Journal
 
Modern Association Rule Mining Methods
Modern Association Rule Mining MethodsModern Association Rule Mining Methods
Modern Association Rule Mining Methodsijcsity
 
Final Project 3 Document
Final Project 3  DocumentFinal Project 3  Document
Final Project 3 DocumentLinda Calkins
 

Similaire à Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics (20)

Web-Application Framework for E-Business Solution
Web-Application Framework for E-Business SolutionWeb-Application Framework for E-Business Solution
Web-Application Framework for E-Business Solution
 
DWDM syllabus.doc
DWDM syllabus.docDWDM syllabus.doc
DWDM syllabus.doc
 
LIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data Management
 
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...
02_07_2018_«El valor de blockchain en el registro de la actividad académica: ...
 
Mtp ppt soumya_sarkar
Mtp ppt soumya_sarkarMtp ppt soumya_sarkar
Mtp ppt soumya_sarkar
 
Comparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms onComparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms on
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
Detecting Malicious Bots in Social Media Accounts Using Machine Learning Tech...
 
Digital Lost and Found Item Portal
Digital Lost and Found Item PortalDigital Lost and Found Item Portal
Digital Lost and Found Item Portal
 
social networking site
social networking sitesocial networking site
social networking site
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
NTXISSACSC3 - Metasploit Year in Review by James Lee
NTXISSACSC3 - Metasploit Year in Review  by James LeeNTXISSACSC3 - Metasploit Year in Review  by James Lee
NTXISSACSC3 - Metasploit Year in Review by James Lee
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
Modern Association Rule Mining Methods
Modern Association Rule Mining MethodsModern Association Rule Mining Methods
Modern Association Rule Mining Methods
 
Final Project 3 Document
Final Project 3  DocumentFinal Project 3  Document
Final Project 3 Document
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

Notes de l'éditeur

  1. SLIDE 1 [MARY] Hello, my name is Mary, from the Semantic Lab at Pratt Institute. We are very happy to be here at QQML to present “Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics.” Before we start, I’d like to invite you all to take a minute to stand up and stretch - I know we’ve all been sitting for a long time already! On the screen here you can see a link to this presentation, if you’d like to follow along. And just to give a roadmap of what we’ll be sharing with you, I will start by speaking about Linked Open Data and introducing the Linked Data creation tool package we’ve coined DADAlytics; then my colleague Rose will speak about the process of using a diary written by Mary Berenson from Villa I Tatti to test the first component of DADAlytics; lastly our colleague Sarah will speak about the results of that testing and then describe the second component of the DADAlytics tool package, Sélavy. So let’s get started.
  2. SLIDE 2 [MARY] DADAlytics is a tool package that helps institutions and researchers create linked open data. So before we get into DADAlytics itself, I’ll just give you a brief overview of what linked open data is and why you might want to create it. The internet as we currently know it is a series of documents linked by URLs. But the semantic web is a web of linked data, rather than just linked documents. In this context, “data” could mean anything from statistics to people to names of artworks. A semantic web with information stored as linked data allows for more granular searching of data, which has the added benefit of increasing discoverability of and access to the data and/or resources.
  3. SLIDE 3 [MARY] Because the generation of linked open data is relatively new to the cultural heritage domain, there are still significant barriers to entry, particularly in terms of understanding the processes and technology involved, and the availability of intuitive linked data tools. We recognize that there’s a significant time cost to creating linked open data at this stage in its development, which can make the process daunting for already-busy library and museum professionals. So the Semantic Lab envisioned DADAlytics, with the goal of creating a lightweight tool package that could enable every librarian, archivist, museum professional and digital humanities scholar to contribute to the Semantic web by creating linked open data, advancing scholarship and creating new knowledge.
  4. SLIDE 4 [MARY] The DADAlytics tool package is being developed by Matt Miller, one of the two co-directors of the Semantic Lab at Pratt, and was informed by needs of the stakeholders. Representatives from these institutions gave us feedback on what they might want to see in a package designed to help them create linked open data. DADAlytics currently consists of two modules: a named-entity-recognition toolchain and a document analysis tool called Sélavy. My colleague Sarah will speak about Sélavy at the end of the presentation, but right now I’ll focus on the named-entity-recognition toolchain, or NER toolchain for short.
  5. SLIDE 5 [MARY] The NER toolchain is a combination of six existing open source tools that work together to recognize entities. In this context, an “entity ” can loosely be thought of as a proper noun. The main categories of entities picked up by NER tools are dates, locations, organizations, people, and “miscellaneous”. Here’s a visual of the NER toolchain process at work on an archival document. First the document has to be transcribed into machine-readable type. Then you can simply copy and paste the text into the NER toolchain, and it returns something that looks like this (gesture to screen) where all of the detected entities are highlighted with a color block indicating what type of entity the tools believe they are.
  6. SLIDE 6 [MARY] The NER toolchain harnesses the strengths of six existing tools into one super-tool, which outputs stronger results combined than any one of these tools could do individually. That said, users have the option to select or deselect tools before processing a document through the NER toolchain. And now I’ll turn it over to my colleague Rose, who will talk about the testing of the NER toolchain...
  7. SLIDE 7 [ROSE] We used 7 different different types of written documents to test the NER toolchain my colleague Mary just described Chapter from a fiction book Interview transcript Metadata descriptions Press release Artist cv/resume Portion of an EAD finding aid And a diary -- the diary of art historian and critic Mary Berenson whose papers are held, along with those of her husband Bernard Bernson, at Villa I Tatti, The Harvard University Center for Renaissance Studies While Mary worked in the shadow of her more renowned husband, she is credited with having had significant influence over his scholarly work as well as cultivating relationships with intellectuals, artists and collectors who surrounded the couple while they lived in Florence. Because of our partnership with i Tatti and their interest in knowing more details about Mary Berenson’s diaries , we decided to take a closer look at one of her diaries
  8. SLIDE 8 [ROSE] Why Berenson’s diaries? As I Tatti archivist Ilaria della Monica puts it: “Mary recorded the travels she undertook with her second husband Bernard Berenson to visit museums, churches and private collections. She also took notes on books, music and the people they met.” Her diaries are thus rich in information, full of useful entities like the names of artwork and artists, places and people, books and theories These entities are helpful because they create the world and orbit that Berenson moved within. This information is highly valued by i Tatti researchers and staff
  9. SLIDE 9 [ROSE] We began by choosing the 1903 diary of the Berenson’s trip to America, which i Tatti researchers were particularly interested in knowing more about
  10. SLIDE 10 [ROSE] I Tatti provided us with a transcribed and OCR’ed PDF of the diary
  11. SLIDE 11 [ROSE] And once we had the diary, we began building a sort of dictionary of terms that we could use to classify entities. We’ll show you some of these terms later.
  12. SLIDE 12 [ROSE] We also began manually extracting entities from the document - and by entities I mean names of people, pieces of art, places they visited, and so on.
  13. SLIDE 13 [ROSE] This manual extraction informed the classification dictionary and vice versa, so the development of both happened concurrently
  14. SLIDE 14 [ROSE] Once this manual extraction was complete, we ran the diary through the NER toolchain...
  15. SLIDE 15 [ROSE] ...and compared the toolchain output to the results of manual extraction
  16. SLIDE 16 [ROSE] You’ve seen this image before. This is what the first page of Mary Berenson’s diary looked like after being processed by the NER toolchain.
  17. SLIDE 17 [ROSE] And here you can see the classification of entities On the left are the entities that DADAlytics looks for: date, location, organization, person and miscellaneous miscellaneous can be domain specific depending on what you choose to be relevant to your research needs On the right side of this slide, you can see the nuances that became available when when the miscellaneous entities were further categorized - for example, this differentiation between drawings, paintings, and murals
  18. SLIDE 18 [SARAH] Thank you Rose, for describing the process for preparing the Mary Berenson document. As previously mentioned by Mary, I will talk about the results of testing the DADAlytics NER tool with the Mary Berenson diary. This screenshot here shows the overall results of which tools picked up which percentages of exact matches, partial matches, and no matches, out of the 3273 entities that were extracted manually from the diary. All of this information - including the extraction results from the 6 other typologies of documents - is available on our website (semlab.io/DADAlytics-ner-evaluation) as well as our github repository (https://github.com/SemanticLab/DADAlytics-ner-evaluation)
  19. SLIDE 19 [SARAH] Clicking on “more” shows each entity that was manually marked up and whether it were matched exactly, partially, or not at all by a given tool. This slide shows examples of the results of the NLTK tool on the Mary Berenson diary. Manually extracted entities were compared to the entities detected by the DADAlytics NER tool through a series of python scripts written by Semantic Lab co-director Matt Miller. These python scripts can also be found in our github repository
  20. SLIDE 20 [SARAH] So how did the DADAlytics NER tool do? This chart shows the exact match average of all the six tools combined for each document (the dark purple) as well as the additional partial match average of all the six tools combined for each document (pink). As you can see, the DADAlytics NER tool picked up the least amount of exact and partial matches for Mary Berenson’s diary We believe the diary had low numbers because we conducted a more nuanced manual extraction of entities it, which greatly expanded potential matches within the miscellaneous category Although this significantly differentiates the results of the diary from the results of the other documents, this granular encoding process was still useful because it can reveal the specific strengths or gaps of each tool. A more in depth study of the precision, recall, and F1 measurements for the first 100 entities of the diary has been conducted and will be made available on our website.
  21. SLIDE 21 [SARAH] Moving back to looking at the results as a whole, what is a typical Named-Entity evaluation metric to which we can compare our results? In a 2018 article David S. Batista cites a metric of 50% precision for exact matches and 66% precision for partial matches as NER performance averages. Even though the results of DADAlytics NER tool don’t meet those evaluation metric thresholds with all types of documents, the NER tool is useful in doing the heavy lifting of recognizing named entities, especially in large amounts of texts, which lessens the researcher’s manual workload. This task becomes more powerful in conjunction with the second component of the DADAlytics tool package, Sélavy
  22. SLIDE 22 [SARAH] Sélavy is a document analysis tool called that will support the generation of linked data from text. While the NER tool is powerful in identifying entities, the strength of the Sélavytool is in its ability to relate the entities to one another. This module is still in development by Matt Miller, but I’ll show a few screenshots in these last slides to give an idea of how this tool might be used.
  23. SLIDE 23 [SARAH] The first step in using Sélavyis to determine the text blocks that make up a document. For example, the text block of Mary Berenson’s diary is a day. The Sélavytool can automatically detect some document structure, and the user can also refine the structure using regular expressions. An example of how this can be useful is that, in the case of the diary, an entity can now be related to a specific day rather than to the whole diary.
  24. SLIDE 24 [SARAH] A traditional “find and replace” can also be used to clean up a document. For example, the administrative text that was on each page of the original pdf was removed so that it would not be run through Sélavy .
  25. SLIDE 25 [SARAH] Once a document has been sufficiently transformed, it pushed through the DADAlytics NER tool for entity recognition, and then pushed back into the Sélavy module.
  26. SLIDE 26 [SARAH] This slide is an example of how Sélavy picked up the entity Bryn Mawr (a college in Pennsylvania) 18 times during the processing of the text This is where the knowledge of a domain expert comes into play. A user can review the detected entities, decide whether or not to include them in a linked data set. A file of curated entities that can be downloaded in RDF, which is a linked data file format.
  27. SLIDE 27 [SARAH] Complete the development of the Sélavy module Test the Sélavy using Mary Berenson’s diary, and then on other types of documents (interviews, finding aids, etc.) Evaluate the tool with the intended community of users Review and refine the tool and workflow Apply methodology to other Semantic Lab projects
  28. SLIDE 28 [SARAH] Thank you so much for your time and attention. It has been a pleasure for us to be here. We’re happy to take any questions.