SlideShare une entreprise Scribd logo
1  sur  30
Language Resources, Language
Technology, Text Mining, the Semantic
Web: How interoperability of machines
can help humans in the multilingual web
Felix Sasaki
DFKI / University of Appl. Sciences Potsdam
W3C German-Austrian Office
felix.sasaki@dfki.de
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 1
Purpose of this talk (1)
• Show gaps
– Between machines
– Between machines and humans
• … which we need to fill to bridge gaps
between humans
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 2
Purpose of this talk (2)
• Identify groups / communities
– To fill gaps
– To come together in new alliances
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 3
Basics:
What are machines doing
(not only on the Web)?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 4
Language Technology
• Summarization
LT
“These texts are
about ... “
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 5
Language Technology
• Machine Translation
LT このワークショップ
は…で開催される
“The workshop
takes place in …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 6
Language Technology
• Spell and grammar checking
LT
“The workshop
takes place in …“
“The worksop
take place in …“
• And many more applications
• Coreference resolution, discourse analysis,
named entity recognition, natural language
generation, question answering, …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 7
Text mining
• Finding out things you did not know
Text
mining
•“Text A and text B
are similar”
•“The text collection
has clusters of
topics: …”
Visualization
of results
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 8
Basics:
What are machines doing
(not only on the Web)?
How are they doing it?
They are using resources
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 9
Resources in language technology
• Sample resources for summarization
LT
“These texts are
about ... “
NLG output
text mining
output
stop word
list
…
10
Language Technology
• Sample resources in Machine Translation
LT このワークショップ
は…で開催される
“The workshop
takes place in …“
Lexicon Grammar
(Training)
corpora
…
Generation 11
Language Technology
• Sample resources for spell and grammar
checking
LT
“The workshop
takes place in …“
“The worksop
take place in …“
Lexicon Grammar …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 12
Text mining
• Sample resources for text mining
Text
mining
•“Text A and text B
are similar”
•“The text collection
has clusters of
topics: …”
Lexicon
Stop word
list
…
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 13
In general: you need three types of
data: input, resources, workflow
Input
Work-
flow
Output
Resources Resources …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 14
What gaps need to be filled for truly
“multilingual content processing”?
• Gap 1: machines don’t use metadata available
in the input
• Gap 2: machines don’t know about the
workflow (input) data goes through
• Gap 3: machines don’t make explicit
– “Who” they are
– What resources they are using
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 15
Gap 1: machines don’t use metadata
available in the input
• Input from www.postbank.de
„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die
häufigsten Fragen zu unseren
Transaktionssystemen finden Sie an
dieser Stelle.“
• Output via Google translate
“Whether Postbank direct, online
banking, online brokerage or myBHW.
Frequently asked questions about our
transaction systems can be found at
this location.”
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 16
Gap 1: machines don’t use metadata
available in the input
• Input from www.postbank.de
„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die
häufigsten Fragen zu unseren
Transaktionssystemen finden Sie an
dieser Stelle.“
• Output via Google translate
“Whether Postbank direct, online
banking, online brokerage or myBHW.
Frequently asked questions about our
transaction systems can be found at
this location.”
Fixed terminology
should not have
been translated.
But – the MT tool
had no chance to
“know” that –
why?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 17
Gap 2: machines don’t know about
processes data goes through
• Input from the data base – the
“hidden web”:
„Ob <term>Postbank direkt</term>,
<term>Online-Banking</term>,
<term>Online-Brokerage</term> …“
• Output on the Web:
„Ob <em>Postbank direkt</em>,
<em>Online-Banking</em>,
<em>Online-Brokerage</em> …“
fixed terminology
(= metadata) …
… is lost
on the Web 
publication
process
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 18
Gap 3: no common identification …
• Of metadata and processes chains (previous
slides)
• Of resources – e.g. what is a lexicon
– In machine translation?
– In localization?
– For a human reader?
– Ability to combine tools depends on knowing
about them (capabilities, resources) in detail
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 19
Who can fill these gaps – people
dealing with multilingual content
• Content producers
– Allow for terminology identification in source formats
/ CMS
• Localizers
– Make localization workflows aware of (process /
source content) metadata
• “Machine” experts
– Make their tools sensible to source content metadata
and expose their capabilities (what resources /
workflows) in a clear defined way
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 20
Who can fill these gaps – people
dealing with multilingual content
• Users
– Add metadata to source content
– Use (machine translation) tools without knowing the
details – e.g. in the browser!
• Browser vendors
– Create APIs which make use of automatic tools /
resource and workflow descriptions / source code
metadata
• …
 The people in this room!
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 21
How can they fill the gaps?
• All these groups need to agree upon one
machine readable information space for filling
the gaps
• It’s actually already here – the Semantic Web!
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 22
What is the Semantic Web
• The Web as humans see it: Identification of
“meaning” e.g. via (typographic or other)
conventions
„Ob Postbank direkt …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 23
What is the Semantic Web
• The Web as machines see it: Identification of
meaning via RDF-based mechanisms (here via
RDFa)
„Ob <span property=”its:term”>Postbank direkt</span>
…“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 24
What is the Semantic Web –
RDF in 30 seconds
• A framework for making statements about
resources, using URIs
• RDF can help to fill our gaps
1. Metadata in the input
2. Metadata for workflows
3. Identify 1., 2. and language technology resources
uniquely
• In one information space – the machine
readable Web
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 25
Instead of a summary – call for project
(participating in ) proposals
• Who needs to come together
– Content producers, localizers, “machine” experts, browser
vendors, users
• What should their work be based upon
– Semantic Web technologies
– Clear interfaces to the human (e.g. browser) Web, like RDFa
• What we do not need
– Web-centred standardization of formats for language resources
themselves – that is already done elsewhere (see this session)
• Where the place is to do that work?
– W3C, since it needs to be part of core Web technologies
• For making it happen, we need a strong alliance of Web
technologies, other fields and machine technologies
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 26
META-NET
• EU-funded project, closely related to
“Multilingual Web”
• Main aim: build an alliance for improving
language technologies in Europe
• Laaarge: soon 40+ participating organizations
in 30+ countries
• Very important: bring users of language
technology in
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 27
META-NET
• Users and language technology companies =
in Europe not only large companies, but more
and more small SMEs
• Target of META-NET are these small and fast
units – including you 
• EU has started special funding programs for
SMEs – see http://tinyurl.com/eu-lt-sme
(“objective 4.1”)
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 28
META-NET
• Event: META-NET Forum
• Brussels, November 17th/18th
• Aim: Bring users / language technology
developers / policy makers together
• Discuss a road map for the next 10 years of
language technology road map and its
applications
• Details and registration at
http://www.meta-net.eu/events
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 29
Language Resources, Language
Technology, Text Mining, the Semantic
Web: How interoperability of machines
can help humans in the multilingual web
Felix Sasaki
DFKI / University of Appl. Sciences Potsdam
W3C German-Austrian Office
felix.sasaki@dfki.de
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 30

Contenu connexe

En vedette

Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010
Felix Sasaki
 
Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117
Felix Sasaki
 
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
Felix Sasaki
 
Prof Klaus: Terminology Management
Prof Klaus: Terminology ManagementProf Klaus: Terminology Management
Prof Klaus: Terminology Management
akashjd
 
Sasaki markupforum2011
Sasaki markupforum2011Sasaki markupforum2011
Sasaki markupforum2011
Felix Sasaki
 

En vedette (17)

Sasaki practical-linked-data
Sasaki practical-linked-dataSasaki practical-linked-data
Sasaki practical-linked-data
 
Sasaki webtechcon2010
Sasaki webtechcon2010Sasaki webtechcon2010
Sasaki webtechcon2010
 
Freme at feisgiltt 2015 freme & linked data & localisers
Freme at feisgiltt 2015   freme & linked data & localisersFreme at feisgiltt 2015   freme & linked data & localisers
Freme at feisgiltt 2015 freme & linked data & localisers
 
HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009HTML5 - presentation at W3C-Tag 2009
HTML5 - presentation at W3C-Tag 2009
 
Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117Sasaki ins-netz-gegangen-20111117
Sasaki ins-netz-gegangen-20111117
 
XML Seminar
XML SeminarXML Seminar
XML Seminar
 
Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015
 
Terminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-EinführungTerminologie als Baustein der CMS-Einführung
Terminologie als Baustein der CMS-Einführung
 
1114 sasaki-metadata
1114 sasaki-metadata1114 sasaki-metadata
1114 sasaki-metadata
 
Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016Sasaki Presentation at EVA 2016
Sasaki Presentation at EVA 2016
 
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
"Warum Metadaten? Ein Plädoyer und mehr …" - webtechcon 2011 Präsentation
 
Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)Tdahtdahok 111120161211-phpapp01 (1)
Tdahtdahok 111120161211-phpapp01 (1)
 
Prof Klaus: Terminology Management
Prof Klaus: Terminology ManagementProf Klaus: Terminology Management
Prof Klaus: Terminology Management
 
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentationtekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
tekom/tcworld 2013 – T2: Einheitliche Terminologie in Technischer Dokumentation
 
Its2 ontology-localization
Its2 ontology-localizationIts2 ontology-localization
Its2 ontology-localization
 
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
Terminologie für Alle - Praktischer Nutzen und unternehmensweite Wertschöpfung
 
Sasaki markupforum2011
Sasaki markupforum2011Sasaki markupforum2011
Sasaki markupforum2011
 

Similaire à Mlw sasaki-20101027

Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSW
Sören Auer
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)
Guus van den Brekel
 

Similaire à Mlw sasaki-20101027 (20)

From Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperabilityFrom Provider to Portal - a chain of interoperability
From Provider to Portal - a chain of interoperability
 
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
Digital Libraries of the Future
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 Tutorial
 
Overview AG AKSW
Overview AG AKSWOverview AG AKSW
Overview AG AKSW
 
The Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open DataThe Europeana Strategy and Linked Open Data
The Europeana Strategy and Linked Open Data
 
Busy Architects Guide to Modern Web Architecture in 2014
Busy Architects Guide to  Modern Web Architecture in 2014Busy Architects Guide to  Modern Web Architecture in 2014
Busy Architects Guide to Modern Web Architecture in 2014
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Semantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseSemantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in Use
 
Semantic Web in the Plateau of Productivity
Semantic Web in the Plateau of ProductivitySemantic Web in the Plateau of Productivity
Semantic Web in the Plateau of Productivity
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in Action
 
Power to the Users (and Librarians)
Power to the Users (and Librarians)Power to the Users (and Librarians)
Power to the Users (and Librarians)
 
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
Crowd wales, Building a crowdsourcing platform for Wales by Paul McCann - Eur...
 
Building an ecosystem of networked references
Building an ecosystem of networked referencesBuilding an ecosystem of networked references
Building an ecosystem of networked references
 
Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02Amersfoort 2016 koch_wg_v02
Amersfoort 2016 koch_wg_v02
 
Silicon Valley Semantic Web Meet Up
Silicon Valley Semantic Web Meet UpSilicon Valley Semantic Web Meet Up
Silicon Valley Semantic Web Meet Up
 
Instructional Design for the Semantic Web
Instructional Design for the Semantic WebInstructional Design for the Semantic Web
Instructional Design for the Semantic Web
 
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
Presentation of context: Web Annotations (& Pundit) during the StoM Project (...
 

Dernier

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Dernier (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 

Mlw sasaki-20101027

  • 1. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 1
  • 2. Purpose of this talk (1) • Show gaps – Between machines – Between machines and humans • … which we need to fill to bridge gaps between humans W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 2
  • 3. Purpose of this talk (2) • Identify groups / communities – To fill gaps – To come together in new alliances W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 3
  • 4. Basics: What are machines doing (not only on the Web)? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 4
  • 5. Language Technology • Summarization LT “These texts are about ... “ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 5
  • 6. Language Technology • Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 6
  • 7. Language Technology • Spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ • And many more applications • Coreference resolution, discourse analysis, named entity recognition, natural language generation, question answering, … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 7
  • 8. Text mining • Finding out things you did not know Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Visualization of results W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 8
  • 9. Basics: What are machines doing (not only on the Web)? How are they doing it? They are using resources W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 9
  • 10. Resources in language technology • Sample resources for summarization LT “These texts are about ... “ NLG output text mining output stop word list … 10
  • 11. Language Technology • Sample resources in Machine Translation LT このワークショップ は…で開催される “The workshop takes place in …“ Lexicon Grammar (Training) corpora … Generation 11
  • 12. Language Technology • Sample resources for spell and grammar checking LT “The workshop takes place in …“ “The worksop take place in …“ Lexicon Grammar … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 12
  • 13. Text mining • Sample resources for text mining Text mining •“Text A and text B are similar” •“The text collection has clusters of topics: …” Lexicon Stop word list … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 13
  • 14. In general: you need three types of data: input, resources, workflow Input Work- flow Output Resources Resources … W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 14
  • 15. What gaps need to be filled for truly “multilingual content processing”? • Gap 1: machines don’t use metadata available in the input • Gap 2: machines don’t know about the workflow (input) data goes through • Gap 3: machines don’t make explicit – “Who” they are – What resources they are using W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 15
  • 16. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 16
  • 17. Gap 1: machines don’t use metadata available in the input • Input from www.postbank.de „Ob Postbank direkt, Online-Banking, Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“ • Output via Google translate “Whether Postbank direct, online banking, online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.” Fixed terminology should not have been translated. But – the MT tool had no chance to “know” that – why? W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 17
  • 18. Gap 2: machines don’t know about processes data goes through • Input from the data base – the “hidden web”: „Ob <term>Postbank direkt</term>, <term>Online-Banking</term>, <term>Online-Brokerage</term> …“ • Output on the Web: „Ob <em>Postbank direkt</em>, <em>Online-Banking</em>, <em>Online-Brokerage</em> …“ fixed terminology (= metadata) … … is lost on the Web  publication process W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 18
  • 19. Gap 3: no common identification … • Of metadata and processes chains (previous slides) • Of resources – e.g. what is a lexicon – In machine translation? – In localization? – For a human reader? – Ability to combine tools depends on knowing about them (capabilities, resources) in detail W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 19
  • 20. Who can fill these gaps – people dealing with multilingual content • Content producers – Allow for terminology identification in source formats / CMS • Localizers – Make localization workflows aware of (process / source content) metadata • “Machine” experts – Make their tools sensible to source content metadata and expose their capabilities (what resources / workflows) in a clear defined way W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 20
  • 21. Who can fill these gaps – people dealing with multilingual content • Users – Add metadata to source content – Use (machine translation) tools without knowing the details – e.g. in the browser! • Browser vendors – Create APIs which make use of automatic tools / resource and workflow descriptions / source code metadata • …  The people in this room! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 21
  • 22. How can they fill the gaps? • All these groups need to agree upon one machine readable information space for filling the gaps • It’s actually already here – the Semantic Web! W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 22
  • 23. What is the Semantic Web • The Web as humans see it: Identification of “meaning” e.g. via (typographic or other) conventions „Ob Postbank direkt …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 23
  • 24. What is the Semantic Web • The Web as machines see it: Identification of meaning via RDF-based mechanisms (here via RDFa) „Ob <span property=”its:term”>Postbank direkt</span> …“ W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 24
  • 25. What is the Semantic Web – RDF in 30 seconds • A framework for making statements about resources, using URIs • RDF can help to fill our gaps 1. Metadata in the input 2. Metadata for workflows 3. Identify 1., 2. and language technology resources uniquely • In one information space – the machine readable Web W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 25
  • 26. Instead of a summary – call for project (participating in ) proposals • Who needs to come together – Content producers, localizers, “machine” experts, browser vendors, users • What should their work be based upon – Semantic Web technologies – Clear interfaces to the human (e.g. browser) Web, like RDFa • What we do not need – Web-centred standardization of formats for language resources themselves – that is already done elsewhere (see this session) • Where the place is to do that work? – W3C, since it needs to be part of core Web technologies • For making it happen, we need a strong alliance of Web technologies, other fields and machine technologies W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 26
  • 27. META-NET • EU-funded project, closely related to “Multilingual Web” • Main aim: build an alliance for improving language technologies in Europe • Laaarge: soon 40+ participating organizations in 30+ countries • Very important: bring users of language technology in W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 27
  • 28. META-NET • Users and language technology companies = in Europe not only large companies, but more and more small SMEs • Target of META-NET are these small and fast units – including you  • EU has started special funding programs for SMEs – see http://tinyurl.com/eu-lt-sme (“objective 4.1”) W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 28
  • 29. META-NET • Event: META-NET Forum • Brussels, November 17th/18th • Aim: Bring users / language technology developers / policy makers together • Discuss a road map for the next 10 years of language technology road map and its applications • Details and registration at http://www.meta-net.eu/events W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 29
  • 30. Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web Felix Sasaki DFKI / University of Appl. Sciences Potsdam W3C German-Austrian Office felix.sasaki@dfki.de W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid 30

Notes de l'éditeur

  1. The purpose of this talk is as shown on this and on the following slide.
  2. Let us start with some basics: what are machines doing, not only on the Web? We focus on a few technology fields like “language technology” and “text mining”, which are sufficient to identify the gaps we want to point out.
  3. We start with the field of language technology and some of its applications. Language technology can be used e.g. for text summarization. A language technology component takes one or several documents as an input and produces as an output a summary of the document(s).
  4. Another application of language technology is machine translation. A language technology component takes a text as an input and produces an output in a different language.
  5. Another application is checking of spelling and grammar. On the left side we see the input sentence from the last slide with some mistakes. A language technology component takes this as an input and produces suggestions for corrections, as shown on the left side of the slide. These are only a few sample application of language technology, to give you a rough idea what this technology is used for. Some others are shown on this slide, but we will not go into details here.
  6. The purpose of text mining is to find out new information about large amount of unstructured text. For example, You may find out from a text mining process that two texts A and B are similar, or that there is a cluster around certain topics in a text collection. Results of text mining are often visualized, which is only indicated on this slide.
  7. Now let us get closer to the point we want to make: what is common about what machines doing? All the technologies we introduced rely on resources. Some of the resources will be described now.
  8. Summarization depends on the output of other processes. There are various approaches to summarization, making use of different kinds of resources or outputs. Sample outputs are from natural language generation, text mining. Text mining itself relies on a stop word list, since you don’t want words like “and, because, or, …” as part of the mining process.
  9. For machine translation it depends very much on your approach what you need. Again we will not go into details here, but list some prototypical resources. You may need a lexicon for your source and your target language, or grammar(s) again for both languages. You may use corpora for several reasons. In corpora you can “train” your translation component: it can help you to generate a lexicon or a grammar from examples of real use. It can be used to calculate probabilities for translations, given examples of aligned source and target language sentences. And it can be used to test and enhance the quality of a translation, given (hand written) examples.
  10. For spell or grammar checking, again you need a lexicon and a grammar, and potentially other resources.
  11. For text mining, you need a lexicon since you want to find e.g. similarities between texts about “multilingualism” and “multilingual”. The lexicon will help you to bring these two expressions together. You will also need something like a stop word list. It contains words like “and”, “when”, “because” which you do not want to take into account for your mining process.
  12. In the previous slides we described resources like a lexicon, grammars or corpora. Generalizing the picture, you may call the resources a kind of “data”. There are other types of data you need: you have input data which you want to summarize, translate, spell check or mine for new information. And there is a workflow in the examples we had: you describe for a language technology or mining process what is happing: a lexicon is used for translation, a corpus to find previously translated examples, a grammar to check correctness in the target language, and so on.
  13. The first gap we want to point out is: machines don’t use metadata which is available in the input. Here we see an example of a translation generated automatically, via Google translate. The result looks OK, but let’s take a closer look.
  14. The machine translation process should have “known” that there is fixed terminology not to be translated. But it could never know that – why?
  15. The reason is that the information about the fixed terminology is not available in the data. By “in the data”, we mean the input data for the machine translation tool. Actually, the information about fixed terminology was available in the “hidden web”, that is in a data base. But the data based does not appear on the web. Here we come to gap 2: Machines need to know about the process the data went through, before it appeared on the “surface” of the Web. This is of course closely related to the first gap. Filling the second gap is somehow the prerequisite for filling the first gap.
  16. The third gap is related to the two others: we want to be able to uniquely identify metadata (for source content) and processing chain descriptions. Also, we want to identify characteristics of tools using that kind of metadata, too, also in a clear manner.
  17. This slide shows what groups can help to fill these gaps. It is actually the people in this room! Note that the examples are – just examples, useful for our “machine translation and terminology” problem. There is no time during this presentation to show other problems which can be solved in the same manner, but believe me that they exist.
  18. It is necessary that the people in this room come together and fill the gaps mentioned. It is also helpful if they do it in one machine readable information space – the Semantic Web.
  19. The Semantic Web is the Web made process able for machines. On this slide you see how humans see the Web. “meaning” is conveyed by the content itself and e.g. typographic conventions.
  20. In the Semantic Web, meaning is conveyed with specific means: via RDF-based mechanisms. The mechanism to add machine readable meaning to Web pages is called RDFa and is exemplified here. The “property “ attribute expresses that the content of the “span” element is to be interpreted as a term.
  21. Instead of a summary, let‘s have a call for project proposals!
  22. These slides provide some background about META-NET.