SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Towards a Web Search
     Service for Minority
 Language Communities
                            Baden Hughes
       Department of Computer Science and
                      Software Engineering
               The University of Melbourne
             badenh@csse.unimelb.edu.au


17 January 2006        Hughes @ OpenRoad 2006   1
Diversity in Australia
     Well recognised cultural and linguistic diversity of
     Australia’s population
           SIL Ethnologue
             311 languages (14th edition, 2000)
             318 languages (15th edition, 2005)
             Australia in top 10 countries for linguistic diversity
             ( = languages in a country / languages globally )
           ABS: 364 languages (2005)
     Considerable number of low density languages used
     within immigrant communities

17 January 2006                   Hughes @ OpenRoad 2006              2
Inefficiency of Web Search
     General web search is a low precision activity
     in the best case scenario
           Google: 8 billion web pages
     Web search for materials in lesser-used
     languages is even lower precision than the
     general case
     Web search for minority (“low density”)
     languages is even lower precision again
           Mining the ‘long tail’ of the web is a specialist
           domain of research
17 January 2006              Hughes @ OpenRoad 2006            3
Harvesting vs Enabling
     Previous work in linguistically-oriented data mining
     of web content to create derivative works: corpora,
     dictionaries
           None of these address the low precision issues for
           generalized web search
     Our work is aimed at increasing the likelihood that
     end users searching for resources in minority
     languages on the web will find useful results from
     searching
           Developing use-case specific tools for web search and
           leveraging existing broad coverage web search tools

17 January 2006                Hughes @ OpenRoad 2006              4
Open Language Archives
Community (OLAC)
     OLAC is a consortium of linguistic data archives
           http://www.language-archives.org/
           34 archives, 28K+ objects in catalogue
     OLAC metadata is based on Dublin Core, with
     extensions for specifically linguistically-oriented
     properties eg language, data type, subject
     language, linguistic subject
     OLAC is an Open Archives Initiative (OAI)
     subcommunity
           Uses standard OAI Protocol for Metadata Harvesting to
           promote data access and integration

17 January 2006                Hughes @ OpenRoad 2006              5
In vs About
     OLAC Metadata crucially distinguishes
     between
           The language a resource is in (‘language’)
           The language a resource is about (‘subject
           language’)
     Such differentiation allows for additional
     precision in classifying, indexing and
     searching for low density language resources
           ‘In-ness’ is more interesting than ‘About-ness’
17 January 2006             Hughes @ OpenRoad 2006           6
Service Architecture
     Building on previous work in developing
     robust strategies for identifying web
     resources for lesser used languages on the
     web, the LangGator service architecture
     provides
           Language-centric web resource identification and
           acquisition
           Language-centric resource description
           Language-aware end-user resource discovery
17 January 2006            Hughes @ OpenRoad 2006             7
Crawler Internals
     Crawl seeded by language name variants
     (Ethnologue), place and country names and variants
     (Getty TGN), lexical items (Rosetta)
     Programmatic queries against Google, Yahoo, A9,
     DogPile
           Essentially guided metasearch
     Resulting URIs merged and sorted using rank
     aggregation techniques
     Highly ranked documents from metasearch used for
     focused crawling around URI
           TF/IDF for low frequency items in found documents
17 January 2006               Hughes @ OpenRoad 2006           8
Crawler Status
     Running intermittently since July 2004 on high
     bandwidth research infrastructure
     >1.6 million web resources have been identified in
     over 3000 languages
     Some exposed via standard OLAC search
     Majority exposed to standard search engines via
     DP9 gateway
           Full circle exploitation of web search
           Evaluation of precision improvement is ongoing
     More details in the paper (or Hughes 2005 paper)
17 January 2006                Hughes @ OpenRoad 2006       9
Metadata Descriptions
     Describing resources separately from their
     realization is required since the web based
     language-centric resources are not held locally
     Metadata creation is an effort intensive process
           Automatic description generation is well studied in the
           general digital libraries community (eg Paynter 2005)
     Some metadata elements are well supported by
     existing automatic metadata creation tools
     We focus particularly on language vs subject
     language metadata creation since it is of primary
     importance
17 January 2006                 Hughes @ OpenRoad 2006               10
Metadata Descriptions Status
     We use a combination of machine learning
     approaches to compare and classify a given
     resource against human curated gold standard data
     for known languages
           Primary data points: encoding, word n-grams, character n-
           grams
           Secondary data points: geographical referent colocation,
           lexical item occurrence, URI
     Currently described around 40% of the >1.6 million
     URIs found by crawler at probability of 0.8 or higher
     as threshold for acceptable language identification
           Computationally bound at present, but re-engineering
17 January 2006                Hughes @ OpenRoad 2006              11
Search Facilities
     Currently search delivered via OLAC Search Engine
     (http://www.language-archives.org/tools/search/)
     Features
           Web search style interface, UTF-8 support, no restrictions
           on string, operators, inline syntax
           Fuzzy string matching for geographical entities and
           language names
           ‘Click minimization’ strategy for empty search: pre-
           composed derivative queries
           Exploits Ethnologue and Getty ontologies
           Exploits linguistic knowledge (eg language families)
17 January 2006                Hughes @ OpenRoad 2006                   12
Search Facilities
     Localization-oriented interface
           XML core with XSL
           Entirely user preference driven with a default
           Post-query encoding/language change
           Currently code auditing for upgrading interface
           strings to XLIFF Portable Objects
           Interest for localization into French, Spanish,
           Bahasa Indonesia, Vietnamese, Thai
           More search architecture detail in Kamat and
           Hughes (2005)
17 January 2006             Hughes @ OpenRoad 2006           13
Language Search: Dinka




17 January 2006   Hughes @ OpenRoad 2006   14
Country Search: Togo




17 January 2006   Hughes @ OpenRoad 2006   15
Future Work
     Increased frequency of web crawling
     More efficient and reliable language identification
     End user documentation and accessibility
     API documentation for third party data consumers and
     documentation for service/interface customization
     Map based search GUI; better geographical context-
     aware search
     Linguistically or geographical proximity based
     language matching
     Basic Language Resource Kits (BLARK)
     Integration with MyLanguage
17 January 2006         Hughes @ OpenRoad 2006          16
Conclusion
     Language-centric broad coverage web search is a
     strongly motivated user function
     Major search providers do not focus on precision
     improvement per se, but can be incrementally
     improved through covert means
     A multilingual web and multilingual web users can
     be supported effectively, even down to low densities
     Interested in leveraging our existing research and
     service development in other ways

17 January 2006         Hughes @ OpenRoad 2006          17
Acknowledgements
     Research supported by the Australian
     Research Council under the funding program
     for Special Research Initiatives (E-Research)
     Grant SR0567353 “An Intelligent Search
     Infrastructure for Language Resources on the
     Web”.




17 January 2006         Hughes @ OpenRoad 2006       18

Contenu connexe

Tendances

IFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableIFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableFigoblog
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411charper
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaEUCLID project
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesMenzo Windhouwer
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data皓仁 柯
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 

Tendances (17)

NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
IFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round tableIFLA 2012 - OCLC Linked Data round table
IFLA 2012 - OCLC Linked Data round table
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Charper.penn.20140411
Charper.penn.20140411Charper.penn.20140411
Charper.penn.20140411
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data Environment
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
RDA and Linked Data. Gordon Dunsire
RDA and Linked Data. Gordon DunsireRDA and Linked Data. Gordon Dunsire
RDA and Linked Data. Gordon Dunsire
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data Categories
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 

En vedette

A Real Love Story34
A Real Love Story34A Real Love Story34
A Real Love Story34guest0ecbb7
 
Week 9 Sponges
Week 9 SpongesWeek 9 Sponges
Week 9 SpongesCorey Topf
 
Schleswig June 07
Schleswig  June 07Schleswig  June 07
Schleswig June 07abfrench
 
Building Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBuilding Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBaden Hughes
 
0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered designHans Kemp
 
Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08zappos
 
User Experience Design Introduction
User Experience Design   IntroductionUser Experience Design   Introduction
User Experience Design IntroductionHans Kemp
 
0708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 10708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 1Hans Kemp
 
Zappos - Community 2.0 Conference - 05-13-08
Zappos - Community 2.0 Conference  - 05-13-08Zappos - Community 2.0 Conference  - 05-13-08
Zappos - Community 2.0 Conference - 05-13-08zappos
 
For Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COFor Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COrteam
 
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1   Structuur, Flow En NavigatieIad1 0809 Q3 Hoorcollege 1   Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En NavigatieHans Kemp
 
Week 26 Sponges
Week 26 SpongesWeek 26 Sponges
Week 26 SpongesCorey Topf
 
Daniela delgadoted talk
Daniela delgadoted talkDaniela delgadoted talk
Daniela delgadoted talkCorey Topf
 
Chapin Challenge
Chapin ChallengeChapin Challenge
Chapin ChallengeCorey Topf
 
Week 15 Sponges
Week 15 SpongesWeek 15 Sponges
Week 15 SpongesCorey Topf
 
How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...Johan Ronnestam
 
0708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege10708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege1Hans Kemp
 

En vedette (20)

A Real Love Story34
A Real Love Story34A Real Love Story34
A Real Love Story34
 
Week 36
Week 36Week 36
Week 36
 
Week 9 Sponges
Week 9 SpongesWeek 9 Sponges
Week 9 Sponges
 
Schleswig June 07
Schleswig  June 07Schleswig  June 07
Schleswig June 07
 
Building Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid MiddlewareBuilding Computational Grids with Apple’s Xgrid Middleware
Building Computational Grids with Apple’s Xgrid Middleware
 
0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design0708 De gebruiker heeft altijd gelijk - user-centered design
0708 De gebruiker heeft altijd gelijk - user-centered design
 
Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08Zappos - ANA - 10-17-08
Zappos - ANA - 10-17-08
 
User Experience Design Introduction
User Experience Design   IntroductionUser Experience Design   Introduction
User Experience Design Introduction
 
0708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 10708 IAD1 Q4 Hoorcollege 1
0708 IAD1 Q4 Hoorcollege 1
 
Zappos - Community 2.0 Conference - 05-13-08
Zappos - Community 2.0 Conference  - 05-13-08Zappos - Community 2.0 Conference  - 05-13-08
Zappos - Community 2.0 Conference - 05-13-08
 
For Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley COFor Sale Infiniti J30 in Greeley CO
For Sale Infiniti J30 in Greeley CO
 
Ux intro
Ux introUx intro
Ux intro
 
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1   Structuur, Flow En NavigatieIad1 0809 Q3 Hoorcollege 1   Structuur, Flow En Navigatie
Iad1 0809 Q3 Hoorcollege 1 Structuur, Flow En Navigatie
 
Week 26 Sponges
Week 26 SpongesWeek 26 Sponges
Week 26 Sponges
 
Daniela delgadoted talk
Daniela delgadoted talkDaniela delgadoted talk
Daniela delgadoted talk
 
Chapin Challenge
Chapin ChallengeChapin Challenge
Chapin Challenge
 
Week 15 Sponges
Week 15 SpongesWeek 15 Sponges
Week 15 Sponges
 
How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...How can innovation change brands and drive your business and communication fo...
How can innovation change brands and drive your business and communication fo...
 
Greek educational system
Greek educational systemGreek educational system
Greek educational system
 
0708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege10708 Iad2 Q4 Hoorcollege1
0708 Iad2 Q4 Hoorcollege1
 

Similaire à Towards a Web Search Service for Minority Language Communities

Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUCourtney McDonald
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Baden Hughes
 
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Rafal Kasprowski
 
Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Menzo Windhouwer
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Debra Kolah
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfJackie Gold
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesHannes Ebner
 
Core presentation
Core presentationCore presentation
Core presentationpetrknoth
 
2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMSRYAN T.
 
SFD 2012 Presentation
SFD 2012 PresentationSFD 2012 Presentation
SFD 2012 PresentationRyan Terrenal
 

Similaire à Towards a Web Search Service for Minority Language Communities (20)

Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
 
Aos Project and Realization
Aos Project and RealizationAos Project and Realization
Aos Project and Realization
 
Use and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
Use and integration of controlled vocabularies (AGROVOC) in DSpace RepositoriesUse and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
Use and integration of controlled vocabularies (AGROVOC) in DSpace Repositories
 
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
Web-scale Discovery Implementation with the End User in Mind (SLA 2012)
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
LOD2: Guest presentation: French datalift project
LOD2: Guest presentation: French datalift projectLOD2: Guest presentation: French datalift project
LOD2: Guest presentation: French datalift project
 
Datalift lod2-paris-24032011
Datalift lod2-paris-24032011Datalift lod2-paris-24032011
Datalift lod2-paris-24032011
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
Collaboratively Defining Widely Accepted Linguistic Data Categories in the IS...
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc Keynote
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdf
 
Learning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning RepositoriesLearning Object Annotation in Agricultural Learning Repositories
Learning Object Annotation in Agricultural Learning Repositories
 
Core presentation
Core presentationCore presentation
Core presentation
 
2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS2012 Software Freedom Day Presentation about Koha ILMS
2012 Software Freedom Day Presentation about Koha ILMS
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
SFD 2012 Presentation
SFD 2012 PresentationSFD 2012 Presentation
SFD 2012 Presentation
 

Plus de Baden Hughes

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsBaden Hughes
 
Managing Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewManaging Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewBaden Hughes
 
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...Baden Hughes
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorBaden Hughes
 
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Baden Hughes
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersBaden Hughes
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Baden Hughes
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesBaden Hughes
 
Refactoring Metadata:
Refactoring Metadata:Refactoring Metadata:
Refactoring Metadata:Baden Hughes
 
Change Management and Versioning in Ontologies
Change Management and Versioning in OntologiesChange Management and Versioning in Ontologies
Change Management and Versioning in OntologiesBaden Hughes
 
The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...Baden Hughes
 
Why Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsWhy Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsBaden Hughes
 

Plus de Baden Hughes (12)

Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary Linguistics
 
Managing Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's ViewManaging Perl Installations: A SysAdmin's View
Managing Perl Installations: A SysAdmin's View
 
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text Editor
 
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
 
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
Metadata Quality Evaluation: Experience from the Open Language Archives Commu...
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML Technologies
 
Refactoring Metadata:
Refactoring Metadata:Refactoring Metadata:
Refactoring Metadata:
 
Change Management and Versioning in Ontologies
Change Management and Versioning in OntologiesChange Management and Versioning in Ontologies
Change Management and Versioning in Ontologies
 
The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...The Effects of Cross-Pollination : How non-library mass market services are c...
The Effects of Cross-Pollination : How non-library mass market services are c...
 
Why Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print CollectionsWhy Digitization Increases the Value of Print Collections
Why Digitization Increases the Value of Print Collections
 

Dernier

Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamArik Fletcher
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifeBhavana Pujan Kendra
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfJamesConcepcion7
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdfMintel Group
 
Implementing Exponential Accelerators.pptx
Implementing Exponential Accelerators.pptxImplementing Exponential Accelerators.pptx
Implementing Exponential Accelerators.pptxRich Reba
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxRakhi Bazaar
 
14680-51-4.pdf Good quality CAS Good quality CAS
14680-51-4.pdf  Good  quality CAS Good  quality CAS14680-51-4.pdf  Good  quality CAS Good  quality CAS
14680-51-4.pdf Good quality CAS Good quality CAScathy664059
 
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...Aggregage
 
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdfSherl Simon
 
Excvation Safety for safety officers reference
Excvation Safety for safety officers referenceExcvation Safety for safety officers reference
Excvation Safety for safety officers referencessuser2c065e
 
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...ssuserf63bd7
 
Types of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfTypes of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfASGITConsulting
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterJamesConcepcion7
 
Entrepreneurial ecosystem- Wider context
Entrepreneurial ecosystem- Wider contextEntrepreneurial ecosystem- Wider context
Entrepreneurial ecosystem- Wider contextP&CO
 
WSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfWSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfJamesConcepcion7
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerAggregage
 
Neha Jhalani Hiranandani: A Guide to Her Life and Career
Neha Jhalani Hiranandani: A Guide to Her Life and CareerNeha Jhalani Hiranandani: A Guide to Her Life and Career
Neha Jhalani Hiranandani: A Guide to Her Life and Careerr98588472
 
Fundamentals Welcome and Inclusive DEIB
Fundamentals Welcome and  Inclusive DEIBFundamentals Welcome and  Inclusive DEIB
Fundamentals Welcome and Inclusive DEIBGregory DeShields
 
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfGUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfDanny Diep To
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreNZSG
 

Dernier (20)

Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management Team
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in Life
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdf
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
 
Implementing Exponential Accelerators.pptx
Implementing Exponential Accelerators.pptxImplementing Exponential Accelerators.pptx
Implementing Exponential Accelerators.pptx
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
 
14680-51-4.pdf Good quality CAS Good quality CAS
14680-51-4.pdf  Good  quality CAS Good  quality CAS14680-51-4.pdf  Good  quality CAS Good  quality CAS
14680-51-4.pdf Good quality CAS Good quality CAS
 
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...
Strategic Project Finance Essentials: A Project Manager’s Guide to Financial ...
 
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
 
Excvation Safety for safety officers reference
Excvation Safety for safety officers referenceExcvation Safety for safety officers reference
Excvation Safety for safety officers reference
 
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
 
Types of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfTypes of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdf
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare Newsletter
 
Entrepreneurial ecosystem- Wider context
Entrepreneurial ecosystem- Wider contextEntrepreneurial ecosystem- Wider context
Entrepreneurial ecosystem- Wider context
 
WSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfWSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdf
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon Harmer
 
Neha Jhalani Hiranandani: A Guide to Her Life and Career
Neha Jhalani Hiranandani: A Guide to Her Life and CareerNeha Jhalani Hiranandani: A Guide to Her Life and Career
Neha Jhalani Hiranandani: A Guide to Her Life and Career
 
Fundamentals Welcome and Inclusive DEIB
Fundamentals Welcome and  Inclusive DEIBFundamentals Welcome and  Inclusive DEIB
Fundamentals Welcome and Inclusive DEIB
 
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfGUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource Centre
 

Towards a Web Search Service for Minority Language Communities

  • 1. Towards a Web Search Service for Minority Language Communities Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne badenh@csse.unimelb.edu.au 17 January 2006 Hughes @ OpenRoad 2006 1
  • 2. Diversity in Australia Well recognised cultural and linguistic diversity of Australia’s population SIL Ethnologue 311 languages (14th edition, 2000) 318 languages (15th edition, 2005) Australia in top 10 countries for linguistic diversity ( = languages in a country / languages globally ) ABS: 364 languages (2005) Considerable number of low density languages used within immigrant communities 17 January 2006 Hughes @ OpenRoad 2006 2
  • 3. Inefficiency of Web Search General web search is a low precision activity in the best case scenario Google: 8 billion web pages Web search for materials in lesser-used languages is even lower precision than the general case Web search for minority (“low density”) languages is even lower precision again Mining the ‘long tail’ of the web is a specialist domain of research 17 January 2006 Hughes @ OpenRoad 2006 3
  • 4. Harvesting vs Enabling Previous work in linguistically-oriented data mining of web content to create derivative works: corpora, dictionaries None of these address the low precision issues for generalized web search Our work is aimed at increasing the likelihood that end users searching for resources in minority languages on the web will find useful results from searching Developing use-case specific tools for web search and leveraging existing broad coverage web search tools 17 January 2006 Hughes @ OpenRoad 2006 4
  • 5. Open Language Archives Community (OLAC) OLAC is a consortium of linguistic data archives http://www.language-archives.org/ 34 archives, 28K+ objects in catalogue OLAC metadata is based on Dublin Core, with extensions for specifically linguistically-oriented properties eg language, data type, subject language, linguistic subject OLAC is an Open Archives Initiative (OAI) subcommunity Uses standard OAI Protocol for Metadata Harvesting to promote data access and integration 17 January 2006 Hughes @ OpenRoad 2006 5
  • 6. In vs About OLAC Metadata crucially distinguishes between The language a resource is in (‘language’) The language a resource is about (‘subject language’) Such differentiation allows for additional precision in classifying, indexing and searching for low density language resources ‘In-ness’ is more interesting than ‘About-ness’ 17 January 2006 Hughes @ OpenRoad 2006 6
  • 7. Service Architecture Building on previous work in developing robust strategies for identifying web resources for lesser used languages on the web, the LangGator service architecture provides Language-centric web resource identification and acquisition Language-centric resource description Language-aware end-user resource discovery 17 January 2006 Hughes @ OpenRoad 2006 7
  • 8. Crawler Internals Crawl seeded by language name variants (Ethnologue), place and country names and variants (Getty TGN), lexical items (Rosetta) Programmatic queries against Google, Yahoo, A9, DogPile Essentially guided metasearch Resulting URIs merged and sorted using rank aggregation techniques Highly ranked documents from metasearch used for focused crawling around URI TF/IDF for low frequency items in found documents 17 January 2006 Hughes @ OpenRoad 2006 8
  • 9. Crawler Status Running intermittently since July 2004 on high bandwidth research infrastructure >1.6 million web resources have been identified in over 3000 languages Some exposed via standard OLAC search Majority exposed to standard search engines via DP9 gateway Full circle exploitation of web search Evaluation of precision improvement is ongoing More details in the paper (or Hughes 2005 paper) 17 January 2006 Hughes @ OpenRoad 2006 9
  • 10. Metadata Descriptions Describing resources separately from their realization is required since the web based language-centric resources are not held locally Metadata creation is an effort intensive process Automatic description generation is well studied in the general digital libraries community (eg Paynter 2005) Some metadata elements are well supported by existing automatic metadata creation tools We focus particularly on language vs subject language metadata creation since it is of primary importance 17 January 2006 Hughes @ OpenRoad 2006 10
  • 11. Metadata Descriptions Status We use a combination of machine learning approaches to compare and classify a given resource against human curated gold standard data for known languages Primary data points: encoding, word n-grams, character n- grams Secondary data points: geographical referent colocation, lexical item occurrence, URI Currently described around 40% of the >1.6 million URIs found by crawler at probability of 0.8 or higher as threshold for acceptable language identification Computationally bound at present, but re-engineering 17 January 2006 Hughes @ OpenRoad 2006 11
  • 12. Search Facilities Currently search delivered via OLAC Search Engine (http://www.language-archives.org/tools/search/) Features Web search style interface, UTF-8 support, no restrictions on string, operators, inline syntax Fuzzy string matching for geographical entities and language names ‘Click minimization’ strategy for empty search: pre- composed derivative queries Exploits Ethnologue and Getty ontologies Exploits linguistic knowledge (eg language families) 17 January 2006 Hughes @ OpenRoad 2006 12
  • 13. Search Facilities Localization-oriented interface XML core with XSL Entirely user preference driven with a default Post-query encoding/language change Currently code auditing for upgrading interface strings to XLIFF Portable Objects Interest for localization into French, Spanish, Bahasa Indonesia, Vietnamese, Thai More search architecture detail in Kamat and Hughes (2005) 17 January 2006 Hughes @ OpenRoad 2006 13
  • 14. Language Search: Dinka 17 January 2006 Hughes @ OpenRoad 2006 14
  • 15. Country Search: Togo 17 January 2006 Hughes @ OpenRoad 2006 15
  • 16. Future Work Increased frequency of web crawling More efficient and reliable language identification End user documentation and accessibility API documentation for third party data consumers and documentation for service/interface customization Map based search GUI; better geographical context- aware search Linguistically or geographical proximity based language matching Basic Language Resource Kits (BLARK) Integration with MyLanguage 17 January 2006 Hughes @ OpenRoad 2006 16
  • 17. Conclusion Language-centric broad coverage web search is a strongly motivated user function Major search providers do not focus on precision improvement per se, but can be incrementally improved through covert means A multilingual web and multilingual web users can be supported effectively, even down to low densities Interested in leveraging our existing research and service development in other ways 17 January 2006 Hughes @ OpenRoad 2006 17
  • 18. Acknowledgements Research supported by the Australian Research Council under the funding program for Special Research Initiatives (E-Research) Grant SR0567353 “An Intelligent Search Infrastructure for Language Resources on the Web”. 17 January 2006 Hughes @ OpenRoad 2006 18