SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Build a Searchable
Knowledge Base
Jimmy Lai
Yahoo! Search Engineer
r97922028 [at] ntu.edu.tw
2014/05/18
http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
Outline
• Introduction to Knowledge Base
• Construct a Knowledge Base
• Search the Knowledge Base
• string match
• synonym search
• full text search
• geo search
• put all together
• More Applications
2
Knowledge
• Knowledge is power. - Francis Bacon, 1597
• Knowledge is boundless and connected. So, an
efficient interface to search and browse the
knowledge base is essential.
• Let’s try to build a searchable knowledge base.
3
Application of Knowledge
Base
Personal assistant: Siri, Google now
!
!
Search engine: Google’s knowledge graph
4
Construct a Knowledge
Base
1. Find good data sources.
2. Aggregate data as knowledge entity.
3. Construct structured data of knowledge entity.
4. Search the knowledge base.
5. Navigate the knowledge base.
5
Wikipedia
• A collaborated encyclopedia with more than 30M
articles over 287 languages.
!
!
!
• A good source of knowledge base. However the
data of Wikipedia is not well-structured.
6
http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
DBpedia
• http://wiki.dbpedia.org/About
• Structured data from Wikipedia.
• A good data source for a knowledge base.
7
8
Knowledge
Entity
9
Identifier
Abstract
Relations
What can Python do for us
• Data Wrangling
• Process the raw text data
• Aggregate the data from different sources
• Output data as json format
• Connecting the Data flow between systems
• Automation script for starting services and
feeding data
• REST API implementing search strategy
10
Example code
git clone git@github.com:jimmylai/knowledge.git!
https://github.com/jimmylai/knowledge!
• required python packages:
1. fabric
2. pysolr
3. django
11
Data Preparation
1. Download data from DBpedia 

http://downloads.dbpedia.org/current/en/
2. Filter out some specific knowledge entity
zcat instance_types_en.nt.bz2 | get_id_list.py

3. Parse and aggregate data entity from files.
12
data file script data field
short_abstracts_en.nt.bz2 get_abstract.py abstract
raw_infobox_properties_en.nt.bz2 get_relation.py relations
geo_coordinates_en.nt.bz2 get_geo.py latlon
redirects_en.nt.bz2 get_redirect.py redirects
Aggregated Data Format
"http://dbpedia.org/resource/Lake_Yosemite": {
"latlon": "37.376389,-120.428889",
"redirects": [
"Lake_yosemite"
],
"abstract": "Lake Yosemite is an artificial freshwater lake located approximately
five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced
is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university
is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand
Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James
Bond Special 1 episode was filmed and tested at Lake Yosemite.",
"relations": {
"type": "http://dbpedia.org/resource/Reservoir",
"location": "http://dbpedia.org/resource/California"
}
}
13
Search by
• Solr is a full-text, real-time search engine based on Apache
lucene.
• Provides REST-like API.
• pysolr make the use of Solr easily.
• Download the latest version 4.8.0 from
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0
and extract to solr/solr-4.8.0 dir
• Start Solr server and then check the web UI
fab start_solr

http://localhost:8983/solr/
14
Search - String Match
• To be able to search by entity name
python feed_data.py string_match

• config: solr/conf/string_match/schema.xml
<field name="name" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="abstract" type="string" indexed="false" stored="true"
multiValued="false"/>
• Feed the entities to Solr. Each entity with name and
abstract fields.
15
Search - String Match
16
http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco
%22&wt=json&indent=true
Search by entity name.
Search - Synonym
• To be able to search by synonym of entity name
python feed_data.py synonym_string_match

• config: solr/conf/synonym_string_match/schema.xml
<field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/>
!
<fieldType name="name_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
…
• Restart Solr server and the synonym file will be reloaded.
17
Synonym handling at index
time
18
Synonym handling at query
time
19
Search - Synonym
20
Search by synonym.
Search - Full Text Search
• To be able to search by entity name
python feed_data.py full_text_search

• config: solr/conf/full_text_search/schema.xml
<copyField source="name" dest="text"/>
<copyField source="abstract" dest=“text"/>
!
• Feed the entities to Solr. Each name and abstract
field will be copied to the text field. After that we
can do full text search without specify field to
search.
21
Search - Full Text Search
22
Search - Geo Search
• To be able to search by distance given a location
python feed_data.py geo_search

• config: solr/conf/geo_search/schema.xml
<field name="location" type="location" indexed="true" stored="true"
required="false" multiValued="false" />
• Feed the entities to Solr. Each entity contains a location
field and the format is like "51.670100,-3.230100".
23
24
Given condition on distance
Search - Put All Together
• Search Strategy
1. Input a query
2. Search by synonym match
3. Search by full text
1. If input a location, filter the result by geo
search
• Implement the search strategy as an API
25
Implement the search
strategy in a Django view
26
27
Review
• A Knowledge Base with synonym, full-text and geo
search API.
• The knowledge entities are connected by relation.
28
More Applications
• Question answering system:
1.Query analysis: identify the intension (e.g. looking
for specific type of entity)
2.Search in the knowledge base
3.Return the knowledge entity
29
The modern search engine don’t just provide web page urls. They provide the
direct answer to users.
30
More Data Sources and
Knowledge Entities
• Open Data
!
!
!
• Open APIs
31
My Life in
• Build online services for billions of users.
• Big data mining on cloud infrastructures.
• Open and Innovative working environment.
• International teamwork and English communication.
• Business trips to Silicon Valley.
• Send me your resume if you need a referral.
r97922028 [at] ntu.edu.tw
32

Contenu connexe

Tendances (20)

Search Engines
Search EnginesSearch Engines
Search Engines
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search Engines
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
How Internet Search Engines Work
How Internet Search Engines WorkHow Internet Search Engines Work
How Internet Search Engines Work
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Smart Searching
Smart SearchingSmart Searching
Smart Searching
 
Search engines
Search enginesSearch engines
Search engines
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Search engine
Search engineSearch engine
Search engine
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
 
Surfing the internet
Surfing the internetSurfing the internet
Surfing the internet
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
 

Similaire à Build a Searchable Knowledge Base

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Deep Web
Deep WebDeep Web
Deep WebSt John
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overviewAmit Sheth
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internetMaria Theresa
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerWiLS
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
 

Similaire à Build a Searchable Knowledge Base (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Deep Web
Deep WebDeep Web
Deep Web
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Deep web
Deep webDeep web
Deep web
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internet
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Where's the Data?
Where's the Data?Where's the Data?
Where's the Data?
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 

Plus de Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdfJimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesJimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringJimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramJimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Jimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 

Plus de Jimmy Lai (20)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 

Dernier

Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...nirzagarg
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...SUHANI PANDEY
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...singhpriety023
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...SUHANI PANDEY
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...SUHANI PANDEY
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceDelhi Call girls
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...nilamkumrai
 

Dernier (20)

Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 

Build a Searchable Knowledge Base

  • 1. Build a Searchable Knowledge Base Jimmy Lai Yahoo! Search Engineer r97922028 [at] ntu.edu.tw 2014/05/18 http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
  • 2. Outline • Introduction to Knowledge Base • Construct a Knowledge Base • Search the Knowledge Base • string match • synonym search • full text search • geo search • put all together • More Applications 2
  • 3. Knowledge • Knowledge is power. - Francis Bacon, 1597 • Knowledge is boundless and connected. So, an efficient interface to search and browse the knowledge base is essential. • Let’s try to build a searchable knowledge base. 3
  • 4. Application of Knowledge Base Personal assistant: Siri, Google now ! ! Search engine: Google’s knowledge graph 4
  • 5. Construct a Knowledge Base 1. Find good data sources. 2. Aggregate data as knowledge entity. 3. Construct structured data of knowledge entity. 4. Search the knowledge base. 5. Navigate the knowledge base. 5
  • 6. Wikipedia • A collaborated encyclopedia with more than 30M articles over 287 languages. ! ! ! • A good source of knowledge base. However the data of Wikipedia is not well-structured. 6 http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
  • 7. DBpedia • http://wiki.dbpedia.org/About • Structured data from Wikipedia. • A good data source for a knowledge base. 7
  • 8. 8
  • 10. What can Python do for us • Data Wrangling • Process the raw text data • Aggregate the data from different sources • Output data as json format • Connecting the Data flow between systems • Automation script for starting services and feeding data • REST API implementing search strategy 10
  • 11. Example code git clone git@github.com:jimmylai/knowledge.git! https://github.com/jimmylai/knowledge! • required python packages: 1. fabric 2. pysolr 3. django 11
  • 12. Data Preparation 1. Download data from DBpedia 
 http://downloads.dbpedia.org/current/en/ 2. Filter out some specific knowledge entity zcat instance_types_en.nt.bz2 | get_id_list.py 3. Parse and aggregate data entity from files. 12 data file script data field short_abstracts_en.nt.bz2 get_abstract.py abstract raw_infobox_properties_en.nt.bz2 get_relation.py relations geo_coordinates_en.nt.bz2 get_geo.py latlon redirects_en.nt.bz2 get_redirect.py redirects
  • 13. Aggregated Data Format "http://dbpedia.org/resource/Lake_Yosemite": { "latlon": "37.376389,-120.428889", "redirects": [ "Lake_yosemite" ], "abstract": "Lake Yosemite is an artificial freshwater lake located approximately five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James Bond Special 1 episode was filmed and tested at Lake Yosemite.", "relations": { "type": "http://dbpedia.org/resource/Reservoir", "location": "http://dbpedia.org/resource/California" } } 13
  • 14. Search by • Solr is a full-text, real-time search engine based on Apache lucene. • Provides REST-like API. • pysolr make the use of Solr easily. • Download the latest version 4.8.0 from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0 and extract to solr/solr-4.8.0 dir • Start Solr server and then check the web UI fab start_solr http://localhost:8983/solr/ 14
  • 15. Search - String Match • To be able to search by entity name python feed_data.py string_match • config: solr/conf/string_match/schema.xml <field name="name" type="string" indexed="true" stored="true" multiValued="false"/> <field name="abstract" type="string" indexed="false" stored="true" multiValued="false"/> • Feed the entities to Solr. Each entity with name and abstract fields. 15
  • 16. Search - String Match 16 http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco %22&wt=json&indent=true Search by entity name.
  • 17. Search - Synonym • To be able to search by synonym of entity name python feed_data.py synonym_string_match • config: solr/conf/synonym_string_match/schema.xml <field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/> ! <fieldType name="name_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> … • Restart Solr server and the synonym file will be reloaded. 17
  • 18. Synonym handling at index time 18
  • 19. Synonym handling at query time 19
  • 21. Search - Full Text Search • To be able to search by entity name python feed_data.py full_text_search • config: solr/conf/full_text_search/schema.xml <copyField source="name" dest="text"/> <copyField source="abstract" dest=“text"/> ! • Feed the entities to Solr. Each name and abstract field will be copied to the text field. After that we can do full text search without specify field to search. 21
  • 22. Search - Full Text Search 22
  • 23. Search - Geo Search • To be able to search by distance given a location python feed_data.py geo_search • config: solr/conf/geo_search/schema.xml <field name="location" type="location" indexed="true" stored="true" required="false" multiValued="false" /> • Feed the entities to Solr. Each entity contains a location field and the format is like "51.670100,-3.230100". 23
  • 25. Search - Put All Together • Search Strategy 1. Input a query 2. Search by synonym match 3. Search by full text 1. If input a location, filter the result by geo search • Implement the search strategy as an API 25
  • 26. Implement the search strategy in a Django view 26
  • 27. 27
  • 28. Review • A Knowledge Base with synonym, full-text and geo search API. • The knowledge entities are connected by relation. 28
  • 29. More Applications • Question answering system: 1.Query analysis: identify the intension (e.g. looking for specific type of entity) 2.Search in the knowledge base 3.Return the knowledge entity 29
  • 30. The modern search engine don’t just provide web page urls. They provide the direct answer to users. 30
  • 31. More Data Sources and Knowledge Entities • Open Data ! ! ! • Open APIs 31
  • 32. My Life in • Build online services for billions of users. • Big data mining on cloud infrastructures. • Open and Innovative working environment. • International teamwork and English communication. • Business trips to Silicon Valley. • Send me your resume if you need a referral. r97922028 [at] ntu.edu.tw 32