SlideShare une entreprise Scribd logo
1  sur  75
Introduction to
Information Retrieval
Carsten Eickhoff
What is it all about?
1 Introduction
• Google and other Web search providers are on
the rise for more than 10 years
• They advance to take the role of
“media all-rounders”
• On Coursera, you have been learning about
the range of products offered by Google
• In this lecture, we will try and understand how
these services are facilitated
IR Applications
• Many on-line activities involve IR technology
– Media consumption (music, videos, text, …)
– News tracking
– Social networking
– Online shopping
– Advertisement
– Mobile communication
Agenda
1. Introduction
2. IR Technology
a)
b)
c)
d)

Crawling
Indexing & Storage
Retrieval
Data-driven Technologies

3. IR for Children
4. Recommended Reading
Section 2

IR TECHNOLOGY
Searching and Term Matching

• Search engines do not understand queries!
• They only find things that are similar
Overview Web Search
Web

Query

Index

Documents

Result Set
Section 2.1

CRAWLING
How does the Engine know
what is out there?
Overview Web Search
Web

Query

Index

Documents

Result Set
Spiders, Crawlers, Robots…
• Search engine companies send hordes of lightweight programs through the web
• Whenever they encounter a new or updated
Web page, they save it.
• From there, they follow the page’s outgoing
links to discover new content
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
The Crawling Process
Crawling the Web
• Web crawling is an ongoing process
• To keep up with the pace at which information
on the Web changes, pages are re-crawled
continuously
• Search engine providers have to make sure
that their crawlers are polite (i.e., do not place
a high or peaking load on web servers)
The Ethics of Saving Everything…
• Large shares of the Web are of private or
proprietary nature
• Crawlers should not visit those
• “robots.txt” regulates simple access rules for
each Web server
Access Policy Examples
• Examples of robot policies in robots.txt
Example 1:
User-agent: *
Disallow: /cyberworld/map/
User-agent: cybermapper
Disallow:

Example 2:
User-agent: *
Disallow: /
The Hidden Web
The Deep Web
• Similar to the hidden Web, pages that only
become available after user interaction
• E.g., product pages in a Web shop
• Some crawlers can fill in information in forms
and explore the underlying pages
• But be careful, or your crawler buys a cruise
ship tour…
Section 2.2

INDEXING & STORAGE
Overview Web Search
Web

Query

Index

Documents

Result Set
Index Construction
• Naïve Approach:
– Make a copy of each page
– When a query comes in, look for the terms in all
saved pages

• But wait, is that clever?
– We need to answer queries in half a second…
Inverted Index
• Turn the problem around
• Build an inverted index that tells us which
document contains the query terms
• At query time, we only consider those
documents that contain at least one query
term
Example: Inverted Index
he

drink

ink

likes

pink

thing

wink

2

1

0

2

0

0

1

He likes to wink, he likes to drink.

1

3

0

1

0

0

0

He likes to drink, and drink, and drink.

1

1

1

1

0

1

0

The thing he likes to drink is ink.

1

1

1

1

1

0

0

The ink he likes to drink is pink.

1

1

1

1

1

0

1

He likes to wink and drink pink ink.
Example: Sparse Index
• Most fields of the matrix would be empty
• So we store it as a sparse matrix
he

D1:2

D2:1

D3:1

ink

D3:1

D4:1

D5:1

pink

D4:1

D5:1

thing

D3:1

wink

D1:1

D5:1

D4:1

D5:1
Positional Indices
• The inverted index allows us to know which
terms are in which documents
• But we lose position information
• Term proximity may be important
Query: “Square dance”
Doc 1: “…Times Square in new York is often host to dance performances...”
Doc 2: “… Square dance is a dance for four couples…”
Example: Positional Index
• To preserve position information, we store
each term occurrence as a tuple <doc,pos>
he

D1:1

D1:5

D2:1

ink

D3:8

D4:2

D5:8

pink

D4:8

D5:7

thing

D3:2

wink

D1:4

D5:4

D3:3

D4:3

D5:1
What if the index becomes too large?
Distributed Indices
• If we cannot fit the index on a single machine
any more, we split it up
• Such distributed architectures introduce new
problems
– Crawling
– Indexing
– Retrieval
Map Reduce
Section 2.3

RETRIEVAL
Overview Web Search
Web

Query

Index

Documents

Result Set
Boolean Retrieval
• The early days (1960’s-80’s)
• Searching in highly specialized collections
– Legal or medical documents
– Newspaper archives

• Searcher is a trained professional
Boolean Queries
• Complex queries in expert syntax describe
information need (E.g., Westlaw)
Information need:
Information on the legal theories involved in preventing the disclosure of
trade secrets by employees formerly employed by a competing company.
Query:
"trade secret" /s disclos! /s prevent /s employe!
Information need:
Requirements for disabled people to be able to access a workplace.
Query:
disab! /p access! /s work-site work-place (employment /3 place)
Boolean Retrieval Summary
+
• Intuitive

• Gives searcher direct
control over the retrieval
process

• High cognitive load during
query formulation and
result set exploration
• No ordering to result sets

• Requires expert syntax
Ranked Retrieval
•
•
•
•

Introduced in the early 1990’s
Returns documents in their order of relevance
New problem: How to determine relevance?
Retrieval model computes one relevance score
per document and sorts accordingly
Vector Space Models
• Geometric model
• Idea:
– Relevant documents are similar to the query
– The more similar they are, the higher they are
ranked
Vector Space Mappings
• Translate text documents into vectors
Document Text:
“IT WAS the best of times, it was the worst of times…”
it

was

the

best

of

times

worst

cat

1

1

1

1

1

1

1

0

it

was

the

best

of

times

worst

cat

Binary Vector:

2

2

2

1

2

2

1

0

Frequency-based Vector:
Documents in Space
• Each vector is a point in a high-dimensional
space
• We tend to stop drawing at 3 dimensions, but
10,000 are also no problem mathematically ;-)
• Recall: Our document index already looks like
a vector!
Distances in Vector Spaces
• Similar documents lie close together
dog
• (1,2)
δ=1
• (1,1)
δ = 1.41
(2,0)
cat
Probabilistic Models
• Alternatively, use probabilistic methods
• For each document, compute the probability
of relevance towards the query
• Rank by probability
Building Document Models
• Build a language model of each document
• Count term frequencies and divide by
document length
is

and

to

the

wink

thing

pink

likes

ink

drink

he

He likes to wink, he likes to
drink.
The thing he likes to drink is
ink.
He likes to wink and drink
pink ink.
Ranking Documents
• Compute the likelihood of the query being
generated by the document models
• Order documents by decreasing probability
PageRank
• Google’s fabled original ranking criterion

• Based on the idea that page authoritativeness
should be rewarded during ranking
• Authoritativeness scores are iteratively
propagated along hyperlinks
Example: PageRank
Section 2.4

DATA-DRIVEN TECHNOLOGIES
Everyone wants your Data.
But what do they do with it?
Search Personalization
Search Personalization
• People have very specific preferences and
interests (Topics, language, textual complexity,
location, etc.)
• Based on your previous search history, we can
find out about these preferences
• For future searches, the engine tries to
optimize for the newly found preference
Query Suggestion

• Query suggestions try to help people
formulate their information needs
• They are selected on the basis of frequent
queries that extend the current query terms
Spelling Correction
Data-driven Spelling Correction
• Idea: Consensus is strong / errors are random
Spelling

Frequency

albert einstein

4834

albert einstien

525

albert einstine

149

albert einsten

27

albert einsteins

25

albert einstain

11

albert einstin

10

albert eintein

9
Dangers: Welcome to the Filter Bubble
• Data-driven methods are very powerful
• But what happens to the niche information
need?
• What if I want to see that video that nobody
else likes?
• Diversification techniques can help but there
is a danger of drowning in the mainstream
Section 3

IR FOR CHILDREN
Meet the Users
The PuppyIR Project
• European Union research project on childfriendly information access
• Investigate the specific needs of children
• Create an open-source framework to cater for
these needs
• 5 Universities, 3 Business partners
The Internet – A Place full of Kids
• Children interact with the Internet at a young
age (~ 4 years old)
• They spend more time online
• ~ 40% of British 10-year-olds have regular
unsupervised Internet access
The Internet – A Place for Kids?
• Many media platforms are mirrors of popular
culture
• Large parts of their content might not be
suitable for children
• Most IR systems are designed with adult users
in mind, not children
How to learn what Children Need?
• Literature study
• Surveys
– 300 US parents and teachers

• User studies
– 49 Dutch elementary school children

• Query log analyses
– Thousands of young Yahoo users
Children’s Deficits
•
•
•
•
•
•

They type / spell badly
They browse rather than search
They are bad at keywording
They struggle with search interfaces
Everything is relevant
Complex content is a challenge
Web Page Classification
• Can we automatically determine whether a
web site is suitable for children?
– Based on its content
– Based on its link neighborhood
Video Classification
Content Simplification
• Sometimes we do not want to completely
exclude certain documents
• But they still contain complex language that is
hard to grasp for kids
• In these cases, we can
automatically offer
simplifications for
difficult/technical
terms
Demo: Museon/Gemeentemuseum
Demo: Emma Kinderziekenhuis
Section 4

RECOMMENDED READING
Introduction to IR

•
•
•
•

Manning, Raghavan, Schutze
Available online at:
http://nlp.stanford.edu/IR-book/
All-in-one introduction
Modern Information Retrieval
• Baeza-Yates, Ribeiro-Neito
• Good overview over
basic topics
Data Mining
• Witten, Frank
• Data mining and pattern
recognition essentials
Managing Gygabytes
• Witten, Moffat, Bell
• What to do when your data
gets large?
Speech and Language Processing
• Jurafsky, Martin
• Comprehensive overview
of speech and language
technology

Contenu connexe

Tendances

Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Abhay Ratnaparkhi
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Information storage and retrieval
Information storage and  retrievalInformation storage and  retrieval
Information storage and retrievalDr. Utpal Das
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalizationAmeenababs
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyDebashisnaskar
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsVaibhav Khanna
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval systemLeslie Vargas
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
basis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival toolsbasis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival toolsSaroj Suwal
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluationNidhirBiswas
 

Tendances (20)

Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
Information storage and retrieval
Information storage and  retrievalInformation storage and  retrieval
Information storage and retrieval
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Controlled Vocabullary.pptx
Controlled Vocabullary.pptxControlled Vocabullary.pptx
Controlled Vocabullary.pptx
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
similarity measure
similarity measure similarity measure
similarity measure
 
basis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival toolsbasis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival tools
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
Inverted index
Inverted indexInverted index
Inverted index
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 

En vedette

Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part IIngo Frommholz
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 Hind Altwirqi
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalMustafa Jarrar
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتAhmed Al-ajamy
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةBeni-Suef University
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالدكتور طلال ناظم الزهيري
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتHuda Farhan
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولىAmany Megahed
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلوماتBeni-Suef University
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation ProcessSujan Patel
 

En vedette (14)

Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1 كورس اساسيات استرجاع المعلومات المحاضره 1
كورس اساسيات استرجاع المعلومات المحاضره 1
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information Retrieval
 
تطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلوماتتطور نظم إسترجاع المعلومات
تطور نظم إسترجاع المعلومات
 
نظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربيةنظم استرجاع المعلومات باللغة العربية
نظم استرجاع المعلومات باللغة العربية
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
التعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلوماتالتعريف بنظم استرجاع المعلومات
التعريف بنظم استرجاع المعلومات
 
محاضرتي الاولى
محاضرتي الاولىمحاضرتي الاولى
محاضرتي الاولى
 
نظم استرجاع المعلومات
نظم استرجاع المعلوماتنظم استرجاع المعلومات
نظم استرجاع المعلومات
 
IR
IRIR
IR
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation Process
 

Similaire à Introduction to Information Retrieval

Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failBen van Mol
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failBIWUG
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at UniversitiesPK Mishra
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
 
Evaluating Electronic Resources
Evaluating Electronic ResourcesEvaluating Electronic Resources
Evaluating Electronic ResourcesRichard Bernier
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchJaap Kamps
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ donellemckinley
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.netWillem Meints
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 

Similaire à Introduction to Information Retrieval (20)

Unit 1
Unit 1Unit 1
Unit 1
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
 
Managing eResources at Universities
Managing eResources at UniversitiesManaging eResources at Universities
Managing eResources at Universities
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Metadata
MetadataMetadata
Metadata
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
 
Evaluating Electronic Resources
Evaluating Electronic ResourcesEvaluating Electronic Resources
Evaluating Electronic Resources
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.net
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 

Plus de Carsten Eickhoff

Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Carsten Eickhoff
 
Web2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalWeb2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalCarsten Eickhoff
 
Cognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCarsten Eickhoff
 
Evaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsEvaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsCarsten Eickhoff
 
Active Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionActive Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionCarsten Eickhoff
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecCarsten Eickhoff
 
An Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationAn Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationCarsten Eickhoff
 
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Carsten Eickhoff
 

Plus de Carsten Eickhoff (8)

Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
Unsupervised Learning of General-Purpose Embeddings for User and Location Mod...
 
Web2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate RemovalWeb2Text: Deep Structured Boilerplate Removal
Web2Text: Deep Structured Boilerplate Removal
 
Cognitive Biases in Crowdsourcing
Cognitive Biases in CrowdsourcingCognitive Biases in Crowdsourcing
Cognitive Biases in Crowdsourcing
 
Evaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for GroupsEvaluating Music Recommender Systems for Groups
Evaluating Music Recommender Systems for Groups
 
Active Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task SelectionActive Content-Based Crowdsourcing Task Selection
Active Content-Based Crowdsourcing Task Selection
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2Vec
 
An Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query ReformulationAn Eye-Tracking Study of Query Reformulation
An Eye-Tracking Study of Query Reformulation
 
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
Exploiting User Comments for Audio-visual Content Indexing and Retrieval (ECI...
 

Dernier

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 

Dernier (20)

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

Introduction to Information Retrieval

  • 2. What is it all about?
  • 3. 1 Introduction • Google and other Web search providers are on the rise for more than 10 years • They advance to take the role of “media all-rounders” • On Coursera, you have been learning about the range of products offered by Google • In this lecture, we will try and understand how these services are facilitated
  • 4. IR Applications • Many on-line activities involve IR technology – Media consumption (music, videos, text, …) – News tracking – Social networking – Online shopping – Advertisement – Mobile communication
  • 5. Agenda 1. Introduction 2. IR Technology a) b) c) d) Crawling Indexing & Storage Retrieval Data-driven Technologies 3. IR for Children 4. Recommended Reading
  • 7. Searching and Term Matching • Search engines do not understand queries! • They only find things that are similar
  • 10. How does the Engine know what is out there?
  • 12. Spiders, Crawlers, Robots… • Search engine companies send hordes of lightweight programs through the web • Whenever they encounter a new or updated Web page, they save it. • From there, they follow the page’s outgoing links to discover new content
  • 19. Crawling the Web • Web crawling is an ongoing process • To keep up with the pace at which information on the Web changes, pages are re-crawled continuously • Search engine providers have to make sure that their crawlers are polite (i.e., do not place a high or peaking load on web servers)
  • 20. The Ethics of Saving Everything… • Large shares of the Web are of private or proprietary nature • Crawlers should not visit those • “robots.txt” regulates simple access rules for each Web server
  • 21. Access Policy Examples • Examples of robot policies in robots.txt Example 1: User-agent: * Disallow: /cyberworld/map/ User-agent: cybermapper Disallow: Example 2: User-agent: * Disallow: /
  • 23. The Deep Web • Similar to the hidden Web, pages that only become available after user interaction • E.g., product pages in a Web shop • Some crawlers can fill in information in forms and explore the underlying pages • But be careful, or your crawler buys a cruise ship tour…
  • 26. Index Construction • Naïve Approach: – Make a copy of each page – When a query comes in, look for the terms in all saved pages • But wait, is that clever? – We need to answer queries in half a second…
  • 27. Inverted Index • Turn the problem around • Build an inverted index that tells us which document contains the query terms • At query time, we only consider those documents that contain at least one query term
  • 28. Example: Inverted Index he drink ink likes pink thing wink 2 1 0 2 0 0 1 He likes to wink, he likes to drink. 1 3 0 1 0 0 0 He likes to drink, and drink, and drink. 1 1 1 1 0 1 0 The thing he likes to drink is ink. 1 1 1 1 1 0 0 The ink he likes to drink is pink. 1 1 1 1 1 0 1 He likes to wink and drink pink ink.
  • 29. Example: Sparse Index • Most fields of the matrix would be empty • So we store it as a sparse matrix he D1:2 D2:1 D3:1 ink D3:1 D4:1 D5:1 pink D4:1 D5:1 thing D3:1 wink D1:1 D5:1 D4:1 D5:1
  • 30. Positional Indices • The inverted index allows us to know which terms are in which documents • But we lose position information • Term proximity may be important Query: “Square dance” Doc 1: “…Times Square in new York is often host to dance performances...” Doc 2: “… Square dance is a dance for four couples…”
  • 31. Example: Positional Index • To preserve position information, we store each term occurrence as a tuple <doc,pos> he D1:1 D1:5 D2:1 ink D3:8 D4:2 D5:8 pink D4:8 D5:7 thing D3:2 wink D1:4 D5:4 D3:3 D4:3 D5:1
  • 32. What if the index becomes too large?
  • 33. Distributed Indices • If we cannot fit the index on a single machine any more, we split it up • Such distributed architectures introduce new problems – Crawling – Indexing – Retrieval
  • 37. Boolean Retrieval • The early days (1960’s-80’s) • Searching in highly specialized collections – Legal or medical documents – Newspaper archives • Searcher is a trained professional
  • 38. Boolean Queries • Complex queries in expert syntax describe information need (E.g., Westlaw) Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace. Query: disab! /p access! /s work-site work-place (employment /3 place)
  • 39. Boolean Retrieval Summary + • Intuitive • Gives searcher direct control over the retrieval process • High cognitive load during query formulation and result set exploration • No ordering to result sets • Requires expert syntax
  • 40. Ranked Retrieval • • • • Introduced in the early 1990’s Returns documents in their order of relevance New problem: How to determine relevance? Retrieval model computes one relevance score per document and sorts accordingly
  • 41. Vector Space Models • Geometric model • Idea: – Relevant documents are similar to the query – The more similar they are, the higher they are ranked
  • 42. Vector Space Mappings • Translate text documents into vectors Document Text: “IT WAS the best of times, it was the worst of times…” it was the best of times worst cat 1 1 1 1 1 1 1 0 it was the best of times worst cat Binary Vector: 2 2 2 1 2 2 1 0 Frequency-based Vector:
  • 43. Documents in Space • Each vector is a point in a high-dimensional space • We tend to stop drawing at 3 dimensions, but 10,000 are also no problem mathematically ;-) • Recall: Our document index already looks like a vector!
  • 44. Distances in Vector Spaces • Similar documents lie close together dog • (1,2) δ=1 • (1,1) δ = 1.41 (2,0) cat
  • 45. Probabilistic Models • Alternatively, use probabilistic methods • For each document, compute the probability of relevance towards the query • Rank by probability
  • 46. Building Document Models • Build a language model of each document • Count term frequencies and divide by document length is and to the wink thing pink likes ink drink he He likes to wink, he likes to drink. The thing he likes to drink is ink. He likes to wink and drink pink ink.
  • 47. Ranking Documents • Compute the likelihood of the query being generated by the document models • Order documents by decreasing probability
  • 48. PageRank • Google’s fabled original ranking criterion • Based on the idea that page authoritativeness should be rewarded during ranking • Authoritativeness scores are iteratively propagated along hyperlinks
  • 51. Everyone wants your Data. But what do they do with it?
  • 53. Search Personalization • People have very specific preferences and interests (Topics, language, textual complexity, location, etc.) • Based on your previous search history, we can find out about these preferences • For future searches, the engine tries to optimize for the newly found preference
  • 54. Query Suggestion • Query suggestions try to help people formulate their information needs • They are selected on the basis of frequent queries that extend the current query terms
  • 56. Data-driven Spelling Correction • Idea: Consensus is strong / errors are random Spelling Frequency albert einstein 4834 albert einstien 525 albert einstine 149 albert einsten 27 albert einsteins 25 albert einstain 11 albert einstin 10 albert eintein 9
  • 57. Dangers: Welcome to the Filter Bubble • Data-driven methods are very powerful • But what happens to the niche information need? • What if I want to see that video that nobody else likes? • Diversification techniques can help but there is a danger of drowning in the mainstream
  • 58. Section 3 IR FOR CHILDREN
  • 60. The PuppyIR Project • European Union research project on childfriendly information access • Investigate the specific needs of children • Create an open-source framework to cater for these needs • 5 Universities, 3 Business partners
  • 61. The Internet – A Place full of Kids • Children interact with the Internet at a young age (~ 4 years old) • They spend more time online • ~ 40% of British 10-year-olds have regular unsupervised Internet access
  • 62. The Internet – A Place for Kids? • Many media platforms are mirrors of popular culture • Large parts of their content might not be suitable for children • Most IR systems are designed with adult users in mind, not children
  • 63. How to learn what Children Need? • Literature study • Surveys – 300 US parents and teachers • User studies – 49 Dutch elementary school children • Query log analyses – Thousands of young Yahoo users
  • 64. Children’s Deficits • • • • • • They type / spell badly They browse rather than search They are bad at keywording They struggle with search interfaces Everything is relevant Complex content is a challenge
  • 65. Web Page Classification • Can we automatically determine whether a web site is suitable for children? – Based on its content – Based on its link neighborhood
  • 67. Content Simplification • Sometimes we do not want to completely exclude certain documents • But they still contain complex language that is hard to grasp for kids • In these cases, we can automatically offer simplifications for difficult/technical terms
  • 71. Introduction to IR • • • • Manning, Raghavan, Schutze Available online at: http://nlp.stanford.edu/IR-book/ All-in-one introduction
  • 72. Modern Information Retrieval • Baeza-Yates, Ribeiro-Neito • Good overview over basic topics
  • 73. Data Mining • Witten, Frank • Data mining and pattern recognition essentials
  • 74. Managing Gygabytes • Witten, Moffat, Bell • What to do when your data gets large?
  • 75. Speech and Language Processing • Jurafsky, Martin • Comprehensive overview of speech and language technology