Being Google

Our Corpus:
1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
4. The dog-cow says
moof.

>>> doc1 = quot;The cow says moo.quot;
>>> doc2 = quot;The sheep says baa.quot;
>>> doc3 = quot;The dogs say woof.quot;
>>> doc4 = quot;The dog-cow says moof.quot;

Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
... for doc in docs:
... if doc.find(term) > -1:
... print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.'

Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']

Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 = quot;The dog-cow says moofquot;
['The', 'dog', 'cow', 'says', 'moof']

Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
['The', 'dog-cow', 'says', 'moof', '']

Data structures
>>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;}
>>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;}
>>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;}
>>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}

Postings
>>> postings = {}

>>> for doc in docs:
... for token in word.split(doc['content']):
... if len(token) == 0: break
... doc_name = doc['name']
... if token not in postings:
... postings[token.lower()] = [doc_name]
... else:
... postings[token.lower()].append(doc_name)

Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']}

O(log n)
>>> def searcher(term):
... if term in postings:
... for match in postings[term]:
... print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'

More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]

and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]

tokenising #3
Punctuation
Stemming
Stop words
Parts of Speech
Entity Extraction
Markup

Logistics
Storage
(serialising, transporting,
clustering)
Updates
Warming up

ranking
Density
(tf–idf)
Position
Date
Relationships
Feedback
Editorial

interesting search
Lucene
(Hadoop, Solr, Nutch)
OpenFTS / MySQL
Sphinx
Hyper Estraier
Xapian
Other index types

Being Google

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Being Google

Similaire à Being Google (20)

Dernier

Dernier (20)

Being Google