Contenu connexe Similaire à Being Google (20) Being Google5. Our Corpus:
1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
4. The dog-cow says
moof. 6. >>> doc1 = quot;The cow says moo.quot;
>>> doc2 = quot;The sheep says baa.quot;
>>> doc3 = quot;The dogs say woof.quot;
>>> doc4 = quot;The dog-cow says moof.quot; 7. Brute force
>>> docs = [doc1, doc2, doc3, doc4]
>>> def searcher(term):
... for doc in docs:
... if doc.find(term) > -1:
... print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.' 10. Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']
>>> doc4 = quot;The dog-cow says moofquot;
>>> word.split(doc4)
['The', 'dog', 'cow', 'says', 'moof'] 11. Tokenising #3
>>> word = re.compile('s|[^a-z-]', re.I)
>>> word.split(doc4)
['The', 'dog-cow', 'says', 'moof', ''] 12. Data structures
>>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;}
>>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;}
>>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;}
>>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;} 13. Postings
>>> postings = {}
>>> for doc in docs:
... for token in word.split(doc['content']):
... if len(token) == 0: break
... doc_name = doc['name']
... if token not in postings:
... postings[token.lower()] = [doc_name]
... else:
... postings[token.lower()].append(doc_name) 14. Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']} 15. O(log n)
>>> def searcher(term):
... if term in postings:
... for match in postings[term]:
... print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'
found 'says' in 'doc 2'
found 'says' in 'doc 4' 18. tokenising #3
Punctuation
Stemming
Stop words
Parts of Speech
Entity Extraction
Markup 19. Logistics
Storage
(serialising, transporting,
clustering)
Updates
Warming up 20. ranking
Density
(tf–idf)
Position
Date
Relationships
Feedback
Editorial 21. interesting search
Lucene
(Hadoop, Solr, Nutch)
OpenFTS / MySQL
Sphinx
Hyper Estraier
Xapian
Other index types