SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
In Search Of...
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
integrating site search
Friday, 29 October 2010
2
How Search Works
Integrating Search
Improving Results
Using Search
Search Performance
Questions
Friday, 29 October 2010
3
Friday, 29 October 2010
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query
Parser
QueryQueryQueryQuery
ResultResultResultResult
Friday, 29 October 2010
5
With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than
expected.
Tokenisation
“
”Friday, 29 October 2010
6
PHP Tokenisation
function tokenise($string) {
$string = strtolower($string);
preg_match_all('/w+/', $string,
$matches, PREG_OFFSET_CAPTURE);
return $matches[0];
}
Friday, 29 October 2010
7
Document Term Pairs
Document ID Term
1 the
1 best
1 of
1 the
... ...
204 and
204 what
204 would
Friday, 29 October 2010
8
Inverted Index
Term Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
Friday, 29 October 2010
9
Boolean Query Merge
Query: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338
western 4 95 194 204 298 305
hotel 2 40 200 298 355 402
working 4 298 305
Friday, 29 October 2010
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed sit amet ante vitae enim
elementum semper sodales quis ipsum. Aliquam
vel condimentum neque. Curabitur ornare
feugiat ornare. Donec consectetur elit metus.
Nulla eleifend tincidunt massa et euismod.
Vestibulum vestibulum, justo vel egestas
elementum, purus enim ornare quam, vel
gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel
risus vitae mauris vehicula facilisis sit amet in
mi. Nulla ut turpis id felis sollicitudin dictum
sed non ipsum. Praesent ut risus nulla, sed
blandit leo. Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec dapibus
fringilla arcu, et semper lacus egestas non.
Quisque eu purus ut lacus egestas dapibus.
Integer in velit id est dictum bibendum in id mi.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Friday, 29 October 2010
11
TF-IDF
function getWeight($docID, $term, $total) {
$tf = count($term[$docID]);
$idf = log($total / count($term), 2);
return $tf * $idf;
}
Friday, 29 October 2010
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 0 0 0.002 0.003 ...
Friday, 29 October 2010
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.023 0.001 0.034 0.004
1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
Ranked Query Merge
13
Friday, 29 October 2010
14
PHP Similarity
function score($queryString, $index) {
$query = tokenize($queryString);
$matches = array();
foreach($query as $qterm) {
$postings = $index[$qterm];
foreach($postings as $id => $posting) {
$matches[$id] += $posting['score'];
}
}
return arsort($matches);
}
Friday, 29 October 2010
15
Integrating Search
Friday, 29 October 2010
16
CREATE TABLE example (
id INT(11) NOT NULL auto_increment,
title VARCHAR(255),
content TEXT,
PRIMARY KEY(id),
FULLTEXT(title,content)
) Engine=MyISAM;
INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
MySQL Full Text Search
Friday, 29 October 2010
17
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+
| id | title | content |
+----+------------------+------------------------+
| 1 | Mikko & Bacon | Mikko loves bacon |
| 2 | Marcello & Bacon | Marcello hates bacon |
| 3 | Jo & Sausages | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)
Friday, 29 October 2010
18
Sphinx
http://www.sphinxsearch.com
Friday, 29 October 2010
19
Sphinx Configuration
source posts
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = search
sql_query = 
SELECT id, title, content FROM example;
sql_attr_multi = uint tag from query; 
SELECT example_id, tag_id FROM tags;
}
Friday, 29 October 2010
20
index posts
{
source = posts
path = /var/data/sphinx/example
morphology = stem_en
min_word_len = 3
min_prefix_len = 3
min_infix_len = 0
enable_star = 1
}
Friday, 29 October 2010
21
Stemming
happening
happened
happens
http://tartarus.org/~martin/PorterStemmer
- happen
- happen
- happen
Friday, 29 October 2010
22
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon
displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits
searchd --config /etc/sphinx.conf
Friday, 29 October 2010
23
Sphinx From PHP
$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
Friday, 29 October 2010
24
Swish-E
http://swish-e.org
pecl install swish-beta
Friday, 29 October 2010
Filesystem Index With Swish-E
IndexDir /var/data/documents
IndexFile fs-swish-e.index
IndexOnly .doc .docx .pdf
FuzzyIndexingMode Stemming_en1
FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
fs-swish-e.conf
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf
Friday, 29 October 2010
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.pl
IndexFile www-swish-e.index
SwishProgParameters default http://phpir.com/
FuzzyIndexingMode Stemming_en1
DefaultContents HTML
www-swish-e.conf
/usr/local/bin/swish-e -S prog -c www-swish-e.conf
Friday, 29 October 2010
Swish-E With Multiple Indices
$swish = new Swish(
'www-swish-e.index fs-swish-e.index'
);
$search = $swish->prepare();
$queryStr = 'search string goes here';
$result = $search->execute($queryStr);
$total = $result->hits;
while($r = $result->nextResult()) {
echo $r->swishdocpath; // url
}
Friday, 29 October 2010
28
Lucene
Friday, 29 October 2010
29
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text(
'title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'content', $content));
$index->addDocument($doc);
}
Build Index
Friday, 29 October 2010
30
$results = $index->find('loves bacon');
foreach($results as $result) {
echo $result->score, " ";
echo $result->title, "n";
}
Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon
Query Zend Search Lucene
Friday, 29 October 2010
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html::
loadHTML($file);
$doc->addField(
Zend_Search_Lucene_Field::Text(
'url', $url
);
$index->addDocument($doc)
Index HTML
Friday, 29 October 2010
32
Solr
http://lucene.apache.org/solr/
Friday, 29 October 2010
33
Solr Search Index
$options = array( 'hostname' => 'localhost',
'port' => 8983 );
$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();
Friday, 29 October 2010
34
Solr Search Client
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();
foreach($r['response']['docs'] as $d) {
echo $d->title[0] . "n";
}
Friday, 29 October 2010
35
Xapian
http://xapian.org
Friday, 29 October 2010
36
Xapian In PHP
$db = new XapianWritableDatabase(
'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));
$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);
$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
Friday, 29 October 2010
37
Xapian Search In PHP
$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);
$enquire->set_query($query);
Friday, 29 October 2010
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();
while(!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$data = $i->get_document()->get_data();
$title = $i->get_document()->get_value(1);
$score = $i->get_percent();
$i->next();
}
Friday, 29 October 2010
39
Improving Results
Friday, 29 October 2010
40
Anchor Text
Friday, 29 October 2010
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Parse Anchor Text
Friday, 29 October 2010
42
1
2
3
Zone Weighting
Friday, 29 October 2010
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text
('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$index->addDocument($doc);
Friday, 29 October 2010
44
Document Authority
Friday, 29 October 2010
45
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text
('title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$doc->boost = 1 + ($numComments / 100);
$index->addDocument($doc);
Friday, 29 October 2010
46
Using Search
Friday, 29 October 2010
47
Summaries & Highlighting
Friday, 29 October 2010
48
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
$text[$doc] = getTextFromDB($doc);
}
$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
echo $extract;
}
Friday, 29 October 2010
Friday, 29 October 2010
50
Xapian Spelling Correction
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
XapianTermGenerator::FLAG_SPELLING);
Indexer
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
$q->get_corrected_query_string() . "n";
Searcher
Friday, 29 October 2010
51
Spelling Correction Output
php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for “strreplace or str_cmp”:
1: 2% docid=572
[phpdocs/html/cc.license.html]
2: 2% docid=7169
[phpdocs/html/imagick.constants.html]
3: 2% docid=10086
[phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
[phpdocs/html/function.swf-posround.html]
Friday, 29 October 2010
52
Results Sorting
Friday, 29 October 2010
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser::
parse('search string');
$results = $index->find($q, 'title');
foreach($results as $result) {
echo '<h3>', $result->title, "</h3>n";
$doc = getDocumentFromDB($result->did);
echo
$q->htmlFragmentHighlightMatches($doc);
}
Friday, 29 October 2010
54
Faceted Search
Friday, 29 October 2010
55
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
echo $facet . " " . $count . "n";
}
Friday, 29 October 2010
56
More Like This
Friday, 29 October 2010
57
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);
$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
$qs[] = new XapianQuery($t->get_term(),
intval($t->get_weight()));
}
$query = new XapianQuery(
XapianQuery::OP_OR, $qs);
Friday, 29 October 2010
58
More Like This Example
php xapsim.php
1656 results found:
1: 100% docid=5959
[phpdocs/html/function.str-replace.html]
2: 47% docid=5956
[phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
[phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
[phpdocs/html/function.str-repeat.html]
Friday, 29 October 2010
59
Search Performance
Friday, 29 October 2010
60
Index Updates
Docs
Main
New
Delta
Delta Main
Query
Delta Main
Main
DocsDocsDocs
Friday, 29 October 2010
61
Search Speed
$index = Zend_Search_Lucene::open('index');
$index->optimize();
indexer --merge main delta --rotate
Zend Search Lucene
Sphinx
$client = new SolrClient($options);
$client->optimize();
Solr
xapian-compact xapindex xapindex2
Xapian
Friday, 29 October 2010
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
Friday, 29 October 2010
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
Friday, 29 October 2010
64
Image Credits
Title http://www.flickr.com/photos/generated/2084287794/
What Do You Want http://www.flickr.com/photos/the_justified_sinner/
2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx http://www.flickr.com/photos/generated/2084287794/
Lucene http://www.flickr.com/photos/mypanda/7731447/
Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/
Solr http://www.flickr.com/photos/m-j-s/2724756177/
Xapian http://www.flickr.com/photos/olibac/3522056495/
Using Search http://www.flickr.com/photos/eneas/175027945/
Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/
Friday, 29 October 2010
Questions?
65
Friday, 29 October 2010
Thank You!
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
Friday, 29 October 2010

Contenu connexe

Similaire à In Search Of: Integrating Site Search (PHP Barcelona)

Bring Your Own Policy: Internet Use/BYOD Policy by consensus
Bring Your Own Policy:  Internet Use/BYOD Policy by consensus Bring Your Own Policy:  Internet Use/BYOD Policy by consensus
Bring Your Own Policy: Internet Use/BYOD Policy by consensus Michael Scheidell
 
Techwards uploadfile updated changes
Techwards uploadfile updated changesTechwards uploadfile updated changes
Techwards uploadfile updated changeselastica 123
 
Epsilon.pptx
Epsilon.pptxEpsilon.pptx
Epsilon.pptxOstoor
 
4.3 mixed scheme dark version
4.3 mixed scheme   dark version4.3 mixed scheme   dark version
4.3 mixed scheme dark versionhamza bekkali
 
Five Typography Tips for Better UX
Five Typography Tips for Better UXFive Typography Tips for Better UX
Five Typography Tips for Better UXMelissa Eggleston
 
State of the Art Presentation Templates- Compilation 5
State of the Art Presentation Templates- Compilation 5State of the Art Presentation Templates- Compilation 5
State of the Art Presentation Templates- Compilation 5Manish Parsuramka
 
Stark PowerPoint Template
Stark PowerPoint TemplateStark PowerPoint Template
Stark PowerPoint TemplateDenny Nugroho
 
Running head KONY 2017 SAMPLE TEMPLATE .docx
Running head KONY 2017 SAMPLE TEMPLATE                         .docxRunning head KONY 2017 SAMPLE TEMPLATE                         .docx
Running head KONY 2017 SAMPLE TEMPLATE .docxcowinhelen
 
week3_garst_107357_mockupv1
week3_garst_107357_mockupv1week3_garst_107357_mockupv1
week3_garst_107357_mockupv1Ashley Garst
 
Vision - Mission Business Template.pptx
Vision - Mission Business Template.pptxVision - Mission Business Template.pptx
Vision - Mission Business Template.pptxAjay Gangakhedkar
 
16.9 mixed scheme dark version
16.9 mixed scheme   dark version16.9 mixed scheme   dark version
16.9 mixed scheme dark versionhamza bekkali
 
16.9 mixed scheme dark version
16.9 mixed scheme   dark version16.9 mixed scheme   dark version
16.9 mixed scheme dark versionhamza bekkali
 

Similaire à In Search Of: Integrating Site Search (PHP Barcelona) (20)

Bring Your Own Policy: Internet Use/BYOD Policy by consensus
Bring Your Own Policy:  Internet Use/BYOD Policy by consensus Bring Your Own Policy:  Internet Use/BYOD Policy by consensus
Bring Your Own Policy: Internet Use/BYOD Policy by consensus
 
Techwards uploadfile updated changes
Techwards uploadfile updated changesTechwards uploadfile updated changes
Techwards uploadfile updated changes
 
Epsilon.pptx
Epsilon.pptxEpsilon.pptx
Epsilon.pptx
 
4.3 mixed scheme
4.3 mixed scheme4.3 mixed scheme
4.3 mixed scheme
 
4.3 mixed scheme dark version
4.3 mixed scheme   dark version4.3 mixed scheme   dark version
4.3 mixed scheme dark version
 
4.3 blue scheme
4.3 blue scheme4.3 blue scheme
4.3 blue scheme
 
4.3 red scheme
4.3 red scheme4.3 red scheme
4.3 red scheme
 
Newspaper
NewspaperNewspaper
Newspaper
 
Five Typography Tips for Better UX
Five Typography Tips for Better UXFive Typography Tips for Better UX
Five Typography Tips for Better UX
 
State of the Art Presentation Templates- Compilation 5
State of the Art Presentation Templates- Compilation 5State of the Art Presentation Templates- Compilation 5
State of the Art Presentation Templates- Compilation 5
 
Stark PowerPoint Template
Stark PowerPoint TemplateStark PowerPoint Template
Stark PowerPoint Template
 
Running head KONY 2017 SAMPLE TEMPLATE .docx
Running head KONY 2017 SAMPLE TEMPLATE                         .docxRunning head KONY 2017 SAMPLE TEMPLATE                         .docx
Running head KONY 2017 SAMPLE TEMPLATE .docx
 
week3_garst_107357_mockupv1
week3_garst_107357_mockupv1week3_garst_107357_mockupv1
week3_garst_107357_mockupv1
 
Pitch Deck Premium Classic
Pitch Deck Premium ClassicPitch Deck Premium Classic
Pitch Deck Premium Classic
 
Vision - Mission Business Template.pptx
Vision - Mission Business Template.pptxVision - Mission Business Template.pptx
Vision - Mission Business Template.pptx
 
16.9 mixed scheme dark version
16.9 mixed scheme   dark version16.9 mixed scheme   dark version
16.9 mixed scheme dark version
 
16.9 blue scheme
16.9 blue scheme16.9 blue scheme
16.9 blue scheme
 
16.9 mixed scheme dark version
16.9 mixed scheme   dark version16.9 mixed scheme   dark version
16.9 mixed scheme dark version
 
16.9 mixed scheme
16.9 mixed scheme16.9 mixed scheme
16.9 mixed scheme
 
16.9 blue scheme
16.9 blue scheme16.9 blue scheme
16.9 blue scheme
 

Plus de Ian Barber

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giantsIan Barber
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleIan Barber
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionIan Barber
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionIan Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment TacticsIan Barber
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & ToolsIan Barber
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)Ian Barber
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search Ian Barber
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHPIan Barber
 

Plus de Ian Barber (13)

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & Tools
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 

In Search Of: Integrating Site Search (PHP Barcelona)

  • 1. In Search Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com integrating site search Friday, 29 October 2010
  • 2. 2 How Search Works Integrating Search Improving Results Using Search Search Performance Questions Friday, 29 October 2010
  • 5. 5 With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected. Tokenisation “ ”Friday, 29 October 2010
  • 6. 6 PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } Friday, 29 October 2010
  • 7. 7 Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would Friday, 29 October 2010
  • 8. 8 Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... Friday, 29 October 2010
  • 9. 9 Boolean Query Merge Query: Best Western Hotel Result: Document 298 best 1 4 129 298 305 338 western 4 95 194 204 298 305 hotel 2 40 200 298 355 402 working 4 298 305 Friday, 29 October 2010
  • 10. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Friday, 29 October 2010
  • 11. 11 TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } Friday, 29 October 2010
  • 12. 12 Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... Friday, 29 October 2010
  • 13. best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 Ranked Query Merge 13 Friday, 29 October 2010
  • 14. 14 PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } Friday, 29 October 2010
  • 16. 16 CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); MySQL Full Text Search Friday, 29 October 2010
  • 17. 17 MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) Friday, 29 October 2010
  • 19. 19 Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } Friday, 29 October 2010
  • 20. 20 index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } Friday, 29 October 2010
  • 22. 22 Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf Friday, 29 October 2010
  • 23. 23 Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); Friday, 29 October 2010
  • 25. Filesystem Index With Swish-E IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl fs-swish-e.conf /usr/local/bin/swish-e -S fs -c fs-swish-e.conf Friday, 29 October 2010
  • 26. Crawling Content IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML www-swish-e.conf /usr/local/bin/swish-e -S prog -c www-swish-e.conf Friday, 29 October 2010
  • 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url } Friday, 29 October 2010
  • 29. 29 $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index Friday, 29 October 2010
  • 30. 30 $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene Friday, 29 October 2010
  • 31. 31 $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML Friday, 29 October 2010
  • 33. 33 Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); Friday, 29 October 2010
  • 34. 34 Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } Friday, 29 October 2010
  • 36. 36 Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); Friday, 29 October 2010
  • 37. 37 Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); Friday, 29 October 2010
  • 38. 38 $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } Friday, 29 October 2010
  • 41. 41 $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } Parse Anchor Text Friday, 29 October 2010
  • 43. 43 ZSL Zone Weighting $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); Friday, 29 October 2010
  • 45. 45 Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); Friday, 29 October 2010
  • 48. 48 Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } Friday, 29 October 2010
  • 50. 50 Xapian Spelling Correction $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Indexer $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; Searcher Friday, 29 October 2010
  • 51. 51 Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] Friday, 29 October 2010
  • 53. 53 Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } Friday, 29 October 2010
  • 55. 55 Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } Friday, 29 October 2010
  • 56. 56 More Like This Friday, 29 October 2010
  • 57. 57 More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); Friday, 29 October 2010
  • 58. 58 More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] Friday, 29 October 2010
  • 60. 60 Index Updates Docs Main New Delta Delta Main Query Delta Main Main DocsDocsDocs Friday, 29 October 2010
  • 61. 61 Search Speed $index = Zend_Search_Lucene::open('index'); $index->optimize(); indexer --merge main delta --rotate Zend Search Lucene Sphinx $client = new SolrClient($options); $client->optimize(); Solr xapian-compact xapindex xapindex2 Xapian Friday, 29 October 2010
  • 64. 64 Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ 2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ Friday, 29 October 2010