SearchMonkey

SearchMonkey

Presentation by:

Paul Tarjan, Chief Technical Monkey
(ptarjan@yahoo-inc.com)

Online at:

http://www.slideshare.net/ptarjan/searchmonkey-presentation

2 | http://developer.yahoo.com/searchmonkey

What is SearchMonkey?

an open platform for using structured data to build more
useful and relevant search results

Before After


Enhanced Result: Zagat

Image Links Key/Value Pairs
or Abstract


Infobar: Wikipedia Preview

Summary Blob


Part of the puzzle


Vocabularies

• Need to speak the same language
• I like to see girls of that... caliber.
• English, French, Spanish, Esparanto?
• URLs to the rescue
– Dublin Core (http://purl.org/dc/elements/1.1/)
– Friend of a Friend (http://xmlns.com/foaf/0.1/)
– X-Friend Network (http://gmpg.org/xfn/11/)
– … (many more)


Syntax

• Nouns, Verbs, and Adjectives, oh my!
• All phrases become lots of triples
• (Subject, Verb / Adj. / Prep. / etc, Object)
• Key / Value pairs ++
– Everything is a URL or String
– Subject doesn’t have to be the document


Syntax 2

• Key / Value pair
– Title = Awesome SearchMonkey Presentation
– Homepage =
http://search.yahoo.com/searchmonkey
• Triples
– (self, http://purl.org/dc#title, “Awesome
SearchMonkey Presentation”)
– (self, http://vcard#url,
http://search.yahoo.com/searchmonkey)


Decompose to triples

• I like to eat red candy
– (self, http://example.com/likeEating,
http://example.org/temp/redcandy)
– (http://example.org/temp/redcandy,
http://example.com/isColored,
http://example.org/colors/red)
– (http://example.org/temp/redcandy,
http://example.com/isInstanceOf,
http://example.org/food/candy)
• Unnamed nodes are O.K.


How to get data to SearchMonkey?

Humans see:
• name
• picture of a person
• current job
• industry, …

Computers see:
an undifferentiated
blob of HTML

Can we make
computers smarter?


Artificial intelligence is hard. Plus …


How does it work?

site owners/publishers share structured data with Yahoo!.
1

site owners & third-party developers build SearchMonkey apps.
2

consumers customize their search experience with Enhanced Results or Infobars
3

Page Extraction

RDF/Microformat Markup

Acme.com’s
Web Pages

Index

DataRSS feed

Web Services
Acme.com’s
database


Innards of SearchMonkey

• You build a web-service inside our
framework
• When a search page renders
– We check which SM apps are enabled
– We call them
• 50ms for in-page
• Long time for AJAX
– They return data in our template
– We render them (and cache)


Inside SM

Developer Developer

Publisher


Data Sources: RDF and Microformats

Name Cached Open Mode Notes
Yahoo! Index yes yes Passive Old-School Y! Index data
RDFa, eRDF yes yes Passive Vocab + markup decoupled
Microformats yes yes Passive Vocab + markup coupled
DataRSS feed yes no Active Atom + metadata
XSLT no no Active Good for prototyping
Web Service no no Active Brings in remote data


Approach #1: Embedded RDF

<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<!DOCTYPE html PUBLIC quot;-//W3C//DTD XHTML+RDFa 1.0//EN”
quot;http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtdquot;>
<html xmlns=http://www.w3.org/1999/xhtml
xmlns:dc=http://purl.org/dc/elements/1.1/
xmlns:foaf=http://xmlns.com/foaf/0.1/
• Cached data
lang=quot;enquot; xml:lang=quot;enquot;>
<head>
• allows Enhanced Results
<title>The Amazing Home Page of Joe Smith</title>
</head>
• but not for dynamic data
<body>
<h1 property=quot;dc:titlequot;>Joe's Home Page</h1>
• Reuse existing markup
<div rel=quot;foaf:makerquot;>
• but requires site redesign
<h2 property=quot;foaf:namequot;>Joe Smith</h2>
<div rel=quot;foaf:depictionquot;
• Open approach
resource=quot;http://joesmith.org/images/jsmith.pngquot;>
<img src=quot;/images/jsmith.pngquot;
• everyone can use
alt=quot;Smiling headshot of Joequot; />
<p property=quot;dc:rightsquot;>Creative Commons
• Passive, crawled by Y!
Attribution 3.0 Unported</p>
</div>
• less bureaucracy to set up
</div>
…


Approach #2: Embedded Microformats

<div id=quot;hcard-Joe-Smithquot; class=quot;vcardquot;>
<span class=quot;fnquot;>Joe Smith</span>
<div class=quot;adrquot;>
<div class=quot;street-addressquot;>123 Murphy Avenue</div>
<span class=quot;localityquot;>Sunnyvale</span>,
• Cached data
<span class=quot;regionquot;>California</span>
<span class=quot;postal-codequot;>94086</span>
</div>
<div class=quot;telquot;>(408) 555-1234</div>
</div>…
• Reuse existing markup
• but requires site redesign
• Open approach
• everyone can use
• Passive, crawled by Y!
• less bureaucracy to set up

Approach #3: DataRSS Feed

<?profile http://search.yahoo.com/searchmonkey-profile ?>
<feed xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot;
xsi:schemaLocation=quot;http://www.w3.org/2005/Atom ../xsd/datarss.xsdquot;
xmlns:dc=quot;http://purl.org/dc/terms/” xmlns=quot;http://www.w3.org/2005/Atomquot;
xmlns:commerce=quot;http://search.yahoo.com/searchmonkey/commerce/quot;
• Cached data
xmlns:y=quot;http://search.yahoo.com/datarss/quot;>
<id>http://local.yahoo.com/datarss/</id>
<author><name>Peter Mika (pmika@yahoo-inc.com)</name></author>
<title>Example data feed for Local</title>
<updated>2008-07-16T04:05:06+07:00</updated>
Generate feed from DB
•
<entry>
• and maintain afterwards
<title>Parcel 104</title>
<id>http://local.yahoo.com/info-21583016-parcel-104-santa-clara</id>
• Closed approach
<updated>2008-07-16T04:05:06+07:00</updated>
<content type=quot;application/xmlquot;>
• only Yahoo! gets data
<y:adjunct version=quot;1.0quot; name=quot;com.yahoo.local”>
• Actively provide a feed
<y:item rel=quot;dc:subjectquot;>
<y:type typeof=quot;vcard:VCard commerce:Restaurant”>
•
<y:meta property=quot;commerce:hoursOfOperationquot;> coord w/Yahoo! to set up
Breakfast daily, Lunch Mon.-Fri., Dinner Mon.-Sat.


Approach #4: Extract with XSLT

<?xml version=quot;1.0quot;?>
<xsl:stylesheet xmlns:xsl=quot;http://www.w3.org/1999/XSL/Transformquot; version=quot;1.0quot;>
<xsl:template match=quot;/quot;>
<adjunctcontainer>
<adjunct id=quot;smid:{$smid}quot; version=quot;1.0quot;>
<item rel=quot;rel:Photo”
• Generally not cached
resource=quot;{//div[@class='hresume']//div[@class='image']/img/@src}quot;/>
<item rel=quot;rel:Cardquot;>
• too slow, infobar only
<meta property=quot;vcard:fnquot;>
• but good for dynamic
<xsl:value-of select=quot;//div[@class='hresume']//span[contains(@class,'fn')]quot;/> data
</meta>
Scrape page with XSLT
•
<meta property=quot;vcard:titlequot;>
<xsl:value-of select=quot;//div[@class='hresume']//ul[@class='current']/liquot;/>
• operates on cleaned up
</meta>
version of the DOM
</item>
</adjunct>
• watch out for template
</adjunctcontainer>
changes
</xsl:template>
</xsl:stylesheet>
• Easy to prototype

Prototyping with XSLT

• What if I don’t have structured data?
– I don’t own the site
– I do own the site, but I want to prototype first
• Build an XSLT custom data service first
– Write some XSLT to extract the data and
transform it into DataRSS
– Mostly about finding the right XPath (use
Firebug or XPather )
– Quick to implement, but brittle
– Can’t do a good Enhanced Result


Approach #5: Call a Web Service

<?xml version=quot;1.0quot;?>
<xsl:stylesheet xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xmlns:xsl=quot;http://www.w3.org/1999/XSL/Transformquot; version=quot;1.0”
xmlns:h=http://www.w3.org/1999/xhtml
xmlns:y=quot;urn:yahoo:srch”
xsi:schemaLocation=quot;urn:yahoo:srch
• Generally not cached
http://api.search.yahoo.com/SiteExplorerService/V1/PageDataResponse.xsdquot;>
<xsl:template match=quot;/quot;>
• too
<adjunctcontainer xmlns:my=quot;http://example.com/ns/1.0quot;> slow, infobar only
<adjunct id=quot;smid:{$smid}quot; version=quot;1.0quot;> • but good for dynamic data
<meta property=quot;my:link1quot;>
•
<xsl:value-of select=quot;//y:Result[1]/y:Urlquot;/> Call a Remote Web Service
</meta>
• allows SearchMonkey
<meta property=quot;my:result1quot;>
<xsl:value-of select=quot;//y:Result[1]/y:Titlequot;/> apps to glue together
</meta>
• can handle OpenSearch
</adjunct>
XML natively
</adjunctcontainer>
</xsl:template>
</xsl:stylesheet>


Creating an Infobar

• Infobar advantages
– Annotate someone else’s site
– Use links and images from other domains
• Mash up info from multiple sites
• Affiliate / coupon links? Hmmm…
– Can act on *, all websites
• But these apps can be annoying if poorly designed

• Key design principles
– Put something useful in the summary
– Be creative with the HTML


Resources

• Main:
– http://developer.yahoo.com/searchmonkey
• Lists and forums:
– searchmonkey-developers@yahoogroups.com
– http://suggestions.yahoo.com/searchmonkey
• RDF and Microformats:
– http://microformats.org
– http://www.w3.org/TR/xhtml-rdfa-primer/


Do it for real

• Demo


Ninja Coding Techniques:
Enter the Monkey

Typical SearchMonkey PHP code

$ret['title'] = Data::get('com.yahoo.uf.hresume/dc:subject/resume:contact/vcard:title’ ;

// Image
$ret['image']['src'] = Data::get('com.yahoo.uf.hcard/rel:Card/vcard:photo/@resource');
$ret['image']['alt'] = SMDEFAULT;
$ret['image']['title'] = SMDEFAULT;
$ret['image']['allowResize'] = true;

// Key Value pairs - up to 4
$ret['dict'][0]['key'] = quot;Affiliationquot;;
$ret['dict'][0]['value'] =
Data::get('com.yahoo.uf.hresume/resume:affiliation/vcard:org/vcard:organization-name');
$ret['dict'][1]['key'] = quot;Contactquot;;
$ret['dict'][1]['value'] = Data::get('com.yahoo.uf.hresume/dc:subject/resume:contact/@resource');


Your first mistake may be your last!


True ninjas leave no room for error

// Get the list of businesses. If we
// get at least one, extract the
// address and telephone number
$appNodeList = Data::xpath(quot;/*/adjunct/item[@rel='rel:Listing']quot;);
$yd = $appNodeList->item(0);
$adr = $tel = quot;”;
$nodeList = Data::xpath(quot;item[@rel='rel:Business']quot;, $yd);
if ($nodeList->length != 0) {
$nd = $nodeList->item(0);

$adr = Data::xpathString(quot;meta[@property='vcard:adr']quot;, $nd);
$tel = Data::xpathString(quot;meta[@property='vcard:tel']quot;, $nd);
}
if ($r_rating != quot;quot;) {
$ratingstr = Data::getStarsFromNum($r_rating);
if ($r_summary != quot;quot;) {
$ratingstr = $ratingstr . quot; quot; . $r_summary;


Useful conditional tricks

• Check for empty data like this:
– if (‘’==trim($var))
• Watch out for $a.’–’.$b.’-’.$c
– What happens if these variables are empty?
• You can create helper functions!
– getOutput() must return an array, but there’s no
reason not to create other functions
– Call using self::function() instead of just
function()


Development (test, debug, collaborate)

• Your two best friends: input and output
• Collaborative development
– Create a shared Y!ID for your organization
– Export and import apps from the dashboard
• Bellwethers
– Start with just one or two, for simplicity
– Once app is working, hit “autofind” and look at
all ten, see what breaks
– Always set the #1 bellwether to something that’s
high-ranking; that’s your Gallery preview


Image Helper Functions

• Data::getStars(string $data_get_path)
– i.e. Data::getStars(“smid:Jk8/review:rating”)
• Data::getStarsFromNum(float $rating)
– Must scale $rating to fall between 0-5 inclusive
• Data::getImage(string $name)
– Adds icons to your app
• Data::getImage(“information”)
• Data::getImage(“email”)
• Data::getImage(“edit”)
•…


XML functions

• NodeList Data::xpath($string query [,
DOMNode $contextnode)
– More complicated than Data::get()
– Can count, iterate, find children
– Can fetch all vcard:fn, regardless where they are
– Can find a node and grab 1st four children
• string Data::xpathString($string query [,
DOMNode $contextnode)
– Convenience function if you don’t need to do
further DOM manipulation


Infobar Design: Party like it’s 1999

• Sadly, can’t use CSS
– and the default stylesheet strips off most style
– thus lists won’t even display bullets or numbers,
you have to fake this
• Layout: use tables (remember tables?)
• Fonts: can use <font color>, <font face>,
<big>, <small>
• Make good use of images and links
• PRO TIP: Use PHP HEREDOC (<<<)


Let Infobars be Infobars

• Make use of the real estate


Let Infobars be Infobars

• Or be minimal

• But don’t do an Infobar that’s really just an
Enhanced Result in disguise
– Use the blob and summary
– Don’t use the thumbnail, key/value pairs, …


Triggering on *

• This can be annoying for general audiences
– but it’s hard to abort an infobar before 50ms
– and you can’t do this in the PHP layer if you
depend on an extractor or web service
– Data has to be provided by a feed or by
structured markup
• For specialized audiences a “*” infobar might
be ok


Triggering on *


Triggering on *

• Trigger on structured markup
– Ex: Creative Commons Infobar
• Use feeds to annotate the URLs you want
• Instead of *, do a comma-separated list of
sites:
– www.uiuc.edu/*, www.stanford.edu/*,
www.berkeley.edu/*, www.cmu.edu/*, …


XSLT Extractors

• Use the Firebug extension for Firefox
– And Xpather, an extension for Firefox
• Typical pattern: a skeleton of DataRSS, into
which you plug some Xpath
– For more complex XSL:
• Use <xsl:template>
• <xsl:for-each> is clumsier

• Find a good ID to cling to
– Compare arxiv.org (easy) to acm.org (harder)


Examples

• Rubic’s cube
• VTA Bus
• API Monkey
• BugMeNot
• RetailMeNot
• Amazon


questions?


SearchMonkey

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à SearchMonkey

Similaire à SearchMonkey (20)

Dernier

Dernier (20)

SearchMonkey

Notes de l'éditeur