2. What’s happening in the
world?
• "Search is the first thing people use on the Web now” - Doug Cutting, a founder and core project manager at
Nutch
• For certain types of searches, search engines are very good.
But I still see major failures, where they aren't delivering useful results.
I think at a deeper almost political level, I think it's important that we as a
global society have some transparency in search.
• What are the algorithms involved?
• What are the reasons why one site comes up over another one
3. • If you consider one of the basic tasks of a search engine, it is to
make a decision: this page is good or this page sucks
Jimmy Wales, father of Wikipedia
• Computers are notoriously bad at making such judgments
• “Dear Jimbo, you do not know the power of machine learning”
4. • Google™ is the most powerful agency crawling the
web
• Billions and billions of page crawled
• Page Ranking based search system
• Wanna pay for some ranking points?
5. Features
• As soon as you compensate someone for a link (with cash, or
a returned link exchanged for barter reasons only), you break
the model.
• It doesn't mean that all these links are bad, or evil;
• It means that we can't evaluate their real merit.
• We are slaves to this fact, and can't change it.
6. What’s a spider?
• Is that a movie? Or an animal?
• Explore the web using a target based search
• Bag-of-words (or ontology) for searching
7. Google Page Ranking (1/2)
• How does it work?
• Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti)
+ C(Ti)
C(T1) C(T2)
•C(Ti) is the outbound set of links from Ti
•Rank(j) depends on Rank(•) of other pages
8. Google Page Ranking (2/2)
Some real fuzzy rules
made by Google™
• if Rank(A) high
Rank(B) += Rank(B) + k
• if Rank(A) high
Weight(li) += Weight(li) + w
• if Rank(A) low
Weight(li) = Weight(li)
9. Reinforced spidering: a
classical problem
• The “mouse and maze” scenario
• States, actions and reward function
• state: position into the maze AND
positions of peaces of cheese to be catched
• action: move right, left, up, down
• reward: ƒ=1/d•ß
10. Reinforced spidering: a not
so classic problem
• State: current crawler position
• Action: follow links from current position
• Reward: ƒ(q,d) calculated indipendently on every page
• Probability: P(s,a) query-page similitude calculation (naive
Bayes) OR/AND a-posteriori from end user selection
11. Reinforced spidering: a not
so classic problem
Features
• a web page is a formatted document
(<h1>,<h2>,<h3>,<p>,<a>)
• a web page belongs to a graph:
whenever the agent finds relevant infos receives a reward.
Reinforcement learning used to let the agent learn how to maximize
rewards and surf the web and search relevant informations.
• reward defined by a Relevance function measuring relevance of
page d wrt query q
12. • Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each
page d.
These are the immediate rewards of each page.
• Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into
the graph:
rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
|links(d)|
where
• ∆ inflation coeff. (how neighbor. pages influence current document
• links(d) is the set of hyperlinks from d.
13. rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
|links(d)|
1. Repeatedly applied formula for each document in a subset of the
collection
2. Subset with significant rsv0
3. After convergence pages that are n links away from page d make a
n
contribution (reward) proportional to ∆ times their rsv