Web Crawling and Reinforcement Learning

•

3 likes•1,744 views

Francesco Gadaleta

Description of a "new" way to implement a crawler to avoid the bias of payed google indexing services

Technology Design

Web crawling and
reinforcement learning
Approfondimento per il corso di Soft Computing
Francesco Gadaleta

What’s happening in the
world?
• "Search is the ﬁrst thing people use on the Web now” - Doug Cutting, a founder and core project manager at
Nutch

• For certain types of searches, search engines are very good.
But I still see major failures, where they aren't delivering useful results.
I think at a deeper almost political level, I think it's important that we as a
global society have some transparency in search.

• What are the algorithms involved?

• What are the reasons why one site comes up over another one

• If you consider one of the basic tasks of a search engine, it is to
make a decision: this page is good or this page sucks
Jimmy Wales, father of Wikipedia

• Computers are notoriously bad at making such judgments

• “Dear Jimbo, you do not know the power of machine learning”

• Google™ is the most powerful agency crawling the
web

• Billions and billions of page crawled

• Page Ranking based search system

• Wanna pay for some ranking points?

Features

• As soon as you compensate someone for a link (with cash, or
a returned link exchanged for barter reasons only), you break
the model.
• It doesn't mean that all these links are bad, or evil;
• It means that we can't evaluate their real merit.
• We are slaves to this fact, and can't change it.

What’s a spider?

• Is that a movie? Or an animal?
• Explore the web using a target based search
• Bag-of-words (or ontology) for searching

Google Page Ranking (1/2)
• How does it work?
• Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti)
+ C(Ti)
C(T1) C(T2)

•C(Ti) is the outbound set of links from Ti
•Rank(j) depends on Rank(•) of other pages

Google Page Ranking (2/2)

Some real fuzzy rules
made by Google™
• if Rank(A) high
Rank(B) += Rank(B) + k

• if Rank(A) high
Weight(li) += Weight(li) + w

• if Rank(A) low
Weight(li) = Weight(li)

Reinforced spidering: a
classical problem
• The “mouse and maze” scenario
• States, actions and reward function
• state: position into the maze AND
positions of peaces of cheese to be catched
• action: move right, left, up, down
• reward: ƒ=1/d•ß

Reinforced spidering: a not
so classic problem
• State: current crawler position

• Action: follow links from current position

• Reward: ƒ(q,d) calculated indipendently on every page

• Probability: P(s,a) query-page similitude calculation (naive
Bayes) OR/AND a-posteriori from end user selection

Reinforced spidering: a not
so classic problem
Features
• a web page is a formatted document
(<h1>,<h2>,<h3>,<p>,<a>)

• a web page belongs to a graph:
whenever the agent ﬁnds relevant infos receives a reward.
Reinforcement learning used to let the agent learn how to maximize
rewards and surf the web and search relevant informations.

• reward deﬁned by a Relevance function measuring relevance of
page d wrt query q

• Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each
page d.
These are the immediate rewards of each page.

• Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into
the graph:

rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
|links(d)|
where

• ∆ inflation coeff. (how neighbor. pages influence current document

• links(d) is the set of hyperlinks from d.

rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
|links(d)|

1. Repeatedly applied formula for each document in a subset of the
collection

2. Subset with signiﬁcant rsv0

3. After convergence pages that are n links away from page d make a
n
contribution (reward) proportional to ∆ times their rsv

Viewers also liked

07 history of cv vision paradigms - system - algorithms - applications - eva...zukun

Some Take-Home Message about Machine LearningGianluca Bontempi

Applying Reinforcement Learning for Network Routingbutest

Supervised Approach to Extract Sentiments from Unstructured TextInternational Journal of Engineering Inventions www.ijeijournal.com

Streamlining Technology to Reduce Complexity and Improve ProductivityKevin Fream

One Size Doesn't Fit All: The New Database Revolutionmark madsen

Graphical Models for chains, trees and gridspotaters

Pattern Recognition and Machine Learning : Graphical Modelsbutest

graphical models for the Internetantiw

Les outils de modélisation des Big DataKezhan SHI

Nearest Neighbor Customer InsightMapR Technologies

A real-time big data architecture for glasses detection using computer vision...Alberto Fernandez Villan

[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition DongHyun Kwak

A system to filter unwanted messages from osn user wallsIEEEFINALYEARPROJECTS

Aggregation for searching complex information spacesMounia Lalmas-Roelleke

Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio

Web crawleranusha kurapati

Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...Alexander Crépin

Viewers also liked (18)

07 history of cv vision paradigms - system - algorithms - applications - eva...

Some Take-Home Message about Machine Learning

Applying Reinforcement Learning for Network Routing

Supervised Approach to Extract Sentiments from Unstructured Text

Streamlining Technology to Reduce Complexity and Improve Productivity

One Size Doesn't Fit All: The New Database Revolution

Graphical Models for chains, trees and grids

Pattern Recognition and Machine Learning : Graphical Models

graphical models for the Internet

Les outils de modélisation des Big Data

Nearest Neighbor Customer Insight

A real-time big data architecture for glasses detection using computer vision...

[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition

A system to filter unwanted messages from osn user walls

Aggregation for searching complex information spaces

Big Data Paradigm - Analysis, Application and Challenges

Web crawler

Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...

Similar to Web Crawling and Reinforcement Learning

Dr. Searcher and Mr. Browser: A unified hyperlink-click graphCarlos Castillo (ChaTo)

Markov chains and page rankGraphs.pdfrayyverma

Linkanalysis handoutcsedays

Ranking systemsMafer

Ranking systemsJoyce

Social (1)Geetha Narra

Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections

Machine Learning in the Cloud with GraphLabDanny Bickson

Local Approximation of PageRanksjuyal

Page rank algortihmSiddharth Kar

Link AnalysisVani Kandhasamy

Search engine page rank demystificationRaja R

LINEAR ALGEBRA BEHIND GOOGLE SEARCHDivyansh Verma

Relating Web Characteristics with Link-Based RankingCarlos Castillo (ChaTo)

The Dom Scripting Toolkit J QueryQConLondon2008

MeasureWorks - Outfox your Competition - Context is king, but Performance is ...MeasureWorks

Web Page Ranking using Machine LearningPradip Rahul

CommunityNext: People not PagesAndrew Chen

Implementing page rank algorithm using hadoop map reduceFarzan Hajian

Designing States, Actions, and Rewards for Using POMDP in Session SearchGrace Yang

Similar to Web Crawling and Reinforcement Learning (20)

Dr. Searcher and Mr. Browser: A unified hyperlink-click graph

Markov chains and page rankGraphs.pdf

Linkanalysis handout

Ranking systems

Social (1)

Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...

Machine Learning in the Cloud with GraphLab

Local Approximation of PageRank

Page rank algortihm

Link Analysis

Search engine page rank demystification

LINEAR ALGEBRA BEHIND GOOGLE SEARCH

Relating Web Characteristics with Link-Based Ranking

The Dom Scripting Toolkit J Query

MeasureWorks - Outfox your Competition - Context is king, but Performance is ...

Web Page Ranking using Machine Learning

CommunityNext: People not Pages

Implementing page rank algorithm using hadoop map reduce

Designing States, Actions, and Rewards for Using POMDP in Session Search

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Slack Application Development 101 Slidespraypatel2

AI as an Interface for Commercial BuildingsMemoori

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Scaling API-first – The story of a global engineering organizationRadu Cotescu

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Slack Application Development 101 Slides

AI as an Interface for Commercial Buildings

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Scaling API-first – The story of a global engineering organization

The 7 Things I Know About Cyber Security After 25 Years | April 2024

SQL Database Design For Developers at php[tek] 2024

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

Pigging Solutions in Pet Food Manufacturing

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Injustice - Developers Among Us (SciFiDevCon 2024)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Pigging Solutions Piggable Sweeping Elbows

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Understanding the Laravel MVC Architecture

Benefits Of Flutter Compared To Other Frameworks

Web Crawling and Reinforcement Learning

1. Web crawling and reinforcement learning Approfondimento per il corso di Soft Computing Francesco Gadaleta

2. What’s happening in the world? • "Search is the ﬁrst thing people use on the Web now” - Doug Cutting, a founder and core project manager at Nutch • For certain types of searches, search engines are very good. But I still see major failures, where they aren't delivering useful results. I think at a deeper almost political level, I think it's important that we as a global society have some transparency in search. • What are the algorithms involved? • What are the reasons why one site comes up over another one

3. • If you consider one of the basic tasks of a search engine, it is to make a decision: this page is good or this page sucks Jimmy Wales, father of Wikipedia • Computers are notoriously bad at making such judgments • “Dear Jimbo, you do not know the power of machine learning”

4. • Google™ is the most powerful agency crawling the web • Billions and billions of page crawled • Page Ranking based search system • Wanna pay for some ranking points?

5. Features • As soon as you compensate someone for a link (with cash, or a returned link exchanged for barter reasons only), you break the model. • It doesn't mean that all these links are bad, or evil; • It means that we can't evaluate their real merit. • We are slaves to this fact, and can't change it.

6. What’s a spider? • Is that a movie? Or an animal? • Explore the web using a target based search • Bag-of-words (or ontology) for searching

7. Google Page Ranking (1/2) • How does it work? • Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti) + C(Ti) C(T1) C(T2) •C(Ti) is the outbound set of links from Ti •Rank(j) depends on Rank(•) of other pages

8. Google Page Ranking (2/2) Some real fuzzy rules made by Google™ • if Rank(A) high Rank(B) += Rank(B) + k • if Rank(A) high Weight(li) += Weight(li) + w • if Rank(A) low Weight(li) = Weight(li)

9. Reinforced spidering: a classical problem • The “mouse and maze” scenario • States, actions and reward function • state: position into the maze AND positions of peaces of cheese to be catched • action: move right, left, up, down • reward: ƒ=1/d•ß

10. Reinforced spidering: a not so classic problem • State: current crawler position • Action: follow links from current position • Reward: ƒ(q,d) calculated indipendently on every page • Probability: P(s,a) query-page similitude calculation (naive Bayes) OR/AND a-posteriori from end user selection

11. Reinforced spidering: a not so classic problem Features • a web page is a formatted document (<h1>,<h2>,<h3>,<p>,<a>) • a web page belongs to a graph: whenever the agent ﬁnds relevant infos receives a reward. Reinforcement learning used to let the agent learn how to maximize rewards and surf the web and search relevant informations. • reward deﬁned by a Relevance function measuring relevance of page d wrt query q

12. • Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each page d. These are the immediate rewards of each page. • Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into the graph: rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| where • ∆ inflation coeff. (how neighbor. pages influence current document • links(d) is the set of hyperlinks from d.

13. rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| 1. Repeatedly applied formula for each document in a subset of the collection 2. Subset with signiﬁcant rsv0 3. After convergence pages that are n links away from page d make a n contribution (reward) proportional to ∆ times their rsv

Web Crawling and Reinforcement Learning

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Web Crawling and Reinforcement Learning

Similar to Web Crawling and Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Web Crawling and Reinforcement Learning