SlideShare a Scribd company logo
1 of 13
Web crawling and
reinforcement learning
Approfondimento per il corso di Soft Computing
            Francesco Gadaleta
What’s happening in the
               world?
•   "Search is the first thing people use on the Web now” - Doug Cutting, a founder and core project manager at
    Nutch



•   For certain types of searches, search engines are very good.
    But I still see major failures, where they aren't delivering useful results.
    I think at a deeper almost political level, I think it's important that we as a
    global society have some transparency in search.

•   What are the algorithms involved?

•   What are the reasons why one site comes up over another one
•   If you consider one of the basic tasks of a search engine, it is to
    make a decision: this page is good or this page sucks
    Jimmy Wales, father of Wikipedia




•   Computers are notoriously bad at making such judgments

•   “Dear Jimbo, you do not know the power of machine learning”
• Google™ is the most powerful agency crawling the
  web

  •   Billions and billions of page crawled


  •   Page Ranking based search system


  •   Wanna pay for some ranking points?
Features

•   As soon as you compensate someone for a link (with cash, or
    a returned link exchanged for barter reasons only), you break
    the model.
•   It doesn't mean that all these links are bad, or evil;
•   It means that we can't evaluate their real merit.
•   We are slaves to this fact, and can't change it.
What’s a spider?


• Is that a movie? Or an animal?
• Explore the web using a target based search
• Bag-of-words (or ontology) for searching
Google Page Ranking (1/2)
• How does it work?
  •   Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti)
                                                   + C(Ti)
                           C(T1)      C(T2)



                                 •C(Ti) is the outbound set of links from Ti
                                 •Rank(j) depends on Rank(•) of other pages
Google Page Ranking (2/2)

             Some real fuzzy rules
              made by Google™
               •   if Rank(A) high
                       Rank(B) += Rank(B) + k

               •   if Rank(A) high
                      Weight(li) += Weight(li) + w

               •   if Rank(A) low
                      Weight(li) = Weight(li)
Reinforced spidering: a
       classical problem
• The “mouse and maze” scenario
• States, actions and reward function
    • state: position into the maze AND
      positions of peaces of cheese to be catched
    • action: move right, left, up, down
    • reward: ƒ=1/d•ß
Reinforced spidering: a not
    so classic problem
•   State: current crawler position

•   Action: follow links from current position

•   Reward: ƒ(q,d) calculated indipendently on every page

•   Probability: P(s,a) query-page similitude calculation (naive
    Bayes) OR/AND a-posteriori from end user selection
Reinforced spidering: a not
    so classic problem
 Features
• a web page is a formatted document
(<h1>,<h2>,<h3>,<p>,<a>)

• a web page belongs to a graph:
 whenever the agent finds relevant infos receives a reward.
 Reinforcement learning used to let the agent learn how to maximize
 rewards and surf the web and search relevant informations.

• reward defined by a Relevance function measuring relevance of
  page d wrt query q
•   Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each
    page d.
    These are the immediate rewards of each page.

•   Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into
    the graph:

                 rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
                                             |links(d)|
                 where


•   ∆ inflation coeff. (how neighbor. pages influence current document

•   links(d) is the set of hyperlinks from d.
rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d')
                                         |links(d)|



1.  Repeatedly applied formula for each document in a subset of the
collection

2.   Subset with significant rsv0

3. After convergence pages that are n links away from page d make a
                                        n
contribution (reward) proportional to ∆ times their rsv

More Related Content

Viewers also liked

07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...zukun
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningGianluca Bontempi
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routingbutest
 
Streamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityStreamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityKevin Fream
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Graphical Models for chains, trees and grids
Graphical Models for chains, trees and gridsGraphical Models for chains, trees and grids
Graphical Models for chains, trees and gridspotaters
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Modelsbutest
 
graphical models for the Internet
graphical models for the Internetgraphical models for the Internet
graphical models for the Internetantiw
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big DataKezhan SHI
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...Alberto Fernandez Villan
 
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition [PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition DongHyun Kwak
 
A system to filter unwanted messages from osn user walls
A system to filter unwanted messages from osn user wallsA system to filter unwanted messages from osn user walls
A system to filter unwanted messages from osn user wallsIEEEFINALYEARPROJECTS
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spacesMounia Lalmas-Roelleke
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...Alexander Crépin
 

Viewers also liked (18)

07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine Learning
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routing
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Streamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityStreamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve Productivity
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Graphical Models for chains, trees and grids
Graphical Models for chains, trees and gridsGraphical Models for chains, trees and grids
Graphical Models for chains, trees and grids
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
graphical models for the Internet
graphical models for the Internetgraphical models for the Internet
graphical models for the Internet
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big Data
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...
 
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition [PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
 
A system to filter unwanted messages from osn user walls
A system to filter unwanted messages from osn user wallsA system to filter unwanted messages from osn user walls
A system to filter unwanted messages from osn user walls
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
Sourcing talent a key recruiting differentiator part 2 - the (Big) Data Lands...
 

Similar to Web Crawling and Reinforcement Learning

Dr. Searcher and Mr. Browser: A unified hyperlink-click graph
Dr. Searcher and Mr. Browser: A unified hyperlink-click graphDr. Searcher and Mr. Browser: A unified hyperlink-click graph
Dr. Searcher and Mr. Browser: A unified hyperlink-click graphCarlos Castillo (ChaTo)
 
Markov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfMarkov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfrayyverma
 
Linkanalysis handout
Linkanalysis handoutLinkanalysis handout
Linkanalysis handoutcsedays
 
Ranking systems
Ranking systemsRanking systems
Ranking systemsMafer
 
Ranking systems
Ranking systemsRanking systems
Ranking systemsJoyce
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabDanny Bickson
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRanksjuyal
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystificationRaja R
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHDivyansh Verma
 
Relating Web Characteristics with Link-Based Ranking
Relating Web Characteristics with Link-Based RankingRelating Web Characteristics with Link-Based Ranking
Relating Web Characteristics with Link-Based RankingCarlos Castillo (ChaTo)
 
The Dom Scripting Toolkit J Query
The Dom Scripting Toolkit J QueryThe Dom Scripting Toolkit J Query
The Dom Scripting Toolkit J QueryQConLondon2008
 
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...MeasureWorks - Outfox your Competition - Context is king, but Performance is ...
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...MeasureWorks
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine LearningPradip Rahul
 
CommunityNext: People not Pages
CommunityNext: People not PagesCommunityNext: People not Pages
CommunityNext: People not PagesAndrew Chen
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceFarzan Hajian
 
Designing States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session SearchDesigning States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session SearchGrace Yang
 

Similar to Web Crawling and Reinforcement Learning (20)

Dr. Searcher and Mr. Browser: A unified hyperlink-click graph
Dr. Searcher and Mr. Browser: A unified hyperlink-click graphDr. Searcher and Mr. Browser: A unified hyperlink-click graph
Dr. Searcher and Mr. Browser: A unified hyperlink-click graph
 
Markov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdfMarkov chains and page rankGraphs.pdf
Markov chains and page rankGraphs.pdf
 
Linkanalysis handout
Linkanalysis handoutLinkanalysis handout
Linkanalysis handout
 
Ranking systems
Ranking systemsRanking systems
Ranking systems
 
Ranking systems
Ranking systemsRanking systems
Ranking systems
 
Social (1)
Social (1)Social (1)
Social (1)
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
 
Local Approximation of PageRank
Local Approximation of PageRankLocal Approximation of PageRank
Local Approximation of PageRank
 
Page rank algortihm
Page rank algortihmPage rank algortihm
Page rank algortihm
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
 
Relating Web Characteristics with Link-Based Ranking
Relating Web Characteristics with Link-Based RankingRelating Web Characteristics with Link-Based Ranking
Relating Web Characteristics with Link-Based Ranking
 
The Dom Scripting Toolkit J Query
The Dom Scripting Toolkit J QueryThe Dom Scripting Toolkit J Query
The Dom Scripting Toolkit J Query
 
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...MeasureWorks - Outfox your Competition - Context is king, but Performance is ...
MeasureWorks - Outfox your Competition - Context is king, but Performance is ...
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine Learning
 
CommunityNext: People not Pages
CommunityNext: People not PagesCommunityNext: People not Pages
CommunityNext: People not Pages
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Designing States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session SearchDesigning States, Actions, and Rewards for Using POMDP in Session Search
Designing States, Actions, and Rewards for Using POMDP in Session Search
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Web Crawling and Reinforcement Learning

  • 1. Web crawling and reinforcement learning Approfondimento per il corso di Soft Computing Francesco Gadaleta
  • 2. What’s happening in the world? • "Search is the first thing people use on the Web now” - Doug Cutting, a founder and core project manager at Nutch • For certain types of searches, search engines are very good. But I still see major failures, where they aren't delivering useful results. I think at a deeper almost political level, I think it's important that we as a global society have some transparency in search. • What are the algorithms involved? • What are the reasons why one site comes up over another one
  • 3. If you consider one of the basic tasks of a search engine, it is to make a decision: this page is good or this page sucks Jimmy Wales, father of Wikipedia • Computers are notoriously bad at making such judgments • “Dear Jimbo, you do not know the power of machine learning”
  • 4. • Google™ is the most powerful agency crawling the web • Billions and billions of page crawled • Page Ranking based search system • Wanna pay for some ranking points?
  • 5. Features • As soon as you compensate someone for a link (with cash, or a returned link exchanged for barter reasons only), you break the model. • It doesn't mean that all these links are bad, or evil; • It means that we can't evaluate their real merit. • We are slaves to this fact, and can't change it.
  • 6. What’s a spider? • Is that a movie? Or an animal? • Explore the web using a target based search • Bag-of-words (or ontology) for searching
  • 7. Google Page Ranking (1/2) • How does it work? • Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti) + C(Ti) C(T1) C(T2) •C(Ti) is the outbound set of links from Ti •Rank(j) depends on Rank(•) of other pages
  • 8. Google Page Ranking (2/2) Some real fuzzy rules made by Google™ • if Rank(A) high Rank(B) += Rank(B) + k • if Rank(A) high Weight(li) += Weight(li) + w • if Rank(A) low Weight(li) = Weight(li)
  • 9. Reinforced spidering: a classical problem • The “mouse and maze” scenario • States, actions and reward function • state: position into the maze AND positions of peaces of cheese to be catched • action: move right, left, up, down • reward: ƒ=1/d•ß
  • 10. Reinforced spidering: a not so classic problem • State: current crawler position • Action: follow links from current position • Reward: ƒ(q,d) calculated indipendently on every page • Probability: P(s,a) query-page similitude calculation (naive Bayes) OR/AND a-posteriori from end user selection
  • 11. Reinforced spidering: a not so classic problem Features • a web page is a formatted document (<h1>,<h2>,<h3>,<p>,<a>) • a web page belongs to a graph: whenever the agent finds relevant infos receives a reward. Reinforcement learning used to let the agent learn how to maximize rewards and surf the web and search relevant informations. • reward defined by a Relevance function measuring relevance of page d wrt query q
  • 12. Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each page d. These are the immediate rewards of each page. • Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into the graph: rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| where • ∆ inflation coeff. (how neighbor. pages influence current document • links(d) is the set of hyperlinks from d.
  • 13. rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| 1. Repeatedly applied formula for each document in a subset of the collection 2. Subset with significant rsv0 3. After convergence pages that are n links away from page d make a n contribution (reward) proportional to ∆ times their rsv