SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
being google
 tom dyson
V.
metadata is easy
language is hard
Our Corpus:
 1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
 4. The dog-cow says
         moof.
>>>   doc1   =   quot;The   cow says moo.quot;
>>>   doc2   =   quot;The   sheep says baa.quot;
>>>   doc3   =   quot;The   dogs say woof.quot;
>>>   doc4   =   quot;The   dog-cow says moof.quot;
Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
...   for doc in docs:
...     if doc.find(term) > -1:
...       print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.'
my first inverted index
Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']
Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 = quot;The dog-cow says moofquot;
>>> word.split(doc4)
['The', 'dog', 'cow', 'says', 'moof']
Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
>>> word.split(doc4)
['The', 'dog-cow', 'says', 'moof', '']
Data structures
>>>   doc1   =   {'name':'doc   1',   'content':quot;The   cow says moo.quot;}
>>>   doc2   =   {'name':'doc   2',   'content':quot;The   sheep says baa.quot;}
>>>   doc3   =   {'name':'doc   3',   'content':quot;The   dogs say woof.quot;}
>>>   doc4   =   {'name':'doc   4',   'content':quot;The   dog-cow says moof.quot;}
Postings
>>> postings = {}

>>> for doc in docs:
...   for token in word.split(doc['content']):
...     if len(token) == 0: break
...     doc_name = doc['name']
...     if token not in postings:
...       postings[token.lower()] = [doc_name]
...     else:
...       postings[token.lower()].append(doc_name)
Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']}
O(log n)
>>> def searcher(term):
...   if term in postings:
...     for match in postings[term]:
...       print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'
found 'says' in 'doc 2'
found 'says' in 'doc 4'
More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]
tokenising #3
  Punctuation
   Stemming
   Stop words
 Parts of Speech
Entity Extraction
     Markup
Logistics
          Storage
(serialising, transporting,
        clustering)
          Updates
       Warming up
ranking
   Density
    (tf–idf)
   Position
      Date
Relationships
  Feedback
   Editorial
interesting search
      Lucene
(Hadoop, Solr, Nutch)
  OpenFTS / MySQL
       Sphinx
   Hyper Estraier
      Xapian
  Other index types
being google
 tom dyson

Contenu connexe

Tendances

Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02
maman__
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LC
maman__
 

Tendances (20)

Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp KrennJavantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
 
MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it Works
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling
 
Groovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークGroovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトーク
 
RESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerRESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens Аuer
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and Iterators
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LC
 
My_sql_with_php
My_sql_with_phpMy_sql_with_php
My_sql_with_php
 
はじめてのGroovy
はじめてのGroovyはじめてのGroovy
はじめてのGroovy
 
コミュニケーションとしてのコード
コミュニケーションとしてのコードコミュニケーションとしてのコード
コミュニケーションとしてのコード
 
Mongo db
Mongo dbMongo db
Mongo db
 
Codigos
CodigosCodigos
Codigos
 
CouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceCouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conference
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
 
Underscore
UnderscoreUnderscore
Underscore
 

En vedette

The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon Account
Tom Dyson
 

En vedette (7)

Personal Carbon Rationing
Personal Carbon RationingPersonal Carbon Rationing
Personal Carbon Rationing
 
Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?
 
Hands on with Google App Engine
Hands on with Google App EngineHands on with Google App Engine
Hands on with Google App Engine
 
Dynamic Demand
Dynamic DemandDynamic Demand
Dynamic Demand
 
The mobile web
The mobile webThe mobile web
The mobile web
 
Wychwood CRAG launch
Wychwood CRAG launchWychwood CRAG launch
Wychwood CRAG launch
 
The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon Account
 

Similaire à Being Google

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)
Kerry Buckley
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
Raghu nath
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
Raghu nath
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
amit kuraria
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
Yaqi Zhao
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docx
PaulntmMilleri
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutos
Edgar Suarez
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
Rajiv Risi
 

Similaire à Being Google (20)

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)
 
Five
FiveFive
Five
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
My First Ruby
My First RubyMy First Ruby
My First Ruby
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Pre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to Elixir
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsComing Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix Shells
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
7li7w devcon5
7li7w devcon57li7w devcon5
7li7w devcon5
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docx
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutos
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing Functionally
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Being Google

  • 2. V.
  • 5. Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
  • 6. >>> doc1 = quot;The cow says moo.quot; >>> doc2 = quot;The sheep says baa.quot; >>> doc3 = quot;The dogs say woof.quot; >>> doc4 = quot;The dog-cow says moof.quot;
  • 7. Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print quot;found '%s' in '%s'quot; % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
  • 10. Tokenising #2 >>> import re >>> word = re.compile('W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = quot;The dog-cow says moofquot; >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
  • 11. Tokenising #3 >>> word = re.compile('s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
  • 12. Data structures >>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;} >>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;} >>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;} >>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}
  • 13. Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
  • 14. Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
  • 15. O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print quot;found '%s' in '%s'quot; % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
  • 16. More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
  • 17. and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
  • 18. tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
  • 19. Logistics Storage (serialising, transporting, clustering) Updates Warming up
  • 20. ranking Density (tf–idf) Position Date Relationships Feedback Editorial
  • 21. interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types