SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
Building Mini‐Google in Ruby 


                                                                               Ilya Grigorik 
                                                                                        @igrigorik 


Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
postrank.com/topic/ruby 




                                The slides…                           Twi+er                         My blog 


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Ruby + Math 
                                                                             PageRank 
               OpDmizaDon 




   Misc Fun                     Examples                                           Indexing 


Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PageRank                                        PageRank + Ruby 




      Tools 
        +                        Examples                                        Indexing 
   OpDmizaDon 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Consume with care… 
                     everything that follows is based on released / public domain info 




Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Search‐engine graveyard 
                                                                   Google did pre9y well… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Query: Ruby 




                                                                                              Results 




       1. Crawl                            2. Index                                          3. Rank 




                                                                   Search pipeline 
                                                                                    50,000‐foot view 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Query: Ruby 




                                                                                               Results 




       1. Crawl                            2. Index                                            3. Rank 




             Bah                           InteresDng                                       Fun 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
CPU Speed                                         333Mhz 
          RAM                                               32‐64MB 

          Index                                             27,000,000 documents 
          Index refresh                                     once a month~ish 
          PageRank computaCon                               several days 

          Laptop CPU                                        2.1Ghz 
          VM RAM                                            1GB 
          1‐Million page web                                ~10 minutes 


                                                                       circa 1997‐1998 



Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
CreaDng & Maintaining an Inverted Index  
                                                                     DIY and the gotchas within 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
                                                                   Word => [Document] 
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>

                                                                   What order? 
 # query: "what is"
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>                                            [1, 2] or [2,1]  


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                  Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank             @igrigorik #railsconf 
require 'set'

    pages = {
     "1" => "it is what it is",
     "2" => "what is it",
     "3" => "it is a banana"
    }

    index = {}                                                       PDF, HTML, RSS? 
                                                                   Lowercase / Upcase? 
    pages.each do |page, content|                                    Compact Index? 
                                                                         Hmmm? 
     content.split(/s/).each do |word|                                Stop words? 
      if index[word]                                                   Persistence? 
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
 
                Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby



Building Mini‐Google in Ruby           h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what it is"}
    index << {:title => "2", :content => "what is it"}
    index << {:title => "3", :content => "it is a banana"}

    index.search_each('content:"banana"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end


    > Score: 1.0, 3




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what it is"}
    index << {:title => "2", :content => "what is it"}
    index << {:title => "3", :content => "it is a banana"}

    index.search_each('content:"banana"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end


    > Score: 1.0, 3


                                Hmmm? 




Building Mini‐Google in Ruby             h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery 
   class Ferret::Analysis::AsciiLe+erAnalyzer                              class Ferret::Search::ConstantScoreQuery 
   class Ferret::Analysis::AsciiLe+erTokenizer                             class Ferret::Search::ExplanaCon 
   class Ferret::Analysis::AsciiLowerCaseFilter                            class Ferret::Search::Filter 
   class Ferret::Analysis::AsciiStandardAnalyzer                           class Ferret::Search::FilteredQuery 
   class Ferret::Analysis::AsciiStandardTokenizer                          class Ferret::Search::FuzzyQuery 
   class Ferret::Analysis::AsciiWhiteSpaceAnalyzer                         class Ferret::Search::Hit 
   class Ferret::Analysis::AsciiWhiteSpaceTokenizer                        class Ferret::Search::MatchAllQuery 
   class Ferret::Analysis::HyphenFilter                                    class Ferret::Search::MulCSearcher 
   class Ferret::Analysis::Le+erAnalyzer                                   class Ferret::Search::MulCTermQuery 
   class Ferret::Analysis::Le+erTokenizer                                  class Ferret::Search::PhraseQuery 
   class Ferret::Analysis::LowerCaseFilter                                 class Ferret::Search::PrefixQuery 
   class Ferret::Analysis::MappingFilter                                   class Ferret::Search::Query 
   class Ferret::Analysis::PerFieldAnalyzer                                class Ferret::Search::QueryFilter 
   class Ferret::Analysis::RegExpAnalyzer                                  class Ferret::Search::RangeFilter 
   class Ferret::Analysis::RegExpTokenizer                                 class Ferret::Search::RangeQuery 
   class Ferret::Analysis::StandardAnalyzer                                class Ferret::Search::Searcher 
   class Ferret::Analysis::StandardTokenizer                               class Ferret::Search::Sort 
   class Ferret::Analysis::StemFilter                                      class Ferret::Search::SortField 
   class Ferret::Analysis::StopFilter                                      class Ferret::Search::TermQuery 
   class Ferret::Analysis::Token                                           class Ferret::Search::TopDocs 
   class Ferret::Analysis::TokenStream                                     class Ferret::Search::TypedRangeFilter 
   class Ferret::Analysis::WhiteSpaceAnalyzer                              class Ferret::Search::TypedRangeQuery 
   class Ferret::Analysis::WhiteSpaceTokenizer                             class Ferret::Search::WildcardQuery 



Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank               @igrigorik #railsconf 
ferret.davebalmain.com/trac 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Ranking Results 
                                                                       0‐60 with PageRank… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

    > Score: 0.827, 3
    > Score: 0.523, 5                                                   Relevance? 
    > Score: 0.125, 4

                                3                     5                    4 
            the                 4                     3                    5 
          brown                 1                     3                    1 
            cow                 1                     4                    1 
      Score                     6                    10                    7 


                                                                Naïve: Term Frequency 

Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

    > Score: 0.827, 3
    > Score: 0.523, 5
    > Score: 0.125, 4

                                3                     5                 4 
            the                 4                     3                 5 
                                                                                                  Skew 
          brown                 1                     3                 1 
            cow                 1                     4                 1 
      Score                     6                    10                 7 


                                                                Naïve: Term Frequency 

Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
3                          5                 4 
            the                 4                          3                 5 
          brown                 1                          3                 1                          Skew 

            cow                 1                          4                 1 


                                # of docs 
                                                                 Score = TF * IDF
                    the              6 
                   brown             3                           TF = # occurrences / # words
                                                                 IDF = # docs / # docs with W
                    cow              4 

      Total # of documents:                       10


                                                                                                       TF‐IDF 
                                              Term Frequency * Inverse Document Frequency 


Building Mini‐Google in Ruby              h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
3                          5                   4 
            the                  4                          3                   5 
          brown                  1                          3                   1 
            cow                  1                          4                   1 


                                # of docs                                Doc # 3 score for ‘the’:
                                                                         4/10 * ln(10/6) = 0.204
                    the               6 
                   brown              3                                  Doc # 3 score for ‘brown’:
                                                                         1/10 * ln(10/3) = 0.120
                    cow               4 
                                                                         Doc # 3 score for ‘cow’:
                                                                         1/10 * ln(10/4) = 0.092
      Total # of documents:                        10
      # words in document:                         10


                                Score = 0.204 + 0.120 + 0.092 = 0.416                                     TF‐IDF 

Building Mini‐Google in Ruby               h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
W1         W2    …           …             …        …        …             …      WN 

         Doc 1        15        23    … 
         Doc 2        24        12    … 
         …            …         …     … 
         … 
         Doc K 

         Size = N * K * size of Ruby object
                                                                                     Ouch. 
          Pages = N = 10,000
          Words = K = 2,000
          Ruby Object = 20+ bytes

          Footprint = 384 MB                                              Frequency Matrix 

Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
NArray is an Numerical N‐dimensional Array class (implemented in C)  



                                                       #    create new NArray. initialize with 0.
       NArray.new(typecode, size, ...)
                                                       #    1 byte unsigned integer
       NArray.byte(size,...)
                                                       #    2 byte signed integer
       NArray.sint(size,...)
                                                       #    4 byte signed integer
       NArray.int(size,...)
                                                       #    single precision float
       NArray.sfloat(size,...)
                                                       #    double precision float
       NArray.float(size,...)
                                                       #    single precision complex
       NArray.scomplex(size,...)
                                                       #    double precision complex
       NArray.complex(size,...)
                                                       #    Ruby object
       NArray.object(size,...)




                                                                                                NArray 
                                                                     h9p://narray.rubyforge.org/ 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
NArray is an Numerical N‐dimensional Array class (implemented in C)  




                                                                                              NArray 
                                                                   h9p://narray.rubyforge.org/ 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
Links as votes 




                                                                                          PageRank 
                Problem: link gaming                                                    the google juice 




Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
P = 0.85 



                                Follow link from page he/she is currently on.  



                                Teleport to a random locaGon on the web. 



                                    P = 0.15 

                                                                                  Random Surfer 
                                                                                      powerful abstracJon 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Follow link from page he/she is currently on.  
                                                                                                        Page K 

                                Teleport to a random locaGon on the web. 



        Page N                          Page M 
                                                                                                      Surfin’ 
                                                                           rinse & repeat, ad naseum 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
On Page P, clicks on link to K 
                                                                                          P = 0.85 


                                On Page K clicks on link to M 
                                                                                          P = 0.85 


                                On Page M teleports to X 

           P = 0.15 

                                                 …                                                    Surfin’ 
                                                                           rinse & repeat, ad naseum 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                        Analyzing the Web Graph 
                                                                                  extracJng PageRank 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
What is PageRank? 
                                                                                               It’s a scalar! 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                                        What is PageRank? 
                                                                                         it’s a probability! 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                                        What is PageRank? 
          Higher Pr, Higher Importance? 
                                                                                         it’s a probability! 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
TeleportaDon? 
                                                                                             sci‐fi fans, … ? 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
1. No in‐links!                                                         3. Isolated Web 




                                            X 
          N 
                           K 
                                                                                         2. No out‐links! 
                                         M
                                                                   M



                                                   Reasons for teleportaDon 
                                                                       enumeraJng edge cases 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
•  readth First Search 
                                 B
                                •  epth First Search 
                                 D
                                •  * Search  
                                 A
                                •  exicographic Search  
                                 L
                                •  ijkstra’s Algorithm  
                                 D
                                •  loyd‐Warshall  
                                 F
                                •  riangulaCon and Comparability detecCon  
                                 T

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # true
dg.vertex?(4) # true
dg.edge?(2,4) # true
dg.vertices # [5, 6, 1, 2, 3, 4]
                                                                        Exploring Graphs 
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]                                    gratr.rubyforge.com 
Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]



Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
P(T) = 0.03 
        P(T) = 0.03                                                    P(T) = 0.15 / # of pages 
                                                                       P(T) = 0.03 
                                             X 
          N 
                           K                                        P(T) = 0.03 

                                          M
                 P(T) = 0.03 
                                                                    M
                                P(T) = 0.03 


                                                                              TeleportaDon 
                                                                                                  probabiliJes 



Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Assume the web is N pages big 
    Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85 
    Assume that teleportaCon probability (E) is uniform 
    Assume that you start on any random page (uniform distribuDon L), then




    Then a^er one step, the probability your on page X is: 




                      PageRank: Simplified MathemaDcal Def’n 
                                                                    cause that’s how we roll 



Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Link Graph                                                  No  link from 1 to N  



                           1    2                …                 …          N 
              1            1    0                …                 …           0 

              2            0    1                …                 …           1 

             …             …    …                …                 …          … 

             …             …    …                …                 …          … 

             N             0    1                …                 …           1 


                     Huge!                                          G = The Link Graph 
                                                                           ginormous and sparse 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Links to… 

                                {
                                    "1"      =>         [25, 26],
                  Page              "2"      =>         [1],
                                    "5"      =>         [123,2],
                                    "6"      =>         [67, 1]
                                }



                                                                            G as a dicDonary 
                                                                                            more compact… 



Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Follow link from page he/she is currently on.  
                                                                                                        Page K 

                                Teleport to a random locaGon on the web. 




                                                                           CompuDng PageRank 
                                                                                              the tedious way 



Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Don’t trust me! Verify it yourself! 




                                IdenDty matrix 




                                                                      CompuDng PageRank 
                                                                                               in one swoop 



Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Enough hand‐waving, dammit! 
                                                                                   show me the code 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Hot, Fast, Awesome 


                                                                   Birth of EM‐Proxy 
                                                                              flash of the obvious 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
h:p://rb‐gsl.rubyforge.org/ 




                                                                         Hot, Fast, Awesome 




                       Click there!  …  Give yourself a weekend.  


Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
h:p://ruby‐gsl.sourceforge.net/ 
                       Click there!  …  Give yourself a weekend.  


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
                                                                         Verify NxN 
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)                                                          Constants… 
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                   PageRank!                                             PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.33                                      X         P = 0.33 
                                N 


                                                                    P = 0.33 
                                                    K 


        pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
        > [0.33, 0.33, 0.33]


                                                                    Ex: Circular Web 
                                                                                 tesJng intuiJon… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.05                                      X         P = 0.07 
                                N 


                                                                    P = 0.87 
                                                    K 


        pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
        > [0.05, 0.07, 0.87]


                                                               Ex: All roads lead to K 
                                                                                 tesJng intuiJon… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
PageRank + Ferret 
                                                                              awesome search, Tw! 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
P = 0.05                                         2                    P = 0.07 
                                                             1 


require 'ferret'                                                                                  P = 0.87 
include Ferret                                                          3 
index = Index::Index.new()

index << {:title => "1", :content => "it is what it is", :pr => 0.05 }
index << {:title => "2", :content => "what is it", :pr => 0.07 }
index << {:title => "3", :content => "it is a banana", :pr => 0.87 }



                                                                                  Store PageRank 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end

puts "*" * 50                   TF‐IDF Search 

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end
                                                 PageRank FTW! 
puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#    Score: 0.267119228839874, 3 (PR: 0.87)
#    Score: 0.17807948589325, 1 (PR: 0.05)                                           Others 
#    Score: 0.17807948589325, 2 (PR: 0.07)
#    ***********************************
#    Score: 0.267119228839874, 3, (PR: 0.87)
#    Score: 0.17807948589325, 2, (PR: 0.07)                                         Google 
#    Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Search*: Graphs are ubiquitous! 
                                                  PageRank is a general purpose hammer 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Username               GitCred
                                                                     ==============================
                                                                     37signals              10.00
                                                                     imbriaco               9.76
                                                                     why                    8.74
                                                                     rails                  8.56
                                                                     defunkt                8.17
                                                                     technoweenie           7.83
                                                                     jeresig                7.60
                                                                     mojombo                7.51
                                                                     yui                    7.34
                                                                     drnic                  7.34
                                                                     pjhyett                6.91
                                                                     wycats                 6.85
                                                                     dhh                    6.84

            h:p://bit.ly/3YQPU 

                                                        PageRank + Social Graph 
                                                                                                      GitHub 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank            @igrigorik #railsconf 
Hmm… 




                                                                   Analyze the social graph: 
                                                                   ‐  Filter messages by ‘Twi:erRank’ 
                                                                   ‐  Suggest users by ‘Twi:erRank’ 
                                                                   ‐  … 
                                                      PageRank + Social Graph 
                                                                                                     Twi9er 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank             @igrigorik #railsconf 
PageRank + Product Graph 
                                                                                            E‐commerce 

                                   Link items purchased in same cart… Run PR on it. 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PageRank = Powerful Hammer 
                                                                                            use it! 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PersonalizaDon 
                                                                       how would you do it? 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
TeleportaDon distribuDon doesn’t 
                                                   have to be uniform! 




                                yahoo.com is 
                                my homepage! 


                                                  PageRank + PersonalizaDon 
                                                                  customize the teleportaJon vector 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Make pages with links! 




                                                                       Gaming PageRank 
       hXp://bit.ly/pagerank‐spam                         for fun and profit (I don’t endorse it) 




Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank           @igrigorik #railsconf 
Slides: hXp://bit.ly/railsconf‐pagerank 

    Ferret: hXp://bit.ly/ferret 
    RB‐GSL: hXp://bit.ly/rb‐gsl 

    PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank 
    Gaming PageRank: hXp://bit.ly/pagerank‐spam  

    Michael Nielsen’s lectures on PageRank: 
    hXp://michaelnielsen.org/blog   



                                                                                   QuesDons? 

                                The slides…                           Twi+er                         My blog 


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 

Contenu connexe

Similaire à Building A Mini Google High Performance Computing In Ruby Presentation 1

High Performance Ruby: Evented vs. Threaded
High Performance Ruby: Evented vs. ThreadedHigh Performance Ruby: Evented vs. Threaded
High Performance Ruby: Evented vs. ThreadedEngine Yard
 
The Future of Dependency Management for Ruby
The Future of Dependency Management for RubyThe Future of Dependency Management for Ruby
The Future of Dependency Management for RubyHiroshi SHIBATA
 
mongodb-introduction
mongodb-introductionmongodb-introduction
mongodb-introductionTse-Ching Ho
 
Padrino - the Godfather of Sinatra
Padrino - the Godfather of SinatraPadrino - the Godfather of Sinatra
Padrino - the Godfather of SinatraStoyan Zhekov
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosLindsay Holmwood
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBMongoDB
 
Padrino is agnostic
Padrino is agnosticPadrino is agnostic
Padrino is agnosticTakeshi Yabe
 
Static Code Analysis For Ruby
Static Code Analysis For RubyStatic Code Analysis For Ruby
Static Code Analysis For RubyRichard Huang
 
Ruby on Rails 3.1: Let's bring the fun back into web programing
Ruby on Rails 3.1: Let's bring the fun back into web programingRuby on Rails 3.1: Let's bring the fun back into web programing
Ruby on Rails 3.1: Let's bring the fun back into web programingBozhidar Batsov
 
Scaling Rails Sites by default
Scaling Rails Sites by defaultScaling Rails Sites by default
Scaling Rails Sites by defaultYi-Ting Cheng
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
把鐵路開進視窗裡
把鐵路開進視窗裡把鐵路開進視窗裡
把鐵路開進視窗裡Wei Jen Lu
 
Breaking bad habits with GitLab CI
Breaking bad habits with GitLab CIBreaking bad habits with GitLab CI
Breaking bad habits with GitLab CIIvan Nemytchenko
 
#CNX14 - Using Ruby for Reliability, Consistency, and Speed
#CNX14 - Using Ruby for Reliability, Consistency, and Speed#CNX14 - Using Ruby for Reliability, Consistency, and Speed
#CNX14 - Using Ruby for Reliability, Consistency, and SpeedSalesforce Marketing Cloud
 

Similaire à Building A Mini Google High Performance Computing In Ruby Presentation 1 (20)

Practical ngx_mruby
Practical ngx_mrubyPractical ngx_mruby
Practical ngx_mruby
 
Rails with mongodb
Rails with mongodbRails with mongodb
Rails with mongodb
 
High Performance Ruby: Evented vs. Threaded
High Performance Ruby: Evented vs. ThreadedHigh Performance Ruby: Evented vs. Threaded
High Performance Ruby: Evented vs. Threaded
 
Web application intro
Web application introWeb application intro
Web application intro
 
The Future of Dependency Management for Ruby
The Future of Dependency Management for RubyThe Future of Dependency Management for Ruby
The Future of Dependency Management for Ruby
 
Mongodb
MongodbMongodb
Mongodb
 
mongodb-introduction
mongodb-introductionmongodb-introduction
mongodb-introduction
 
Padrino - the Godfather of Sinatra
Padrino - the Godfather of SinatraPadrino - the Godfather of Sinatra
Padrino - the Godfather of Sinatra
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagios
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDB
 
Padrino is agnostic
Padrino is agnosticPadrino is agnostic
Padrino is agnostic
 
Static Code Analysis For Ruby
Static Code Analysis For RubyStatic Code Analysis For Ruby
Static Code Analysis For Ruby
 
Ruby on Rails 3.1: Let's bring the fun back into web programing
Ruby on Rails 3.1: Let's bring the fun back into web programingRuby on Rails 3.1: Let's bring the fun back into web programing
Ruby on Rails 3.1: Let's bring the fun back into web programing
 
Scaling Rails Sites by default
Scaling Rails Sites by defaultScaling Rails Sites by default
Scaling Rails Sites by default
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
把鐵路開進視窗裡
把鐵路開進視窗裡把鐵路開進視窗裡
把鐵路開進視窗裡
 
Ender
EnderEnder
Ender
 
Php resque
Php resquePhp resque
Php resque
 
Breaking bad habits with GitLab CI
Breaking bad habits with GitLab CIBreaking bad habits with GitLab CI
Breaking bad habits with GitLab CI
 
#CNX14 - Using Ruby for Reliability, Consistency, and Speed
#CNX14 - Using Ruby for Reliability, Consistency, and Speed#CNX14 - Using Ruby for Reliability, Consistency, and Speed
#CNX14 - Using Ruby for Reliability, Consistency, and Speed
 

Plus de elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

Plus de elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Building A Mini Google High Performance Computing In Ruby Presentation 1

  • 1. Building Mini‐Google in Ruby  Ilya Grigorik  @igrigorik  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 2. postrank.com/topic/ruby  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 3. Ruby + Math  PageRank  OpDmizaDon  Misc Fun  Examples  Indexing  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 4. PageRank  PageRank + Ruby  Tools  +   Examples  Indexing  OpDmizaDon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 5. Consume with care…  everything that follows is based on released / public domain info  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 6. Search‐engine graveyard  Google did pre9y well…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 7. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Search pipeline  50,000‐foot view  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 8. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Bah  InteresDng  Fun  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 9. CPU Speed       333Mhz  RAM         32‐64MB  Index         27,000,000 documents  Index refresh      once a month~ish  PageRank computaCon  several days  Laptop CPU       2.1Ghz  VM RAM       1GB  1‐Million page web    ~10 minutes  circa 1997‐1998  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 10. CreaDng & Maintaining an Inverted Index   DIY and the gotchas within  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 11. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 12. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 13. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| Word => [Document]  content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 14. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 15. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 16. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 17. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> What order?  # query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}> [1, 2] or [2,1]   { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 18. require 'set' pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" } index = {} PDF, HTML, RSS?  Lowercase / Upcase?  pages.each do |page, content| Compact Index?  Hmmm?  content.split(/s/).each do |word| Stop words?  if index[word] Persistence?  index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 19. Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 20.   Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Hmmm?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 23. class Ferret::Analysis::Analyzer  class Ferret::Search::BooleanQuery  class Ferret::Analysis::AsciiLe+erAnalyzer  class Ferret::Search::ConstantScoreQuery  class Ferret::Analysis::AsciiLe+erTokenizer  class Ferret::Search::ExplanaCon  class Ferret::Analysis::AsciiLowerCaseFilter  class Ferret::Search::Filter  class Ferret::Analysis::AsciiStandardAnalyzer  class Ferret::Search::FilteredQuery  class Ferret::Analysis::AsciiStandardTokenizer  class Ferret::Search::FuzzyQuery  class Ferret::Analysis::AsciiWhiteSpaceAnalyzer  class Ferret::Search::Hit  class Ferret::Analysis::AsciiWhiteSpaceTokenizer  class Ferret::Search::MatchAllQuery  class Ferret::Analysis::HyphenFilter  class Ferret::Search::MulCSearcher  class Ferret::Analysis::Le+erAnalyzer  class Ferret::Search::MulCTermQuery  class Ferret::Analysis::Le+erTokenizer  class Ferret::Search::PhraseQuery  class Ferret::Analysis::LowerCaseFilter  class Ferret::Search::PrefixQuery  class Ferret::Analysis::MappingFilter  class Ferret::Search::Query  class Ferret::Analysis::PerFieldAnalyzer  class Ferret::Search::QueryFilter  class Ferret::Analysis::RegExpAnalyzer  class Ferret::Search::RangeFilter  class Ferret::Analysis::RegExpTokenizer  class Ferret::Search::RangeQuery  class Ferret::Analysis::StandardAnalyzer  class Ferret::Search::Searcher  class Ferret::Analysis::StandardTokenizer  class Ferret::Search::Sort  class Ferret::Analysis::StemFilter  class Ferret::Search::SortField  class Ferret::Analysis::StopFilter  class Ferret::Search::TermQuery  class Ferret::Analysis::Token  class Ferret::Search::TopDocs  class Ferret::Analysis::TokenStream  class Ferret::Search::TypedRangeFilter  class Ferret::Analysis::WhiteSpaceAnalyzer  class Ferret::Search::TypedRangeQuery  class Ferret::Analysis::WhiteSpaceTokenizer class Ferret::Search::WildcardQuery  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 24. ferret.davebalmain.com/trac  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 25. Ranking Results  0‐60 with PageRank…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 26. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 Relevance?  > Score: 0.125, 4 3  5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 27. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3  5  4  the  4  3  5  Skew  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 28. 5  4  the  4  3  5  brown  1  3  1  Skew  cow  1  4  1  # of docs  Score = TF * IDF the  6  brown  3  TF = # occurrences / # words IDF = # docs / # docs with W cow  4  Total # of documents: 10 TF‐IDF  Term Frequency * Inverse Document Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 29. 5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  # of docs  Doc # 3 score for ‘the’: 4/10 * ln(10/6) = 0.204 the  6  brown  3  Doc # 3 score for ‘brown’: 1/10 * ln(10/3) = 0.120 cow  4  Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 Score = 0.204 + 0.120 + 0.092 = 0.416  TF‐IDF  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 30. W1  W2  …  …  …  …  …  …  WN  Doc 1  15  23  …  Doc 2  24  12  …  …  …  …  …  …  Doc K  Size = N * K * size of Ruby object Ouch.  Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Footprint = 384 MB Frequency Matrix  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 31. NArray is an Numerical N‐dimensional Array class (implemented in C)   # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 32. NArray is an Numerical N‐dimensional Array class (implemented in C)   NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 33. Links as votes  PageRank  Problem: link gaming  the google juice  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 34. P = 0.85  Follow link from page he/she is currently on.   Teleport to a random locaGon on the web.  P = 0.15  Random Surfer  powerful abstracJon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 35. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  Page N  Page M  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 36. On Page P, clicks on link to K  P = 0.85  On Page K clicks on link to M  P = 0.85  On Page M teleports to X  P = 0.15  …  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 37. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  Analyzing the Web Graph  extracJng PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 38. What is PageRank?  It’s a scalar!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 39. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 40. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  Higher Pr, Higher Importance?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 41. TeleportaDon?  sci‐fi fans, … ?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 42. 1. No in‐links!  3. Isolated Web  X  N  K  2. No out‐links!  M M Reasons for teleportaDon  enumeraJng edge cases  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 43. •  readth First Search  B •  epth First Search  D •  * Search   A •  exicographic Search   L •  ijkstra’s Algorithm   D •  loyd‐Warshall   F •  riangulaCon and Comparability detecCon   T require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs  Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com  Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 44. P(T) = 0.03  P(T) = 0.03  P(T) = 0.15 / # of pages  P(T) = 0.03  X  N  K  P(T) = 0.03  M P(T) = 0.03  M P(T) = 0.03  TeleportaDon  probabiliJes  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 45. Assume the web is N pages big  Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85  Assume that teleportaCon probability (E) is uniform  Assume that you start on any random page (uniform distribuDon L), then Then a^er one step, the probability your on page X is:  PageRank: Simplified MathemaDcal Def’n  cause that’s how we roll  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 46. Link Graph  No  link from 1 to N   1  2  …  …  N  1  1  0  …  …  0  2  0  1  …  …  1  …  …  …  …  …  …  …  …  …  …  …  …  N  0  1  …  …  1  Huge!  G = The Link Graph  ginormous and sparse  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 47. Links to…  { "1" => [25, 26], Page   "2" => [1], "5" => [123,2], "6" => [67, 1] } G as a dicDonary  more compact…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 48. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  CompuDng PageRank  the tedious way  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 49. Don’t trust me! Verify it yourself!  IdenDty matrix  CompuDng PageRank  in one swoop  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 50. Enough hand‐waving, dammit!  show me the code  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 51. Hot, Fast, Awesome  Birth of EM‐Proxy  flash of the obvious  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 52. h:p://rb‐gsl.rubyforge.org/  Hot, Fast, Awesome  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 53. h:p://ruby‐gsl.sourceforge.net/  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 54. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 55. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants…  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 56. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank!  PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 57. P = 0.33  X  P = 0.33  N  P = 0.33  K  pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 58. P = 0.05  X  P = 0.07  N  P = 0.87  K  pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 59. PageRank + Ferret  awesome search, Tw!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 60. P = 0.05  2  P = 0.07  1  require 'ferret' P = 0.87  include Ferret 3  index = Index::Index.new() index << {:title => "1", :content => "it is what it is", :pr => 0.05 } index << {:title => "2", :content => "what is it", :pr => 0.07 } index << {:title => "3", :content => "it is a banana", :pr => 0.87 } Store PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 61. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 TF‐IDF Search  sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 62. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end PageRank FTW!  puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 63. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others  # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google  # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 64. Search*: Graphs are ubiquitous!  PageRank is a general purpose hammer  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 h:p://bit.ly/3YQPU  PageRank + Social Graph  GitHub  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 66. Hmm…  Analyze the social graph:  ‐  Filter messages by ‘Twi:erRank’  ‐  Suggest users by ‘Twi:erRank’  ‐  …  PageRank + Social Graph  Twi9er  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 67. PageRank + Product Graph  E‐commerce  Link items purchased in same cart… Run PR on it.  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 68. PageRank = Powerful Hammer  use it!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 69. PersonalizaDon  how would you do it?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 70. TeleportaDon distribuDon doesn’t  have to be uniform!  yahoo.com is  my homepage!  PageRank + PersonalizaDon  customize the teleportaJon vector  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 71. Make pages with links!  Gaming PageRank  hXp://bit.ly/pagerank‐spam   for fun and profit (I don’t endorse it)  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 72. Slides: hXp://bit.ly/railsconf‐pagerank  Ferret: hXp://bit.ly/ferret  RB‐GSL: hXp://bit.ly/rb‐gsl  PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank  Gaming PageRank: hXp://bit.ly/pagerank‐spam   Michael Nielsen’s lectures on PageRank:  hXp://michaelnielsen.org/blog    QuesDons?  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf