SlideShare une entreprise Scribd logo
1  sur  56
Hashing



                                                                ‣ hash functions
                                                                ‣ collision resolution
                                                                ‣ applications


References:
  Algorithms in Java, Chapter 14                                 Except as otherwise noted, the content of this presentation
  http://www.cs.princeton.edu/algs4                              is licensed under the Creative Commons Attribution 2.5 License.




Algorithms in Java, 4th Edition    · Robert Sedgewick and Kevin Wayne · Copyright © 2008             ·    September 2, 2009 2:46:10 AM
Optimize judiciously




     “ More computing sins are committed in the name of efficiency
     (without necessarily achieving it) than for any other single reason—
     including blind stupidity. ” — William A. Wulf



     “ We should forget about small efficiencies, say about 97% of the time:
     premature optimization is the root of all evil. ” — Donald E. Knuth



     “ We follow two rules in the matter of optimization:
       Rule 1: Don't do it.
       Rule 2 (for experts only). Don't do it yet - that is, not until
       you have a perfectly clear and unoptimized solution. ”
       — M. A. Jackson


   Reference: Effective Java by Joshua Bloch


                                                                              2
ST implementations: summary




                          guarantee                         average case
                                                                                        ordered      operations
implementation
                                                                                       iteration?     on keys
                 search    insert     delete   search hit      insert       delete


unordered list     N         N          N        N/2             N           N/2          no         equals()


 ordered array    lg N       N          N        lg N           N/2          N/2          yes       compareTo()


     BST           N         N          N      1.38 lg N     1.38 lg N        ?           yes       compareTo()


randomized BST   3 lg N    3 lg N     3 lg N   1.38 lg N     1.38 lg N     1.38 lg N      yes       compareTo()


red-black tree   3 lg N    3 lg N     3 lg N     lg N           lg N         lg N         yes       compareTo()




  Q. Can we do better?



                                                                                                                  3
Hashing: basic plan


Save items in a key-indexed table (index is a function of the key).


Hash function. Method for computing table index from key.             0

                                                                      1
                                                 hash("it") = 3
                                                                      2

                                                                      3   "it"
                                                                      4

                                                                      5




                                                                                 4
Hashing: basic plan


Save items in a key-indexed table (index is a function of the key).


Hash function. Method for computing table index from key.                      0

                                                                               1
                                                     hash("it") = 3
                                                                               2

                                                                               3   "it"
                                                                        ??     4
Issues                                         hash("times") = 3               5
•   Computing the hash function.
•   Equality test: Method for checking whether two keys are equal.
•   Collision resolution: Algorithm and data structure
    to handle two keys that hash to the same table index.


Classic space-time tradeoff.
•   No space limitation: trivial hash function with key as address.
•   No time limitation: trivial collision resolution with sequential search.
•   Limitations on both time and space: hashing (the real world).
                                                                                          5
‣ hash functions
‣ collision resolution
‣ applications




                         6
Computing the hash function


Idealistic goal: scramble the keys uniformly.
•   Efficiently computable.
•   Each table index equally likely for each key.
                                     thoroughly researched problem,
                                     still problematic in practical applications




Practical challenge. Need different approach for each Key type.


Ex: Social Security numbers.
•   Bad: first three digits.              573 = California, 574 = Alaska

•   Better: last three digits.
                                          (assigned in chronological order within a given geographic region)




Ex: phone numbers.
•   Bad: first three digits.
•   Better: last three digits.


                                                                                                               7
Hash codes and hash functions


Hash code. All Java classes have a method hashCode(), which returns an int.

                                                                           between -232 and 231 - 1


Hash function. An int between 0 and M-1 (for use as an array index).


First attempt.

                     String s = "call";
                     int code = s.hashCode();                    3045982

                     int hash = code % M;


                          7121        8191




Bug. Don't use (code % M) as array index.
1-in-a billion bug. Don't use (Math.abs(code) % M) as array index.
OK. Safe to use ((code & 0x7fffffff) % M) as array index.

                                       hex literal 31-bit mask
                                                                                                      8
Java’s hash code conventions


The method hashCode() is inherited from Object.
•   Ensures hashing can be used for every object type.
•   Enables expert implementations for each type.




Available implementations.
•   Default implementation: memory address of x.
•   Customized implementations: String, URL, Integer, Date, ….
•   User-defined types: users are on their own.



                                                                 9
Implementing hash code: phone numbers


Ex. Phone numbers: (609) 867-5309.


   public final class PhoneNumber
   {
      private final int area, exch, ext;

       public PhoneNumber(int area, int exch, int ext)
       {
          this.area = area;
          this.exch = exch;
          this.ext = ext;
       }

       ...

       public boolean equals(Object y)
       { /* as before */ }

       public int hashCode()
                                                          sufficiently random?
       { return 10007 * (area + 1009 * exch) + ext;   }
   }




                                                                                 10
Implementing hash code: strings


Ex. Strings (in Java 1.5).

         public int hashCode()                                                        char      Unicode

         {                                                                             …               …
            int hash = 0;                                                             'a'          97
            for (int i = 0; i < length(); i++)
               hash = s[i] + (31 * hash);                                             'b'          98

            return hash;                                                              'c'          99
                              ith character of s
         }
                                                                                       …               ...



•   Equivalent to h = 31L-1 · s0 + … + 312 · sL-3 + 311 · sL-2 + 310 · sL-1.
•   Horner's method to hash string of length L: L multiplies/adds.



           String s = "call";
Ex.
           int code = s.hashCode();               3045982 = 99·313 + 97·312 + 108·311 + 108·310
                                                           = 108 + 31· (108 + 31 · (97 + 31 · (99)))



Provably random? Well, no.
                                                                                                             11
A poor hash code design


Ex. Strings (in Java 1.1).
•   For long strings: only examine 8-9 evenly spaced characters.
•   Benefit: saves time in performing arithmetic.

             public   int hashCode()
             {
                int  hash = 0;
                int  skip = Math.max(1, length() / 8);
                for  (int i = 0; i < length(); i += skip)
                    hash = (37 * hash) + s[i];
                 return hash;
             }




•   Downside: great potential for bad collision patterns.

         http://www.cs.princeton.edu/introcs/13loop/Hello.java
         http://www.cs.princeton.edu/introcs/13loop/Hello.class
         http://www.cs.princeton.edu/introcs/13loop/Hello.html
         http://www.cs.princeton.edu/introcs/13loop/index.html
         http://www.cs.princeton.edu/introcs/12type/index.html

                                                                   12
Designing a good hash function


Requirements.
•   If x.equals(y), then we must also have (x.hashCode() == y.hashCode()).
•   Repeated calls to x.hashCode() must return the same value
    (provided no info used in equals() is changed).


Highly desirable. If !x.equals(y), then we want            x              y

(x.hashCode() != y.hashCode()).



                                                      x.hashCode()   y.hashCode()




Basic rule. Need to use the whole key to compute hash code.


Fundamental problem. Need a theorem for each type to ensure reliability.



                                                                                    13
Digression: using a hash function for data mining


Use content to characterize documents.


Applications.
•   Search documents on the web for documents similar to a given one.
•   Determine whether a new document belongs in one set or another.




                                                                   import javax.imageio.ImageIO;
                                                                   import java.io.*;
                                                                   import javax.swing.*;
                                                                   import java.awt.event.*;
                                                                   import java.awt.*;


                                                                   public class Picture {
                                                                     private BufferedImage image;
                                                                     private JFrame frame;
                                                                     ...




Context. Effective for literature, genomes, Java code, art, music, data, video.
                                                                                                    14
Digression: using a hash function for data mining


Approach.
•   Fix order k and dimension d.
•   Compute (hashCode() % d) for all k-grams in the document.
•   Result is d-dimensional vector profile of each document.



To compare documents: Consider angle θ separating vectors
•   cos θ close to 0: not similar.
•   cos θ close to 1: similar.
                                                  a
                                                                   b




                                                      θ



                                         cos θ = a ⋅ b /   a   b


                                                                       15
Digression: using a hash function for data mining


                                                  tale.txt                genome.txt

                                             10-grams with             10-grams with
                                    i                        freq                      freq
                                              hash code i               hash code i
% more tale.txt
it was the best of times            0                          0                        0
it was the worst of times
                                    1                          0                        0
it was the age of wisdom
it was the age of                  ...
foolishness
...                                          "best of ti"              "TTTCGGTTTG"
                                   435                         2                        2
                                             "foolishnes"              "TGTCTGCTGC"

% more genome.txt                  ...
CTTTCGGTTTGGAACC
GAAGCCGCGCGTCT                     8999      "it was the"      8                        0
TGTCTGCTGCAGC
                                   ...
ATCGTTC
...                               12122                        0       "CTTTCGGTTT"     3

                                   ...

                                  34543      "t was the b"     5       "ATGCGGTCGA"     4

                                   ...

                                  65535


                                 k = 10
                                 d = 65536
                                                             profile
                                                                                              16
Digression: using a hash function for data mining



      public class Document
      {
         private String name;
         private double[] profile;

          public Document(String name, int k, int d)
          {
             this.name = name;
             String doc = (new In(name)).readAll();
             int N = doc.length();
             profile = new double[d];
             for (int i = 0; i < N-k; i++)
             {
                int h = doc.substring(i, i+k).hashCode();
                profile[Math.abs(h % d)] += 1;
             }
          }

          public double simTo(Document that)
          { /* compute dot product and divide by magnitudes */   }

      }




                                                                     17
Digression: using a hash function for data mining

                         file                description

                         Cons               US Constitution

                         TomS                Tom Sawyer

                         Huck              Huckleberry Finn

                         Prej             Pride and Prejudice

                         Pict                a photograph

                         DJIA                financial data

                         Amaz       amazon.com website .html source

                         ACTG                   genome


    % java CompareAll   5 1000 < docs.txt
              Cons      TomS    Huck    Prej        Pict        DJIA   Amaz   ACTG
    Cons      1.00      0.89    0.87    0.88        0.35        0.70   0.63   0.58
    TomS      0.89      1.00    0.98    0.96        0.34        0.75   0.66   0.62
    Huck      0.87      0.98    1.00    0.94        0.32        0.74   0.65   0.61
    Prej      0.88      0.96    0.94    1.00        0.34        0.76   0.67   0.63
    Pict      0.35      0.34    0.32    0.34        1.00        0.29   0.48   0.24
    DJIA      0.70      0.75    0.74    0.76        0.29        1.00   0.62   0.58
    Amaz      0.63      0.66    0.65    0.67        0.48        0.62   1.00   0.45
    ACTG      0.58      0.62    0.61    0.63        0.24        0.58   0.45   1.00

                                                                                     18
‣ hash functions
‣ collision resolution
‣ applications




                         19
Helpful results from probability theory


Bins and balls. Throw balls uniformly at random into M bins.




                     0   1   2   3   4   5   6   7   8   9   10 11 12 13 14 15




Birthday problem. Expect two balls in the same bin after ~                       π M / 2 tosses.


Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses.


Load balancing. After M tosses, expect most loaded bin has
Θ(log M / log log M) balls.




                                                                                                   20
Collisions


Collision. Two distinct keys hashing to same index.
•   Birthday problem ⇒ can't avoid collisions unless you have a ridiculous
    amount (quadratic) of memory.
•   Coupon collector + load balancing ⇒ collisions will be evenly distributed.


Challenge. Deal with collisions efficiently.




        approach 1: accept multiple collisions   approach 2: minimize collisions



                                                                                   21
Collision resolution: two approaches


  Separate chaining. [H. P. Luhn, IBM 1953]
  Put keys that collide in a list associated with index.


  Open addressing. [Amdahl-Boehme-Rocherster-Samuel, IBM 1953]
  When a new key collides, find next empty slot, and put it there.


                                                                          st[0]             jocularly


                                                                          st[1]               null

 st[0]     jocularly          seriously                                   st[2]              listen


 st[1]      listen                                                        st[3]             suburban

 st[2]       null

 st[3]     suburban          untravelled             considerating                            null

                                                                       st[30001]            browsing
st[8190]   browsing



           separate chaining (M = 8191, N = 15000)                   linear probing (M = 30001, N = 15000)

                                                                                                             22
Collision resolution approach 1: separate chaining


  Use an array of M < N linked lists.                                 good choice: M ~ N/10

  •   Hash: map key to integer i between 0 and M-1.
  •   Insert: put at front of ith chain (if not already there).
  •   Search: only need to search ith chain.




                                                                                key           hash

                                                                              "call"          7121


            jocularly          seriously                                       "me"           3480
 st[0]

 st[1]       listen                                                         "ishmael"         5017

 st[2]        null                                                         "seriously"         0

 st[3]      suburban          untravelled             considerating       "untravelled"        3

                                                                            "suburban"         3
st[8190]    browsing
                                                                                ...           ...

            separate chaining (M = 8191, N = 15000)


                                                                                                     23
Separate chaining ST: Java implementation (skeleton)


    public class ListHashST<Key, Value>
    {
        private int M = 8191;                              array doubling
        private Node[] st = new Node[M];                   code omitted

         private class Node
         {
             private Object key;                           no generics in
             private Object val;                           arrays in Java
             private Node next;
             public Node(Key key, Value val, Node next)
             {
                 this.key   = key;
                 this.val   = val;
                 this.next = next;
             }
         }

        private int hash(Key key)
        { return (key.hashCode() & 0x7ffffffff) % M;   }

        public void put(Key key, Value val)
        { /* see next slide */ }

        public Val get(Key key)
           { /* see next slide */   }
    }


                                                                            24
Separate chaining ST: Java implementation (put and get)




 public void put(Key key, Value val)
 {
    int i = hash(key);
    for (Node x = st[i]; x != null; x = x.next)
       if (key.equals(x.key))
       { x.val = val; return; }
    st[i] = new Node(key, value, first);
 }
                                                          identical to linked-list code,
                                                            except hash to pick a list
 public Value get(Key key)
 {
    int i = hash(key);
    for (Node x = st[i]; x != null; x = x.next)
       if (key.equals(x.key))
          return (Value) x.val;
    return null;
 }




                                                                                           25
Analysis of separate chaining


Separate chaining performance.
•   Cost is proportional to length of chain.
•   Average length of chain α = N / M.
•   Worst case: all keys hash to same chain.


Proposition. Let α > 1. For any t > 1, probability that chain length > t α is
exponentially small in t.

                      depends on hash map being random map


Parameters.
•   M too large ⇒ too many empty chains.
•   M too small ⇒ chains too long.
•   Typical choice: M ~ N/10 ⇒ constant-time ops.




                                                                                26
Collision resolution approach 2: linear probing


Use an array of size M >> N.                   good choice: M ~ 2N

•   Hash: map key to integer i between 0 and M-1.
•   Insert: put in slot i if free; if not try i+1, i+2, etc.
•   Search: search slot i; if occupied but no match, try i+1, i+2, etc.



            -    -   -    S   H   -    -   A   C    E    R     -     -

             0   1    2   3   4    5   6   7    8   9    10    11    12




                                                                          27
Collision resolution approach 2: linear probing


Use an array of size M >> N.                   good choice: M ~ 2N

•   Hash: map key to integer i between 0 and M-1.
•   Insert: put in slot i if free; if not try i+1, i+2, etc.
•   Search: search slot i; if occupied but no match, try i+1, i+2, etc.



            -    -   -    S   H   -    -   A   C    E    R     -     -

             0   1    2   3   4    5   6   7    8   9    10    11    12




            -    -   -    S   H   -    -   A   C    E    R     I     -    insert I
                                                                          hash(I) = 11
             0   1    2   3   4    5   6   7    8   9    10    11    12




                                                                                         27
Collision resolution approach 2: linear probing


Use an array of size M >> N.                   good choice: M ~ 2N

•   Hash: map key to integer i between 0 and M-1.
•   Insert: put in slot i if free; if not try i+1, i+2, etc.
•   Search: search slot i; if occupied but no match, try i+1, i+2, etc.



            -    -   -    S   H   -    -   A   C    E    R     -     -

             0   1    2   3   4    5   6   7    8   9    10    11    12




            -    -   -    S   H   -    -   A   C    E    R     I     -    insert I
                                                                          hash(I) = 11
             0   1    2   3   4    5   6   7    8   9    10    11    12




            -    -   -    S   H   -    -   A   C    E    R     I     N    insert N
                                                                          hash(N) = 8
             0   1    2   3   4    5   6   7    8   9    10    11    12




                                                                                         27
Linear probing ST implementation


   public class ArrayHashST<Key, Value>
   {                                         standard ugly casts
      private int M = 30001;
      private Value[] vals = (Value[]) new Object[maxN];            array doubling
      private Key[]   keys = (Key[])    new Object[maxN];           code omitted


       privat int hash(Key key) {     /* as before */    }

       public void put(Key key, Value val)
       {
          int i;
          for (i = hash(key); keys[i] != null; i = (i+1) % M)
             if (key.equals(keys[i]))
                 break;
          vals[i] = val;
          keys[i] = key;
       }

       public Value get(Key key)
       {
          for (int i = hash(key); keys[i] != null; i = (i+1) % M)
             if (key.equals(keys[i]))
                 return vals[i];
          return null;
       }
   }

                                                                                     28
Clustering


Cluster. A contiguous block of items.
Observation. New keys likely to hash into middle of big clusters.




                                                                    29
Clustering


Cluster. A contiguous block of items.
Observation. New keys likely to hash into middle of big clusters.




                                                                    29
Knuth's parking problem


Model. Cars arrive at one-way street with M parking spaces. Each desires a
random space i: if space i is taken, try i+1, i+2, …


Q. What is mean displacement of a car?


                                               displacement =3




Empty. With M/2 cars, mean displacement is ~ 3/2.
Full.   With M cars, mean displacement is ~         πM/8



                                                                             30
Analysis of linear probing

Linear probing performance.
•   Insert and search cost depend on length of cluster.
•   Average length of cluster α = N / M.                                 but keys more likely to

•
                                                                         hash to big clusters
    Worst case: all keys hash to same cluster.


Proposition. [Knuth 1962] Let α < 1 be the load factor.
      average probes for insert/search miss

                     1
                     —
                     2(    1 +
                                          1
                                      ( 1 − α )2   )   = ( 1 + α + 2α2 + 3α3 + 4α4 + . . . ) /2

      average probes for search hit

                     1
                     —
                     2
                      (    1 +
                                          1
                                      ( 1 − α )
                                                   )   =   1 + ( α + α2 + α3 + α4 + . . . ) /2



Parameters.
•   Load factor too small ⇒ too many empty array entries.
•   Load factor too large ⇒ clusters coalesce.
•   Typical choice: M ~ 2N ⇒ constant-time ops.
                                                                                                   31
Hashing: variations on the theme


Many improved versions have been studied.


Two-probe hashing. (separate chaining variant)
•   Hash to two positions, put key in shorter of the two chains.
•   Reduces average length of the longest chain to log log N.


Double hashing. (linear probing variant)
•   Use linear probing, but skip a variable amount, not just 1 each time.
•   Effectively eliminates clustering.
•   Can allow table to become nearly full.




                                                                            32
Double hashing


Idea. Avoid clustering by using second hash to compute skip for search.


Hash function. Map key to integer i between 0 and M-1.
Second hash function. Map key to nonzero skip value k.


Ex: k = 1 + (v mod 97).

         hashCode()




Effect. Skip values give different search paths for keys that collide.



Best practices. Make k and M relatively prime.


                                                                          33
Double hashing performance


Theorem. [Guibas-Szemerédi] Let α = N / M < 1 be average length of cluster.


    Average probes for insert/search miss

                                       1
                                                =   1 + α + α2 + α3 + α4 + . . .
                                    ( 1 − α )

    Average probes for search hit

                        1              1
                        —   ln                  =   1 + α/2 + α2 /3 + α3 /4 + α4 /5 + . . .
                        α           ( 1 − α )




Parameters. Typical choice: α ~ 1.2 ⇒ constant-time ops.


Disadvantage. Deletion is cumbersome to implement.




                                                                                              34
Hashing Tradeoffs


 Separate chaining vs. linear probing/double hashing.
 •   Space for links vs. empty table slots.
 •   Small table + linked allocation vs. big coherent array.



 Linear probing vs. double hashing.


                                              load factor

                                    50%       66%      75%    90%

                             get    1.5       2.0      3.0     5.5
                    linear
                   probing   put    2.5       5.0      8.5    55.5

                             get    1.4       1.6       1.8    2.6
                   double
                   hashing
                             put    1.5       2.0      3.0     5.5

                                          number of probes




                                                                     35
Summary of symbol-table implementations



                          guarantee                         average case
                                                                                        ordered      operations
implementation
                                                                                       iteration?     on keys
                 search    insert     delete   search hit      insert       delete


unordered list     N         N          N        N/2             N           N/2          no         equals()


 ordered array    lg N       N          N        lg N           N/2          N/2          yes       compareTo()


     BST           N         N          N      1.38 lg N     1.38 lg N        ?           yes       compareTo()


randomized BST   3 lg N    3 lg N     3 lg N   1.38 lg N     1.38 lg N     1.38 lg N      yes       compareTo()


red-black tree   3 lg N    3 lg N     3 lg N     lg N           lg N         lg N         yes       compareTo()


                                                                                                     equals()
    hashing       1*         1*        1*         1*            1*            1*          no        hashCode()


                   * assumes random hash function




                                                                                                                  36
Hashing versus balanced trees


Hashing
•   Simpler to code.
•   No effective alternative for unordered keys.
•   Faster for simple keys (a few arithmetic ops versus log N compares).
•   Better system support in Java for strings (e.g., cached hash code).
•   Does your hash function produce random values for your key type??


Balanced trees.
•   Stronger performance guarantee.
•   Can support many more ST operations for ordered keys.
•   Easier to implement compareTo() correctly than equals() and hashCode().


Java system includes both.
•   Red-black trees: java.util.TreeMap, java.util.TreeSet.
•   Hashing: java.util.HashMap, java.util.IdentityHashMap.


                                                                              37
Typical "full" symbol table API


  public class *ST<Key extends Comparable<Key>, Value> implements Iterable<Key>

                *ST()                                   create an empty symbol table

          void put(Key key, Value val)                 put key-value pair into the table

         Value get(Key key)                   return value paired with key; null if no such value

       boolean contains(Key key)                      is there a value paired with key?

           Key min()                                         return smallest key

           Key max()                                          return largest key

           Key ceil(Key key)                       return smallest key in table ≥ query key

           Key floor(Key key)                      return largest key in table ≤ query key

          void remove(Key key)                        remove key-value pair from table

  Iterator<Key> iterator()                              iterator through keys in table



Hashing is not suitable for implementing such an API (no order).
BSTs are easy to extend to support such an API (basic tree ops).


                                                                                                    38
‣ hash functions
‣ collision resolution
‣ applications




                         39
Searching challenge


Problem. Index for a PC or the web.
Assumptions. 1 billion++ words to index.


Which searching method to use?
•   Hashing implementation of ST.
•   Hashing implementation of SET.
•   Red-black-tree implementation of ST.
•   Red-black-tree implementation of SET.
•   Doesn’t matter much.




                                            40
Index for search in a PC



    ST<String, SET<File>> st = new ST<String, SET<File>>();
    for (File f : filesystem)
    {
       In in = new In(f);
       String[] words = in.readAll().split("s+");
       for (int i = 0; i < words.length; i++)
       {                                                      build index
           String s = words[i];
           if (!st.contains(s))
              st.put(s, new SET<File>());
           SET<File> files = st.get(s);
           files.add(f);
       }
    }




    SET<File> files = st.get(s);                              process lookup
    for (File f : files) ...                                  request




                                                                               41
Searching challenge


Problem. Index for an e-book.
Assumptions. Book has 100,000+ words.




Which searching method to use?
•   Hashing implementation of ST.
•   Hashing implementation of SET.
•   Red-black-tree implementation of ST.
•   Red-black-tree implementation of SET.
•   Doesn’t matter much.




                                            42
Index for a book


     public class Index
     {
        public static void main(String[] args)
        {
           String[] words = StdIn.readAll().split("s+");
                                                             read book and
           ST<String, SET<Integer>> st;
                                                             create ST
           st = new ST<String, SET<Integer>>();

             for (int i = 0; i < words.length; i++)
             {
                String s = words[i];
                if (!st.contains(s))                         process all words
                   st.put(s, new SET<Integer>());
                SET<Integer> pages = st.get(s);
                pages.add(page(i));
             }

             for (String s : st)
                                                             print index
                StdOut.println(s + ": " + st.get(s));

         }
     }




                                                                                 43
Searching challenge 5


Problem. Sparse matrix-vector multiplication.
Assumptions. Matrix dimension is 10,000; average nonzeros per row ~ 10.




Which searching method to use?
1) Unordered array.
2) Ordered linked list.
3) Ordered array with binary search.
4) Need better method, all too slow.
                                                       A   *   x   =   b
5) Doesn’t matter much, all fast enough.




                                                                           44
Sparse vectors and matrices


Vector. Ordered sequence of N real numbers.
Matrix. N-by-N table of real numbers.


     vector operations


          a =       [   0 3 15    ]   ,       b =   [−1   2 2   ]
          a + b = [−1 5 17                ]
          a o b = (0 ⋅ −1) + (3 ⋅ 2) + (15 ⋅ 2) = 36
          a     =       a o a =       0 2 + 3 2 + 15 2 = 3 26



 €
     matrix-vector multiplication


        0 1  1   −1    4
                       
        2 4 −2 ×  2 =  2
        0 3 15
                  2
                        36
                           



€
                                                                    45
Sparse vectors and matrices


An N-by-N matrix is sparse if it contains O(N) nonzeros.


Property. Large matrices that arise in practice are sparse.



2D array matrix representation.
•   Constant time access to elements.
•   Space proportional to N2.


Goal.
•   Efficient access to elements.
•   Space proportional to number of nonzeros.




                                                              46
Sparse vector data type



 public class SparseVector
 {
    private int N;                    // length
    private ST<Integer, Double> st;   // the elements

    public SparseVector(int N)
    {                                                   all 0s vector
       this.N = N;
       this.st = new ST<Integer, Double>();
    }

    public void put(int i, double value)
    {
       if (value == 0.0) st.remove(i);                  a[i] = value
       else              st.put(i, value);
    }

    public double get(int i)
    {
       if (st.contains(i)) return st.get(i);            return a[i]
       else                return 0.0;
    }

    ...


                                                                        47
Sparse vector data type (cont)



       public double dot(SparseVector that)
       {
          double sum = 0.0;
          for (int i : this.st)
                                                     dot product
             if (that.st.contains(i))
                sum += this.get(i) * that.get(i);
          return sum;
       }

       public double norm()                          2-norm
       { return Math.sqrt(this.dot(this));   }

       public SparseVector plus(SparseVector that)
       {
          SparseVector c = new SparseVector(N);
          for (int i : this.st)                      vector sum
             c.put(i, this.get(i));
          for (int i : that.st)
             c.put(i, that.get(i) + c.get(i));
          return c;
       }

   }


                                                                   48
Sparse matrix data type

   public class SparseMatrix
   {
      private final int N;               // length
      private SparseVector[] rows;       // the elements

       public SparseMatrix(int N)
       {                                                   all 0s matrix
          this.N = N;
          this.rows = new SparseVector[N];
          for (int i = 0; i < N; i++)
             this.rows[i] = new SparseVector(N);
       }

       public void put(int i, int j, double value)         a[i][j] = value
       { rows[i].put(j, value); }

       public double get(int i, int j)                     return a[i][j]
       { return rows[i].get(j); }

       public SparseVector times(SparseVector x)
       {
          SparseVector b = new SparseVector(N);
          for (int i = 0; i < N; i++)                      matrix-vector
             b.put(i, rows[i].dot(x));                     multiplication
          return b;
       }
   }
                                                                             49
Hashing in the wild: algorithmic complexity attacks


Is the random hash map assumption important in practice?
•   Obvious situations: aircraft control, nuclear reactor, pacemaker.
•   Surprising situations: denial-of-service attacks.


                   malicious adversary learns your hash function
                   (e.g., by reading Java API) and causes a big pile-up
                   in single slot that grinds performance to a halt




Real-world exploits. [Crosby-Wallach 2003]
•   Bro server: send carefully chosen packets to DOS the server,
    using less bandwidth than a dial-up modem.
•   Perl 5.8.0: insert carefully chosen strings into associative array.
•   Linux 2.4.20 kernel: save files with carefully chosen names.



                                                                          50
Algorithmic complexity attack on Java


Goal. Find strings with the same hash code.
Solution. The base-31 hash code is part of Java's string API.


     key    hashCode()

     "Aa"      2112

     "BB"      2112




Q. Does your hash function produce random values for your key type?
                                                                      51
Algorithmic complexity attack on Java


Goal. Find strings with the same hash code.
Solution. The base-31 hash code is part of Java's string API.


     key    hashCode()           key         hashCode()                  key          hashCode()

     "Aa"      2112           "AaAaAaAa"      -540425984             "BBAaAaAa"       -540425984

     "BB"      2112           "AaAaAaBB"      -540425984             "BBAaAaBB"       -540425984

                              "AaAaBBAa"      -540425984             "BBAaBBAa"       -540425984

                              "AaAaBBBB"      -540425984             "BBAaBBBB"       -540425984

                              "AaBBAaAa"      -540425984             "BBBBAaAa"       -540425984

                              "AaBBAaBB"      -540425984             "BBBBAaBB"       -540425984

                              "AaBBBBAa"      -540425984             "BBBBBBAa"       -540425984

                              "AaBBBBBB"      -540425984             "BBBBBBBB"       -540425984


                                        2N strings of length 2N that hash to same value!




Q. Does your hash function produce random values for your key type?
                                                                                                   51
One-way hash functions


One-way hash function. Hard to find a key that will hash to a desired value,
or to find two keys that hash to same value.


Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160.

           known to be insecure




       String password = args[0];
       MessageDigest sha1 = MessageDigest.getInstance("SHA1");
       byte[] bytes = sha1.digest(password);

       /* prints bytes as hex string */




Applications. Digital fingerprint, message digest, storing passwords.
Caveat. Too expensive for use in ST implementations.


                                                                               52

Contenu connexe

Tendances

Basic Traversal and Search Techniques
Basic Traversal and Search TechniquesBasic Traversal and Search Techniques
Basic Traversal and Search TechniquesSVijaylakshmi
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsMohamed Loey
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesAntonis Antonopoulos
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure Meghaj Mallick
 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And ConquerUrviBhalani2
 
Backtracking Algorithm.ppt
Backtracking Algorithm.pptBacktracking Algorithm.ppt
Backtracking Algorithm.pptSalmIbrahimIlyas
 
3.9 external sorting
3.9 external sorting3.9 external sorting
3.9 external sortingKrish_ver2
 
Master method
Master method Master method
Master method Rajendran
 
Open addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashingOpen addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashingSangeethaSasi1
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networksswapnac12
 
Newton’s Divided Difference Formula
Newton’s Divided Difference FormulaNewton’s Divided Difference Formula
Newton’s Divided Difference FormulaJas Singh Bhasin
 

Tendances (20)

Basic Traversal and Search Techniques
Basic Traversal and Search TechniquesBasic Traversal and Search Techniques
Basic Traversal and Search Techniques
 
Quadratic probing
Quadratic probingQuadratic probing
Quadratic probing
 
08 Hash Tables
08 Hash Tables08 Hash Tables
08 Hash Tables
 
Alpha beta
Alpha betaAlpha beta
Alpha beta
 
Hashing
HashingHashing
Hashing
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity Classes
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure
 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And Conquer
 
Hamiltonian path
Hamiltonian pathHamiltonian path
Hamiltonian path
 
Backtracking Algorithm.ppt
Backtracking Algorithm.pptBacktracking Algorithm.ppt
Backtracking Algorithm.ppt
 
3.9 external sorting
3.9 external sorting3.9 external sorting
3.9 external sorting
 
Master method
Master method Master method
Master method
 
b+ tree
b+ treeb+ tree
b+ tree
 
Insertion sort algorithm power point presentation
Insertion  sort algorithm power point presentation Insertion  sort algorithm power point presentation
Insertion sort algorithm power point presentation
 
Open addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashingOpen addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashing
 
Hash tables
Hash tablesHash tables
Hash tables
 
Join operation
Join operationJoin operation
Join operation
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networks
 
Newton’s Divided Difference Formula
Newton’s Divided Difference FormulaNewton’s Divided Difference Formula
Newton’s Divided Difference Formula
 

En vedette

Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)Sri Prasanna
 
Advanced Topics Sorting
Advanced Topics SortingAdvanced Topics Sorting
Advanced Topics SortingSri Prasanna
 
Qr codes para tech radar
Qr codes para tech radarQr codes para tech radar
Qr codes para tech radarSri Prasanna
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
 
Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Sri Prasanna
 

En vedette (6)

Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
 
Advanced Topics Sorting
Advanced Topics SortingAdvanced Topics Sorting
Advanced Topics Sorting
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Qr codes para tech radar
Qr codes para tech radarQr codes para tech radar
Qr codes para tech radar
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)Clock Synchronization (Distributed computing)
Clock Synchronization (Distributed computing)
 

Similaire à Hashing

Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptxkratika64
 
Hash presentation
Hash presentationHash presentation
Hash presentationomercode
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAgonySingh
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent HashingDavid Evans
 
Data Structures : hashing (1)
Data Structures : hashing (1)Data Structures : hashing (1)
Data Structures : hashing (1)Home
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashingchidabdu
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniyaTutorialsDuniya.com
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingzukun
 
Lecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxLecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxSLekshmiNair
 
Hash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxHash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxmy6305874
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingThomas Mueller
 

Similaire à Hashing (20)

Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
 
Hash presentation
Hash presentationHash presentation
Hash presentation
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent Hashing
 
Lec5
Lec5Lec5
Lec5
 
Hashing
HashingHashing
Hashing
 
Data Structures : hashing (1)
Data Structures : hashing (1)Data Structures : hashing (1)
Data Structures : hashing (1)
 
Lec5
Lec5Lec5
Lec5
 
Hashing .pptx
Hashing .pptxHashing .pptx
Hashing .pptx
 
Hashing
HashingHashing
Hashing
 
session 15 hashing.pptx
session 15   hashing.pptxsession 15   hashing.pptx
session 15 hashing.pptx
 
Hashing
HashingHashing
Hashing
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniya
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
 
Lecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxLecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptx
 
Hash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxHash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptx
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 

Plus de Sri Prasanna

Plus de Sri Prasanna (20)

Qr codes para tech radar 2
Qr codes para tech radar 2Qr codes para tech radar 2
Qr codes para tech radar 2
 
Test
TestTest
Test
 
Test
TestTest
Test
 
assds
assdsassds
assds
 
assds
assdsassds
assds
 
asdsa
asdsaasdsa
asdsa
 
dsd
dsddsd
dsd
 
About stacks
About stacksAbout stacks
About stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About  StacksAbout  Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
About Stacks
About StacksAbout Stacks
About Stacks
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
Introduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clustersIntroduction & Parellelization on large scale clusters
Introduction & Parellelization on large scale clusters
 
Mapreduce: Theory and implementation
Mapreduce: Theory and implementationMapreduce: Theory and implementation
Mapreduce: Theory and implementation
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 

Dernier

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 

Dernier (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

Hashing

  • 1. Hashing ‣ hash functions ‣ collision resolution ‣ applications References: Algorithms in Java, Chapter 14 Except as otherwise noted, the content of this presentation http://www.cs.princeton.edu/algs4 is licensed under the Creative Commons Attribution 2.5 License. Algorithms in Java, 4th Edition · Robert Sedgewick and Kevin Wayne · Copyright © 2008 · September 2, 2009 2:46:10 AM
  • 2. Optimize judiciously “ More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason— including blind stupidity. ” — William A. Wulf “ We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. ” — Donald E. Knuth “ We follow two rules in the matter of optimization: Rule 1: Don't do it. Rule 2 (for experts only). Don't do it yet - that is, not until you have a perfectly clear and unoptimized solution. ” — M. A. Jackson Reference: Effective Java by Joshua Bloch 2
  • 3. ST implementations: summary guarantee average case ordered operations implementation iteration? on keys search insert delete search hit insert delete unordered list N N N N/2 N N/2 no equals() ordered array lg N N N lg N N/2 N/2 yes compareTo() BST N N N 1.38 lg N 1.38 lg N ? yes compareTo() randomized BST 3 lg N 3 lg N 3 lg N 1.38 lg N 1.38 lg N 1.38 lg N yes compareTo() red-black tree 3 lg N 3 lg N 3 lg N lg N lg N lg N yes compareTo() Q. Can we do better? 3
  • 4. Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hash function. Method for computing table index from key. 0 1 hash("it") = 3 2 3 "it" 4 5 4
  • 5. Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hash function. Method for computing table index from key. 0 1 hash("it") = 3 2 3 "it" ?? 4 Issues hash("times") = 3 5 • Computing the hash function. • Equality test: Method for checking whether two keys are equal. • Collision resolution: Algorithm and data structure to handle two keys that hash to the same table index. Classic space-time tradeoff. • No space limitation: trivial hash function with key as address. • No time limitation: trivial collision resolution with sequential search. • Limitations on both time and space: hashing (the real world). 5
  • 6. ‣ hash functions ‣ collision resolution ‣ applications 6
  • 7. Computing the hash function Idealistic goal: scramble the keys uniformly. • Efficiently computable. • Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Practical challenge. Need different approach for each Key type. Ex: Social Security numbers. • Bad: first three digits. 573 = California, 574 = Alaska • Better: last three digits. (assigned in chronological order within a given geographic region) Ex: phone numbers. • Bad: first three digits. • Better: last three digits. 7
  • 8. Hash codes and hash functions Hash code. All Java classes have a method hashCode(), which returns an int. between -232 and 231 - 1 Hash function. An int between 0 and M-1 (for use as an array index). First attempt. String s = "call"; int code = s.hashCode(); 3045982 int hash = code % M; 7121 8191 Bug. Don't use (code % M) as array index. 1-in-a billion bug. Don't use (Math.abs(code) % M) as array index. OK. Safe to use ((code & 0x7fffffff) % M) as array index. hex literal 31-bit mask 8
  • 9. Java’s hash code conventions The method hashCode() is inherited from Object. • Ensures hashing can be used for every object type. • Enables expert implementations for each type. Available implementations. • Default implementation: memory address of x. • Customized implementations: String, URL, Integer, Date, …. • User-defined types: users are on their own. 9
  • 10. Implementing hash code: phone numbers Ex. Phone numbers: (609) 867-5309. public final class PhoneNumber { private final int area, exch, ext; public PhoneNumber(int area, int exch, int ext) { this.area = area; this.exch = exch; this.ext = ext; } ... public boolean equals(Object y) { /* as before */ } public int hashCode() sufficiently random? { return 10007 * (area + 1009 * exch) + ext; } } 10
  • 11. Implementing hash code: strings Ex. Strings (in Java 1.5). public int hashCode() char Unicode { … … int hash = 0; 'a' 97 for (int i = 0; i < length(); i++) hash = s[i] + (31 * hash); 'b' 98 return hash; 'c' 99 ith character of s } … ... • Equivalent to h = 31L-1 · s0 + … + 312 · sL-3 + 311 · sL-2 + 310 · sL-1. • Horner's method to hash string of length L: L multiplies/adds. String s = "call"; Ex. int code = s.hashCode(); 3045982 = 99·313 + 97·312 + 108·311 + 108·310 = 108 + 31· (108 + 31 · (97 + 31 · (99))) Provably random? Well, no. 11
  • 12. A poor hash code design Ex. Strings (in Java 1.1). • For long strings: only examine 8-9 evenly spaced characters. • Benefit: saves time in performing arithmetic. public int hashCode() { int hash = 0; int skip = Math.max(1, length() / 8); for (int i = 0; i < length(); i += skip) hash = (37 * hash) + s[i]; return hash; } • Downside: great potential for bad collision patterns. http://www.cs.princeton.edu/introcs/13loop/Hello.java http://www.cs.princeton.edu/introcs/13loop/Hello.class http://www.cs.princeton.edu/introcs/13loop/Hello.html http://www.cs.princeton.edu/introcs/13loop/index.html http://www.cs.princeton.edu/introcs/12type/index.html 12
  • 13. Designing a good hash function Requirements. • If x.equals(y), then we must also have (x.hashCode() == y.hashCode()). • Repeated calls to x.hashCode() must return the same value (provided no info used in equals() is changed). Highly desirable. If !x.equals(y), then we want x y (x.hashCode() != y.hashCode()). x.hashCode() y.hashCode() Basic rule. Need to use the whole key to compute hash code. Fundamental problem. Need a theorem for each type to ensure reliability. 13
  • 14. Digression: using a hash function for data mining Use content to characterize documents. Applications. • Search documents on the web for documents similar to a given one. • Determine whether a new document belongs in one set or another. import javax.imageio.ImageIO; import java.io.*; import javax.swing.*; import java.awt.event.*; import java.awt.*; public class Picture { private BufferedImage image; private JFrame frame; ... Context. Effective for literature, genomes, Java code, art, music, data, video. 14
  • 15. Digression: using a hash function for data mining Approach. • Fix order k and dimension d. • Compute (hashCode() % d) for all k-grams in the document. • Result is d-dimensional vector profile of each document. To compare documents: Consider angle θ separating vectors • cos θ close to 0: not similar. • cos θ close to 1: similar. a b θ cos θ = a ⋅ b / a b 15
  • 16. Digression: using a hash function for data mining tale.txt genome.txt 10-grams with 10-grams with i freq freq hash code i hash code i % more tale.txt it was the best of times 0 0 0 it was the worst of times 1 0 0 it was the age of wisdom it was the age of ... foolishness ... "best of ti" "TTTCGGTTTG" 435 2 2 "foolishnes" "TGTCTGCTGC" % more genome.txt ... CTTTCGGTTTGGAACC GAAGCCGCGCGTCT 8999 "it was the" 8 0 TGTCTGCTGCAGC ... ATCGTTC ... 12122 0 "CTTTCGGTTT" 3 ... 34543 "t was the b" 5 "ATGCGGTCGA" 4 ... 65535 k = 10 d = 65536 profile 16
  • 17. Digression: using a hash function for data mining public class Document { private String name; private double[] profile; public Document(String name, int k, int d) { this.name = name; String doc = (new In(name)).readAll(); int N = doc.length(); profile = new double[d]; for (int i = 0; i < N-k; i++) { int h = doc.substring(i, i+k).hashCode(); profile[Math.abs(h % d)] += 1; } } public double simTo(Document that) { /* compute dot product and divide by magnitudes */ } } 17
  • 18. Digression: using a hash function for data mining file description Cons US Constitution TomS Tom Sawyer Huck Huckleberry Finn Prej Pride and Prejudice Pict a photograph DJIA financial data Amaz amazon.com website .html source ACTG genome % java CompareAll 5 1000 < docs.txt Cons TomS Huck Prej Pict DJIA Amaz ACTG Cons 1.00 0.89 0.87 0.88 0.35 0.70 0.63 0.58 TomS 0.89 1.00 0.98 0.96 0.34 0.75 0.66 0.62 Huck 0.87 0.98 1.00 0.94 0.32 0.74 0.65 0.61 Prej 0.88 0.96 0.94 1.00 0.34 0.76 0.67 0.63 Pict 0.35 0.34 0.32 0.34 1.00 0.29 0.48 0.24 DJIA 0.70 0.75 0.74 0.76 0.29 1.00 0.62 0.58 Amaz 0.63 0.66 0.65 0.67 0.48 0.62 1.00 0.45 ACTG 0.58 0.62 0.61 0.63 0.24 0.58 0.45 1.00 18
  • 19. ‣ hash functions ‣ collision resolution ‣ applications 19
  • 20. Helpful results from probability theory Bins and balls. Throw balls uniformly at random into M bins. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses. Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses. Load balancing. After M tosses, expect most loaded bin has Θ(log M / log log M) balls. 20
  • 21. Collisions Collision. Two distinct keys hashing to same index. • Birthday problem ⇒ can't avoid collisions unless you have a ridiculous amount (quadratic) of memory. • Coupon collector + load balancing ⇒ collisions will be evenly distributed. Challenge. Deal with collisions efficiently. approach 1: accept multiple collisions approach 2: minimize collisions 21
  • 22. Collision resolution: two approaches Separate chaining. [H. P. Luhn, IBM 1953] Put keys that collide in a list associated with index. Open addressing. [Amdahl-Boehme-Rocherster-Samuel, IBM 1953] When a new key collides, find next empty slot, and put it there. st[0] jocularly st[1] null st[0] jocularly seriously st[2] listen st[1] listen st[3] suburban st[2] null st[3] suburban untravelled considerating null st[30001] browsing st[8190] browsing separate chaining (M = 8191, N = 15000) linear probing (M = 30001, N = 15000) 22
  • 23. Collision resolution approach 1: separate chaining Use an array of M < N linked lists. good choice: M ~ N/10 • Hash: map key to integer i between 0 and M-1. • Insert: put at front of ith chain (if not already there). • Search: only need to search ith chain. key hash "call" 7121 jocularly seriously "me" 3480 st[0] st[1] listen "ishmael" 5017 st[2] null "seriously" 0 st[3] suburban untravelled considerating "untravelled" 3 "suburban" 3 st[8190] browsing ... ... separate chaining (M = 8191, N = 15000) 23
  • 24. Separate chaining ST: Java implementation (skeleton) public class ListHashST<Key, Value> { private int M = 8191; array doubling private Node[] st = new Node[M]; code omitted private class Node { private Object key; no generics in private Object val; arrays in Java private Node next; public Node(Key key, Value val, Node next) { this.key = key; this.val = val; this.next = next; } } private int hash(Key key) { return (key.hashCode() & 0x7ffffffff) % M; } public void put(Key key, Value val) { /* see next slide */ } public Val get(Key key) { /* see next slide */ } } 24
  • 25. Separate chaining ST: Java implementation (put and get) public void put(Key key, Value val) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) if (key.equals(x.key)) { x.val = val; return; } st[i] = new Node(key, value, first); } identical to linked-list code, except hash to pick a list public Value get(Key key) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) if (key.equals(x.key)) return (Value) x.val; return null; } 25
  • 26. Analysis of separate chaining Separate chaining performance. • Cost is proportional to length of chain. • Average length of chain α = N / M. • Worst case: all keys hash to same chain. Proposition. Let α > 1. For any t > 1, probability that chain length > t α is exponentially small in t. depends on hash map being random map Parameters. • M too large ⇒ too many empty chains. • M too small ⇒ chains too long. • Typical choice: M ~ N/10 ⇒ constant-time ops. 26
  • 27. Collision resolution approach 2: linear probing Use an array of size M >> N. good choice: M ~ 2N • Hash: map key to integer i between 0 and M-1. • Insert: put in slot i if free; if not try i+1, i+2, etc. • Search: search slot i; if occupied but no match, try i+1, i+2, etc. - - - S H - - A C E R - - 0 1 2 3 4 5 6 7 8 9 10 11 12 27
  • 28. Collision resolution approach 2: linear probing Use an array of size M >> N. good choice: M ~ 2N • Hash: map key to integer i between 0 and M-1. • Insert: put in slot i if free; if not try i+1, i+2, etc. • Search: search slot i; if occupied but no match, try i+1, i+2, etc. - - - S H - - A C E R - - 0 1 2 3 4 5 6 7 8 9 10 11 12 - - - S H - - A C E R I - insert I hash(I) = 11 0 1 2 3 4 5 6 7 8 9 10 11 12 27
  • 29. Collision resolution approach 2: linear probing Use an array of size M >> N. good choice: M ~ 2N • Hash: map key to integer i between 0 and M-1. • Insert: put in slot i if free; if not try i+1, i+2, etc. • Search: search slot i; if occupied but no match, try i+1, i+2, etc. - - - S H - - A C E R - - 0 1 2 3 4 5 6 7 8 9 10 11 12 - - - S H - - A C E R I - insert I hash(I) = 11 0 1 2 3 4 5 6 7 8 9 10 11 12 - - - S H - - A C E R I N insert N hash(N) = 8 0 1 2 3 4 5 6 7 8 9 10 11 12 27
  • 30. Linear probing ST implementation public class ArrayHashST<Key, Value> { standard ugly casts private int M = 30001; private Value[] vals = (Value[]) new Object[maxN]; array doubling private Key[] keys = (Key[]) new Object[maxN]; code omitted privat int hash(Key key) { /* as before */ } public void put(Key key, Value val) { int i; for (i = hash(key); keys[i] != null; i = (i+1) % M) if (key.equals(keys[i])) break; vals[i] = val; keys[i] = key; } public Value get(Key key) { for (int i = hash(key); keys[i] != null; i = (i+1) % M) if (key.equals(keys[i])) return vals[i]; return null; } } 28
  • 31. Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. 29
  • 32. Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. 29
  • 33. Knuth's parking problem Model. Cars arrive at one-way street with M parking spaces. Each desires a random space i: if space i is taken, try i+1, i+2, … Q. What is mean displacement of a car? displacement =3 Empty. With M/2 cars, mean displacement is ~ 3/2. Full. With M cars, mean displacement is ~ πM/8 30
  • 34. Analysis of linear probing Linear probing performance. • Insert and search cost depend on length of cluster. • Average length of cluster α = N / M. but keys more likely to • hash to big clusters Worst case: all keys hash to same cluster. Proposition. [Knuth 1962] Let α < 1 be the load factor. average probes for insert/search miss 1 — 2( 1 + 1 ( 1 − α )2 ) = ( 1 + α + 2α2 + 3α3 + 4α4 + . . . ) /2 average probes for search hit 1 — 2 ( 1 + 1 ( 1 − α ) ) = 1 + ( α + α2 + α3 + α4 + . . . ) /2 Parameters. • Load factor too small ⇒ too many empty array entries. • Load factor too large ⇒ clusters coalesce. • Typical choice: M ~ 2N ⇒ constant-time ops. 31
  • 35. Hashing: variations on the theme Many improved versions have been studied. Two-probe hashing. (separate chaining variant) • Hash to two positions, put key in shorter of the two chains. • Reduces average length of the longest chain to log log N. Double hashing. (linear probing variant) • Use linear probing, but skip a variable amount, not just 1 each time. • Effectively eliminates clustering. • Can allow table to become nearly full. 32
  • 36. Double hashing Idea. Avoid clustering by using second hash to compute skip for search. Hash function. Map key to integer i between 0 and M-1. Second hash function. Map key to nonzero skip value k. Ex: k = 1 + (v mod 97). hashCode() Effect. Skip values give different search paths for keys that collide. Best practices. Make k and M relatively prime. 33
  • 37. Double hashing performance Theorem. [Guibas-Szemerédi] Let α = N / M < 1 be average length of cluster. Average probes for insert/search miss 1 = 1 + α + α2 + α3 + α4 + . . . ( 1 − α ) Average probes for search hit 1 1 — ln = 1 + α/2 + α2 /3 + α3 /4 + α4 /5 + . . . α ( 1 − α ) Parameters. Typical choice: α ~ 1.2 ⇒ constant-time ops. Disadvantage. Deletion is cumbersome to implement. 34
  • 38. Hashing Tradeoffs Separate chaining vs. linear probing/double hashing. • Space for links vs. empty table slots. • Small table + linked allocation vs. big coherent array. Linear probing vs. double hashing. load factor 50% 66% 75% 90% get 1.5 2.0 3.0 5.5 linear probing put 2.5 5.0 8.5 55.5 get 1.4 1.6 1.8 2.6 double hashing put 1.5 2.0 3.0 5.5 number of probes 35
  • 39. Summary of symbol-table implementations guarantee average case ordered operations implementation iteration? on keys search insert delete search hit insert delete unordered list N N N N/2 N N/2 no equals() ordered array lg N N N lg N N/2 N/2 yes compareTo() BST N N N 1.38 lg N 1.38 lg N ? yes compareTo() randomized BST 3 lg N 3 lg N 3 lg N 1.38 lg N 1.38 lg N 1.38 lg N yes compareTo() red-black tree 3 lg N 3 lg N 3 lg N lg N lg N lg N yes compareTo() equals() hashing 1* 1* 1* 1* 1* 1* no hashCode() * assumes random hash function 36
  • 40. Hashing versus balanced trees Hashing • Simpler to code. • No effective alternative for unordered keys. • Faster for simple keys (a few arithmetic ops versus log N compares). • Better system support in Java for strings (e.g., cached hash code). • Does your hash function produce random values for your key type?? Balanced trees. • Stronger performance guarantee. • Can support many more ST operations for ordered keys. • Easier to implement compareTo() correctly than equals() and hashCode(). Java system includes both. • Red-black trees: java.util.TreeMap, java.util.TreeSet. • Hashing: java.util.HashMap, java.util.IdentityHashMap. 37
  • 41. Typical "full" symbol table API public class *ST<Key extends Comparable<Key>, Value> implements Iterable<Key> *ST() create an empty symbol table void put(Key key, Value val) put key-value pair into the table Value get(Key key) return value paired with key; null if no such value boolean contains(Key key) is there a value paired with key? Key min() return smallest key Key max() return largest key Key ceil(Key key) return smallest key in table ≥ query key Key floor(Key key) return largest key in table ≤ query key void remove(Key key) remove key-value pair from table Iterator<Key> iterator() iterator through keys in table Hashing is not suitable for implementing such an API (no order). BSTs are easy to extend to support such an API (basic tree ops). 38
  • 42. ‣ hash functions ‣ collision resolution ‣ applications 39
  • 43. Searching challenge Problem. Index for a PC or the web. Assumptions. 1 billion++ words to index. Which searching method to use? • Hashing implementation of ST. • Hashing implementation of SET. • Red-black-tree implementation of ST. • Red-black-tree implementation of SET. • Doesn’t matter much. 40
  • 44. Index for search in a PC ST<String, SET<File>> st = new ST<String, SET<File>>(); for (File f : filesystem) { In in = new In(f); String[] words = in.readAll().split("s+"); for (int i = 0; i < words.length; i++) { build index String s = words[i]; if (!st.contains(s)) st.put(s, new SET<File>()); SET<File> files = st.get(s); files.add(f); } } SET<File> files = st.get(s); process lookup for (File f : files) ... request 41
  • 45. Searching challenge Problem. Index for an e-book. Assumptions. Book has 100,000+ words. Which searching method to use? • Hashing implementation of ST. • Hashing implementation of SET. • Red-black-tree implementation of ST. • Red-black-tree implementation of SET. • Doesn’t matter much. 42
  • 46. Index for a book public class Index { public static void main(String[] args) { String[] words = StdIn.readAll().split("s+"); read book and ST<String, SET<Integer>> st; create ST st = new ST<String, SET<Integer>>(); for (int i = 0; i < words.length; i++) { String s = words[i]; if (!st.contains(s)) process all words st.put(s, new SET<Integer>()); SET<Integer> pages = st.get(s); pages.add(page(i)); } for (String s : st) print index StdOut.println(s + ": " + st.get(s)); } } 43
  • 47. Searching challenge 5 Problem. Sparse matrix-vector multiplication. Assumptions. Matrix dimension is 10,000; average nonzeros per row ~ 10. Which searching method to use? 1) Unordered array. 2) Ordered linked list. 3) Ordered array with binary search. 4) Need better method, all too slow. A * x = b 5) Doesn’t matter much, all fast enough. 44
  • 48. Sparse vectors and matrices Vector. Ordered sequence of N real numbers. Matrix. N-by-N table of real numbers. vector operations a = [ 0 3 15 ] , b = [−1 2 2 ] a + b = [−1 5 17 ] a o b = (0 ⋅ −1) + (3 ⋅ 2) + (15 ⋅ 2) = 36 a = a o a = 0 2 + 3 2 + 15 2 = 3 26 € matrix-vector multiplication 0 1 1 −1  4       2 4 −2 ×  2 =  2 0 3 15    2   36   € 45
  • 49. Sparse vectors and matrices An N-by-N matrix is sparse if it contains O(N) nonzeros. Property. Large matrices that arise in practice are sparse. 2D array matrix representation. • Constant time access to elements. • Space proportional to N2. Goal. • Efficient access to elements. • Space proportional to number of nonzeros. 46
  • 50. Sparse vector data type public class SparseVector { private int N; // length private ST<Integer, Double> st; // the elements public SparseVector(int N) { all 0s vector this.N = N; this.st = new ST<Integer, Double>(); } public void put(int i, double value) { if (value == 0.0) st.remove(i); a[i] = value else st.put(i, value); } public double get(int i) { if (st.contains(i)) return st.get(i); return a[i] else return 0.0; } ... 47
  • 51. Sparse vector data type (cont) public double dot(SparseVector that) { double sum = 0.0; for (int i : this.st) dot product if (that.st.contains(i)) sum += this.get(i) * that.get(i); return sum; } public double norm() 2-norm { return Math.sqrt(this.dot(this)); } public SparseVector plus(SparseVector that) { SparseVector c = new SparseVector(N); for (int i : this.st) vector sum c.put(i, this.get(i)); for (int i : that.st) c.put(i, that.get(i) + c.get(i)); return c; } } 48
  • 52. Sparse matrix data type public class SparseMatrix { private final int N; // length private SparseVector[] rows; // the elements public SparseMatrix(int N) { all 0s matrix this.N = N; this.rows = new SparseVector[N]; for (int i = 0; i < N; i++) this.rows[i] = new SparseVector(N); } public void put(int i, int j, double value) a[i][j] = value { rows[i].put(j, value); } public double get(int i, int j) return a[i][j] { return rows[i].get(j); } public SparseVector times(SparseVector x) { SparseVector b = new SparseVector(N); for (int i = 0; i < N; i++) matrix-vector b.put(i, rows[i].dot(x)); multiplication return b; } } 49
  • 53. Hashing in the wild: algorithmic complexity attacks Is the random hash map assumption important in practice? • Obvious situations: aircraft control, nuclear reactor, pacemaker. • Surprising situations: denial-of-service attacks. malicious adversary learns your hash function (e.g., by reading Java API) and causes a big pile-up in single slot that grinds performance to a halt Real-world exploits. [Crosby-Wallach 2003] • Bro server: send carefully chosen packets to DOS the server, using less bandwidth than a dial-up modem. • Perl 5.8.0: insert carefully chosen strings into associative array. • Linux 2.4.20 kernel: save files with carefully chosen names. 50
  • 54. Algorithmic complexity attack on Java Goal. Find strings with the same hash code. Solution. The base-31 hash code is part of Java's string API. key hashCode() "Aa" 2112 "BB" 2112 Q. Does your hash function produce random values for your key type? 51
  • 55. Algorithmic complexity attack on Java Goal. Find strings with the same hash code. Solution. The base-31 hash code is part of Java's string API. key hashCode() key hashCode() key hashCode() "Aa" 2112 "AaAaAaAa" -540425984 "BBAaAaAa" -540425984 "BB" 2112 "AaAaAaBB" -540425984 "BBAaAaBB" -540425984 "AaAaBBAa" -540425984 "BBAaBBAa" -540425984 "AaAaBBBB" -540425984 "BBAaBBBB" -540425984 "AaBBAaAa" -540425984 "BBBBAaAa" -540425984 "AaBBAaBB" -540425984 "BBBBAaBB" -540425984 "AaBBBBAa" -540425984 "BBBBBBAa" -540425984 "AaBBBBBB" -540425984 "BBBBBBBB" -540425984 2N strings of length 2N that hash to same value! Q. Does your hash function produce random values for your key type? 51
  • 56. One-way hash functions One-way hash function. Hard to find a key that will hash to a desired value, or to find two keys that hash to same value. Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160. known to be insecure String password = args[0]; MessageDigest sha1 = MessageDigest.getInstance("SHA1"); byte[] bytes = sha1.digest(password); /* prints bytes as hex string */ Applications. Digital fingerprint, message digest, storing passwords. Caveat. Too expensive for use in ST implementations. 52

Notes de l'éditeur

  1. [wayne S08] this lecture was a bit long - I trimmed some of the applications
  2. Goal: hash function scrambles keys, then each chain will have roughly 1/M fraction of elements Assumes table size is approximately M = 1000
  3. int between -2^32 and 2^31 - 1 Note: reference Java implementation caches String hash codes. Can&apos;t use h % M for index since h can be negative so % can return negative number. Plausible fix |h| % M doesn&apos;t work since |h| can be negative if h = -2^31. Should also count 1 bit-whacking operation in work to hash string of length W.
  4. OK, but not desirable, for two distinct keys to have the same hashCode() users are on their own - that&apos;s you!
  5. leading 0s are annoying if you take integer inputs - Probably easier (but worse style) to implement with strings than integers! 1009 and 10007 are primes just bigger than 999 and 9999
  6. Note: reference Java implementation caches String hash codes. this is essentially Java&apos;s implementation of hashCode() for strings. It&apos;s ok to add an int and a char. It&apos;s also ok if the result overflows since this is all well-defined in java. (2&apos;s complement integers)
  7. Java 1.1: only considered 8 characters, evenly spaced starting at beginning. Intent was for speed, but ended up creating performance problems because of bad collision patterns. (See complexity attacks later in lecture)
  8. Practical realities. Cost is an important consideration. True randomness is hard to achieve.
  9. Ex. After 23 people enter a room, expect two with same birthday.
  10. ridiculous amount = quadratic
  11. separate chaining is easy extension of linked list ST linear probing is easy extension of array ST
  12. Q. Now, what if we insert G and it has a hash value of 10. A. Wraparound
  13. Q. Now, what if we insert G and it has a hash value of 10. A. Wraparound
  14. http://portal.acm.org/citation.cfm?id=1103965 1/4 sqrt(2 pi * M) + 1/3 + 1/48 sqrt(2 pi / M) + O(1/M) 1/2 (1 + 1 / (1 - alpha)) - 1 / ((2 (1 - alpha)^3) M)) + O(1/M^2) for typical values of alpha Lesson: don&apos;t fill up linear probing hash table!
  15. estimates only accurate when alpha isn&apos;t too close to 1 landmark result in analysis of algorithms!
  16. 97 is good choice since it is relatively prime to M
  17. deep result of Guibas and Szemeredi. estimates only accurate when alpha isn&apos;t too close to 1
  18. java.util.HashMap uses separate chaining, java.util.IdentityHashMap uses linear probing
  19. Ex: Can use LLRB trees implement priority queues for distinct keys
  20. Caveat: use B-tree or similar structure for truly huge indices hash table uses less space
  21. care about ordered iteration
  22. http://people.scs.fsu.edu/~burkardt/data/sparse_cc/sparse_cc.html
  23. http://people.scs.fsu.edu/~burkardt/data/sparse_cc/sparse_cc.html
  24. we take little solace in knowing that we encountered a statistically insignificant event. http://www.cs.rice.edu/~scrosby/hash
  25. Alternative 1: hide details of String hashCode(), This is security-by-obscurity Alternate 2: use different data structure that guarantees performance
  26. SHA-1 has 160-bit output Caveat: too expensive to use one-way hash functions for ST because you can get a performance guarantee using ST and it will run faster in practice.