SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
STRUCTURES DE DONNEES
EXOTIQUES

     17h - 17h50 - Salle Seine A
•  Sam BESSALAH	

	

•  Independent Software Developer,	

	

•  Works for startups, finance shops, mostly in Big
      Data, Machine Learning related projects.	

	

•  Rambling on twitter as @samklr
Why care about data structures?
Powerful librairies, powerful frameworks. 	

                        But .. 	

                          	

                           	

« If all you have is a hammer, everything looks	

                    like a nail ». 	

                               Abraham H. Maslow
SKIPLISTS
Efficient structure for sorted data sets	

	

Time complexity for basic operations :	

Insertion in O(log N)	

Removal in O(log N)	

Contains and Retrieval in O(log N)	

Range operations in O(log N)	

Find the k-th element in the set in O(log N)
Start with a simple linked list
Add extra levels for express lines
Search for a value looks like this
Insert (X)	

Search X to find its place in the bottom list.	

       (Remember the bottom list contains all the elements ).	

	

      Find which other list should contain X	

        Use a controlled probabilistic distribution.	

        Flip a coin;	

           if HEADS	

               Promote x to next level up, then flip again	

	

       At the end we end up with a distribution of the data like this	

        - ½ of the elements promoted 0 level	

        - ¼ of the elemnts promoted 1 level	

        - 1/8 of the elements promoted 2 levels	

        - and so on
Remove (X)	

 Find X in all the levels it participates and delete	

 If One or more of the upper levels are empty,
remove them.
Insert	

void insert(E value) {
       SkipNode<E> x = header;
       SkipNode<E>[] update = new SkipNode[MAX_LEVEL + 1];

      for (int i = level; i >= 0; i--) {
       while (x.forward[i] != null && x.forward[i].value.compareTo(value) < 0) {
          x = x.forward[i];
       }
       update[i] = x;
    }
    x = x.forward[0];

  if (x == null || !x.value.equals(value)) {
          int lvl = randomLevel();
          if (lvl > level) {
            for (int i = level + 1; i <= lvl; i++) {
                update[i] = header;
            }
            level = lvl;
        }
…	

x = new SkipNode<E>(lvl, value);
      for (int i = 0; i <= lvl; i++) {
         x.forward[i] = update[i].forward[i];
         update[i].forward[i] = x;     }

      }
  }



  private int getLevel(){ // Coin Flipping
   int lvl = (int)(Math.log(1.-Math.random())/Math.log(1.-P));
         return Math.min(lvl, MAX_LEVEL);
   }
Fast structure on average, very fast in practice	

	

Can be implemented in a thread safe way without 	

locking the entire strcture, and instead acting 	

on pointers. In a lock free fashion, by using 	

CAS instructions.	

	

In Java since JDK 1.6 within the collections
java.util.concurrent.ConcurrentSkipListMap and	

java.util.concurrent.ConcurrentSkipListSet. 	

Both are non blocking, thread safe data structures with locality
of reference properties.	

   Ideal for cache implementation.
CAS instruction	

      CAS(address, expected_value, new_value)	

	

Atomically replaces the value at the address with
    the new_value if it is equal to the
    expected_value. 	

In one instruction CMPXCHG	

	


Returns true if successful, false otherwise.
Drawbacks

They can be memory hungry, by being space
inefficient.
TRIES
- Ordered Tree data structure, used to store associative
arrays. Usually, encoded keys through traversal, in the nodes of
the tree with value in the leaves.	

	

-Used for dictionnary (Map), word completion, 	

web requests parsing., etc.	

	

-Time complexity in O(k) where k is the length of	

searched string. Where it usually is O(length of the tree)
HASH ARRAY MAPPED TRIE 
           (HAMT)
                	


Functional data structure, for fast association map	

	

Based on the idea of hashing a key, and storing 	

the value on a trie. 	

	

Each node is an array of size 32, pointing to children
nodes of same size with maximum Seven	

deferencement, for fast key look ups.
How does it work?   	

                                  	

During association of K → V , compute hashes yielding an Integer
or a long, coded on the JVM in 32 bits	

	

Partition those bits to blocks of 5 bits, representing	

a level in the tree structure, from the right most (root)	

to the left most (leaves)	

	

The colored nodes have between 2 and 31 children	

	

Each colored node maintain a bitmap, encoding 	

how many children the nodes contains, and their indexes	

in the array.
Trie Structure (from Rich Hikey)
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


 0 = 0000002
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                 0	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


16 = 0100002     0	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                 0	
   16	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


 4 = 0001002     0	
   16	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


 4 = 0001002           16	
  



               0	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                        16	
  



          0	
   4	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


12 = 0011002                 16	
  



               0	
   4	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  


12 = 0011002                 16	
  



               0	
   4	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                 16	
  



          0	
   4	
     12	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                 16	
  33	
  



          0	
   4	
     12	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                 16	
  33	
   48	
  



          0	
   4	
     12	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                 16	
     48	
  



          0	
   4	
     12	
                       33	
  37	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                16	
     48	
  



               4	
     12	
                       33	
  37	
  


       0	
     3	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  



                     4	
     12	
          16	
  20	
  25	
     33	
  37	
     48	
     57	
  


       0	
   1	
     3	
     8	
   9	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  



                     4	
     12	
          16	
  20	
  25	
     33	
  37	
          48	
     57	
  


       0	
   1	
     3	
     8	
   9	
                                  Too	
  much	
  space!	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  



                     4	
     12	
          16	
  20	
  25	
     33	
  37	
     48	
  57	
  


        0	
   1	
   3	
      8	
   9	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  



                     4	
     12	
          16	
  20	
  25	
     33	
  37	
     48	
  57	
  


        0	
   1	
   3	
      8	
   9	
                Linear	
  search	
  at	
  every	
  level	
  -­‐	
  slow!	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  



                     4	
     12	
              16	
  20	
  25	
       33	
  37	
     48	
  57	
  


       0	
   1	
   3	
                8	
   9	
                     SoluHon	
  –	
  bitmap	
  index!	
  
                                                                    Relying	
  on	
  BITPOP	
  instrucHon.	
  
Hash	
  Array	
  Mapped	
  Tries	
  (HAMT)	
  

                                    48	
           57	
  




            1	
   0	
   1	
   0	
   48	
           57	
  



            1	
   0	
   1	
   0	
   48	
  57	
  



                            10	
   48	
  57	
  


   BITPOP(((1  ((hc  lev)  1F)) – 1)  BMP)
Complexity almost all in log32 N ~ O(7) ~ O(1) 	

                             	

        Requires only 7 max, array deferencements.
                                                 	

                             	

	

        O(1)                   O(log n)               O(n)	

      Append                                          concat	

      First 	

 	

 	

 	

 	

 	

 	

 	

 	

 	

   	

insert	

      Last 	

 	

 	

 	

 	

 	

 	

 	

 	

 	

    	

prepend	

      K-th	

      Update
Hash Array Mapped Tries
               (HAMT)
•  advantages:	

    •  low space consumption and shrinking	

    •  no contiguous memory region required	

    •  fast – logarithmic complexity, but with a low
       constant factor	

   •  used as efficient immutable maps	

   Clojure's PersistentHashMap and PersistentVector,
      Scala's Mutable Map.	

   •  no global resize phase – real time applications,
      potentially more scalable concurrent operations?
Concurrent Trie (Ctrie)	

•  goals:	

    •  thread-safe concurrent trie	

    •  maintain the advantages of HAMT	

    •  rely solely on CAS instructions	

    •  ensure lock-freedom and linearizability	

	

•  lookup – probably same as for HAMT
Lock	
  Free	
  Concurrent	
  Trie	
  (C	
  Trie)	
  
                                                                                                                 T2	
  



                                                                                                   	
  CAS	
  
                      4	
   9	
   12	
            20	
  25	
                  20	
  25	
  28	
  
                                                                                                   	
  CAS	
  
        0	
   1	
   3	
                    16	
  17	
            16	
  17	
  18	
  


                                                                                                                 T1	
  
SKETCHES	

(Or how to remove the fat in your datasets)
BLOOM FILTERS
Probabilistic data structures, designed to tell rapidly in a 	

memory efficient way, whether anelement is absent or	

not from a set.	

	

Made of Array of bits B of length n, and a hash function h	

   Insert (X) : for all i in set, B[h(x,i)] = 1	

   Query (X) : return FALSEif for all i, B[h(x,i)]=0	

	

Only returns if all k bits are set.
We can pick the error rate and optimize the size of the filter to match requirements.
TRADE OFFS	

	

More hashes enduce more accurate results, but render the
sketch less space efficient.	

	

Choice of the hash function is important. 	

	

Cryptographic hashes are great; because evenly distributed,
with less collisions.	

	

Watch out to time spent computing your hashes.
Cool properties	

	

Intersection and Union through AND and OR bitwise
operations	

	

No False negatives	

	

For union, helps with parallel construction of BF	

	

Fast approximation of set union, by using bit map instead of
set manipulation
Handling Deletion
                           	

 Bloom filters can handle insertions, but not
 deletions.	

                   xi    xj
 	

B	

 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
 	

 If deleting xi means resetting 1s to 0s, then
 deleting xi will “delete” xj. 	

 Solution : Counting Bloom Filter
COUNTING BLOOM FILTERS	

Keeps track of inserts	

- Query(X) : return TRUE if for all i b[h(x,i)]0	

- Insert(X) : if (query(x) == false ){ //don't insert twice	

                  For all i b[h(x,i)]++ }	

- Remove(X) : if (query(x) == true ) {//don't remove absents	

                   For all i, b[h(x,i)]-- }
Usages
                             	

- Fast web events tracking	

- IO optimisations in databases like Cassandra	

 Hadoop, Hbases ..	

- Network IP filtering ...
Guava Bloom Filters	

Default gives a 3 % error. With a MurMurHashV3 function.	

Creating the BloomFilter	

BloomFilter bloomFilter = BloomFilter.create(Funnels.byteArrayFunnel(),
1000);	

	

//Putting elements into the filter	

//A BigInteger representing a key of some sort	

bloomFilter.put(bigInteger.toByteArray());	

	

//Testing for element in set	

boolean mayBeContained =
bloomFilter.mayContain(bitIntegerII.toByteArray());
//With a custom filter	

class BigIntegerFunnel implements FunnelBigInteger {	

             @Override	

             public void funnel(BigInteger from, Sink into) {	

                    into.putBytes(from.toByteArray());	

             }	

      }	

//Creating the BloomFilter	

BloomFilter bloomFilter = BloomFilter.create(new BigIntegerFunnel(), 1000);	

	

//Putting elements into the filter	

bloomFilter.put(bigInteger);	

	

//Testing for element in set	

boolean mayBeContained = bloomFilter.mayContain(bitIntegerII);
COUNT MIN SKETCHES
Family of memory efficient data structures, that can	

estimate frequency-related properties of the data set.	

	

Used to find occurrences of an element in a set, in 	

time / space efficient way, with a tunable accuracy.	

	

E.g Find heavy hitters elements; perform range queries	

(where the goal is to find the sum of frequencies of 	

  Elements in a certain range ), estimate quantiles.
How does it work?
Two-dimensional array (d x w) of integer counters. When a
value arrives, it is mapped to one position at each of d rows
using d different and preferably independent hash functions.
d : number of hash functions
w : hash array size
Algorithm :	

insert(x):	

for i =1 to d 	

          M[i,h(x,i)) ++ //Like counting bloom filters	

query(x):	

 return min {h(x,i) for all 1 ≤ i ≤ d}
Error Estimation
                                   	





Accuracy depends on the ratio between sketch size 	

and number of expected data. Works better with	

Highly uncorrelated, unstructured data.	

	

For higly skewed data, use noise estimation,	

  and compute the median estimation
HEAVY HITTERS 
                       (through count min sketches)
                                                  	

Find all the elements in a data sets with frequencies over a 	

Fixed threshold. K percent of the total number in the set.	

	

Use a count min sketched algorithm.	

	

Use case : detect most trafic consuming IP addresses, thwart 	

DdoS attacks by blacklisting those Ips. Detect market prices	

with highest bids swings
Algorithm.	

      1. Maintain a Count-min sketch of all the element in the set	

      2. Maintain a heap of top elements, initially empty, and a 	

         Counter N, of already processed elements.	

      3. For each element in the set	

        Add it in the set	

        Estimate the Frequency of the element. If higher than 	

        the threshold k*N, add it to heap. Continuously clean the 	

        The heap of all the elements below the new threshold.
OTHER INTERESTING SKETCHES
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/
Bibliographies  Blogs.
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-
count-a-billion-distinct-objects-us.html


Librairies. http://github.com/slearspring/stream-lib
        org.apache.cassandra.utils.{MurmurHashV3, BloomFilter}
        Google Guava.
Ideal Hash Trees by Phil BagWell
http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf

Concurrent Tries in the Scala Parallel Collections
SkipLists By William Pugh.

Contenu connexe

Similaire à Structures de données exotiques

Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structureMahmoud Alfarra
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAgonySingh
 
Data Structures : hashing (1)
Data Structures : hashing (1)Data Structures : hashing (1)
Data Structures : hashing (1)Home
 
11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithmfarhankhan89766
 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptxkratika64
 
computer notes - Data Structures - 35
computer notes - Data Structures - 35computer notes - Data Structures - 35
computer notes - Data Structures - 35ecomputernotes
 
Computer notes - Hashing
Computer notes - HashingComputer notes - Hashing
Computer notes - Hashingecomputernotes
 
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...Subhajit Sahu
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent HashingDavid Evans
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to MatlabAmr Rashed
 
Hashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT iHashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT icajiwol341
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for FresherJaved Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationsonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKgr Sushmitha
 

Similaire à Structures de données exotiques (20)

Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
 
Hashing
HashingHashing
Hashing
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Data Structures : hashing (1)
Data Structures : hashing (1)Data Structures : hashing (1)
Data Structures : hashing (1)
 
11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
 
computer notes - Data Structures - 35
computer notes - Data Structures - 35computer notes - Data Structures - 35
computer notes - Data Structures - 35
 
Computer notes - Hashing
Computer notes - HashingComputer notes - Hashing
Computer notes - Hashing
 
Commands list
Commands listCommands list
Commands list
 
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...
Concurrent Hashing and Natural Parallelism : The Art of Multiprocessor Progra...
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent Hashing
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
Hashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT iHashing In Data Structure Download PPT i
Hashing In Data Structure Download PPT i
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Hash function
Hash functionHash function
Hash function
 

Plus de Samir Bessalah

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTSSamir Bessalah
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Samir Bessalah
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with FinagleSamir Bessalah
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 

Plus de Samir Bessalah (10)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTS
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with Finagle
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Scala+data
Scala+dataScala+data
Scala+data
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Structures de données exotiques

  • 1. STRUCTURES DE DONNEES EXOTIQUES 17h - 17h50 - Salle Seine A
  • 2. •  Sam BESSALAH •  Independent Software Developer, •  Works for startups, finance shops, mostly in Big Data, Machine Learning related projects. •  Rambling on twitter as @samklr
  • 3. Why care about data structures?
  • 4. Powerful librairies, powerful frameworks. But .. « If all you have is a hammer, everything looks like a nail ». Abraham H. Maslow
  • 6. Efficient structure for sorted data sets Time complexity for basic operations : Insertion in O(log N) Removal in O(log N) Contains and Retrieval in O(log N) Range operations in O(log N) Find the k-th element in the set in O(log N)
  • 7. Start with a simple linked list
  • 8. Add extra levels for express lines
  • 9. Search for a value looks like this
  • 10. Insert (X) Search X to find its place in the bottom list. (Remember the bottom list contains all the elements ). Find which other list should contain X Use a controlled probabilistic distribution. Flip a coin; if HEADS Promote x to next level up, then flip again At the end we end up with a distribution of the data like this - ½ of the elements promoted 0 level - ¼ of the elemnts promoted 1 level - 1/8 of the elements promoted 2 levels - and so on
  • 11. Remove (X) Find X in all the levels it participates and delete If One or more of the upper levels are empty, remove them.
  • 12. Insert void insert(E value) { SkipNode<E> x = header; SkipNode<E>[] update = new SkipNode[MAX_LEVEL + 1]; for (int i = level; i >= 0; i--) { while (x.forward[i] != null && x.forward[i].value.compareTo(value) < 0) { x = x.forward[i]; } update[i] = x; } x = x.forward[0]; if (x == null || !x.value.equals(value)) { int lvl = randomLevel(); if (lvl > level) { for (int i = level + 1; i <= lvl; i++) { update[i] = header; } level = lvl; }
  • 13. … x = new SkipNode<E>(lvl, value); for (int i = 0; i <= lvl; i++) { x.forward[i] = update[i].forward[i]; update[i].forward[i] = x; } } } private int getLevel(){ // Coin Flipping int lvl = (int)(Math.log(1.-Math.random())/Math.log(1.-P)); return Math.min(lvl, MAX_LEVEL); }
  • 14. Fast structure on average, very fast in practice Can be implemented in a thread safe way without locking the entire strcture, and instead acting on pointers. In a lock free fashion, by using CAS instructions. In Java since JDK 1.6 within the collections java.util.concurrent.ConcurrentSkipListMap and java.util.concurrent.ConcurrentSkipListSet. Both are non blocking, thread safe data structures with locality of reference properties. Ideal for cache implementation.
  • 15. CAS instruction CAS(address, expected_value, new_value) Atomically replaces the value at the address with the new_value if it is equal to the expected_value. In one instruction CMPXCHG Returns true if successful, false otherwise.
  • 16. Drawbacks They can be memory hungry, by being space inefficient.
  • 17. TRIES
  • 18. - Ordered Tree data structure, used to store associative arrays. Usually, encoded keys through traversal, in the nodes of the tree with value in the leaves. -Used for dictionnary (Map), word completion, web requests parsing., etc. -Time complexity in O(k) where k is the length of searched string. Where it usually is O(length of the tree)
  • 19.
  • 20. HASH ARRAY MAPPED TRIE (HAMT) Functional data structure, for fast association map Based on the idea of hashing a key, and storing the value on a trie. Each node is an array of size 32, pointing to children nodes of same size with maximum Seven deferencement, for fast key look ups.
  • 21.
  • 22. How does it work? During association of K → V , compute hashes yielding an Integer or a long, coded on the JVM in 32 bits Partition those bits to blocks of 5 bits, representing a level in the tree structure, from the right most (root) to the left most (leaves) The colored nodes have between 2 and 31 children Each colored node maintain a bitmap, encoding how many children the nodes contains, and their indexes in the array.
  • 23. Trie Structure (from Rich Hikey)
  • 24. Hash  Array  Mapped  Tries  (HAMT)  
  • 25. Hash  Array  Mapped  Tries  (HAMT)   0 = 0000002
  • 26. Hash  Array  Mapped  Tries  (HAMT)   0  
  • 27. Hash  Array  Mapped  Tries  (HAMT)   16 = 0100002 0  
  • 28. Hash  Array  Mapped  Tries  (HAMT)   0   16  
  • 29. Hash  Array  Mapped  Tries  (HAMT)   4 = 0001002 0   16  
  • 30. Hash  Array  Mapped  Tries  (HAMT)   4 = 0001002 16   0  
  • 31. Hash  Array  Mapped  Tries  (HAMT)   16   0   4  
  • 32. Hash  Array  Mapped  Tries  (HAMT)   12 = 0011002 16   0   4  
  • 33. Hash  Array  Mapped  Tries  (HAMT)   12 = 0011002 16   0   4  
  • 34. Hash  Array  Mapped  Tries  (HAMT)   16   0   4   12  
  • 35. Hash  Array  Mapped  Tries  (HAMT)   16  33   0   4   12  
  • 36. Hash  Array  Mapped  Tries  (HAMT)   16  33   48   0   4   12  
  • 37. Hash  Array  Mapped  Tries  (HAMT)   16   48   0   4   12   33  37  
  • 38. Hash  Array  Mapped  Tries  (HAMT)   16   48   4   12   33  37   0   3  
  • 39. Hash  Array  Mapped  Tries  (HAMT)   4   12   16  20  25   33  37   48   57   0   1   3   8   9  
  • 40. Hash  Array  Mapped  Tries  (HAMT)   4   12   16  20  25   33  37   48   57   0   1   3   8   9   Too  much  space!  
  • 41. Hash  Array  Mapped  Tries  (HAMT)   4   12   16  20  25   33  37   48  57   0   1   3   8   9  
  • 42. Hash  Array  Mapped  Tries  (HAMT)   4   12   16  20  25   33  37   48  57   0   1   3   8   9   Linear  search  at  every  level  -­‐  slow!  
  • 43. Hash  Array  Mapped  Tries  (HAMT)   4   12   16  20  25   33  37   48  57   0   1   3   8   9   SoluHon  –  bitmap  index!   Relying  on  BITPOP  instrucHon.  
  • 44. Hash  Array  Mapped  Tries  (HAMT)   48   57   1   0   1   0   48   57   1   0   1   0   48  57   10   48  57   BITPOP(((1 ((hc lev) 1F)) – 1) BMP)
  • 45. Complexity almost all in log32 N ~ O(7) ~ O(1) Requires only 7 max, array deferencements. O(1) O(log n) O(n) Append concat First insert Last prepend K-th Update
  • 46. Hash Array Mapped Tries (HAMT) •  advantages: •  low space consumption and shrinking •  no contiguous memory region required •  fast – logarithmic complexity, but with a low constant factor •  used as efficient immutable maps Clojure's PersistentHashMap and PersistentVector, Scala's Mutable Map. •  no global resize phase – real time applications, potentially more scalable concurrent operations?
  • 47. Concurrent Trie (Ctrie) •  goals: •  thread-safe concurrent trie •  maintain the advantages of HAMT •  rely solely on CAS instructions •  ensure lock-freedom and linearizability •  lookup – probably same as for HAMT
  • 48. Lock  Free  Concurrent  Trie  (C  Trie)   T2    CAS   4   9   12   20  25   20  25  28    CAS   0   1   3   16  17   16  17  18   T1  
  • 49. SKETCHES (Or how to remove the fat in your datasets)
  • 51. Probabilistic data structures, designed to tell rapidly in a memory efficient way, whether anelement is absent or not from a set. Made of Array of bits B of length n, and a hash function h Insert (X) : for all i in set, B[h(x,i)] = 1 Query (X) : return FALSEif for all i, B[h(x,i)]=0 Only returns if all k bits are set.
  • 52. We can pick the error rate and optimize the size of the filter to match requirements.
  • 53. TRADE OFFS More hashes enduce more accurate results, but render the sketch less space efficient. Choice of the hash function is important. Cryptographic hashes are great; because evenly distributed, with less collisions. Watch out to time spent computing your hashes.
  • 54. Cool properties Intersection and Union through AND and OR bitwise operations No False negatives For union, helps with parallel construction of BF Fast approximation of set union, by using bit map instead of set manipulation
  • 55. Handling Deletion Bloom filters can handle insertions, but not deletions. xi xj B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj. Solution : Counting Bloom Filter
  • 56. COUNTING BLOOM FILTERS Keeps track of inserts - Query(X) : return TRUE if for all i b[h(x,i)]0 - Insert(X) : if (query(x) == false ){ //don't insert twice For all i b[h(x,i)]++ } - Remove(X) : if (query(x) == true ) {//don't remove absents For all i, b[h(x,i)]-- }
  • 57. Usages - Fast web events tracking - IO optimisations in databases like Cassandra Hadoop, Hbases .. - Network IP filtering ...
  • 58. Guava Bloom Filters Default gives a 3 % error. With a MurMurHashV3 function. Creating the BloomFilter BloomFilter bloomFilter = BloomFilter.create(Funnels.byteArrayFunnel(), 1000); //Putting elements into the filter //A BigInteger representing a key of some sort bloomFilter.put(bigInteger.toByteArray()); //Testing for element in set boolean mayBeContained = bloomFilter.mayContain(bitIntegerII.toByteArray());
  • 59. //With a custom filter class BigIntegerFunnel implements FunnelBigInteger { @Override public void funnel(BigInteger from, Sink into) { into.putBytes(from.toByteArray()); } } //Creating the BloomFilter BloomFilter bloomFilter = BloomFilter.create(new BigIntegerFunnel(), 1000); //Putting elements into the filter bloomFilter.put(bigInteger); //Testing for element in set boolean mayBeContained = bloomFilter.mayContain(bitIntegerII);
  • 61. Family of memory efficient data structures, that can estimate frequency-related properties of the data set. Used to find occurrences of an element in a set, in time / space efficient way, with a tunable accuracy. E.g Find heavy hitters elements; perform range queries (where the goal is to find the sum of frequencies of Elements in a certain range ), estimate quantiles.
  • 62. How does it work? Two-dimensional array (d x w) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. d : number of hash functions w : hash array size
  • 63. Algorithm : insert(x): for i =1 to d M[i,h(x,i)) ++ //Like counting bloom filters query(x): return min {h(x,i) for all 1 ≤ i ≤ d}
  • 64. Error Estimation Accuracy depends on the ratio between sketch size and number of expected data. Works better with Highly uncorrelated, unstructured data. For higly skewed data, use noise estimation, and compute the median estimation
  • 65. HEAVY HITTERS (through count min sketches) Find all the elements in a data sets with frequencies over a Fixed threshold. K percent of the total number in the set. Use a count min sketched algorithm. Use case : detect most trafic consuming IP addresses, thwart DdoS attacks by blacklisting those Ips. Detect market prices with highest bids swings
  • 66. Algorithm. 1. Maintain a Count-min sketch of all the element in the set 2. Maintain a heap of top elements, initially empty, and a Counter N, of already processed elements. 3. For each element in the set Add it in the set Estimate the Frequency of the element. If higher than the threshold k*N, add it to heap. Continuously clean the The heap of all the elements below the new threshold.
  • 69. Bibliographies Blogs. http://highscalability.com/blog/2012/4/5/big-data-counting-how-to- count-a-billion-distinct-objects-us.html Librairies. http://github.com/slearspring/stream-lib org.apache.cassandra.utils.{MurmurHashV3, BloomFilter} Google Guava. Ideal Hash Trees by Phil BagWell http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf Concurrent Tries in the Scala Parallel Collections SkipLists By William Pugh.