17. Architecture technique
17
DEX MANAGEMENT SERVER
https front
(nginx)
play2
SEARCH INDEX
Elastic Search
(Optionally in cluster)
DB
User
DEX DATA PROCESSING ENGINE
Job Queue
Data Processor
Scala / jre 8
Embedded
Spark driver
(CUSTOMER PROVIDED) HADOOP CLUSTER
YARN
Resource
Manager
HDFS
Name
Node
Cluster
node
Cluster
node
Cluster
node
…
Dedicated
folders
in HDFS
Web
HDFS
Auth
Provider
(ldap)
Network
filesystem
mount point
One node hosting DEX components
Customer provided
Cluster
node
Perf monitor
Tools
18. Score enrichment process
Dataset to enrich
Analysis
Spark
Dataframes
Stats on columns
Text analysis
Matching
« fuzzy join »
STORAGE CLUSTER :
+10 000 DATASETS
Classification model
Classification model
with joined data
Any column can be a join candidate a priori
20. K-Min Value (KMV) Synopsis
• Hashing = dropping DVs uniformly on [0,1]
• KMV synopsis:
• Estimator
• Unbiased
– Cf paper…
• Space complexity : constant !
)(/ kUkcard
XX X X X X X X
a
e
b
…
D distinct values
hash
a
a
Partition
X X
1/D
},...,,{ )()2()1( kUUUL
0 1U(1)
U(2)
U(k)
k-min
...
)(/)1( kUkcard
22. (Multiset) Union of Partitions
0
XX X X
k-min
0
XX X X
k-min
0
XX X X
XX X X
k-min
U(k)
L
LA LB
Combine KMV synopses: L=LALB
Theorem: L is a KMV synopsis of AB
Can use previous unbiased estimator:
… 1 … 1
… 1
X
)(/)1( kUkcard
23. L=LALB as with union (contains k elements)
Note: L corresponds to a uniform random sample of DVs in AB
K = # values in L that are also in D(AB)
Theorem: Can compute from LA and LB alone
K/k estimates Jaccard distance:
estimates
Unbiased estimator of #DVs in the intersection:
See paper for variance of estimator
Can extend to general compound partitions from ordinary set
operations
(Multiset) Intersection of Partitions
)(/)1(ˆ
kUkD )( BADD
)(
1ˆ
kU
k
k
K
D
)(
)(
BAD
BAD
D
D
24. REX d’utilisation du KMV pour le matching
SIGMOD 07
Une métrique d’intersection
approximative, mais :
• suffisante pour éliminer les
datasets non-pertinents
(eg. 3000 -> 75)
• 100-1000x plus rapide
d’estimer une jointure avec
un KMV que de l’exécuter