1. Qiming Chen, Meichun Hsu, Rui Liu* HP Labs, Palo Alto, California, USA *HP Labs, Beijing, China Extend UDF Technology for Integrated Analytics
2. Motivations Running data-intensive analytics outside database causes significant overhead Huge round-trip data transfer overhead between database platform and computation platform Analytics layer is burdened with many generic data management issues Opportunity to balance resource utilization between data management and analytic processing is lost UDF has been extensively investigated for pushing down computation
3. Challenges & Problems (1) UDF is lack of formal support of relational input and output Unable to model complex applications Inefficiency of execution Tuple-wise pipeline prohibits in–function batch and parallel processing
4. Challenges & Problems (2) There exists a conflict between UDF execution efficiency and coding easiness UDF is hard to code Analytics users have to deal with hard-to-follow system details, while MapRedcueisolates system details form developer Encoding arguments into strings simplifies argument passing while incurs performance penalty
5.
6. Solution (2) Simple Relation Object Mapping (SROM) Separate RVF into RVF shell and ‘user-function’ Automated RVF shell generation
10. Acollection of sample images of ‘typical’ corner kick scenescorner kick In soccer games
11. Calculate Image Similarity For each image Extract SIFT features Each point as a128-dimensional vector Generate a composite feature vector The closeness of two images is determined by the similarity of their composite feature vectors 8 8/31/2009
12. Rank Sample Images SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; Derive the closest sample image of each corner kick image (by maximal similarity) For each sample image s, calculate the number of images having s as the closest sample Rank the sample images by that number 9 8/31/2009
13. Inefficiency of execution SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; CKSamples relation is not cached CKSamples relation is retrieved in a nested query for each (tuple) instance p of CKImages 10 8/31/2009
14. Relation Value Function RVF is specified as DEFINE RVF f (x, y, R1, R2) RETURN R3 { float a, b; Relation R1 (/*schema1*/); Relation R2 (/*schema2*/); Relation R3 (/*schema3*/); PROCEDURE fn(/*dll name*/); RETURN MODE SET_MODE; INVOCATION PATTERN BLOCK } RVFs can be naturally composed along with other relational operators or sub-queries SELECT * FROM rvf1(Q4, rvf2(Q1, Q2, Q3));
16. PerTuple Input Mode SELECT ID, Summary FROM per_image_summery_rvf (“SELECT feature FROM CKSamples”);
17. Block Input Mode SELECT r.sid, COUNT(r.neighbor) AS n FROM ck_ rvf1 (“SELECT * FROM CKImages”, “SELECT * FROM CKIsamples”) r GROUP BY r.sid ORDER BY n;
18. PerTuple/Block Input Mode SELECT Sid, COUNT(Neighbor) AS n FROM ( SELECT P.ID AS Neighbor, ck_ rvf2 (P.ID, P.feature, “SELECT * FROM CKIsamples”) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n;
19. Separating RVF Shell and User-Function Separate an RVF into RVF shell and user-function Provide high-level RVF Shell APIs for building the shell Shading the DBMS internal details from RVF developers Generate RVF shells based on RVF specifications, input and output modes
24. Summary Tackled two major limitations of UDF technology Lack of set input or output which causes insufficient application modeling capability and inefficiency of execution Difficulty in coding and integrating UDFs with the query engine Relation Value function Extend UDF for pushing down data-intensive computation RVF invocation pattern Separate RVF into RVF shell and user function RVF shell generation and Simple Relation Object Mapping (SROM) Prototype has been implemented on PostgreSQL