SlideShare une entreprise Scribd logo
1  sur  21
Qiming Chen, Meichun Hsu, Rui Liu* HP Labs, Palo Alto, California, USA  *HP Labs, Beijing, China Extend UDF Technology for Integrated Analytics
Motivations Running data-intensive analytics outside database causes significant overhead Huge round-trip data transfer overhead between database platform and computation platform Analytics layer is burdened with many generic data management issues  Opportunity to balance resource utilization between data management and analytic processing is lost UDF has been extensively investigated for pushing down computation
Challenges & Problems (1) UDF is lack of formal support of relational input and output Unable to model complex applications Inefficiency of execution Tuple-wise pipeline prohibits in–function batch and parallel processing
Challenges & Problems (2) There exists a conflict between UDF execution efficiency and coding easiness UDF is hard to code Analytics users have to deal with hard-to-follow system details, while MapRedcueisolates system details form developer Encoding arguments into strings simplifies argument passing while incurs performance penalty
Solution (1) Relation Valued Function Invocation pattern Mechanisms for dealing with inputs and return values, e.g. tuple by tuple, or as a whole set ,[object Object],Classify Relation Valued Functions based on invocation patterns Deterministic steps of system interaction Single out application logic from system utilities
Solution (2) Simple Relation Object Mapping (SROM) Separate RVF into RVF shell and ‘user-function’ Automated RVF shell generation
Example: Corner kick scene rank Tables: ,[object Object]
Alarge set of images on corner kicks
CKSamples [ID, feature]
Acollection of sample images of ‘typical’ corner kick scenescorner kick In soccer games
Calculate Image Similarity For each image Extract SIFT features Each point as a128-dimensional vector Generate a composite feature vector The closeness of two images is determined by the similarity of their composite feature vectors 8 8/31/2009
Rank Sample Images SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; Derive the closest sample image of each corner kick image (by maximal similarity) For each sample image s, calculate the number of images having s as the closest sample Rank the sample images by that number 9 8/31/2009
Inefficiency of execution  SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; CKSamples relation is not cached CKSamples relation is retrieved in a nested query for each (tuple) instance p of CKImages 10 8/31/2009
Relation Value Function RVF is specified as DEFINE RVF f (x, y, R1, R2) RETURN R3 { float a, b; Relation R1 (/*schema1*/); Relation R2 (/*schema2*/); Relation R3 (/*schema3*/); PROCEDURE fn(/*dll name*/); RETURN MODE SET_MODE; INVOCATION PATTERN BLOCK } RVFs can be naturally composed along with other relational operators or sub-queries SELECT * FROM rvf1(Q4, rvf2(Q1, Q2, Q3));
Invocation Patterns PerTuple Input Mode Block Input Mode PerTuple/Block Input Mode Tuple Return Mode Set Return Mode
PerTuple Input Mode SELECT ID, Summary FROM per_image_summery_rvf 	(“SELECT feature FROM CKSamples”);
Block Input Mode SELECT r.sid, COUNT(r.neighbor) AS n  FROM 	ck_ rvf1 (“SELECT * FROM CKImages”, 	“SELECT * FROM CKIsamples”) r GROUP BY r.sid ORDER BY n;
PerTuple/Block Input Mode SELECT Sid, COUNT(Neighbor) AS n FROM ( SELECT P.ID AS Neighbor, ck_ rvf2 (P.ID, P.feature, 	“SELECT * FROM CKIsamples”) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n;
Separating RVF Shell and User-Function Separate an RVF into RVF shell and user-function Provide high-level RVF Shell APIs for building the shell  Shading the DBMS internal details from RVF developers Generate RVF shells based on RVF specifications, input and output modes
RVF Shell Generation
RVF-Shell and APIs SQLUDR_INT32 ck_rvf2(RVF_ARGS) { intrv; RVFCallContext *h; ck_rvf2_args *hARGS; CKSamples *samples; if (RVF_IS_FIRST_CALL()) { …. } if (RVF_IS_NORMAL_CALL()) { …. /*user-function*/ intsid = find_closest_sample (ID, feature, samples); RVF_RETURN_NEXT(sid); RVF_NORMAL_CALL_END(h); } if (RVF_IS_LAST_CALL()) { …. } return rv; }

Contenu connexe

Tendances (7)

Stack and heap allocation
Stack and heap allocationStack and heap allocation
Stack and heap allocation
 
Stack Data structure
Stack Data structureStack Data structure
Stack Data structure
 
Stack, queue and hashing
Stack, queue and hashingStack, queue and hashing
Stack, queue and hashing
 
Programming in java - Concepts- Operators- Control statements-Expressions
Programming in java - Concepts- Operators- Control statements-ExpressionsProgramming in java - Concepts- Operators- Control statements-Expressions
Programming in java - Concepts- Operators- Control statements-Expressions
 
Stack and Heap
Stack and HeapStack and Heap
Stack and Heap
 
Uml struct2
Uml struct2Uml struct2
Uml struct2
 
Stack project
Stack projectStack project
Stack project
 

En vedette

Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7Roger Barga
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingRui Liu
 
Scaling Up And Speeding Up Video Analytics Inside Database Engine
Scaling Up And Speeding Up Video Analytics Inside Database EngineScaling Up And Speeding Up Video Analytics Inside Database Engine
Scaling Up And Speeding Up Video Analytics Inside Database EngineRui Liu
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteRoger Barga
 
10 Tips for WeChat
10 Tips for WeChat10 Tips for WeChat
10 Tips for WeChatChris Baker
 
20 Ideas for your Website Homepage Content
20 Ideas for your Website Homepage Content20 Ideas for your Website Homepage Content
20 Ideas for your Website Homepage ContentBarry Feldman
 

En vedette (6)

Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
 
Scaling Up And Speeding Up Video Analytics Inside Database Engine
Scaling Up And Speeding Up Video Analytics Inside Database EngineScaling Up And Speeding Up Video Analytics Inside Database Engine
Scaling Up And Speeding Up Video Analytics Inside Database Engine
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
10 Tips for WeChat
10 Tips for WeChat10 Tips for WeChat
10 Tips for WeChat
 
20 Ideas for your Website Homepage Content
20 Ideas for your Website Homepage Content20 Ideas for your Website Homepage Content
20 Ideas for your Website Homepage Content
 

Similaire à Extend Udf Technology For Integrated Analytics

Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfSamHoney6
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys
 
Mainframe Technology Overview
Mainframe Technology OverviewMainframe Technology Overview
Mainframe Technology OverviewHaim Ben Zagmi
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL Suraj Bang
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Crafted Design - Sandro Mancuso
Crafted Design - Sandro MancusoCrafted Design - Sandro Mancuso
Crafted Design - Sandro MancusoJAXLondon2014
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youLuc Bors
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelAlessio Bucaioni
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys
 
Crafted Design - GeeCON 2014
Crafted Design - GeeCON 2014Crafted Design - GeeCON 2014
Crafted Design - GeeCON 2014Sandro Mancuso
 

Similaire à Extend Udf Technology For Integrated Analytics (20)

Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
 
Mainframe Technology Overview
Mainframe Technology OverviewMainframe Technology Overview
Mainframe Technology Overview
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Graph computation
Graph computationGraph computation
Graph computation
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Crafted Design - Sandro Mancuso
Crafted Design - Sandro MancusoCrafted Design - Sandro Mancuso
Crafted Design - Sandro Mancuso
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
OMG D&C Tutorial
OMG D&C TutorialOMG D&C Tutorial
OMG D&C Tutorial
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component Model
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Scala and Spring
Scala and SpringScala and Spring
Scala and Spring
 
Crafted Design - GeeCON 2014
Crafted Design - GeeCON 2014Crafted Design - GeeCON 2014
Crafted Design - GeeCON 2014
 

Extend Udf Technology For Integrated Analytics

  • 1. Qiming Chen, Meichun Hsu, Rui Liu* HP Labs, Palo Alto, California, USA *HP Labs, Beijing, China Extend UDF Technology for Integrated Analytics
  • 2. Motivations Running data-intensive analytics outside database causes significant overhead Huge round-trip data transfer overhead between database platform and computation platform Analytics layer is burdened with many generic data management issues Opportunity to balance resource utilization between data management and analytic processing is lost UDF has been extensively investigated for pushing down computation
  • 3. Challenges & Problems (1) UDF is lack of formal support of relational input and output Unable to model complex applications Inefficiency of execution Tuple-wise pipeline prohibits in–function batch and parallel processing
  • 4. Challenges & Problems (2) There exists a conflict between UDF execution efficiency and coding easiness UDF is hard to code Analytics users have to deal with hard-to-follow system details, while MapRedcueisolates system details form developer Encoding arguments into strings simplifies argument passing while incurs performance penalty
  • 5.
  • 6. Solution (2) Simple Relation Object Mapping (SROM) Separate RVF into RVF shell and ‘user-function’ Automated RVF shell generation
  • 7.
  • 8. Alarge set of images on corner kicks
  • 10. Acollection of sample images of ‘typical’ corner kick scenescorner kick In soccer games
  • 11. Calculate Image Similarity For each image Extract SIFT features Each point as a128-dimensional vector Generate a composite feature vector The closeness of two images is determined by the similarity of their composite feature vectors 8 8/31/2009
  • 12. Rank Sample Images SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; Derive the closest sample image of each corner kick image (by maximal similarity) For each sample image s, calculate the number of images having s as the closest sample Rank the sample images by that number 9 8/31/2009
  • 13. Inefficiency of execution SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; CKSamples relation is not cached CKSamples relation is retrieved in a nested query for each (tuple) instance p of CKImages 10 8/31/2009
  • 14. Relation Value Function RVF is specified as DEFINE RVF f (x, y, R1, R2) RETURN R3 { float a, b; Relation R1 (/*schema1*/); Relation R2 (/*schema2*/); Relation R3 (/*schema3*/); PROCEDURE fn(/*dll name*/); RETURN MODE SET_MODE; INVOCATION PATTERN BLOCK } RVFs can be naturally composed along with other relational operators or sub-queries SELECT * FROM rvf1(Q4, rvf2(Q1, Q2, Q3));
  • 15. Invocation Patterns PerTuple Input Mode Block Input Mode PerTuple/Block Input Mode Tuple Return Mode Set Return Mode
  • 16. PerTuple Input Mode SELECT ID, Summary FROM per_image_summery_rvf (“SELECT feature FROM CKSamples”);
  • 17. Block Input Mode SELECT r.sid, COUNT(r.neighbor) AS n FROM ck_ rvf1 (“SELECT * FROM CKImages”, “SELECT * FROM CKIsamples”) r GROUP BY r.sid ORDER BY n;
  • 18. PerTuple/Block Input Mode SELECT Sid, COUNT(Neighbor) AS n FROM ( SELECT P.ID AS Neighbor, ck_ rvf2 (P.ID, P.feature, “SELECT * FROM CKIsamples”) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n;
  • 19. Separating RVF Shell and User-Function Separate an RVF into RVF shell and user-function Provide high-level RVF Shell APIs for building the shell Shading the DBMS internal details from RVF developers Generate RVF shells based on RVF specifications, input and output modes
  • 21. RVF-Shell and APIs SQLUDR_INT32 ck_rvf2(RVF_ARGS) { intrv; RVFCallContext *h; ck_rvf2_args *hARGS; CKSamples *samples; if (RVF_IS_FIRST_CALL()) { …. } if (RVF_IS_NORMAL_CALL()) { …. /*user-function*/ intsid = find_closest_sample (ID, feature, samples); RVF_RETURN_NEXT(sid); RVF_NORMAL_CALL_END(h); } if (RVF_IS_LAST_CALL()) { …. } return rv; }
  • 22. Simple Relation-Object Mapping (SROM) typedefstruct { byte * mask; float4 * vector; } FloatVectorType; typedefstruct { intID FloatVectorType feature; } CKImage; typedefstruct { CKImage* CKImageArray; inttuple_num; } CKImages; typedefstruct { int ID; FloatVectorType feature; } CKSample; typedefstruct { CKSample * CKSampleArray; inttuple_num; } CKSamples; CREATE TYPE FloatVectorType AS ( mask BIT VARYING(100), floatVector float4 [] ); CREATE TABLE CKImages ( ID INTEGER NOT NULL, feature FloatVectorType ); CREATE TABLE CKSamples ( ID INTEGER NOT NULL, feature FloatVectorType );
  • 24. Summary Tackled two major limitations of UDF technology Lack of set input or output which causes insufficient application modeling capability and inefficiency of execution Difficulty in coding and integrating UDFs with the query engine Relation Value function Extend UDF for pushing down data-intensive computation RVF invocation pattern Separate RVF into RVF shell and user function RVF shell generation and Simple Relation Object Mapping (SROM) Prototype has been implemented on PostgreSQL