SlideShare une entreprise Scribd logo
1  sur  38
High-Performance Holistic
XML Twig Filtering Using
GPUs
Ildar Absalyamov, Roger Moussalli,
Walid Najjar and Vassilis Tsotras
Outline
Motivation
XML filtering in the literature
Software approaches
Hardware approaches

Proposed GPU-based approach & detailed algorithm
Optimizations
Experimental evaluation
Conclusions
2
Motivation – XML Pub-subs
Filtering engine is at the heart of publish-subscribe systems (pub-subs)
Used to deliver news, blog updates, stock data, etc

XML is a standard format for data exchange
Powerful enough to capture message values as well as its structure using XPath

The growing volume of information requires exploring massively parallel highperformance approaches for XML filtering

Add, Update or
Delete Profiles

Data
Streams
Publisher 0

Subscriber 0

Query
Profiles

Publisher 1

Subscriber 1

Filtering
Algorithm
Publisher N

Matches
Forward
Matches

Subscriber M

3
Related work (software)
XFilter (VLDB 2000)
Creates a separate FSM for each query

YFilter (TODS 2003)
Combines individual paths, creates a single
NFA

LazyDFA (TODS 2004)
Uses deterministic FSMs

FSM-based
approache
s

XPush (SIGMOD 2003)
Lazily constructs a deterministic pushdown
automaton
4
Related work (software)
FiST (VLDB 2005)
Converts the XML document into
Prüfer sequences and matches
respective sequences

sequencebased
approache
s

XTrie (VLDB 2002)
Uses Trie-based index to match query
prefix

AFilter (VLDB 2006)

others

Leverages prefix as well as suffix
query indexes
5
Related work (hardware)
“Accelerating XML query matching through custom
stack generation on FPGAs” (HiPEAC 2010)
Introduced dynamic-programming XML path filtering approach for
FPGAs

“Efficient XML path filtering using GPUs” (ADMS 2011)
Modified original approach to perform path filtering on GPUs

“Massively parallel XML twig filtering using dynamic
programming on FPGAs” (ICDE 2011)
Extended algorithm to support holistic twig filtering on FPGAs
6
Why GPUs
This work proposes holistic XML twig filtering algorithm,
which runs on GPUs
Why GPUs?
Highly scalable, massively parallel architecture
Flexibility as for software XML filtering engines

Why not FPGAs?
Limited scalability due to scarce hardware resources available on
the chip
Lack of query dynamicity - need time to reconfigure FGPA
hardware implementation

7
XML Document Preprocessing
To be able to run
algorithm in streaming
mode XML tree structure
needs to be flattened
XML document is
presented as a stream
of open(tag) and
close(tag) events

XML Document
a
b
c

d
...

...

Event Stream
open(a) – open(b) –
open(c) – close(c) – … –
close(b) – open(d) – … –
close(d) – close(a)
8
Twig Filtering: approach
Twig processing contains
two steps
Matching individual root-toleaf paths
Report matches back to
root, while joining them at
split nodes

a
b

c

d
...

...

a
b
c

d
...

...

9
Dynamic programming: algorithm
Query: /a[//c]/d/*

Every query is mapped to a DP
table

a
c

DP table - binary stack
Each stack is mapped to query

Every column has prefix pointer
Open and close events map to
push and pop actions on the topof-the-stack (TOS)

*
Query length
XML Max depth

Each node in query is mapped to
stack column

d

$

/a

//c

/d

Prefix pointers

/*
10
Dynamic programming: stacks
Two different types of stack are used for
different parts of filtering algorithm
Stacks captures query matches via propagation
of ‘1’s

push stacks:
Used for matching root-to-leaf paths
The TOS is updated on open (i.e.
push) events
Propagation
goes
diagonally
upwards and vertically upwards from
query root to query leaves
11
Dynamic programming: stacks
pop stacks:
Used for reporting leaf matches back
to the root
The TOS is updated on close (pop)
events (open events clear the TOS)
Propagation
goes
diagonally
downwards and vertically downwards
from query leaves to the root

12
Push stack: Example
Dummy root node (‘$’) is
always matched in the
beginning

XML Document
a
d

b

e
c
Open(a)
TOS

‘1’ is propagated diagonally
upwards if

d

1
1
$

/a

//c

/d

/*

Prefix holds ‘1’
Relationship with prefix is ‘/’
Open event tag matches
column tag
13
Push stack: Example
‘1’ is propagated diagonally
(in terms of prefix pointer)
upwards as in previous
example

XML Document
a
d

b

e
c
Open(d)

d

1

TOS

1
1
$

/a

//c

/d

/*
14
Push stack: Example
If the query node tag is a
wildcard (‘*’), then any tag in
open event qualifies to be
matched

XML Document
a
d

TOS

b

e
c
Open(e)

d
1
1

1
1
$

/a

//c

/d

Since ‘/*’ is a leaf node
matched this fact is saved in
a special binary array
(colored in red in example)

/*
15
Push stack: Example
XML Document
a
d

b

e
c
Open(b)
TOS

‘1’ propagates upwards in
prefix column if
Prefix holds ‘1’
Relationship with prefix is ‘//’
Tag in open event could be
arbitrary

d

1
1
1
$

/a

//c

/d

/*
16
Push stack: Example
If ‘1’ is propagated to
query leaf node (‘//c’ in
example) is saved as
matched

XML Document
a
d

TOS

b

e
c
Open(c)

d

1
1
1
1
$

/a

//c

/d

/*
17
Push stack: Example
Node ‘/d’ is not updated,
since ‘/a’ is a split node,
whose
children
have
different relationships (‘//’
with ‘c’ and ‘/’ with ‘d’)

XML Document
a
d

TOS

b

e
c
Open(d)
1

d

The split node maintain
different fields for these
two kinds of children

1

1
1
$

/a

//c

/d

/*
18
Pop stack: Example
XML Document

Leaf nodes contain ‘1’ if
these nodes were saved in
binary match array during 1st
algorithm phase

a
d

TOS

b

e
c
Close(e)

‘1’ is propagated diagonally
downwards if

d
1
1

$

/a

//c

/d

Node holds ‘1’ on TOS
Relationship with prefix is ‘/’
Close event tag matches
column tag or column tag is ‘*’
(shown in example)

/*
19
Pop stack: Example
XML Document

Split node (‘/a’ in example)
is matched only if all it’s
children propagate ‘1’

a
d

b

e
c
Close(d)
TOS

d



1

$

/a

//c

/d

/*

Since so far we have
explored only one subtree of
node ‘/a’ match cannot be
propagated to this node
(‘and’-logical operation used
to match split node)
20
Pop stack: Example
XML Document



a
d

TOS

b



e
c
Close(c)



d



1
1

$

/a

‘1’ is propagated
downwards in descendant
node if

1

//c

/d

Node holds ‘1’ on TOS
Relationship with prefix is ‘//’
Close event tag matches
column tag

/*
21
Pop stack: Example
XML Document
a
d

TOS

b

e
c
Close(d)

d

1

$

/a

The same rules apply,
however nothing is
propagated

1

//c

/d

/*
22
Pop stack: Example


This time both children of
the split node are matched,
thus the result of ‘and’operation propagates ‘1’
down to node ‘/a’



As with push stack split
node has two separate
fields for children with ‘/’ and
‘//’ relationships



XML Document

Reporting
match
from
children to parent does not
depend on event tag

a
d

b

e
c
Close(b)
TOS

1

d

1

1
$

/a

//c

/d

/*

23
Pop stack: Example
XML Document
a
d

b

e
c
Close(a)
TOS

Full query is matched if
dummy root node reports
match

d

1
1
$

/a

//c

/d

/*
24
GPU Architecture








SM is a multicore
processor, consisting of
multiple SPs
SPs execute the same
instructions (kernel)
SPs within SM
communicate through
small fast shared memory
Block is a logical set of
threads, scheduled on
SPs within SM

Grid
Grid

SM
SM

SM
SM

Block1
Block1

IU
IU

IU
IU

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SP SP
SP SP

SFU
SFU

SFU
SFU

SMem
SMem

SMem
SMem

Block2
Block2

…
BlockN
BlockN

Global memory
Global memory
25
Filtering Parallelism on GPUs


Intra-query parallelism




Inter-query parallelism




Each stack column on TOS is independently
evaluated in parallel on SP
Queries scheduled parallely on different SMs

Inter-document parallelism


Filtering several XML documents as a time using
concurrent GPU kernels (co-scheduling kernels
with different input parameters)
26
XML Event & Stack Entry Encoding




The XML document is
preprocessed and
transferred to the GPU as a
stream of byte-long
(event/tag ID) pairs

XML event
push/pop tagID
push/pop tagID
1 bit

7 bits

The event streams reside
in the GPU global memory

27
GPU Kernel Personality Encoding


Each GPU kernel, maps to one query node



The offline query parser creates a personality for each
kernel, which is later stored in GPU registers



A personality is a 4 byte entry

GPU personality
prefix children children
prefix children children
isLeaf
isLeaf
relation with ‘/’ with ‘//’
relation with ‘/’ with ‘//’
1 bit

1 bit

1 bit

1 bit

…
…

prefixID tagID
prefixID tagID
10 bits

7 bits
28
GPU Optimizations


Merging push and pop stacks to save shared
memory



Coalescing global memory readswrites



Caching XML stream items in shared memory




Reading stream in chunks by looping in strided
manner, since shared memory size is limited

Avoiding usage of atomic functions


Calling non-atomic analogs in separate thread
29
Experimentation Setup


GPU experiments





NVidia Tesla C2075(Fermi architecture), 448 cores
NVidia Tesla K20(Kepler architecture), 2496 cores

Software filtering experiments



YFilter filtering engine
Dual 6-core 2.30GHz Intel Xeon E5 machine with
30 GB of RAM
30
Experimentation Datasets


DBLP XML dataset






Documents of varied size 32kB-2MB, obtained by
trimming original DBLP XML
Synthetic documents (having DBLP schema) of fixed size
25kB
Maximum XML depth – 10

Queries generated by the YFilter XPath generator
with varied parameters





Query size: 5,10 and 15 nodes
Number of split points: 1,3 and 6
Probability of ‘*’ node and ‘//’ relations: 10%,30%,50%
Number of queries: 32-2k
31
Experiment Results: Throughput
GPU throughput is constant until “breaking”
point – point where all GPU cores are
occupied


Number of
occupied cores
depends on
number of
queries and
query length
32
Experiment Results: Speedup
GPU speedup depends on XML document
size: larger docs incur greater global memory
read latency


Speedup up to
9x



‘*’ and ‘//’-probability
affects speedup since
it increases the
YFilter NFA size
33
Batch Experiments


Batched experiments show the usage of intradocument parallelism



Mimic real case scenarios (batches of real-time docs)
Batches of size 500 and 1000 docs were used



It is nor fair to use single-threaded YFilter
implementation in batch experiments



“Pseudo”-multicore Yfilter: run multiple copies of
program, distributing document load




Each copy runs on it’s own core
Each copy filters subset of document set, query set is fixed
Query load is the same for all copies, since it affect NFA
size. Distributing queries deteriorates query sharing due to
34
lack of commonalities.
Batch Experiments: Throughput
No breaking point – GPU is always fully
occupied by concurrently executing kernels


Throughput
increased up
to 16 times in
comparison
with singledocument case
35
Batch Experiments: Speedup


GPU fully utilized – increase in the query length
number yields speedup drop by factor of 2



Achieve up to
16x speedup



Slowdown after
512 queries



Multicore version
performs better then
ordinary

36
Conclusions


Proposed the first holistic twig filtering using
GPUs, effectively leveraging GPU parallelism



Allow processing of thousands of queries, whilst
allowing dynamic query updates (vs. FPGA)



Up to 9x speedup over software systems in
single-document experiments



Up to 16x speedup over software systems in
batches experiments
37
Thank you!

Contenu connexe

Tendances

The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntaxcsandit
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.comRavi Raj
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 
fileop report
fileop reportfileop report
fileop reportJason Lu
 
23 priority queue
23 priority queue23 priority queue
23 priority queueGodo Dodo
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Order Independent Transparency
Order Independent TransparencyOrder Independent Transparency
Order Independent Transparencyacbess
 
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...Provectus
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...PVS-Studio
 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Ed Dodds
 

Tendances (19)

The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntax
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers
 
fileop report
fileop reportfileop report
fileop report
 
23 priority queue
23 priority queue23 priority queue
23 priority queue
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Order Independent Transparency
Order Independent TransparencyOrder Independent Transparency
Order Independent Transparency
 
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...
Data Summer Conf 2018, “How to accelerate your neural net inference with Tens...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Run time storage
Run time storageRun time storage
Run time storage
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 
Python
PythonPython
Python
 
HDF5 Advanced Topics
HDF5 Advanced TopicsHDF5 Advanced Topics
HDF5 Advanced Topics
 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011
 
More mpi4py
More mpi4pyMore mpi4py
More mpi4py
 
cpp-file-handling
cpp-file-handlingcpp-file-handling
cpp-file-handling
 

En vedette

Oficio altamira
Oficio altamiraOficio altamira
Oficio altamiraCANATAME
 
Costa brugué prevencióvg
Costa brugué prevencióvgCosta brugué prevencióvg
Costa brugué prevencióvgNuria Costa
 
Bamboo based -Green & environmental friendly Tribal school
Bamboo based -Green & environmental friendly Tribal schoolBamboo based -Green & environmental friendly Tribal school
Bamboo based -Green & environmental friendly Tribal schoolShamsundar Subbarao
 
Mapas descubrimientos en américa
Mapas descubrimientos en  américaMapas descubrimientos en  américa
Mapas descubrimientos en américaNameless RV
 
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...Indian dental academy
 
PHP 5.3 - Estruturas de Controle
PHP 5.3 - Estruturas de ControlePHP 5.3 - Estruturas de Controle
PHP 5.3 - Estruturas de ControleGeorge Mendonça
 
Sosyal Medyanın Kullanım Alanları
Sosyal Medyanın Kullanım AlanlarıSosyal Medyanın Kullanım Alanları
Sosyal Medyanın Kullanım AlanlarıHaydar Özkömürcü
 

En vedette (15)

Oficio altamira
Oficio altamiraOficio altamira
Oficio altamira
 
Mars
MarsMars
Mars
 
Costa brugué prevencióvg
Costa brugué prevencióvgCosta brugué prevencióvg
Costa brugué prevencióvg
 
t&dAnandadurai
t&dAnandadurait&dAnandadurai
t&dAnandadurai
 
Gooti skulptuur
Gooti skulptuurGooti skulptuur
Gooti skulptuur
 
Taller 1
Taller 1Taller 1
Taller 1
 
Bamboo based -Green & environmental friendly Tribal school
Bamboo based -Green & environmental friendly Tribal schoolBamboo based -Green & environmental friendly Tribal school
Bamboo based -Green & environmental friendly Tribal school
 
Mapas descubrimientos en américa
Mapas descubrimientos en  américaMapas descubrimientos en  américa
Mapas descubrimientos en américa
 
10 4 10-5
10 4 10-510 4 10-5
10 4 10-5
 
Wed 9 28
Wed 9 28Wed 9 28
Wed 9 28
 
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...
Current concepts in the pathogenesis of cancer /certified fixed orthodontic c...
 
Wegmans final
Wegmans finalWegmans final
Wegmans final
 
PHP 5.3 - Estruturas de Controle
PHP 5.3 - Estruturas de ControlePHP 5.3 - Estruturas de Controle
PHP 5.3 - Estruturas de Controle
 
Sosyal Medyanın Kullanım Alanları
Sosyal Medyanın Kullanım AlanlarıSosyal Medyanın Kullanım Alanları
Sosyal Medyanın Kullanım Alanları
 
Valentina t.
Valentina t.Valentina t.
Valentina t.
 

Similaire à ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs

FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector BuilderMark Wilkinson
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in TokyoSummary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in TokyoCLOUDIAN KK
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Technical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques NadeauTechnical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques NadeauMapR Technologies
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersAlessandro Sanino
 
Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)Pinaki Poddar
 
Introduction to Actionscript3
Introduction to Actionscript3Introduction to Actionscript3
Introduction to Actionscript3Yoss Cohen
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 
Complex Event Processing with Esper and WSO2 ESB
Complex Event Processing with Esper and WSO2 ESBComplex Event Processing with Esper and WSO2 ESB
Complex Event Processing with Esper and WSO2 ESBPrabath Siriwardena
 
gLite Data Management System
gLite Data Management SystemgLite Data Management System
gLite Data Management SystemLeandro Ciuffo
 

Similaire à ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs (20)

FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Odp
OdpOdp
Odp
 
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in TokyoSummary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
 
Ods chapter7
Ods chapter7Ods chapter7
Ods chapter7
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Technical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques NadeauTechnical Overview of Apache Drill by Jacques Nadeau
Technical Overview of Apache Drill by Jacques Nadeau
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)
 
Introduction to Actionscript3
Introduction to Actionscript3Introduction to Actionscript3
Introduction to Actionscript3
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
Complex Event Processing with Esper and WSO2 ESB
Complex Event Processing with Esper and WSO2 ESBComplex Event Processing with Esper and WSO2 ESB
Complex Event Processing with Esper and WSO2 ESB
 
gLite Data Management System
gLite Data Management SystemgLite Data Management System
gLite Data Management System
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs

  • 1. High-Performance Holistic XML Twig Filtering Using GPUs Ildar Absalyamov, Roger Moussalli, Walid Najjar and Vassilis Tsotras
  • 2. Outline Motivation XML filtering in the literature Software approaches Hardware approaches Proposed GPU-based approach & detailed algorithm Optimizations Experimental evaluation Conclusions 2
  • 3. Motivation – XML Pub-subs Filtering engine is at the heart of publish-subscribe systems (pub-subs) Used to deliver news, blog updates, stock data, etc XML is a standard format for data exchange Powerful enough to capture message values as well as its structure using XPath The growing volume of information requires exploring massively parallel highperformance approaches for XML filtering Add, Update or Delete Profiles Data Streams Publisher 0 Subscriber 0 Query Profiles Publisher 1 Subscriber 1 Filtering Algorithm Publisher N Matches Forward Matches Subscriber M 3
  • 4. Related work (software) XFilter (VLDB 2000) Creates a separate FSM for each query YFilter (TODS 2003) Combines individual paths, creates a single NFA LazyDFA (TODS 2004) Uses deterministic FSMs FSM-based approache s XPush (SIGMOD 2003) Lazily constructs a deterministic pushdown automaton 4
  • 5. Related work (software) FiST (VLDB 2005) Converts the XML document into Prüfer sequences and matches respective sequences sequencebased approache s XTrie (VLDB 2002) Uses Trie-based index to match query prefix AFilter (VLDB 2006) others Leverages prefix as well as suffix query indexes 5
  • 6. Related work (hardware) “Accelerating XML query matching through custom stack generation on FPGAs” (HiPEAC 2010) Introduced dynamic-programming XML path filtering approach for FPGAs “Efficient XML path filtering using GPUs” (ADMS 2011) Modified original approach to perform path filtering on GPUs “Massively parallel XML twig filtering using dynamic programming on FPGAs” (ICDE 2011) Extended algorithm to support holistic twig filtering on FPGAs 6
  • 7. Why GPUs This work proposes holistic XML twig filtering algorithm, which runs on GPUs Why GPUs? Highly scalable, massively parallel architecture Flexibility as for software XML filtering engines Why not FPGAs? Limited scalability due to scarce hardware resources available on the chip Lack of query dynamicity - need time to reconfigure FGPA hardware implementation 7
  • 8. XML Document Preprocessing To be able to run algorithm in streaming mode XML tree structure needs to be flattened XML document is presented as a stream of open(tag) and close(tag) events XML Document a b c d ... ... Event Stream open(a) – open(b) – open(c) – close(c) – … – close(b) – open(d) – … – close(d) – close(a) 8
  • 9. Twig Filtering: approach Twig processing contains two steps Matching individual root-toleaf paths Report matches back to root, while joining them at split nodes a b c d ... ... a b c d ... ... 9
  • 10. Dynamic programming: algorithm Query: /a[//c]/d/* Every query is mapped to a DP table a c DP table - binary stack Each stack is mapped to query Every column has prefix pointer Open and close events map to push and pop actions on the topof-the-stack (TOS) * Query length XML Max depth Each node in query is mapped to stack column d $ /a //c /d Prefix pointers /* 10
  • 11. Dynamic programming: stacks Two different types of stack are used for different parts of filtering algorithm Stacks captures query matches via propagation of ‘1’s push stacks: Used for matching root-to-leaf paths The TOS is updated on open (i.e. push) events Propagation goes diagonally upwards and vertically upwards from query root to query leaves 11
  • 12. Dynamic programming: stacks pop stacks: Used for reporting leaf matches back to the root The TOS is updated on close (pop) events (open events clear the TOS) Propagation goes diagonally downwards and vertically downwards from query leaves to the root 12
  • 13. Push stack: Example Dummy root node (‘$’) is always matched in the beginning XML Document a d b e c Open(a) TOS ‘1’ is propagated diagonally upwards if d 1 1 $ /a //c /d /* Prefix holds ‘1’ Relationship with prefix is ‘/’ Open event tag matches column tag 13
  • 14. Push stack: Example ‘1’ is propagated diagonally (in terms of prefix pointer) upwards as in previous example XML Document a d b e c Open(d) d 1 TOS 1 1 $ /a //c /d /* 14
  • 15. Push stack: Example If the query node tag is a wildcard (‘*’), then any tag in open event qualifies to be matched XML Document a d TOS b e c Open(e) d 1 1 1 1 $ /a //c /d Since ‘/*’ is a leaf node matched this fact is saved in a special binary array (colored in red in example) /* 15
  • 16. Push stack: Example XML Document a d b e c Open(b) TOS ‘1’ propagates upwards in prefix column if Prefix holds ‘1’ Relationship with prefix is ‘//’ Tag in open event could be arbitrary d 1 1 1 $ /a //c /d /* 16
  • 17. Push stack: Example If ‘1’ is propagated to query leaf node (‘//c’ in example) is saved as matched XML Document a d TOS b e c Open(c) d 1 1 1 1 $ /a //c /d /* 17
  • 18. Push stack: Example Node ‘/d’ is not updated, since ‘/a’ is a split node, whose children have different relationships (‘//’ with ‘c’ and ‘/’ with ‘d’) XML Document a d TOS b e c Open(d) 1 d The split node maintain different fields for these two kinds of children 1 1 1 $ /a //c /d /* 18
  • 19. Pop stack: Example XML Document Leaf nodes contain ‘1’ if these nodes were saved in binary match array during 1st algorithm phase a d TOS b e c Close(e) ‘1’ is propagated diagonally downwards if d 1 1 $ /a //c /d Node holds ‘1’ on TOS Relationship with prefix is ‘/’ Close event tag matches column tag or column tag is ‘*’ (shown in example) /* 19
  • 20. Pop stack: Example XML Document Split node (‘/a’ in example) is matched only if all it’s children propagate ‘1’ a d b e c Close(d) TOS d  1 $ /a //c /d /* Since so far we have explored only one subtree of node ‘/a’ match cannot be propagated to this node (‘and’-logical operation used to match split node) 20
  • 21. Pop stack: Example XML Document  a d TOS b  e c Close(c)  d  1 1 $ /a ‘1’ is propagated downwards in descendant node if 1 //c /d Node holds ‘1’ on TOS Relationship with prefix is ‘//’ Close event tag matches column tag /* 21
  • 22. Pop stack: Example XML Document a d TOS b e c Close(d) d 1 $ /a The same rules apply, however nothing is propagated 1 //c /d /* 22
  • 23. Pop stack: Example  This time both children of the split node are matched, thus the result of ‘and’operation propagates ‘1’ down to node ‘/a’  As with push stack split node has two separate fields for children with ‘/’ and ‘//’ relationships  XML Document Reporting match from children to parent does not depend on event tag a d b e c Close(b) TOS 1 d 1 1 $ /a //c /d /* 23
  • 24. Pop stack: Example XML Document a d b e c Close(a) TOS Full query is matched if dummy root node reports match d 1 1 $ /a //c /d /* 24
  • 25. GPU Architecture     SM is a multicore processor, consisting of multiple SPs SPs execute the same instructions (kernel) SPs within SM communicate through small fast shared memory Block is a logical set of threads, scheduled on SPs within SM Grid Grid SM SM SM SM Block1 Block1 IU IU IU IU SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SMem SMem SMem SMem Block2 Block2 … BlockN BlockN Global memory Global memory 25
  • 26. Filtering Parallelism on GPUs  Intra-query parallelism   Inter-query parallelism   Each stack column on TOS is independently evaluated in parallel on SP Queries scheduled parallely on different SMs Inter-document parallelism  Filtering several XML documents as a time using concurrent GPU kernels (co-scheduling kernels with different input parameters) 26
  • 27. XML Event & Stack Entry Encoding   The XML document is preprocessed and transferred to the GPU as a stream of byte-long (event/tag ID) pairs XML event push/pop tagID push/pop tagID 1 bit 7 bits The event streams reside in the GPU global memory 27
  • 28. GPU Kernel Personality Encoding  Each GPU kernel, maps to one query node  The offline query parser creates a personality for each kernel, which is later stored in GPU registers  A personality is a 4 byte entry GPU personality prefix children children prefix children children isLeaf isLeaf relation with ‘/’ with ‘//’ relation with ‘/’ with ‘//’ 1 bit 1 bit 1 bit 1 bit … … prefixID tagID prefixID tagID 10 bits 7 bits 28
  • 29. GPU Optimizations  Merging push and pop stacks to save shared memory  Coalescing global memory readswrites  Caching XML stream items in shared memory   Reading stream in chunks by looping in strided manner, since shared memory size is limited Avoiding usage of atomic functions  Calling non-atomic analogs in separate thread 29
  • 30. Experimentation Setup  GPU experiments    NVidia Tesla C2075(Fermi architecture), 448 cores NVidia Tesla K20(Kepler architecture), 2496 cores Software filtering experiments   YFilter filtering engine Dual 6-core 2.30GHz Intel Xeon E5 machine with 30 GB of RAM 30
  • 31. Experimentation Datasets  DBLP XML dataset     Documents of varied size 32kB-2MB, obtained by trimming original DBLP XML Synthetic documents (having DBLP schema) of fixed size 25kB Maximum XML depth – 10 Queries generated by the YFilter XPath generator with varied parameters     Query size: 5,10 and 15 nodes Number of split points: 1,3 and 6 Probability of ‘*’ node and ‘//’ relations: 10%,30%,50% Number of queries: 32-2k 31
  • 32. Experiment Results: Throughput GPU throughput is constant until “breaking” point – point where all GPU cores are occupied  Number of occupied cores depends on number of queries and query length 32
  • 33. Experiment Results: Speedup GPU speedup depends on XML document size: larger docs incur greater global memory read latency  Speedup up to 9x  ‘*’ and ‘//’-probability affects speedup since it increases the YFilter NFA size 33
  • 34. Batch Experiments  Batched experiments show the usage of intradocument parallelism   Mimic real case scenarios (batches of real-time docs) Batches of size 500 and 1000 docs were used  It is nor fair to use single-threaded YFilter implementation in batch experiments  “Pseudo”-multicore Yfilter: run multiple copies of program, distributing document load    Each copy runs on it’s own core Each copy filters subset of document set, query set is fixed Query load is the same for all copies, since it affect NFA size. Distributing queries deteriorates query sharing due to 34 lack of commonalities.
  • 35. Batch Experiments: Throughput No breaking point – GPU is always fully occupied by concurrently executing kernels  Throughput increased up to 16 times in comparison with singledocument case 35
  • 36. Batch Experiments: Speedup  GPU fully utilized – increase in the query length number yields speedup drop by factor of 2  Achieve up to 16x speedup  Slowdown after 512 queries  Multicore version performs better then ordinary 36
  • 37. Conclusions  Proposed the first holistic twig filtering using GPUs, effectively leveraging GPU parallelism  Allow processing of thousands of queries, whilst allowing dynamic query updates (vs. FPGA)  Up to 9x speedup over software systems in single-document experiments  Up to 16x speedup over software systems in batches experiments 37

Notes de l'éditeur

  1. {"27":"To minimize slow communication between CPU and GPU all preprocessing (XML parsing, Xpath query parsing) is done on CPU side and transferred to GPU in compressed format.\nXML document is represented as a stream of 1-byte-long XML event records. 1 bit of this record encodes type of XML event (either push or pop) and the rest 7 bits encode tag name of the event.\nEvent streams reside in the pre-allocated buffer in GPU global memory.\n","16":"To address ancestor-descendant semantics we introduce 1 upwards propagation, which applies if the following is true:\nnode’s prefix has 1 on the old TOS\nRelationship between node and it’s prefix is ancestor-descendant(//)\ntag in open event equals to the node’s tag name\nNote that match is propagated not in node’s column, but in it’s prefix\n","5":"The second type - sequence-based approaches is represented by FiST, which converts XML document and queries into Prufer sequences and applies substring search algorithm with additional postprocessing to determine if the original query was matched.\nLastly, the third type of software consist of systems, which use different indexing techniques to match Xpath queries. Xtrie builds a trie-like index to match query prefix. Afilter uses couple of similar indexes, to match query prefix as well as it’s suffix.\n","33":"The following graph shows us speedup of GPU filtering over software filtering approach. The largest speedup (~9x) is obtained on small documents (32Kb), for bigger documents it rapidly drops to constant value.\nThis effect could be explained by the global memory latency overhead which is an issue with bigger documents.\nHigher probability of wildcard nodes and //-relationships increase speedup, since filtering on GPU is not sensitive to these parameters at all, YFilter on contrary is affected by them, since they result in bigger NFA size.\n","11":"To perform two stages of the algorithm (root-to-leaf and reversed match propagation) we introduce 2 types of stacks: push stack and pop stack, which would capture matches for each algorithm stage accordingly.\nValues in the push stack are updated only in response to open events (on close event only TOS is decreased).\nOn contrary values in pop stack are updated both on open and close events, where open forces values on pop stack TOS to be erased.\n","28":"Since GPU kernel is the code, executed by each query node we need to store node-specific information in some parameter, which is passed to GPU kernel. This parameter is called personality. Personality is a 4-byte record produced by Xpath query parser on CPU, which transmitted to GPU in the beginning of kernel execution.\nPersonality stores all properties of particular query node: whether it is a leaf node or not, what parental relation (/ or //) it has with it’s prefix, if the node has children nodes this / or // relationship (in case when the node is a split node), pointer to the node’s prefix and tag name, corresponding to this node.\n","17":"Another example of upward diagonal propagation, where leaf node is saved in match array\n","6":"A number of our previous works studies hardware approaches to solve stated problem.\nThe first paper proposed streaming dynamic programming approach for XML path filtering and proposed FPGA implementation of this algorithm.\nICDE paper extended this work by supporting twig queries.\nPaper presented on ADMS in 2011 used original approach to implement path filtering system on GPUs\nAll these works on hardware acceleration of XML filtering showed significant speedup over state-of-the-art software analogs.\n","34":"In order to study effect of intra-document parallelism we performed a number of batch experiments where we were filtering a set of XML documents through given set of fixed queries (size of document set in our experiments was 500 and 1000 docs).\nIn the previous single-document experiments performance measurements were not really “fair” because YFilter is implemented as a single threaded program and cannot load all CPU cores on our experiment server. In the batch experiments we introduce “pseudo”-multicore version of YFilter, which just start a number of Yfilter copies (equal to the number of available CPU cores), each with a subset of original document set. In other words we split document load across these copies. Note that we cannot apply the same technique for query load, since it will decrease amount path commonalities, used to build compact unified NFA.\n","23":"The following example shows the situation when we need to combine matches from several paths in a split node. Since we want to match twig holistically we 1 is propagated into prefix node if all it’s children have 1 on TOS simultaneously.\nAgain since split node could have children with different parental relationships we need to take onto account /-children and //-children fields.\nNote that event tag (b) is not equal to tag names of children nodes //c and /d. However they are already matched and reporting does not depend on tag name equality.\n","29":"A number of specific optimizations were applied to maximize GPU performance.\nFirstly since stacks reside in shared memory, which is very limited in size we merge push and pop stacks into single data structure, however maintaining their logical separation.\nCoalescing accesses to global memory is another common GPU optimization technique, which decreases GPU bus contention. We use memory coalescing when GPU personality is read in the beginning of kernel execution.\nSince every thread within thread block reads the same XML stream it would be perfect if XML stream would reside in shared memory, however this is not possible due to tiny shared memory size. In order to tolerate this we cache small part of the stream into shared memory and loop through XML stream in strided manner, caching new block from the stream on each iteration.\nFinally to apply path merging semantic in the 2nd phase of the algorithm it is natural to call atomic functions to avoid race conditions. However our tests have shown that use of atomics results in huge performance drop, therefore we use a dedicated thread (split node thread) which executes merging logic in serial manner.\n","18":"Note that 1 do not propagate from node /a to /d despite the fact that all requirements for upwards diagonal propagation holds.\nThe reason is the following: node /a is a split node and has 2 children: //c and /d with different types of parental relationships. In this case stack column for node /a is split into 2 columns: one for children with / relationship, and one for children with // relationship.\nIn the example /a column have 0 in /-children field and 1 in //-children field on TOS. Since nodes /a and /d are connected with parent-child relationship only value from the respective field (which is 0 in this case) could be propagated.\n","7":"The following work studies the problem of holistic XML twig filtering on GPUs.\nThe nature of the problem allows it to fully leverage massively parallel GPU hardware architecture for this task.\nIn ICDE paper we have already studied problem of twig filtering on FPGAs, but despite the significant speedup, which we were able to achieve on FPGAs, this approach had some significant drawbacks.\nFirstly FPGA chip real estate is expensive, therefore the number of queries, which we can filter is limited.\nThe second drawback is the lack of query dynamicity, since changing queries requires reprogramming of FPGA, the process which takes time, effort and cannot be done online\n","35":"Throughput graph in batch experiments differs from the one, which we have seen in single-document case. There is no more breaking point on the plot, throughput just degrades linearly with increasing query load. The reason is the following: no more GPU core are wasted now (as in case prior to the breaking point in single-document case), now they are used for concurrent kernel execution, where different kernels filter different XML streams.\nAnother observation is that concurrent kernel execution increases maximum GPU throughput almost by the factor of 16 in comparison to single-document experiment.\nQuery length and query load (number of queries) are still major factors, influencing the throughput.\nWe can also see that Tesla K20, which have more computational cores than Tesla C2075 shows better results, although they do not differ by the factor of 6 (the ratio of the cores in these GPUs) because the amount of concurrent kernel overlapping is limited by GPU architecture.\n","24":"Finally 2nd phase of the algorithm is completed when dummy root node is matched.\n","13":"The following example shows match propagation rules in push stack.\nWhen the stack is generated from parsed XPath query the first column is reserved for dummy root node (which is referred as $). In the beginning of root-to-leaf path matching values on TOS for column corresponding to root node are set to 1, whereas all other values on TOS are 0.\nOn open event 1 can be propagated diagonally upwards to the node’s column if \nit’s prefix column has 1 on the old TOS \nrelationship between node and it’s prefix parent-child(/) \ntag in open event equals to the node’s tag name\n","30":"In the experimental evaluation we have used two GPUs (NVidia Tesla C2075 and NVidia Tesla K20) to compare the effect of different GPU architectures and different number of computational cores. \nAs a reference implementation of software filtering system we have chosen YFilter engine as a state-of-the-art software filtering approach.\nTo measure performance of software filtering he have used the server with two 6-core 2.30GHz Intel Xeon processors and 30 GB of memory.\n","19":"The second phase of the algorithm saves it’s match information in pop stack.\nPropagation starts in leaf nodes, if these nodes where saved in match array during the 1st phase of the algorithm.\nOn open event 1 can be propagated diagonally downwards to the node’s column if \nit’s prefix column has 1 on old TOS \nrelationship between node and it’s prefix parent-child(/) \ntag in open event equals to the node’s prefix tag name or if node’s prefix tag is wildcard (as it is shown in example).\n","8":"The proposed filtering approach requires some preliminary work.\nFirst of all we need to convert original XML document into a parsed stream, which will be fed to our algorithm.\nXML stream is obtained by using SAX XML parser, which traverses XML document and generates open(tag) event, whenever it reads opening of the tag, and close(tag) event in case when closing tag is observed.\n","36":"Speedup graph for batch experiments also shows us a different picture: the maximum speedup is up to 16x and instead of flattening out for greater number of queries the graph continues to drop by factor of 2. Moreover after 512 queries we see that GPU performance is even worse than the performance of Yfilter. This could be explained by the nature of throughput shown in previous plot.\nAgain Tesla K20 performs better then Tesla C2075.\n“Pseudo”-multicore version also performs better then the regular one, however again we do not see 12x difference (number of used cores) because of the startup overhead, incurred by executing additional Yfilter copy.\n","25":"GPUs have very specific architecture.\n","3":"Publish-subscribe systems, used for information distribution, consist of several components, out of which filtering engine is the most important one. Examples of data disseminated by the following systems include news, blog updates, financial data and many other types of deliverable content.\nAdoption of the XML as the standard format for message exchange allowed to use Xpath expressions to filter particular message based on it’s content as well as on it’s structure.\nHowever existing pub-sub systems are not capable to of effective scaling, which is required because of growing amount of information. Therefore massively parallel approaches for XML filtering should be proposed.\n","31":"As an experimental dataset we have used DBLP bibliography xml. To study the effect of document size we trimmed original dataset into chunks of variable size (32kB-2MB). We have also used synthetic 25kB XML documents, generated with ToXGene XML generator from DBLP DTD schema. Since maximum depth of the DBLP xml is known we limite depth of our stacks to 10.\nQuery dataset was generated with YFilter XPath generator from DBLP DTD. We have used different parameters while generating queries to study their effect on XML filtering performance:\nQuery size (number of nodes in the query) varied from 5 to 15\nNumber of split points in the twig varied from 1 to 6\nProbability of wildcard node and probability of having ancestor-descendant relationship between node and it’s prefix varied from 10% to 50%\nTotal number of queries, executed in parallel varied from 32 to 2048\n","20":"To match split node (/a) we need to merge matches from all it’s children.\nOnly one of the children was explored so far, therefore we cannot match split node yet\n","9":"Twig filtering algorithm consists of two parts: a) matching individual root-to-leaf paths and b) reporting those matches back to root along with joining them at split nodes.\n","37":"In conclusion we have proposed an algorithm for holistic twig filtering tuned to execute effectively on massively parallel GPU architecture, which uses all available sources of hardware parallelism provided by GPUs.\nUnlike the FPGA analog this approach allows processing theoretically unlimited number of queries, which could be changed on the fly without hardware reprogramming.\nWe have shown that our approach is capable to reach speedup up to 9x in single-document usage scenario and up to 16 in batch document case.\n","26":"It turns out that twig path filtering problem is especially well suited for this type of parallel architecture, since it uses several types of inherent parallelism, provided by GPUs:\nIntra-query parallelism allows us to evaluate all stack columns simultaneously, executing them on streaming processors (SP). This type of parallelism is useful because in practice thread block (a set of threads scheduled to execute in parallel) co-allocates several queries, thus all these queries obtain their result simultaneously.\nInter-query parallelism allows concurrent scheduling of thread blocks on the set of streaming multiprocessors (SM), which again allows to execute queries in parallel, if they were not co-allocated in the same thread block in the first place.\nConcurrent kernel execution feature of the latest Fermi GPU architecture allows us to use inter-document parallelism. This mean several GPU kernels with different parameters (XML documents) could be executed in parallel.\n","15":"In the case when node has wildcard tag (*) the same diagonal propagation rules apply, but tag name check is omitted.\nIf the leaf node if matched it is saved in match array, which later will be used on the second phase of the algorithm.\n","4":"A number of previous research works study the problem of XML filtering using software methods. They could be divided into 3 groups.\nThe first uses final-state machines to match Xpath queries against given documents. Query is considered to be matched if final state of FSM is reached\nXfilter converts each query into FSM, where query node corresponds to individual state in one of the FSMs. \nYfilter exploits commonalities between queries to construct single nondeterministic final automaton.\nLazyDFA uses deterministic automaton representation, constructed in lazy way.\nXpush create pushdown automaton for the same purposes.\n","32":"The following graph shows us filtering throughput on GPU (graph shown for 1MB document, but the graph is the same for documents of other sizes).\nWe can see that GPU throughput is constant up until some point, which we call breaking point. At that point all available GPU cores are occupied and having more queries results in their linear execution, which is described by rapid drop in throughput. \nUp until breaking point GPU is underutilized, therefore some computation cores are just wasted.\nThe number of occupied GPU cores depend on query length and number of queries. This correlation is clearly shown on the plot.\n","21":"Strict downwards propagation, applies if the following is true:\nnode’s has 1 on old TOS\nRelationship between node and it’s prefix is ancestor-descendant(//)\ntag in open event equals to the node’s tag name\nThis time match is propagated in node’s column not in it’s prefix (unlike the same situation in push stack)\n","10":"To capture those matches our algorithm uses dynamic programming strategy, where the role of dynamic programming table is played by binary stack – the stack containing only 0s and 1s (here and later in examples only 1s and shown, 0s are omitted for brevity).\nStacks are generated during preprocessing step. Each stack is mapped to single XPath query. Every column within the stack is mapped to query node. The length of the stack is equal to the length of the query. Depth of the stack should be at least the maximum depth of XML document.\nDuring query parsing every node determines it’s prefix in the twig, which is saved as a stack column prefix pointer.\nMatches are always saved on the current top-of-the-stack (TOS). TOS is updated (increased of decreased) in reaction to events read from XML stream, open event translates into stack push operation, close event – into pop operation accordingly.\n"}