1. HadoopXML A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries
1Computer
Hyebong Choi1, Kyong-Ha Lee1, Soo-Hyong Kim1, Yoon-Joon Lee1 and Bongki Moon2 Science Department, KAIST, Korea
2Computer Science Department, University of Arizona, USA
hbchoi@dbserver.kaist.ac.kr bart7449@gmail.com kimsh@dbserver.kaist.ac.kr yoonjoon.lee@kaist.ac.kr bkmoon@cs.arizona.edu
Motivation System Architecture Performance
Twig pattern Experimental environment
Big data in XML join
Mappers
Tagging
Reducers Hadoop
CentOS 6.2 1Gb switching hub
A large 0.21.0 [1]
▶ More than 100GB of protein sequences and their
XML Reducer ID Holistic Final
XML file AMD Athlon II x4 620 8GB memory
Pre‐ blocks Path Final twig join answers 1 master
1st M/R job 2nd M/R job Tagging 4‐cores 7200 RPM HDD
XPath processing Query Solutions Answers Path
functional information are provided in XML format queries index solutions Reducer ID Holistic Final 8 slaves
i5‐2500k processor
4‐cores
8GB memory
7200 RPM HDD
Tagging twig join answers
and also updated in every four weeks [2] Reducer ID Shuffle by
ReducerId XML dataset statistics Loading time
▶ Conventional XML tools like single‐site XML DBMSes Size information
for path solutions Distributed cache Filename UniRef100 UniParc UniProtKB XMark1000
File size (MB) 24,500 37,436 105,745 114,414
and XML pub/sub systems failed to process that size of Relationship
Multi query # of elements 335M 360M 2,110M 1,670M
btw. path patterns &
XML data Path query twig patterns optimizer # of attributes
Depth in avg.
589M
4.5649
1,215M 2,783M
3.7753 4.3326
383M
4.7375
Query index Query
XML DB eXist [9] BaseX [10] builder index processing Max depth 6 5 7 12
Mappers Reducers
XML Label # distinct paths 30 24 149 548
Query processing Query processing HDFS Path Path
Data size Loading time Loading time Path Counting Path
w/ 4000 twig queries w/ 4000 twig queries XPath Query block1 filtering solutions
patterns solutions solutions Overall execution time
queries Decomposition XML Label XML Label
1GB 5m 54s failed 2m 1s 2h 48m 7s Relationship Path Path
block1 block1 block2
btw. paths filtering solutions Counting Path
…
10GB 1h 5m 21s failed 19m 36s 30h 11m 34s Copy to XML Label Synthetic dataset Real‐world dataset
and twigs Solutions solutions
HDFS block2 block2 XML Label
100GB failed ‐ failed ‐ Path Path
…
…
A large Partitioning Label blocks blockn filtering solutions <Path ID, a list of labels>
Yfilter [5] XML file & Labeling XML Label <Path ID, label>
XML blocks
blockn blockn Query index
Data size Filtering time Postprocessing time (twig pattern join) Size information
Block collocation Distributed cache
for path solutions
1MB 2m 4s 0.264s
10MB 20s 14s 16s
100MB 3h 22m 6s 1h 1m 37s
1GB failed ‐
Working Example Effect of converting paths
to distinct paths
Effect of block collocation
Label_1
<region> block_1 /
<region> Example.xml <Africa> 1, 24, 1
HadoopXML <Africa>
<item id=“item0”>
<quantity>1</quantity>
<payment>Creditcard</payment>
<item id=“item0”>
<quantity>1</quantity>
<payment>Creditcard</payment>
</item>
2, 15, 2
3, 8, 3
4, 5, 4
6, 7, 4
Path offset
Path query Path solution
ID
1.1 3, 8, 3
</item>
▶ It efficiently processes many twig pattern queries for <item id=“item1”> block_2 Label_2 9, 14, 3
<item id=“item1”>
<quantity>1</quantity> <quantity>1</quantity> /region/Africa 1.2 4, 5, 4 Twig query ID Path solution
<payment>Money order</payment> Preprocessing <payment>Money order</payment> 9, 14, 3 10, 11, 4 2nd M/R 1 6, 7, 4
</item> Partitioning </item> 10, 11, 4 1st M/R Twig pattern Effect of multi query optimization
a massive volume of XML data in parallel </Africa>
<Asia>
& labeling </Africa>
<Asia>
12, 13, 4
16, 23, 2
Path filtering
1.3 6, 7, 4
12, 13, 4 join 2
12, 13, 4
17, 22, 3
<item id="item135"> <item id="item135"> 2 17, 22, 3
<quantity>2</quantity>
block_3 Label_3
‐ Block partitioning with no loss of structural information <payment>Personal Check</payment>
<quantity>2</quantity> /region/Asia
<payment>Personal Check</payment>
17, 22, 3
</item> </item> Path query Count
18, 19, 4
</Asia> </Asia> ID
‐ Path filtering with NFA‐style query indexes [5] </region> </region>
20, 21, 4
1.1 2 Multi query
Path query ID Path query 1.2 2 optimizer
‐ I/O optimal Holistic twig pattern joins [3] Twig query ID
1
Twig query
/region/Africa/item[quantity]/payment
Query decomposition 1.1 /region/Africa/item 1.3 2
1.2 /region/Africa/item/quantity A query index 2 1
2 //Asia/item
& Converting to
▶ Simultaneous processing of multiple twig pattern
1.3 /region/Africa/item/payment
… . . . . .
root‐to‐leaf paths 2 /region/Asia/item
queries Load Balancing &
‐ Many twig pattern joins are distributed across nodes and
Path Filtering
Multi Query Optimization References
<item id=“item1”>
<quantity>1</quantity>
block_2 /region/Africa Label_2 ▶ Twig pattern join, a specialized multi‐way join that reads multiple [1] Hadoop. http://hadoop.apache.org, Apache Software Foundation.
executed in parallel <payment>Money order</payment> 9, 14, 3
path solutions [2] A. Bairoch et al. The universal protein resource (uniprot). Nucleic acids
</item> 10, 11, 4
12, 13, 4 ‐ With static one‐to‐one shuffling scheme, i.e. given partitioned path solutions, reducers research, 33(suppl 1):D154–D159, 2005.
▶ Optimization of the I/O cost in MapReduce jobs </Africa>
<Asia> 16, 23, 2 generate incomplete join results [3] N. Bruno et al. Holistic twig joins: optimal xml pattern matching. In
Reducer1 Missing results!
Q1: A1 join B1 join C1 A1 join B2 join C2 Proceedings of ACM SIGMOD, pages 310–321. ACM, 2002.
‐ Sharing input scans and intermediate path solutions startElement(“region”) A1 Q2: A1 join C1 join D1 A2 join B1 join C2
startElement(“Africa”) [4] J. Dean et al. Mapreduce: Simplified data processing on large clusters.
& SAX events from block_2 B1 Q3: A1 join B1 join D1
‐ Converting redundant path patterns with {//, *} to a few A2
…
Communications of the ACM, 51(1):107–113, 2008.
C1
NFA style Path solutions A B2 D1 Reducer2 Input queries [5] Y. Diao et al. Path sharing and predicate evaluation for high‐performance xml
distinct root‐to‐leaf paths Query index region 1st Mapper B C2 Q1: A2 join B2 join C2 Q1: A join B join C filtering. ACM Transactions on Database Systems, 28(4):467–516, 2003.
&1 D2 Q2: A2 join C2 join D2 Q2: A join C join D
Africa C
Q3: A2 join B2 join D2 Q3: A join B join D [6] K. Lee et al. Parallel data processing with MapReduce: a survey. ACM
‐ Collocation of XML blocks and corresponding label blocks Asia
D SIGMOD Record, 40(4):11–20, 2011.
&2 &3
item ▶ Runtime one‐to‐many data shuffling [7] Q. Li et al. Indexing and querying xml data for regular path expressions. In
▶ Runtime load balancing & multi query optimization 1.1
item
‐ It distributes both queries and data at runtime Proceedings of VLDB, pages 361–370, 2001.
&4 &5
quantity payment ‐ Path solutions can be redundantly copied to reducers, involving redundant I/Os
2 [8] T. Nykiel et al. MRshare: Sharing across multiple queries in MapReduce.
‐ XML twig queries may share path patterns each other Runtime stack ‐ a straggling reduce task dominates the overall performance of M/R jobs
Proceedings of the VLDB Endowment, 3(1‐2):494–505, 2010
‐ Optimization problem : find the optimal way that distributes queries and path solutions across
&6 &7 reducers so that every reducer is assigned even workload
‐ For I/O reduction and workload balance, twig pattern [9] W. Meier. eXist: An open source native XML database. Web, Web‐Services,
1.2 1.3 Reducer1
30 Q1: A join B join C and Database Systems 2002, LNCS 2593, Springer, Berlin (2002), pp. 169–183
queries that share path patterns are grouped together Path solution Path solutions A 80 cost = |A|+|B|+|C| = 200 Input queries [10] C. Grün et al. BaseX ‐ Processing and Visualizing XML with a native XML
For block_2 Q1: A join B join C Database, http://www.basex.org/, 2010.
90 Reducer2
Path query ID Path solution Q2: A join C join D
‐ The twig query groups are assigned to reducers at 1.1 9, 14, 3
B Q2: A join C join D
Q3: A join B join D
Q3: A join B Join D
1.2 10, 11, 4 C
runtime such that every reducer has the same overall 5 cost = |A|+|C|+|D| +
1.3 12, 13, 4 D |A|+|B|+|D| = 240
cost of join operations
This work was partly supported by NRF grant funded by the Korea government (MEST) (no. 2011‐0016282)