SlideShare une entreprise Scribd logo
1  sur  3
HadoopXML: A Suite for Parallel Processing of Massive
        XML Data with Multiple Twig Pattern Queries

                 Hyebong Choi‡               Kyong-Ha Lee‡           Soo-Hyong Kim‡
            hbchoi@dbserver.kaist.ac.kr  bart7449@gmail.com     kimsh@dbserver.kaist.ac.kr
                              Yoon-Joon Lee ‡
                                                        Bongki Moon§
                         yoonjoon.lee@kaist.ac.kr   bkmoon@cs.arizona.edu
                                  ‡
                               Computer Science Dept., KAIST, Daejeon, 301-781, Korea
                    §
                     Dept. of Computer Science, University of Arizona, Tucson, Arizona, 85721, USA

ABSTRACT                                                               in a huge XML file in disk and the volume is continuously grow-
The volume of XML data is tremendous in many areas, but espe-          ing. This makes it difficult to process the data within XML pub/sub
cially in data logging and scientific areas. XML data in the ar-        systems or single-site XML databases. It is because conventional
eas are accumulated over time as new data are continuously col-        XML pub/sub systems are mainly devised to consider a series of
lected. It is a challenge to process massive XML data with multi-      small-size XML documents and XML databases are not optimized
ple twig pattern queries given by multiple users in a timely manner.   for such a big XML file that also must be appended or even substi-
We showcase HadoopXML, a system that simultaneously processes          tuted by a new XML file frequently. Thus, it is prudent to process
many twig pattern queries for a massive volume of XML data with        user queries over XML data with MapReduce [4].
Hadoop. Specifically, HadoopXML provides an efficient way to                To address this issue, we devise HadoopXML which provides
process a single large XML file in parallel. It processes multi-        facilities to efficiently process a massive volume of XML data in
ple twig pattern queries simultaneously with a shared input scan.      parallel. HadoopXML is a set of applications developed on the
Users do not need to iterate M/R jobs for each query. HadoopXML        popular MapReduce framework, Hadoop [1]. Main features of
also saves many I/Os by enabling twig pattern queries to share their   HadoopXML are as follows: First, it provides an efficient means
path solutions each other. Moreover, HadoopXML provides a so-          to process a massive volume of XML data in parallel. It parti-
phisticated runtime load balancing scheme for fairly assigning mul-    tions XML data into blocks with no loss of structural informa-
tiple twig pattern joins across nodes. With synthetic and real world   tion. Second, HadoopXML processes multiple twig pattern queries
XML dataset, we demonstrate how efficiently HadoopXML pro-              simultaneously. There is no need to iterate M/R jobs for each
cesses many twig pattern queries in a shared and balanced way.         query in a query set. Third, HadoopXML enables query processing
                                                                       tasks to share input scans and their intermediate results with each
                                                                       other. A path solution is shared by multiple twig pattern queries
Categories and Subject Descriptors                                     that contain the common path pattern. Moreover, it saves many I/Os
H.2.4 [Database Management]: Systems—query processing; D.1.3           by removing many redundant intermediate results as we substitute
[Software]: Programming technique—concurrent programming               many path patterns that include //, * to distinct root-to-leaf paths.
                                                                       Lastly, HadoopXML provides a sophisticated runtime load balanc-
General Terms                                                          ing scheme for evenly distributing twig joins across nodes. The rest
                                                                       of this proposal is organized as follows. Section 2 describes our
algorithms, experimentation, performance
                                                                       system architecture. Section 3 explains features of HadoopXML.
                                                                       Section 4 presents implementation details. Section 5 describes our
Keywords                                                               demonstration scenarios.
XML, parallel processing, query optimization, MapReduce
                                                                       2. SYSTEM ARCHITECTURE
1. INTRODUCTION                                                           HadoopXML processes XML data in 3 steps: preprocessing and
   XML is one of the most prominent data formats and many data         2 consecutive M/R jobs. In preprocessing step (shown in Fig. 1),
have been produced and transformed into the format. Specifically,       XML data are partitioned into equal-sized blocks and then loaded
scientific data and log messages are often kept in the form of XML.     into HDFS. Also, elements are labeled for the use in twig pattern
Such XML data are large and also growing very quickly. For exam-       joins and the labels are written in label blocks separate from XML
ple, UniprotKB, which provides the collection of functional infor-     blocks. In the stage, HadoopXML also decomposes a given set
mation on proteins, now hits more than 108GB a file [2]. Moreover,      of queries into linear path patterns. Then, it builds an NFA-style
new elements and attributes are continuously appended to existing      query index and a table that holds mapping information between
XML files as they are generated over time. In a typical scenario,       given queries and the decomposed path patterns.
users prepare their queries in advance even when XML data are             In the 1st M/R job, the query index is loaded into each map-
not completely produced. This is akin to the background of XML         per via distributedCache mechanism in Hadoop beforehand.
pub/sub systems, but different in that the data is sometimes stored    After that, mappers read XML blocks as SAX streams and filter out
                                                                       only the labels of elements matched with the decomposed path pat-
Copyright is held by the author/owner(s).
CIKM’12, October 29–November 2, 2012, Maui, HI, USA.                   terns. Then, reducers group the labels by PathId and count the
ACM 978-1-4503-1156-4/12/10.                                           number of labels for each pathId. The path solutions and size in-
Query index     Query                                 practice. Sharing path solutions reduces redundant processing of
                               builder       index
                                                      HDFS
                                                                                   path patterns and saves many I/Os [8]. In this respect, we borrow
  XPath        Query
                                Path
                               patterns
                                                                                   the concept of path sharing from YFilter [5]. Moreover, path so-
  queries   Decomposition                                    XML          Label    lutions are shared by multiple twig pattern joins in HadoopXML.
                             Relationship                    block1       block1
                              btw. paths
                                            Copy to
                                                                                   While joining path solutions for processing twig patterns, a group
                              and twigs                      XML          Label
                                            HDFS             block2       block2   of join operations assigned to the same reducer share the path solu-
                             Label blocks                                          tions each other if the path patterns are shared by the twig patterns.




                                                             …



                                                                         …
  A large     Partitioning
                                                             XML          Label
 XML file     & Labeling     XML blocks
                                                             blockn       blockn
                                                                                   This helps reduce the overall I/O cost of join operations.
                                                             Block collocation

                                                                                   Converting to distinct path patterns
            Figure 1: Preprocessing step in HadoopXML                              Many path patterns may be matched with a single root-to-leaf path
                                                                                   in practice. For example, assume that path patterns /a//c, //c
formation are stored in HDFS. After that, our multi-query optimizer                and /a/*/c are matched with a root-to-leaf path /a/b/c in an
decides which reducer in the next M/R job will perform which twig                  XML file. If the paths are treated as different each other, three path
pattern join for balancing workloads across nodes, based on size in-               solutions are redundantly produced for a single distinct path during
formation for the path solutions and the mapping table.                            query processing. By converting redundant path patterns to root-
   In the 2nd M/R job, mappers read grouped path solutions and                     to-leaf paths which are distinct in an input XML, we nicely reduce
tag reduce ids to the grouped path solutions as keys. Since mapped                 the sizes of path solutions and save many I/Os. In order to support
outputs are shuffled by intermediate keys, path solutions tagged by                 this feature, HadoopXML extracts distinct root-to-leaf paths during
the same reducer id go to the same reducer together. Finally, re-                  data loading in preprocessing step.
ducers perform twig pattern joins and output final results to HDFS.
Fig. 2 illustrates data flows in two M/R jobs in HadoopXML.                         Runtime load balancing and multi query optimization
                                                                                   A straggling task lags overall job execution in Mapreduce. This
3. FEATURES OF HADOOPXML                                                           problem becomes more severe if it happens in reduce stage. MapRe-
                                                                                   duce’s native runtime scheduling does not work well especially for
   HadoopXML has many features for efficient XML data process-
                                                                                   reducers. HadoopXML rather uses dynamic shuffling scheme that
ing. Since fault-tolerance and scalability are its primary goal, Hadoop
                                                                                   balances workloads across reducers at runtime. To achieve this,
is not optimized for I/O efficiency [6]. Thus, we endeavor to in-
                                                                                   HadoopXML estimates the cost of each twig join operation be-
crease I/O efficiency but without modification of Hadoop internals.
                                                                                   fore actual joining. It is achieved by computing the cost of each
                                                                                   join operation, as the worst-case I/O and CPU time complexities
Partitioning with no loss of structural information
                                                                                   of twigStack algorithm is linear in the sum of sizes of input path
With labeled values, each label block records a root-to-leaf path
                                                                                   solutions. The sizes of path solutions is counted in the 1st reduce
that represents the structural information for the start of the corre-
                                                                                   stage. The cost estimation also considers the sizes of path solutions
sponding XML block. For example, consider an XML document
                                                                                   shared by multiple twig pattern queries. Then, it assigns join oper-
with four elements: <a><b><c></c><d></d></b></a>. If
                                                                                   ations into reducers at runtime such that every reducer has the same
the XML file is partitioned into two blocks and the second block
                                                                                   overall cost of join operations each other.
contains an XML fragment </d></b></a>, we keep a root-to-
leaf path /a/b/d for the start of the block. When a map task reads
the second block, a query index is first fed with the SAX stream re-                4. IMPLEMENTATION
stored from the root-to-leaf path string /a/b/d before actual block                   We implemented HadoopXML with Hadoop version 0.21.0. Our
reading. This guarantees that a query index in each map task starts                cluster consists of 9 nodes, running on CentOS 6.2. A master is
with correct internal states when processing XML blocks that lie in                equipped with an AMD Athlon II x4 620 processor, 8GB memory
the "middle" of the SAX stream.                                                    and a 7200RPM HDD. The other nodes are designated as slaves,
                                                                                   each of which has an Intel i5-2500k processor, 8GB memory and
Collocating XML blocks and label blocks                                            a 7200RPM HDD. All nodes are connected via Gigabit switching
HadoopXML reads both XML blocks and their corresponding la-                        hub. We use default settings for our Hadoop cluster for fair com-
bel blocks during query processing. If two blocks are stored sep-                  parison. Region numbering scheme [7] is used for labeling XML,
arately in two nodes, additional network I/Os occur as the system                  but modified for the support of big XML files. Since end values in
reads blocks via network, delaying map stage. To increase spatial                  the numbering scheme are generated in postorder, labels are kept in
locality, we extend block placement policy in HDFS so as to put                    memory until we meet endElement(), causing a memory space
an XML block and its corresponding label block together into the                   problem in such a big XML file. Our scheme reads an XML block,
same node.                                                                         then promptly appends labels into the corresponding label block.
                                                                                   After data loading, HadoopXML sorts labels by start in preorder.
Multiple twig pattern matchings in parallel                                        For path filtering, we use the NFA-style query index in YFilter [5].
In HadoopXML, multiple join operations are distributed across nodes                We also use TwigStack algorithm [3] to implement holistic twig
and executed in parallel as many as the number of reducers. We also                pattern joins in the 2nd M/R jobs, but other holistic join techniques
implement each join operation with an I/O optimal holistic twig                    can be used in HadoopXML with no loss of generality. Finally, we
pattern join algorithm for improving I/O efficiency in HadoopXML [3].               also extend DataPlacementPolicy class in HDFS in order to
                                                                                   collocate XML blocks and their corresponding label blocks.
Sharing input scan and path solutions
MapReduce’s batch nature makes it difficult to support ad-hoc queries
like DBMS. To iterate the same M/R job from input scan to reduce                   5. DEMONSTRATION SCENARIO
stage for each query is wasteful in many cases. Moreover, many                      Table 1 presents statistics for XML datasets used in our experi-
twig patten queries share linear path patterns with each other in                  ments. The demonstration will use only a small fraction of one syn-
Mappers
                                                                         Mappers                                                                  Reducers                                                                                                                                                                                                                                            Reducers
                   XML Label                                                                                                                                                                                                                                                                           Tagging
                                                                            Path                               Path                                                                                                                                                                                                                                                                                      Holistic                  Final
                   block1                                                                                                                                                                                                  Path                                                                       Reducer ID
                                                                          filtering                          solutions                                              Counting
                                                                                                                                                                                                                         solutions                                                                                                                                                                      twig join                 answers
                                                                                                                                                                    solutions                                                                                                                          Tagging
                       XML Label                                                                               Path
                                                                                Path                                                                                                                                                                                                                  Reducer ID                                                                                                                   Final
                       block2                                                                                                                                                                                                                                         Path                                                                                                                               Holistic
                                                                              filtering                      solutions
                                                                                                                                                                                                                           Path                                     solutions                                                                                                                           twig join                 answers
                                           …


                                                                                                                                                                   Counting                                                                                                                            Tagging
                                                                                                                                                                                                                         solutions                                                                                                                                                  Shuffle
                                                                                                                                                                   Solutions                                                                                                                          Reducer ID
                        XML Label                                                                              Path                                                                                                                                                                                                                                                              by ReducerId
                                                                            Path                                                                                                                                 <Path ID, a list of labels>                      Size information
                        blockn                                                                               solutions
                                                                          filtering                                                                                                                                                                               for path solutions                                                                                          Distributed cache
                                                                                                      <Path ID, label>
                                                                                                                                      Size information                                                                                                           Relationship
     Query index                                                                                                                      for path solutions                                                                                                      btw. path patterns &                                                                                                Multi query
   Distributed cache                                                                                                                                                                                                                                             twig patterns                                                                                                     optimizer
                                                                                                                (a)                                                                                                                                                                                                                                                               (b)
                                                                    Figure 2: (a) path query processing in the 1st M/R job (b) twig pattern joins in the 2nd M/R job


                                             50000                                                                                                                                                                                                                                   7226
                                             45000              Labeling                                                                                                                                  7000      1st MR           2nd MR                                                                                      2000                                        1st MR      2nd MR
                                                                                                                                                                                   Elapsed time (sec)




                                                                                                                                                                                                                                                                                                            Elapsed time (sec)
                                                                                                                                                                                                                                                                                  5630                                           1800
                        Loading time (sec)




                                             40000         Copy to HDFS                                                                                                                                   6000                                                                 5485                                              1600
                                             35000                                                                                                                                                        5000                                                             4930
                                                                                                                                                                                                                                                                                                                                 1400
                                             30000                                                                                                                                                        4000
                                                                                                                                                                                                                                                                   4095                                                          1200
                                             25000                                                                                                                                                                                                                                                                               1000
                                                                                                                                                                                                          3000
                                             20000                                                                                                                                                                                                                                                                                800
                                             15000                                                                                                                                                        2000                                                                                                                    600
                                             10000                                                                                                                                                        1000
                                                                                                                                                                                                                                                                                                                                  400
                                                                                                                                                                                                           500                              365 361 398 394 517                                                                   200
                                              5000                                                                                                                                                                  99 86 104 94 119
                                                                                                                                                                                                                                                                                                                                    0
                                                                                                                                                                                                                    1k 2k 4k 8k 16k         1k 2k 4k 8k 16k          1k 2k 4k 8k 16k                                                                                         1k 2k 4k 8k 16k     1k 2k 4k 8k 16k         1k 2k 4k 8k 16k
                                                 0                 XM
                                                                       ark1
                                                                           0
                                                                               XM
                                                                                  ark1
                                                                                        00
                                                                                             XM
                                                                                               ark1
                                                                                                      Unir
                                                                                                    000
                                                                                                          ef10
                                                                                                                    Unip
                                                                                                                        arc
                                                                                                                               Unip
                                                                                                                                     rotK                                                                             XMark10        XMark100        XMark1000                                                                                                                Uniref100         Uniparc        UniprotKB
                                                                                                                0                        B                                                                              with 1k, 2k, 4k, 8k, and 16k queries                                                                                                                     with 1k, 2k, 4k, 8k, and 16k queries
                                                            (a) data loading time                                                                                                         (b) execution time for synthetic dataset                                                                            (c) execution time for real-world dataset
                                                  Block collocation                      Non-collocation
                                                                                                                                             Execution time of 1st M/R job (sec)




                                                                                                                                                                                                                                                                                                                                 Execution time of 2nd reducer (sec)
     Execution time of 1st M/R job (sec)




                                                                                                                                                                                                                         (size)   Distinct path XMark100          (time)
                                                                             1587




                                                                                                                                                                                                                                                                                                  Path solution size (MB)
                                           1700                           1563                                                                                                                                           (size)   Normal path XMark100            (time)                                                                                                            XMark100 balanced           UniprotKB balanced
                                           1600
                                           1500                                                                                                                                                                          (size)   Distinct path Uniref100         (time)
                                                                                                                                                                                                                                                                                                                                                                                    XMark100 random             UniprotKB random
                                           1400                                                                                                                                                                                                                                                                                                                        700
                                                                                                                        1148                                                                                             (size)   Normal path Uniref100           (time)
                                           1300                                                                                                                                    25600
                                           1200                                                                      1141
                                                                                                                                                                                   12800                                                                                                 512000                                                                        600
                                           1100
                                                                                                                                                                                    6400                                                                                                 256000                                                                        500
                                           1000                                                                                                                                                                                                                                          128000
                                            900                                                                                                                                     3200                                                                                                 64000                                                                         400
                                            800                                                       609 621                                                                       1600
                                            700                                                                                                                                                                                                                                          32000
                                            600                                                                                                                                      800                                                                                                 16000                                                                         300
                                            500                                           379 388                                                                                    400                                                                                                 8000
                                            400                                                                                                                                                                                                                                                                                                                        200
                                            300                                                                                                                                      200                                                                                                 4000
                                                             171 179                                                                                                                 100                                                                                                                                                                               100
                                            200   39 41                                                                                                                                                                                                                                  2000
                                            100
                                              0                                                                                                                                                                   1000       2000    4000      8000                  16000                                                                                               0
                                                    XM            XM       XM        UNIR                 UNIP             UNIP                                                                                                                                                                                                                                                     1k         2k       4k          8k         16k
                                                          ark1
                                                              0
                                                                    ark1
                                                                        00
                                                                              ark1
                                                                                  000     EF                     ARC           Rot                                                                                            the number of queries
                                                                                                                                                                                                                                                                                                                                                                                                The number of queries

                                               (d) effect of block collocation                                                                                        (e) effect of converting paths to distinct paths                                                                                                                      (f) effect of multi query optimization

                                                                                                                                                                                                          Figure 3: Experimental results

                                                                                                                                                                                                                                                    Acknowledgments
                                                           Table 1: Statistics of XML dataset
 Filename                                                         UniRef100                               UniParc                           UniProtKB                                                                XMark1000                      We thank to Jiaheng Lu for providing us with Java version of twig
 File size(KB)                                                    25,088,663                          38,334,953                       108,283,066                                                                   117,159,962                    join algorithms. This work was partly supported by NRF grant
 # of elements                                               335,153,446                            360,376,852                   2,110,330,358                                                                    1,670,594,672                    funded by the Korea government (MEST)(No. 2011-0016282).
 # of attributes                                             589,568,839                          1,215,063,103                        383,127,024                                                                 2,783,354,175
 Depth in avg.                                                           4.5649                              3.7753                                                                4.3326                                     4.7375                6. REFERENCES
 Max depth                                                                          6                                      5                                                                              7                            12           [1] Hadoop. http://hadoop.apache.org, Apache Software
 # distinct paths                                                                30                                   24                                                                                149                          548                Foundation.
                                                                                                                                                                                                                                                    [2] A. Bairoch et al. The universal protein resource (uniprot).
thetic and one real-world data set due to the limited demonstration                                                                                                                                                                                     Nucleic acids research, 33(suppl 1):D154–D159, 2005.
time and the nature of MPP(Massive Parallel Processing) applica-                                                                                                                                                                                    [3] N. Bruno et al. Holistic twig joins: optimal xml pattern
tions. However, we still present our experimental results done with                                                                                                                                                                                     matching. In Proceedings of ACM SIGMOD, pages 310–321.
all the dataset in fig. 3. Currently, HadoopXML supports a subset                                                                                                                                                                                        ACM, 2002.
of XPath 1.0 language, i.e. {/,//,*, @, []}.
                                                                                                                                                                                                                                                    [4] J. Dean et al. Mapreduce: Simplified data processing on large
   In our demonstration, users will be given a list of sample XPath
                                                                                                                                                                                                                                                        clusters. Communications of the ACM, 51(1):107–113, 2008.
queries generated from DTDs for the datasets in Table 1. Users
                                                                                                                                                                                                                                                    [5] Y. Diao et al. Path sharing and predicate evaluation for
can also edit the queries with their tastes. Users are then allowed
                                                                                                                                                                                                                                                        high-performance xml filtering. ACM Transactions on
to load sample XML files into Hadoop XML and run their queries
                                                                                                                                                                                                                                                        Database Systems, 28(4):467–516, 2003.
themselves. During the processing, users will be explained step by
step with Hadoop GUI how the system processes a massive volume                                                                                                                                                                                      [6] K. Lee et al. Parallel data processing with mapreduce: a
XML data. Users will also check how features of HadoopXML                                                                                                                                                                                               survey. ACM SIGMOD Record, 40(4):11–20, 2012.
affect the overall performance of the system as they can turn on                                                                                                                                                                                    [7] Q. Li et al. Indexing and querying xml data for regular path
and off the features, e.g. block collocation, sharing input scan &                                                                                                                                                                                      expressions. In Proceedings of VLDB, pages 361–370, 2001.
path solutions, load balancing and so on.                                                                                                                                                                                                           [8] T. Nykiel et al. Mrshare: Sharing across multiple queries in
                                                                                                                                                                                                                                                        mapreduce. Proceedings of the VLDB Endowment,
                                                                                                                                                                                                                                                        3(1-2):494–505, 2010.

Contenu connexe

Tendances

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESijdpsjournal
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudBharat Rane
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 

Tendances (20)

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
Hadoop
HadoopHadoop
Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIESSTUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
STUDY ON EMERGING APPLICATIONS ON DATA PLANE AND OPTIMIZATION POSSIBILITIES
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 
H04502048051
H04502048051H04502048051
H04502048051
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 

Similaire à HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries

Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataMulti-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataWaqas Tariq
 
prefix based labelling scheme for xml data
prefix based labelling scheme for xml dataprefix based labelling scheme for xml data
prefix based labelling scheme for xml dataakash1391
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSijdms
 
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTSINVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTScsandit
 
Pal gov.tutorial2.session3.xml schemas
Pal gov.tutorial2.session3.xml schemasPal gov.tutorial2.session3.xml schemas
Pal gov.tutorial2.session3.xml schemasMustafa Jarrar
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATIONijcsit
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...journalBEEI
 
Facilitating Busines Interoperability from the Semantic Web
Facilitating Busines Interoperability from the Semantic WebFacilitating Busines Interoperability from the Semantic Web
Facilitating Busines Interoperability from the Semantic WebRoberto García
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionThomas Lee
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
Implementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/CoalescedImplementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/CoalescedCriatividadeZeroDocs
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmIRJET Journal
 

Similaire à HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries (20)

Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive DataMulti-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
Multi-GranularityUser Friendly List Locking Protocol for XML Repetitive Data
 
Er2000
Er2000Er2000
Er2000
 
prefix based labelling scheme for xml data
prefix based labelling scheme for xml dataprefix based labelling scheme for xml data
prefix based labelling scheme for xml data
 
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGSXML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
XML COMPACTION IMPROVEMENTS BASED ON BINARY STRING ENCODINGS
 
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTSINVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS
 
Pal gov.tutorial2.session3.xml schemas
Pal gov.tutorial2.session3.xml schemasPal gov.tutorial2.session3.xml schemas
Pal gov.tutorial2.session3.xml schemas
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...
 
Facilitating Busines Interoperability from the Semantic Web
Facilitating Busines Interoperability from the Semantic WebFacilitating Busines Interoperability from the Semantic Web
Facilitating Busines Interoperability from the Semantic Web
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Implementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/CoalescedImplementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/Coalesced
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
58 65
58 6558 65
58 65
 

Dernier

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Dernier (20)

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries

  • 1. HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries Hyebong Choi‡ Kyong-Ha Lee‡ Soo-Hyong Kim‡ hbchoi@dbserver.kaist.ac.kr bart7449@gmail.com kimsh@dbserver.kaist.ac.kr Yoon-Joon Lee ‡ Bongki Moon§ yoonjoon.lee@kaist.ac.kr bkmoon@cs.arizona.edu ‡ Computer Science Dept., KAIST, Daejeon, 301-781, Korea § Dept. of Computer Science, University of Arizona, Tucson, Arizona, 85721, USA ABSTRACT in a huge XML file in disk and the volume is continuously grow- The volume of XML data is tremendous in many areas, but espe- ing. This makes it difficult to process the data within XML pub/sub cially in data logging and scientific areas. XML data in the ar- systems or single-site XML databases. It is because conventional eas are accumulated over time as new data are continuously col- XML pub/sub systems are mainly devised to consider a series of lected. It is a challenge to process massive XML data with multi- small-size XML documents and XML databases are not optimized ple twig pattern queries given by multiple users in a timely manner. for such a big XML file that also must be appended or even substi- We showcase HadoopXML, a system that simultaneously processes tuted by a new XML file frequently. Thus, it is prudent to process many twig pattern queries for a massive volume of XML data with user queries over XML data with MapReduce [4]. Hadoop. Specifically, HadoopXML provides an efficient way to To address this issue, we devise HadoopXML which provides process a single large XML file in parallel. It processes multi- facilities to efficiently process a massive volume of XML data in ple twig pattern queries simultaneously with a shared input scan. parallel. HadoopXML is a set of applications developed on the Users do not need to iterate M/R jobs for each query. HadoopXML popular MapReduce framework, Hadoop [1]. Main features of also saves many I/Os by enabling twig pattern queries to share their HadoopXML are as follows: First, it provides an efficient means path solutions each other. Moreover, HadoopXML provides a so- to process a massive volume of XML data in parallel. It parti- phisticated runtime load balancing scheme for fairly assigning mul- tions XML data into blocks with no loss of structural informa- tiple twig pattern joins across nodes. With synthetic and real world tion. Second, HadoopXML processes multiple twig pattern queries XML dataset, we demonstrate how efficiently HadoopXML pro- simultaneously. There is no need to iterate M/R jobs for each cesses many twig pattern queries in a shared and balanced way. query in a query set. Third, HadoopXML enables query processing tasks to share input scans and their intermediate results with each other. A path solution is shared by multiple twig pattern queries Categories and Subject Descriptors that contain the common path pattern. Moreover, it saves many I/Os H.2.4 [Database Management]: Systems—query processing; D.1.3 by removing many redundant intermediate results as we substitute [Software]: Programming technique—concurrent programming many path patterns that include //, * to distinct root-to-leaf paths. Lastly, HadoopXML provides a sophisticated runtime load balanc- General Terms ing scheme for evenly distributing twig joins across nodes. The rest of this proposal is organized as follows. Section 2 describes our algorithms, experimentation, performance system architecture. Section 3 explains features of HadoopXML. Section 4 presents implementation details. Section 5 describes our Keywords demonstration scenarios. XML, parallel processing, query optimization, MapReduce 2. SYSTEM ARCHITECTURE 1. INTRODUCTION HadoopXML processes XML data in 3 steps: preprocessing and XML is one of the most prominent data formats and many data 2 consecutive M/R jobs. In preprocessing step (shown in Fig. 1), have been produced and transformed into the format. Specifically, XML data are partitioned into equal-sized blocks and then loaded scientific data and log messages are often kept in the form of XML. into HDFS. Also, elements are labeled for the use in twig pattern Such XML data are large and also growing very quickly. For exam- joins and the labels are written in label blocks separate from XML ple, UniprotKB, which provides the collection of functional infor- blocks. In the stage, HadoopXML also decomposes a given set mation on proteins, now hits more than 108GB a file [2]. Moreover, of queries into linear path patterns. Then, it builds an NFA-style new elements and attributes are continuously appended to existing query index and a table that holds mapping information between XML files as they are generated over time. In a typical scenario, given queries and the decomposed path patterns. users prepare their queries in advance even when XML data are In the 1st M/R job, the query index is loaded into each map- not completely produced. This is akin to the background of XML per via distributedCache mechanism in Hadoop beforehand. pub/sub systems, but different in that the data is sometimes stored After that, mappers read XML blocks as SAX streams and filter out only the labels of elements matched with the decomposed path pat- Copyright is held by the author/owner(s). CIKM’12, October 29–November 2, 2012, Maui, HI, USA. terns. Then, reducers group the labels by PathId and count the ACM 978-1-4503-1156-4/12/10. number of labels for each pathId. The path solutions and size in-
  • 2. Query index Query practice. Sharing path solutions reduces redundant processing of builder index HDFS path patterns and saves many I/Os [8]. In this respect, we borrow XPath Query Path patterns the concept of path sharing from YFilter [5]. Moreover, path so- queries Decomposition XML Label lutions are shared by multiple twig pattern joins in HadoopXML. Relationship block1 block1 btw. paths Copy to While joining path solutions for processing twig patterns, a group and twigs XML Label HDFS block2 block2 of join operations assigned to the same reducer share the path solu- Label blocks tions each other if the path patterns are shared by the twig patterns. … … A large Partitioning XML Label XML file & Labeling XML blocks blockn blockn This helps reduce the overall I/O cost of join operations. Block collocation Converting to distinct path patterns Figure 1: Preprocessing step in HadoopXML Many path patterns may be matched with a single root-to-leaf path in practice. For example, assume that path patterns /a//c, //c formation are stored in HDFS. After that, our multi-query optimizer and /a/*/c are matched with a root-to-leaf path /a/b/c in an decides which reducer in the next M/R job will perform which twig XML file. If the paths are treated as different each other, three path pattern join for balancing workloads across nodes, based on size in- solutions are redundantly produced for a single distinct path during formation for the path solutions and the mapping table. query processing. By converting redundant path patterns to root- In the 2nd M/R job, mappers read grouped path solutions and to-leaf paths which are distinct in an input XML, we nicely reduce tag reduce ids to the grouped path solutions as keys. Since mapped the sizes of path solutions and save many I/Os. In order to support outputs are shuffled by intermediate keys, path solutions tagged by this feature, HadoopXML extracts distinct root-to-leaf paths during the same reducer id go to the same reducer together. Finally, re- data loading in preprocessing step. ducers perform twig pattern joins and output final results to HDFS. Fig. 2 illustrates data flows in two M/R jobs in HadoopXML. Runtime load balancing and multi query optimization A straggling task lags overall job execution in Mapreduce. This 3. FEATURES OF HADOOPXML problem becomes more severe if it happens in reduce stage. MapRe- duce’s native runtime scheduling does not work well especially for HadoopXML has many features for efficient XML data process- reducers. HadoopXML rather uses dynamic shuffling scheme that ing. Since fault-tolerance and scalability are its primary goal, Hadoop balances workloads across reducers at runtime. To achieve this, is not optimized for I/O efficiency [6]. Thus, we endeavor to in- HadoopXML estimates the cost of each twig join operation be- crease I/O efficiency but without modification of Hadoop internals. fore actual joining. It is achieved by computing the cost of each join operation, as the worst-case I/O and CPU time complexities Partitioning with no loss of structural information of twigStack algorithm is linear in the sum of sizes of input path With labeled values, each label block records a root-to-leaf path solutions. The sizes of path solutions is counted in the 1st reduce that represents the structural information for the start of the corre- stage. The cost estimation also considers the sizes of path solutions sponding XML block. For example, consider an XML document shared by multiple twig pattern queries. Then, it assigns join oper- with four elements: <a><b><c></c><d></d></b></a>. If ations into reducers at runtime such that every reducer has the same the XML file is partitioned into two blocks and the second block overall cost of join operations each other. contains an XML fragment </d></b></a>, we keep a root-to- leaf path /a/b/d for the start of the block. When a map task reads the second block, a query index is first fed with the SAX stream re- 4. IMPLEMENTATION stored from the root-to-leaf path string /a/b/d before actual block We implemented HadoopXML with Hadoop version 0.21.0. Our reading. This guarantees that a query index in each map task starts cluster consists of 9 nodes, running on CentOS 6.2. A master is with correct internal states when processing XML blocks that lie in equipped with an AMD Athlon II x4 620 processor, 8GB memory the "middle" of the SAX stream. and a 7200RPM HDD. The other nodes are designated as slaves, each of which has an Intel i5-2500k processor, 8GB memory and Collocating XML blocks and label blocks a 7200RPM HDD. All nodes are connected via Gigabit switching HadoopXML reads both XML blocks and their corresponding la- hub. We use default settings for our Hadoop cluster for fair com- bel blocks during query processing. If two blocks are stored sep- parison. Region numbering scheme [7] is used for labeling XML, arately in two nodes, additional network I/Os occur as the system but modified for the support of big XML files. Since end values in reads blocks via network, delaying map stage. To increase spatial the numbering scheme are generated in postorder, labels are kept in locality, we extend block placement policy in HDFS so as to put memory until we meet endElement(), causing a memory space an XML block and its corresponding label block together into the problem in such a big XML file. Our scheme reads an XML block, same node. then promptly appends labels into the corresponding label block. After data loading, HadoopXML sorts labels by start in preorder. Multiple twig pattern matchings in parallel For path filtering, we use the NFA-style query index in YFilter [5]. In HadoopXML, multiple join operations are distributed across nodes We also use TwigStack algorithm [3] to implement holistic twig and executed in parallel as many as the number of reducers. We also pattern joins in the 2nd M/R jobs, but other holistic join techniques implement each join operation with an I/O optimal holistic twig can be used in HadoopXML with no loss of generality. Finally, we pattern join algorithm for improving I/O efficiency in HadoopXML [3]. also extend DataPlacementPolicy class in HDFS in order to collocate XML blocks and their corresponding label blocks. Sharing input scan and path solutions MapReduce’s batch nature makes it difficult to support ad-hoc queries like DBMS. To iterate the same M/R job from input scan to reduce 5. DEMONSTRATION SCENARIO stage for each query is wasteful in many cases. Moreover, many Table 1 presents statistics for XML datasets used in our experi- twig patten queries share linear path patterns with each other in ments. The demonstration will use only a small fraction of one syn-
  • 3. Mappers Mappers Reducers Reducers XML Label Tagging Path Path Holistic Final block1 Path Reducer ID filtering solutions Counting solutions twig join answers solutions Tagging XML Label Path Path Reducer ID Final block2 Path Holistic filtering solutions Path solutions twig join answers … Counting Tagging solutions Shuffle Solutions Reducer ID XML Label Path by ReducerId Path <Path ID, a list of labels> Size information blockn solutions filtering for path solutions Distributed cache <Path ID, label> Size information Relationship Query index for path solutions btw. path patterns & Multi query Distributed cache twig patterns optimizer (a) (b) Figure 2: (a) path query processing in the 1st M/R job (b) twig pattern joins in the 2nd M/R job 50000 7226 45000 Labeling 7000 1st MR 2nd MR 2000 1st MR 2nd MR Elapsed time (sec) Elapsed time (sec) 5630 1800 Loading time (sec) 40000 Copy to HDFS 6000 5485 1600 35000 5000 4930 1400 30000 4000 4095 1200 25000 1000 3000 20000 800 15000 2000 600 10000 1000 400 500 365 361 398 394 517 200 5000 99 86 104 94 119 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 0 XM ark1 0 XM ark1 00 XM ark1 Unir 000 ef10 Unip arc Unip rotK XMark10 XMark100 XMark1000 Uniref100 Uniparc UniprotKB 0 B with 1k, 2k, 4k, 8k, and 16k queries with 1k, 2k, 4k, 8k, and 16k queries (a) data loading time (b) execution time for synthetic dataset (c) execution time for real-world dataset Block collocation Non-collocation Execution time of 1st M/R job (sec) Execution time of 2nd reducer (sec) Execution time of 1st M/R job (sec) (size) Distinct path XMark100 (time) 1587 Path solution size (MB) 1700 1563 (size) Normal path XMark100 (time) XMark100 balanced UniprotKB balanced 1600 1500 (size) Distinct path Uniref100 (time) XMark100 random UniprotKB random 1400 700 1148 (size) Normal path Uniref100 (time) 1300 25600 1200 1141 12800 512000 600 1100 6400 256000 500 1000 128000 900 3200 64000 400 800 609 621 1600 700 32000 600 800 16000 300 500 379 388 400 8000 400 200 300 200 4000 171 179 100 100 200 39 41 2000 100 0 1000 2000 4000 8000 16000 0 XM XM XM UNIR UNIP UNIP 1k 2k 4k 8k 16k ark1 0 ark1 00 ark1 000 EF ARC Rot the number of queries The number of queries (d) effect of block collocation (e) effect of converting paths to distinct paths (f) effect of multi query optimization Figure 3: Experimental results Acknowledgments Table 1: Statistics of XML dataset Filename UniRef100 UniParc UniProtKB XMark1000 We thank to Jiaheng Lu for providing us with Java version of twig File size(KB) 25,088,663 38,334,953 108,283,066 117,159,962 join algorithms. This work was partly supported by NRF grant # of elements 335,153,446 360,376,852 2,110,330,358 1,670,594,672 funded by the Korea government (MEST)(No. 2011-0016282). # of attributes 589,568,839 1,215,063,103 383,127,024 2,783,354,175 Depth in avg. 4.5649 3.7753 4.3326 4.7375 6. REFERENCES Max depth 6 5 7 12 [1] Hadoop. http://hadoop.apache.org, Apache Software # distinct paths 30 24 149 548 Foundation. [2] A. Bairoch et al. The universal protein resource (uniprot). thetic and one real-world data set due to the limited demonstration Nucleic acids research, 33(suppl 1):D154–D159, 2005. time and the nature of MPP(Massive Parallel Processing) applica- [3] N. Bruno et al. Holistic twig joins: optimal xml pattern tions. However, we still present our experimental results done with matching. In Proceedings of ACM SIGMOD, pages 310–321. all the dataset in fig. 3. Currently, HadoopXML supports a subset ACM, 2002. of XPath 1.0 language, i.e. {/,//,*, @, []}. [4] J. Dean et al. Mapreduce: Simplified data processing on large In our demonstration, users will be given a list of sample XPath clusters. Communications of the ACM, 51(1):107–113, 2008. queries generated from DTDs for the datasets in Table 1. Users [5] Y. Diao et al. Path sharing and predicate evaluation for can also edit the queries with their tastes. Users are then allowed high-performance xml filtering. ACM Transactions on to load sample XML files into Hadoop XML and run their queries Database Systems, 28(4):467–516, 2003. themselves. During the processing, users will be explained step by step with Hadoop GUI how the system processes a massive volume [6] K. Lee et al. Parallel data processing with mapreduce: a XML data. Users will also check how features of HadoopXML survey. ACM SIGMOD Record, 40(4):11–20, 2012. affect the overall performance of the system as they can turn on [7] Q. Li et al. Indexing and querying xml data for regular path and off the features, e.g. block collocation, sharing input scan & expressions. In Proceedings of VLDB, pages 361–370, 2001. path solutions, load balancing and so on. [8] T. Nykiel et al. Mrshare: Sharing across multiple queries in mapreduce. Proceedings of the VLDB Endowment, 3(1-2):494–505, 2010.