SlideShare a Scribd company logo
1 of 45
Download to read offline
A Study of I/O and Virtualization
Performance with a Search Engine
 based on an XML database and
              Lucene
                Ed Bueché, EMC
      edward.bueche@emc.com, May 25, 2011
Agenda

§    My Background
§    Documentum xPlore Context and History
§    Overview of Documentum xPlore
§    Tips and Observations on IO and Host
      Virtualization




                                              3
My Background
§    Ed Bueché
§    Information Intelligence Group within EMC
§    EMC Distinguished Engineer & xPlore Architect
§    Areas of expertise
      •  Content Management (especially performance &
         scalability)
      •  Database (SQL and XML) and Full text search
      •  Previous experience: Sybase and Bell Labs
§  Part of the EMC Documentum xPlore
    development team
      •  Pleasanton (CA), Grenoble (France), Shanghai,
         and Rotterdam (Netherlands)
                                                         4
Documentum search 101
•  Documentum Content Server provides an object/
   relational data model and query language
   —  Object metadata called attributes (sample: title, subject,
      author)
   —  Sub-types can be created with customer defined attributes
   —  Documentum Query Language (DQL)
   —  Example:
      SELECT object_name FROM foo
      WHERE subject = bar AND customer_id = ID1234
•  DQL also support full text extensions
   —  Example:
      SELECT object_name FROM foo
      SEARCH DOCUMENT CONTAINS hello world
      WHERE subject = bar AND customer_id = ID1234
Introducing Documentum xPlore

§  Provides Integrated
    Search for Documentum
   •  but is built as a
      standalone search
      engine to replace FAST
      Instream
§  Built over EMC xDB,
    Lucene, and leading
    content extraction and
    linguistic analysis
    software
Documentum Search
History-at-a-glance

§  almost 15 years of Structured/Unstructured integrated search 2010 - ???
                                                   xPlore Integration
                                                                  •    Replaces FAST in DCTM
                                   FAST Integration 2005 – 2011   •    Integrated security
                                   • Combined structured /        •    Deep facet computation
Verity Integration 1996 – 2005     unstructured search            •    HA/DR improvements
• Basic full text search through   • 2 – 5 min latency            •    Latency: typically seconds
DQL                                • Score ordered results             Improved Administration
• Basic attribute search                                          •    Virtualization Support
• 1 day à 1 hour latency
• Embedded implementation




  1996                                 2005                                    2010
Enhancing Documentum Deployments
with Search

                                                  RDBMS


                            DQL             SQL
     search                       Content
              DCTM client
                                   Server



• 

     – 
• 
Enhancing Documentum Deployments
with Search

                                                      RDBMS


               Documentum   DQL              SQL
      search       client
                                  Content
                                   Server




 •                                      xQuery


 • 
 • 

                                                   Metadata + content
Some Basic Design Concepts
   behind Documentum xPlore
§  Inverted Indexes are not optimized for all use-
    cases
   •  B+-tree indexes can be far more efficient for
      simple, low-latency/highly dynamic scenarios
§  De-normalization can t efficiently solve all
    problems
   •  Update propagation problem can be deadly
   •  Joins are a necessary part of most applications
§  Applications need fine control over not only
    search criteria, but also result sets

                                                        10
Design concepts (con t)
§  Applications need fluid, changing metadata
    schemas that can be efficiently queried
  •  Adding metadata through joins with side-tables
     can be inefficient to query
§  Users want the power of Information Retrieval
    on their structured queries
§  Data Management, HA, DR shouldn t be an
    after-thought
§  When possible, operate within standards
§  Lucene is not a database. Most Lucene
    applications deploy with databases.
                                                      11
Lessons Learned…




Fit to
use-case




     Structured Query          Unstructured
     use-cases                 Query use-cases
Indexes, DB, and IR

                    Full Text
                    searches


                                Hierarchical data
                                representations
                                     (XML)

Fit to                                              Constantly
                                                     changing
use-case                                             schemas
           Relational DB                                          Scoring,
           technology                                            Relevance,
                                                                  Entities




     Structured Query                                            Unstructured
     use-cases                                                   Query use-cases
Indexes, DB, and IR

                                                          Meta data
                                                           query



                                                  JOINs


Fit to                           Advanced data
                                  management
use-case                           (partitions)
                                                                  Full Text
                  Transactions                                    index
                                                                  technology

     Structured Query                                         Unstructured
     use-cases                                                Query use-cases
Indexes, DB, and IR




Fit to
use-case
           Relational DB           Full Text
           technology              index
                                   technology

     Structured Query             Unstructured
     use-cases                    Query use-cases
Documentum xPlore

•  Bring	
  best-­‐of-­‐breed	
  XML	
  Database	
  with	
            xPlore API
   powerful	
  Apache	
  Lucene	
  Fulltext	
  Engine	
           Indexing          Search
                                                                  Services          Services
•  Provides	
  structured	
  and	
  unstructured	
  search	
  
                                                                  Content      Node & Data
   leveraging	
  XML	
  and	
  XQuery	
  standards	
             Processing    Management
                                                                  Services      Services
•  Designed	
  with	
  Enterprise	
  readiness,	
  
   scalability	
  and	
  ingesCon	
                               Analytics
                                                                                     Admin
                                                                                    Services

•  Advanced	
  Data	
  Management	
  funcConality	
  
   necessary	
  for	
  large	
  scale	
  systems	
                        xDB API
                                                                  xDB Query Processing&
•  Industry	
  leading	
  linguisCc	
  technology	
  and	
             Optimization
   comprehensive	
  format	
  filters	
  
                                                                  xDB Transaction, Index
                                                                   & Page Management
•  Metrics	
  and	
  AnalyCcs	
  
EMC xDB: Native XML database
§  Formerly XHive database
   •  100% java
   •  XML stored in persistent DOM format
      §  Each XML node can be located through a 64 bit identifier
      §  Structure mapped to pages
      §  Easy to operate on GB XML files
   •  Full Transactional Database
   •  Query Language: XQuery with full text extensions
§  Indexing & Optimization
   •  Palette of index options optimizer can pick from
   •  At it simplest: indexLookup(key) à node id

                                                                     17
Libraries / Collections & Indexes




                      = xDB segment
Lucene Integration
§  Transactional
   •  Non-committed index updates in separate
      (typically in memory) lucene indexes
   •  Recently committed (but dirty) indexes backed by
      xDB log
   •  Query to index leverages Lucene multi-searcher
      with filter to apply update/delete blacklisting
§  Lucene indexes managed to fit into xDB s
    ARIES-based recovery mechanism
§  No changes to Lucene
   •  Goal: no obstacles to be as current as possible

                                                         19
Lucene Integration (con t)
§  Both value and full text queries supported
   •  XML elements mapped to lucene fields
   •  Tokenized and value-based fields available
§  Composite key queries supported
   •  Lucene much more flexible than traditional B-
      tree composite indexes
§  ACL and Facet information stored in Lucene
    field array
   •  Documentum s security ACL security model
      highly complex and potentially dynamic
   •  Enables secure facet computation

                                                      20
xPlore has lucene search engine
 capabilities plus….
ü  XQuery provides powerful query & data
    manipulation language
  •  A typical search engine can t even express a join
  •  Creation of arbitrary structure for result set
  •  Ability to call to language-based functions or java-
     based methods
ü  Ability to use B-tree based indexes when needed
  •  xDB optimizer decides this
ü  Transactional update and recovery of data/index
ü  Hierarchical data modeling capability
Tips and Observations on
     IO and Host Virtualization
§  Virtualization offers huge savings for companies
    through consolidation and automation
§  Both Disk and Host virtualization available
§  However, there are pitfalls to avoid
   •  One-size-fits-all
   •  Consolidation contention
   •  Availability of resources




                                                       22
Tip #1: Don t assume that
 one-size-fits all
§  Most IT shops will create VM or SAN
    templates that have a fixed resource
    consumption
  •  Reduces admin costs
  •  Example: Two CPU VM with 2 GB of memory
  •  Deviations from this must be made in a special
     request
§  Recommendations:
  •  Size correctly, don t accept insufficient resources
  •  Test pre-production environments
Same concept applies for disk
virtualization
§  The capacity of disks are
    typically expressed in terms of           50GB and 100 I/
                                              O s per sec
    two metrics: space and I/O                capacity
    capacity
    •  Space defined in terms of        50GB and 200 I/
                                        O s per sec
       GBytes                           capacity
    •  I/O capacity defined in terms
       of I/O s per sec
§  NAS and SAN are forms of disk         50GB and 400 I/
                                          O s per sec
    virtualization                        capacity
    •  The space associated with a
       SAN volume (for example)
       could be striped over multiple
       disks
    •  The more disks allocated, the
       higher the I/O capacity
Linear mapping s and Luns

                                                                               Four	
  Luns

                                                                                       §  When mapped
                                                                                           directly to physical
                                                                                           disks then this
                                                                                           could concentrate I/
                                                       Logical	
  volume	
  with	
  
                                                                                           O to fewer than a
                                                       linear	
  mapping                   desired set of
      Allocated	
  for	
     Free	
  space	
  in	
                                         drives.
          Index                 volume

                                                                                       §      High-end SAN s
                                                                                              like Symmetrix can
                                                                                              handle this situation
                                                                                              with virtual LUN s

                                                                                                                25
EMC Symmetrix:
Nondisruptive Mobility
Virtual LUN VP Mobility

             Virtual Pools       §  Fast, efficient mobility
Flash                            §  Maintains replication and
400 GB
RAID 5                               quality of service during
                                     relocations
Fibre Channel
                             V
600 GB 15K
                             L   §  Supports up to thousands of
                             U
                  Tier 2


RAID 1
                             N       concurrent VP LUN
                                     migrations
SATA
2 TB                             §  Recommendation: work with
RAID 6
                                     storage technicians to
                                     ensure backend storage has
                                     sufficient I/O
Tip #2: Consolidation Contention
 §  Virtualization provides benefit from consolidation
 §  Consolidation provides resources to the active
    •  Your resources can be consumed by other VM s,
       other apps
    •  Physical resources can be over-stretched
 §  Recommendations:
    •  Track actual capacity vs. planned
       §  Vmware: track number of times your VM is denied CPU
       §  SANs: track % I/O utilization vs. number of I/O s
    •  For Vmware leverage guaranteed minimum
       resource allocations and/or allocate to non-
       overloaded HW
Some Vmware statistics
§  Ready metric
   •  Generated by Vcenter and represents the
      number of cycles (across all CPUs) in which VM
      was denied CPU
   •  Generated in milliseconds and real-time
      sample happens at best every 20 secs
   •  For interactive apps: As a percentage of offered
      capacity > 10% is considered worrisome
§  Pages-in, Pages-out
   •  Can indicate over subscription of memory



                                                         28
Sample %Ready for a production VM with xPlore
        deployment for an entire week

  16%
         In this case Avg
         resp time
  14%                                   official area that
         doubled and
  12%    max resp time                 Indicates pain
  10%    grew by 5x
  8%
  6%
  4%
  2%
  0%




                                                             29
Actual Ready samples during
             several hour period

       Ready	
  samples	
  (#	
  of	
  millisecs	
  VM	
  denied	
  
                  CPU	
  in	
  20	
  sec	
  intervals)
2500

2000

1500

1000

 500

  0




                                                                       30
Some Subtleties with
Interactive CPU denial

§  The Ready metric represents denial upon
    demand
  •  Interactive workloads can be bursty
  •  If no demand, then Ready counter will be low
§  Poor user response encourages less usage
  •  Like walking on a broken leg
  •  Causing less Ready samples
                                                  Denial
                                                  spike



                                20 sec interval
                                                           31
Sharing I/O capacity
   §  If Multiple VM s (or servers) are sharing the
       same underlying physical volumes and the
       capacity is not managed properly
         •  then the available I/O capacity of the volume could
            be less than the theoretical capacity
   §  This can be seen if the OS tools show that the
       disk is very busy (high utilization) while the
       number of I/Os is lower than expected
Volume for                                                           Volume for
other                                                                Lucene
application                                                          application




                   Both volumes spread over the same set of drives
                   and effectively sharing the I/O capacity
Recommendations on diagnosing
disk I/O related issues
§  On Linux/UNIX
  •  Have IT group install SAR and IOSTAT
     §  Also install a disk I/O testing tool (like Bonnie )
  •  Compare Bonnie output with SAR & IOSTAT
     data
     §  High disk Utilization at much lower achieved rates could
         indicate contention from other applications
  •  Also, High SAR I/O wait time might be an
     indication of slow disks
§  On Windows
  •  Leverage the Windows Performance Monitor
  •  Objects: Processor, Physical Disk, Memory
Sample output from the Bonnie tool

bonnie -s 1024 -y -u -o_direct -v 10 -p 10
This will increase the size of the file to 2 Gb.
Examine the output. Focus on the random I/O area:
              ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek-
              -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)-
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2




                                                                                    This output means
  -s 1024 means that 2 GB files will be created                                     that the random read
                                                                                    test saw 735 random I/
  -o_direct means that direct I/O (by-passing buffer cache)
                                                                                    O s per sec at 15%
  will be done
                                                                                    CPU busy
  -v 10 means that 10 different 2GB files will be created.
  -p 10 means that 10 different threads will query those files


 ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for
 pretesting Linux disk environments prior to an xPlore/Lucene install.
Linux indicators compared
      to bonnie output
                                           Notice that at 200+ I/Os per sec the underlying volume
                                           is 80% busy. Although there could be multiple causes,
                                           one could be that some other VM is consuming the
I/O stat output:                           remaining I/O capacity (735 – 209 = 500+).
Device:                  tps      kB_read/s          kB_wrtn/s      kB_read         kB_wrtn
sde                   206.10        2402.40               0.80        24024               8

SAR –d output:
09:29:17      DEV        tps   rd_sec/s   wr_sec/s    avgrq-sz   avgqu-sz       await      svctm     %util
09:29:27    dev8-65   209.24    4877.97       1.62       23.32       1.62        7.75       3.80     79.59

SAR –u output:
09:29:17 PM             CPU       %user        %nice      %system     %iowait           %steal      %idle
09:29:27 PM             all       41.37         0.00         5.56           29.86         0.00      23.21
09:29:27 PM               0       62.44         0.00        10.56           25.38         0.00       1.62
09:29:27 PM               1       30.90         0.00         4.26           35.56         0.00      29.28
09:29:27 PM               2       36.35         0.00         3.96           30.76         0.00      28.93
09:29:27 PM               3       35.77         0.00         3.46           27.64         0.00      33.13

                                                                                    High I/O wait
      See https://community.emc.com/docs/DOC-9179
      for additional example
Tip #3: Try to ensure availability
of resources
§  Similar to the previous issue,
    but
   •  resource displacement not
      caused by overload,
   •  Inactivity can cause Lucene
      resources to be displaced
   •  Not different from running on
      large shared native OS host
§  Recommendation:
   •  Periodic warmup
       §  non-intrusive
   •  See next example
IO / caching test use-case
§  Unselective Term search
   •  100 sample queries
   •  Avg( hits per term) = 4,300+, max ~ 60,000
   •  Searching over 100 s of DCTM object attributes + content
§  Medium result window
   •  Avg( results returned per query) = 350 (max: 800)
§  Stored Fields Utilized
   •  Some security & facet info
§  Goal:
   •  Pre-cache portions of the index to improve response time in
      scenarios
   •  Reboot, buffer cache contention, & vm memory contention
Some xPlore Structures for Search¹


 Dictionary of terms
                            Posting list (doc-id s for term)




                       Stored fields (facets and node-ids)

                  1st doc                          N-th              xDB XML
                                                   doc               store
                                                                     (contains
                                                                     text for
                                           Security indexes          summary)
    Facet decompression map                (b-tree based)
         ¹Frequency and position structures ignored for simplicity
IO model for search in xPlore
Search Term:
 term1 term2                                Result
                                            set
       Dictionary       Posting list (doc-id s for term)




                                          Stored fields
       Xdb node-id
       plus facet /                                        xDB XML
       security info                                       store
                                                           (contains
                                                           text for
                                       Security lookup     summary)
      Facet decompression map
                                       (b-tree based)
Separation of covering values in
stored fields and summary


  Potentially      Potentially
  thousands        thousands of
  of hits          results                         Small
                                                   structure   FinalFacet
                             Security   Facet                  calc values
                             lookup     Calc                   over
                                                               thousands of
                                            Small number       results
                                            for result
                                            window             Res-1 - sum
         Stored fields                                         Res-2 - sum
         (Random access)                                       Res-3 - sum
                                   Xdb docs
                                   with text for               :
                                   summary                     :
                                                               Res-350-sum
xPlore Memory Pool areas
at-a-glance
                                           Native code
                      Lucene               content        Operating
Other vm   xPlore     Caches               extraction &
                                                           System
           caches       &          xDB     linguistic
working                           Buffer   processing     File Buffer
  mem                 working
                      memory      Cache    memory           cache
                                                          (dynamically
                                                             sized)
           xPlore Instance (fixed size)
  memory
Lucene data resides primarily in
OS buffer cache
                                                                  Dictionary of terms
                                                                                             Posting list (doc-id’s for term)




    N-th                xDB XML
    doc                 store                                                           Stored fields (facets and node-ids)
                        (contains
                        text for                                                   1st doc                           N-th
                        summary)
                                                                                                                     doc




                                             Native code                                Potential for many
                       Lucene
 Other vm   xPlore     Caches
                                             content
                                             extraction &
                                                            Operating
                                                             System
                                                                                        things to sweep
                                     xDB
 working    caches       &
                       working      Buffer
                                             linguistic
                                             processing     File Buffer                 lucene from that
   mem                                                        cache
                       memory       Cache    memory
                                                                                        cache
                                                            (dynamically
                                                               sized)
            xPlore Instance (fixed size)
   memory



                                                                                                                                42
Test Env
§    32 GB memory
§    Direct attached storage (no SAN)
§    1.4 million documents
§    Lucene index size = 10 GB
§    Size of internal parts of Lucene CFS file
      •    Stored fields (fdt, fdx): 230 MB (2% of index)
      •    Term Dictionary (tis,tii): 537 MB (5% of index)
      •    Positions (prx): 8.78 GB (80% of index)
      •    Frequencies (frq) : 1.4 GB (13 % of index)
§  Text in xDB stored compressed separately

                                                             43
Some results of the query suite
 Test                     Avg Resp MB pre-        I/O per      Total MB
                          to          cached      result       loaded into
                          consume                              memory
                          all results                          (cached + test)
                          (sec)
 Nothing cached           1.89               0    0.89              77
 Stored fields cached     0.95             241    0.38             272
 Term dict cached         1.73             537    0.79             604
 Positions cached         1.58           8,789    0.74            8,800
 Frequencies cached       1.65           1,406    0.63            1,436
 Entire index cached      0.59          10,970    < 0.05        10,970

•  Linux buffer cache cleared completely before each run
•  Resp as seen by final user in Documentum
•  Facets not computed in this example. Just a result set returned. With Facets
   response time difference more pronounced.
•  Mileage will vary depending on a series of factors that include query complexity,
   compositions of the index, and number of results consumed
                                                                                  44
Other Notes
§  Caching 2% of index yields a response time
    that is only 60% greater than if the entire index
    was cached.
   •  Caching cost only 9 secs on a mirrored drive pair
   •  Caching cost 6800 large sequential I/O s vs.
      potentially 58,000 random I/O s
§  Mileage will vary, factors include
   •  Phrase search
   •  Wildcard search
   •  Multi-term search
§  SAN s can grow I/O capacity as search
    complexity increases
                                                          45
Contact
§  Ed Bueché
  •  edward.bueche@emc.com
  •  http://community.emc.com/people/Ed_Bueche/blog
  •  http://community.emc.com/docs/DOC-8945




                                                  46

More Related Content

What's hot

Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RCapgemini
 
SQLBits X SQL Server 2012 Beyond Relational
SQLBits X SQL Server 2012 Beyond RelationalSQLBits X SQL Server 2012 Beyond Relational
SQLBits X SQL Server 2012 Beyond RelationalMichael Rys
 
SharePoint Performance - Tales from the Field
SharePoint Performance - Tales from the FieldSharePoint Performance - Tales from the Field
SharePoint Performance - Tales from the FieldChris McNulty
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Vodafone xone fev142013v3 ext
Vodafone xone fev142013v3 extVodafone xone fev142013v3 ext
Vodafone xone fev142013v3 extInfiniteGraph
 
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...Michael Noel
 
Introducing Open XDX Technology for Open Data API development
Introducing Open XDX Technology for Open Data API developmentIntroducing Open XDX Technology for Open Data API development
Introducing Open XDX Technology for Open Data API developmentBizagi Inc
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2Calpont Corporation
 
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael Noel
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael NoelSPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael Noel
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael NoelMichael Noel
 
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...Michael Noel
 
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQLEin Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQLEDB
 
DB Luminous... Know Your Data
DB Luminous... Know Your DataDB Luminous... Know Your Data
DB Luminous... Know Your DataRuss Pierce
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
 
Free Hibernate Tutorial | VirtualNuggets
Free Hibernate Tutorial  | VirtualNuggetsFree Hibernate Tutorial  | VirtualNuggets
Free Hibernate Tutorial | VirtualNuggetsVirtual Nuggets
 

What's hot (20)

Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle R
 
SQLBits X SQL Server 2012 Beyond Relational
SQLBits X SQL Server 2012 Beyond RelationalSQLBits X SQL Server 2012 Beyond Relational
SQLBits X SQL Server 2012 Beyond Relational
 
SharePoint Performance - Tales from the Field
SharePoint Performance - Tales from the FieldSharePoint Performance - Tales from the Field
SharePoint Performance - Tales from the Field
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Vodafone xone fev142013v3 ext
Vodafone xone fev142013v3 extVodafone xone fev142013v3 ext
Vodafone xone fev142013v3 ext
 
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...
HAD04: Building it Right the First Time; Best Practice SharePoint 2010 Infras...
 
Introducing Open XDX Technology for Open Data API development
Introducing Open XDX Technology for Open Data API developmentIntroducing Open XDX Technology for Open Data API development
Introducing Open XDX Technology for Open Data API development
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael Noel
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael NoelSPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael Noel
SPTechCon SFO 2012 - Building the Perfect SharePoint 2010 Farm by Michael Noel
 
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...
Building the Perfect SharePoint 2010 Farm - SharePoint Connections Amsterdam ...
 
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQLEin Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
Ein Expertenleitfaden für die Migration von Legacy-Datenbanken zu PostgreSQL
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
DB Luminous... Know Your Data
DB Luminous... Know Your DataDB Luminous... Know Your Data
DB Luminous... Know Your Data
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
Free Hibernate Tutorial | VirtualNuggets
Free Hibernate Tutorial  | VirtualNuggetsFree Hibernate Tutorial  | VirtualNuggets
Free Hibernate Tutorial | VirtualNuggets
 

Viewers also liked

最新ブラウザー UI 比較
最新ブラウザー UI 比較最新ブラウザー UI 比較
最新ブラウザー UI 比較彰 村地
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucidworks (Archived)
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012彰 村地
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1彰 村地
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search PerformanceLucidworks (Archived)
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceLucidworks (Archived)
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク彰 村地
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012彰 村地
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Civil War
Civil WarCivil War
Civil Wartanica
 
まっちゃ4451LT「IE の InPrivateブラウズ」
まっちゃ4451LT「IE の InPrivateブラウズ」まっちゃ4451LT「IE の InPrivateブラウズ」
まっちゃ4451LT「IE の InPrivateブラウズ」彰 村地
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence communityLucidworks (Archived)
 
Presentation
PresentationPresentation
Presentationtarodnova
 

Viewers also liked (20)

最新ブラウザー UI 比較
最新ブラウザー UI 比較最新ブラウザー UI 比較
最新ブラウザー UI 比較
 
Search Analytics What? Why? How?
Search Analytics What? Why? How?Search Analytics What? Why? How?
Search Analytics What? Why? How?
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lr
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
Short Presentation
Short PresentationShort Presentation
Short Presentation
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User Experience
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク
 
корея
кореякорея
корея
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Civil War
Civil WarCivil War
Civil War
 
E learning At The Library
E learning At The LibraryE learning At The Library
E learning At The Library
 
まっちゃ4451LT「IE の InPrivateブラウズ」
まっちゃ4451LT「IE の InPrivateブラウズ」まっちゃ4451LT「IE の InPrivateブラウズ」
まっちゃ4451LT「IE の InPrivateブラウズ」
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence community
 
Presentation
PresentationPresentation
Presentation
 

Similar to "A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene"

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Oracle - Document Life - 6apr2012
Oracle - Document Life - 6apr2012Oracle - Document Life - 6apr2012
Oracle - Document Life - 6apr2012Agora Group
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database CloudsRoland Slee
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution
 
Relational
RelationalRelational
Relationaldieover
 
ESPC13 - 10 Things I Like in SharePoint 2013 Search
ESPC13 - 10 Things I Like in SharePoint 2013 SearchESPC13 - 10 Things I Like in SharePoint 2013 Search
ESPC13 - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB James Serra
 
2009.10.22 S308460 Cloud Data Services
2009.10.22 S308460  Cloud Data Services2009.10.22 S308460  Cloud Data Services
2009.10.22 S308460 Cloud Data ServicesJeffrey T. Pollock
 
CARA User Interface for Oracle WebCenter
CARA User Interface for Oracle WebCenterCARA User Interface for Oracle WebCenter
CARA User Interface for Oracle WebCentercara4oraclewebcenter
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
MetaVis Webinar - 10 Things I Like in SharePoint 2013 Search
MetaVis Webinar - 10 Things I Like in SharePoint 2013 SearchMetaVis Webinar - 10 Things I Like in SharePoint 2013 Search
MetaVis Webinar - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingNephoScale
 
Postgres for Digital Transformation: NoSQL Features, Replication, FDW & More
Postgres for Digital Transformation:NoSQL Features, Replication, FDW & MorePostgres for Digital Transformation:NoSQL Features, Replication, FDW & More
Postgres for Digital Transformation: NoSQL Features, Replication, FDW & MoreAshnikbiz
 
Storage simplicity value_110810
Storage simplicity value_110810Storage simplicity value_110810
Storage simplicity value_110810rjmurphyslideshare
 
Java Batch for Cost Optimized Efficiency
Java Batch for Cost Optimized EfficiencyJava Batch for Cost Optimized Efficiency
Java Batch for Cost Optimized EfficiencySridharSudarsan
 
Innovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceInnovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceBob Rhubart
 
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...InSync2011
 
Win2KServer Active Directory
Win2KServer Active DirectoryWin2KServer Active Directory
Win2KServer Active DirectoryPhil Ashman
 
Putting the "Share" and "Point" back in SharePoint 2013
Putting the "Share" and "Point" back in SharePoint 2013Putting the "Share" and "Point" back in SharePoint 2013
Putting the "Share" and "Point" back in SharePoint 2013C/D/H Technology Consultants
 

Similar to "A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene" (20)

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Oracle - Document Life - 6apr2012
Oracle - Document Life - 6apr2012Oracle - Document Life - 6apr2012
Oracle - Document Life - 6apr2012
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database Clouds
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
Relational
RelationalRelational
Relational
 
ESPC13 - 10 Things I Like in SharePoint 2013 Search
ESPC13 - 10 Things I Like in SharePoint 2013 SearchESPC13 - 10 Things I Like in SharePoint 2013 Search
ESPC13 - 10 Things I Like in SharePoint 2013 Search
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
2009.10.22 S308460 Cloud Data Services
2009.10.22 S308460  Cloud Data Services2009.10.22 S308460  Cloud Data Services
2009.10.22 S308460 Cloud Data Services
 
CARA User Interface for Oracle WebCenter
CARA User Interface for Oracle WebCenterCARA User Interface for Oracle WebCenter
CARA User Interface for Oracle WebCenter
 
MarkAndrews
MarkAndrewsMarkAndrews
MarkAndrews
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
MetaVis Webinar - 10 Things I Like in SharePoint 2013 Search
MetaVis Webinar - 10 Things I Like in SharePoint 2013 SearchMetaVis Webinar - 10 Things I Like in SharePoint 2013 Search
MetaVis Webinar - 10 Things I Like in SharePoint 2013 Search
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Postgres for Digital Transformation: NoSQL Features, Replication, FDW & More
Postgres for Digital Transformation:NoSQL Features, Replication, FDW & MorePostgres for Digital Transformation:NoSQL Features, Replication, FDW & More
Postgres for Digital Transformation: NoSQL Features, Replication, FDW & More
 
Storage simplicity value_110810
Storage simplicity value_110810Storage simplicity value_110810
Storage simplicity value_110810
 
Java Batch for Cost Optimized Efficiency
Java Batch for Cost Optimized EfficiencyJava Batch for Cost Optimized Efficiency
Java Batch for Cost Optimized Efficiency
 
Innovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceInnovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle Coherence
 
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...
E-Business Suite 1 | Nadia Bendiedou | Oracle E-Business Suite Technology rel...
 
Win2KServer Active Directory
Win2KServer Active DirectoryWin2KServer Active Directory
Win2KServer Active Directory
 
Putting the "Share" and "Point" back in SharePoint 2013
Putting the "Share" and "Point" back in SharePoint 2013Putting the "Share" and "Point" back in SharePoint 2013
Putting the "Share" and "Point" back in SharePoint 2013
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

"A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene"

  • 1. A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Bueché, EMC edward.bueche@emc.com, May 25, 2011
  • 2. Agenda §  My Background §  Documentum xPlore Context and History §  Overview of Documentum xPlore §  Tips and Observations on IO and Host Virtualization 3
  • 3. My Background §  Ed Bueché §  Information Intelligence Group within EMC §  EMC Distinguished Engineer & xPlore Architect §  Areas of expertise •  Content Management (especially performance & scalability) •  Database (SQL and XML) and Full text search •  Previous experience: Sybase and Bell Labs §  Part of the EMC Documentum xPlore development team •  Pleasanton (CA), Grenoble (France), Shanghai, and Rotterdam (Netherlands) 4
  • 4. Documentum search 101 •  Documentum Content Server provides an object/ relational data model and query language —  Object metadata called attributes (sample: title, subject, author) —  Sub-types can be created with customer defined attributes —  Documentum Query Language (DQL) —  Example: SELECT object_name FROM foo WHERE subject = bar AND customer_id = ID1234 •  DQL also support full text extensions —  Example: SELECT object_name FROM foo SEARCH DOCUMENT CONTAINS hello world WHERE subject = bar AND customer_id = ID1234
  • 5. Introducing Documentum xPlore §  Provides Integrated Search for Documentum •  but is built as a standalone search engine to replace FAST Instream §  Built over EMC xDB, Lucene, and leading content extraction and linguistic analysis software
  • 6. Documentum Search History-at-a-glance §  almost 15 years of Structured/Unstructured integrated search 2010 - ??? xPlore Integration •  Replaces FAST in DCTM FAST Integration 2005 – 2011 •  Integrated security • Combined structured / •  Deep facet computation Verity Integration 1996 – 2005 unstructured search •  HA/DR improvements • Basic full text search through • 2 – 5 min latency •  Latency: typically seconds DQL • Score ordered results Improved Administration • Basic attribute search •  Virtualization Support • 1 day à 1 hour latency • Embedded implementation 1996 2005 2010
  • 7. Enhancing Documentum Deployments with Search RDBMS DQL SQL search Content DCTM client Server •  –  • 
  • 8. Enhancing Documentum Deployments with Search RDBMS Documentum DQL SQL search client Content Server •  xQuery •  •  Metadata + content
  • 9. Some Basic Design Concepts behind Documentum xPlore §  Inverted Indexes are not optimized for all use- cases •  B+-tree indexes can be far more efficient for simple, low-latency/highly dynamic scenarios §  De-normalization can t efficiently solve all problems •  Update propagation problem can be deadly •  Joins are a necessary part of most applications §  Applications need fine control over not only search criteria, but also result sets 10
  • 10. Design concepts (con t) §  Applications need fluid, changing metadata schemas that can be efficiently queried •  Adding metadata through joins with side-tables can be inefficient to query §  Users want the power of Information Retrieval on their structured queries §  Data Management, HA, DR shouldn t be an after-thought §  When possible, operate within standards §  Lucene is not a database. Most Lucene applications deploy with databases. 11
  • 11. Lessons Learned… Fit to use-case Structured Query Unstructured use-cases Query use-cases
  • 12. Indexes, DB, and IR Full Text searches Hierarchical data representations (XML) Fit to Constantly changing use-case schemas Relational DB Scoring, technology Relevance, Entities Structured Query Unstructured use-cases Query use-cases
  • 13. Indexes, DB, and IR Meta data query JOINs Fit to Advanced data management use-case (partitions) Full Text Transactions index technology Structured Query Unstructured use-cases Query use-cases
  • 14. Indexes, DB, and IR Fit to use-case Relational DB Full Text technology index technology Structured Query Unstructured use-cases Query use-cases
  • 15. Documentum xPlore •  Bring  best-­‐of-­‐breed  XML  Database  with   xPlore API powerful  Apache  Lucene  Fulltext  Engine   Indexing Search Services Services •  Provides  structured  and  unstructured  search   Content Node & Data leveraging  XML  and  XQuery  standards   Processing Management Services Services •  Designed  with  Enterprise  readiness,   scalability  and  ingesCon   Analytics Admin Services •  Advanced  Data  Management  funcConality   necessary  for  large  scale  systems   xDB API xDB Query Processing& •  Industry  leading  linguisCc  technology  and   Optimization comprehensive  format  filters   xDB Transaction, Index & Page Management •  Metrics  and  AnalyCcs  
  • 16. EMC xDB: Native XML database §  Formerly XHive database •  100% java •  XML stored in persistent DOM format §  Each XML node can be located through a 64 bit identifier §  Structure mapped to pages §  Easy to operate on GB XML files •  Full Transactional Database •  Query Language: XQuery with full text extensions §  Indexing & Optimization •  Palette of index options optimizer can pick from •  At it simplest: indexLookup(key) à node id 17
  • 17. Libraries / Collections & Indexes = xDB segment
  • 18. Lucene Integration §  Transactional •  Non-committed index updates in separate (typically in memory) lucene indexes •  Recently committed (but dirty) indexes backed by xDB log •  Query to index leverages Lucene multi-searcher with filter to apply update/delete blacklisting §  Lucene indexes managed to fit into xDB s ARIES-based recovery mechanism §  No changes to Lucene •  Goal: no obstacles to be as current as possible 19
  • 19. Lucene Integration (con t) §  Both value and full text queries supported •  XML elements mapped to lucene fields •  Tokenized and value-based fields available §  Composite key queries supported •  Lucene much more flexible than traditional B- tree composite indexes §  ACL and Facet information stored in Lucene field array •  Documentum s security ACL security model highly complex and potentially dynamic •  Enables secure facet computation 20
  • 20. xPlore has lucene search engine capabilities plus…. ü  XQuery provides powerful query & data manipulation language •  A typical search engine can t even express a join •  Creation of arbitrary structure for result set •  Ability to call to language-based functions or java- based methods ü  Ability to use B-tree based indexes when needed •  xDB optimizer decides this ü  Transactional update and recovery of data/index ü  Hierarchical data modeling capability
  • 21. Tips and Observations on IO and Host Virtualization §  Virtualization offers huge savings for companies through consolidation and automation §  Both Disk and Host virtualization available §  However, there are pitfalls to avoid •  One-size-fits-all •  Consolidation contention •  Availability of resources 22
  • 22. Tip #1: Don t assume that one-size-fits all §  Most IT shops will create VM or SAN templates that have a fixed resource consumption •  Reduces admin costs •  Example: Two CPU VM with 2 GB of memory •  Deviations from this must be made in a special request §  Recommendations: •  Size correctly, don t accept insufficient resources •  Test pre-production environments
  • 23. Same concept applies for disk virtualization §  The capacity of disks are typically expressed in terms of 50GB and 100 I/ O s per sec two metrics: space and I/O capacity capacity •  Space defined in terms of 50GB and 200 I/ O s per sec GBytes capacity •  I/O capacity defined in terms of I/O s per sec §  NAS and SAN are forms of disk 50GB and 400 I/ O s per sec virtualization capacity •  The space associated with a SAN volume (for example) could be striped over multiple disks •  The more disks allocated, the higher the I/O capacity
  • 24. Linear mapping s and Luns Four  Luns §  When mapped directly to physical disks then this could concentrate I/ Logical  volume  with   O to fewer than a linear  mapping desired set of Allocated  for   Free  space  in   drives. Index volume §  High-end SAN s like Symmetrix can handle this situation with virtual LUN s 25
  • 25. EMC Symmetrix: Nondisruptive Mobility Virtual LUN VP Mobility Virtual Pools §  Fast, efficient mobility Flash §  Maintains replication and 400 GB RAID 5 quality of service during relocations Fibre Channel V 600 GB 15K L §  Supports up to thousands of U Tier 2 RAID 1 N concurrent VP LUN migrations SATA 2 TB §  Recommendation: work with RAID 6 storage technicians to ensure backend storage has sufficient I/O
  • 26. Tip #2: Consolidation Contention §  Virtualization provides benefit from consolidation §  Consolidation provides resources to the active •  Your resources can be consumed by other VM s, other apps •  Physical resources can be over-stretched §  Recommendations: •  Track actual capacity vs. planned §  Vmware: track number of times your VM is denied CPU §  SANs: track % I/O utilization vs. number of I/O s •  For Vmware leverage guaranteed minimum resource allocations and/or allocate to non- overloaded HW
  • 27. Some Vmware statistics §  Ready metric •  Generated by Vcenter and represents the number of cycles (across all CPUs) in which VM was denied CPU •  Generated in milliseconds and real-time sample happens at best every 20 secs •  For interactive apps: As a percentage of offered capacity > 10% is considered worrisome §  Pages-in, Pages-out •  Can indicate over subscription of memory 28
  • 28. Sample %Ready for a production VM with xPlore deployment for an entire week 16% In this case Avg resp time 14% official area that doubled and 12% max resp time Indicates pain 10% grew by 5x 8% 6% 4% 2% 0% 29
  • 29. Actual Ready samples during several hour period Ready  samples  (#  of  millisecs  VM  denied   CPU  in  20  sec  intervals) 2500 2000 1500 1000 500 0 30
  • 30. Some Subtleties with Interactive CPU denial §  The Ready metric represents denial upon demand •  Interactive workloads can be bursty •  If no demand, then Ready counter will be low §  Poor user response encourages less usage •  Like walking on a broken leg •  Causing less Ready samples Denial spike 20 sec interval 31
  • 31. Sharing I/O capacity §  If Multiple VM s (or servers) are sharing the same underlying physical volumes and the capacity is not managed properly •  then the available I/O capacity of the volume could be less than the theoretical capacity §  This can be seen if the OS tools show that the disk is very busy (high utilization) while the number of I/Os is lower than expected Volume for Volume for other Lucene application application Both volumes spread over the same set of drives and effectively sharing the I/O capacity
  • 32. Recommendations on diagnosing disk I/O related issues §  On Linux/UNIX •  Have IT group install SAR and IOSTAT §  Also install a disk I/O testing tool (like Bonnie ) •  Compare Bonnie output with SAR & IOSTAT data §  High disk Utilization at much lower achieved rates could indicate contention from other applications •  Also, High SAR I/O wait time might be an indication of slow disks §  On Windows •  Leverage the Windows Performance Monitor •  Objects: Processor, Physical Disk, Memory
  • 33. Sample output from the Bonnie tool bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2 This output means -s 1024 means that 2 GB files will be created that the random read test saw 735 random I/ -o_direct means that direct I/O (by-passing buffer cache) O s per sec at 15% will be done CPU busy -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene install.
  • 34. Linux indicators compared to bonnie output Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the I/O stat output: remaining I/O capacity (735 – 209 = 500+). Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sde 206.10 2402.40 0.80 24024 8 SAR –d output: 09:29:17 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:29:27 dev8-65 209.24 4877.97 1.62 23.32 1.62 7.75 3.80 79.59 SAR –u output: 09:29:17 PM CPU %user %nice %system %iowait %steal %idle 09:29:27 PM all 41.37 0.00 5.56 29.86 0.00 23.21 09:29:27 PM 0 62.44 0.00 10.56 25.38 0.00 1.62 09:29:27 PM 1 30.90 0.00 4.26 35.56 0.00 29.28 09:29:27 PM 2 36.35 0.00 3.96 30.76 0.00 28.93 09:29:27 PM 3 35.77 0.00 3.46 27.64 0.00 33.13 High I/O wait See https://community.emc.com/docs/DOC-9179 for additional example
  • 35. Tip #3: Try to ensure availability of resources §  Similar to the previous issue, but •  resource displacement not caused by overload, •  Inactivity can cause Lucene resources to be displaced •  Not different from running on large shared native OS host §  Recommendation: •  Periodic warmup §  non-intrusive •  See next example
  • 36. IO / caching test use-case §  Unselective Term search •  100 sample queries •  Avg( hits per term) = 4,300+, max ~ 60,000 •  Searching over 100 s of DCTM object attributes + content §  Medium result window •  Avg( results returned per query) = 350 (max: 800) §  Stored Fields Utilized •  Some security & facet info §  Goal: •  Pre-cache portions of the index to improve response time in scenarios •  Reboot, buffer cache contention, & vm memory contention
  • 37. Some xPlore Structures for Search¹ Dictionary of terms Posting list (doc-id s for term) Stored fields (facets and node-ids) 1st doc N-th xDB XML doc store (contains text for Security indexes summary) Facet decompression map (b-tree based) ¹Frequency and position structures ignored for simplicity
  • 38. IO model for search in xPlore Search Term: term1 term2 Result set Dictionary Posting list (doc-id s for term) Stored fields Xdb node-id plus facet / xDB XML security info store (contains text for Security lookup summary) Facet decompression map (b-tree based)
  • 39. Separation of covering values in stored fields and summary Potentially Potentially thousands thousands of of hits results Small structure FinalFacet Security Facet calc values lookup Calc over thousands of Small number results for result window Res-1 - sum Stored fields Res-2 - sum (Random access) Res-3 - sum Xdb docs with text for : summary : Res-350-sum
  • 40. xPlore Memory Pool areas at-a-glance Native code Lucene content Operating Other vm xPlore Caches extraction & System caches & xDB linguistic working Buffer processing File Buffer mem working memory Cache memory cache (dynamically sized) xPlore Instance (fixed size) memory
  • 41. Lucene data resides primarily in OS buffer cache Dictionary of terms Posting list (doc-id’s for term) N-th xDB XML doc store Stored fields (facets and node-ids) (contains text for 1st doc N-th summary) doc Native code Potential for many Lucene Other vm xPlore Caches content extraction & Operating System things to sweep xDB working caches & working Buffer linguistic processing File Buffer lucene from that mem cache memory Cache memory cache (dynamically sized) xPlore Instance (fixed size) memory 42
  • 42. Test Env §  32 GB memory §  Direct attached storage (no SAN) §  1.4 million documents §  Lucene index size = 10 GB §  Size of internal parts of Lucene CFS file •  Stored fields (fdt, fdx): 230 MB (2% of index) •  Term Dictionary (tis,tii): 537 MB (5% of index) •  Positions (prx): 8.78 GB (80% of index) •  Frequencies (frq) : 1.4 GB (13 % of index) §  Text in xDB stored compressed separately 43
  • 43. Some results of the query suite Test Avg Resp MB pre- I/O per Total MB to cached result loaded into consume memory all results (cached + test) (sec) Nothing cached 1.89 0 0.89 77 Stored fields cached 0.95 241 0.38 272 Term dict cached 1.73 537 0.79 604 Positions cached 1.58 8,789 0.74 8,800 Frequencies cached 1.65 1,406 0.63 1,436 Entire index cached 0.59 10,970 < 0.05 10,970 •  Linux buffer cache cleared completely before each run •  Resp as seen by final user in Documentum •  Facets not computed in this example. Just a result set returned. With Facets response time difference more pronounced. •  Mileage will vary depending on a series of factors that include query complexity, compositions of the index, and number of results consumed 44
  • 44. Other Notes §  Caching 2% of index yields a response time that is only 60% greater than if the entire index was cached. •  Caching cost only 9 secs on a mirrored drive pair •  Caching cost 6800 large sequential I/O s vs. potentially 58,000 random I/O s §  Mileage will vary, factors include •  Phrase search •  Wildcard search •  Multi-term search §  SAN s can grow I/O capacity as search complexity increases 45
  • 45. Contact §  Ed Bueché •  edward.bueche@emc.com •  http://community.emc.com/people/Ed_Bueche/blog •  http://community.emc.com/docs/DOC-8945 46