SlideShare une entreprise Scribd logo
1  sur  97
Unifying Search Engine and NoSQL DBMS with a Universal Index Chris Biow MarkLogic Federal CTO
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
>200 customers, >170 employees HQ: San Carlos, CA Lead investor:  Sequoia Capital 2008: Top 5 fastest growing technology companies in Silicon Valley (Deloitte) 2009, 2010: Best DBMS (SIIA CODiE). Previously best Search, CMS. About MarkLogic MarkLogic Corporation makes a purpose-built database for unstructured information
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
What is MarkLogic Server? A hybrid (integrated parts) Special purpose DBMS for XML, with enterprise expectations ACID transactions DBA, backup, replication Search engine kernel, with enterprise expectations Full text Faceted navigation, at massive scale  Boolean, proximity, stemming, tokenization, decompounding, case, diacritics, … Application Server HTTP XCC Java/.NET WebDAV
MarkLogic as Special DBMS Not relational (RDBMS) XML The only data model required Schema agnostic Text a first-class citizen among data types XQuery (SQL) Search engine algorithms for many DB queries Order(1) initial lookup in number of docs O(log(n)) in range indexing	 Very low DBA overhead (0.5 FTE / 100 hosts) 5-minute install 5-minute scale-out Database and search engine are the same
MarkLogic as NoSQL DBMS SQL XQuery ! Extensions: cts:search() / xdmp:document-insert() NoSQL Categories [per AKF Partners] Key->Value store URI -> document (XML, JSON, text, binary) Extensible Record store Extensible Markup Language Document store XML documents, natch Differentiators ACID transactions in LAN cluster Ad hoc XQuery XML declares what is to be indexed, independently for each document DBMS and Search Engine are the same
MarkLogic as Special Search Engine timeline Understands document structure Transactional: high CRUD load Unicode Holds the documents Update / reindexing Delivery Geospatial: Box, Point/radius, polygon Alerting: Profiles, alerts, filters, tipping, selectors, “triggers,” …  Analytics: Facets, Co-occurrence, word lexicons, …  Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) Processing near the data Relational joins and inferencing Database and search engine are the same message message @id 3 @id 5 status status Oh boy… Testing XProc Element node Attribute Node Text Node
MarkLogic as Special App Server Native HTTP(S) server RESTful XML by default Transform to HTML Transform to PDF, MS Office, etc. PKI with no dependencies Optionally with external Auth HTTP(S) client XCC Java / .NET server Similar to JDBC / ADO.NET WebDAV Folder on the user’s desktop RESTful Architecture user/ Representations + get  .json.xqy  .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy
MarkLogic at Scale Scale up: typically 1-2TB+ XML per server Scale out: low hundreds(++) of servers in a cluster Commodity hardware  Typically ~$15K HW budget per server 2-CPU x 6-core/hyperthreaded 32+ GB RAM 3x disk: local mount with failover OS Linux RHEL 5 Solaris 10 Windows 2003/8 (XP/Vista/7 for dev)
Collapsing the Stack The extended stack Data (CSV: data) DBMS (SQL: result sets) Search Engine (Search languages: result page) Web service (Java: XML, JSON) Application server (Ruby: HTML) MarkLogic Data (XML: xml) DBMS, Search, Service, App (XQuery, XQuery, XQuery, XQuery)
Data Model A database for unstructured (and semi-structured) information  XML Data Model fpML Document Trade Product Title Author Metadata Trade Cashflow Section ID ID Last TradeLeg First TradeLeg Amount TradeLeg Event Event Event Event Section Section Section Section
Example Document Document Title Section Section (cont’d) Author Abstract Section Metadata Section Section Footer
Serialized as XML <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
Target Query Classes Target to optimize performance for these kinds of queries Full-Text Search  Find all documents that contain the phrase “uniquely identify”. XML Structure Find all articles that have an abstract. XML Semantics Find all documents that mention the product “IMS”. Aggregate Queries How many articles that contain “data base” were written in each of the last 5 decades. All of the above . . .   	Count all articles that contain “data” in the title and mention the product “IMS” in a section, grouping by year. at the same time
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
1) Full-text Search Find all documents that contain the phrase “uniquely identify” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
1) Full-text Search Find all documents that contain the phrase “uniquely identify” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . .  “uniquely identify” 126, 130, 167, 212, 219, 377 . . . “which uniquely” . . .  126, 130, 167, 212, 219, 377 . . . “identify each” . . .
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
<article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 2) XML Structure Find all articles that have an abstract
2) XML Structure Find all articles that have an abstract UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . .  “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . .  126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .
<article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 3) XML Semantics Find all documents that mention the product “IMS”
3) XML Semantics Find all documents that mention the product “IMS” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . .  “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . .  126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .  <product>IMS</product>
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
<article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 4) Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades?
4) Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . .  “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . .  126, 130, 167, … <article>/<abstract> . . .  <product>IMS</product> Volume
<article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> 	<abstract> 	Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . .  	</abstract> 	<body> 		<section> 		     <section> … has values which uniquely identify each element ...  </section> 		</section> 		<section>… version of <product>IMS</product> provides the user . . .  </section> 	</body> 	<metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 5) All Of The Above Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year.
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . .  Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . .  <article>/<abstract> . . .  126, 130, 167, … <product>IMS</product>
Additional Uses For Universal Index Directories Exclusive, hierarchical, analogous to file system URI: /some/directory/hierarchy/me.xml Collections Set-based, N:M relationship Document URI : Collection URI Security Invisible to your app Document: Role, action
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . .  Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . .  <article>/<abstract> . . .  126, 130, 167, … <product>IMS</product> Directory(“/articles”) Collection(Red) Role:Editor + Action:Read
Universal Index: Schema Agnostic XML is self-describing <article> <title>MarkLogic Server: . . .</title> <author> 		<first-name>Dale</first-name> 		<last-name>Kim</last-name> 	</author> 	<abstract> 		. . . .<company>Mark Logic</company> </abstract> 	<body> 		<section> 			<section>. . .</section> 		</section> 		<section>. . . index . . . </section> 	</body> 	<copyright>Copyright©  . . . </copyright> </article>
Load As Is XML is self-describing <article> <title>MarkLogic Server: . . .</title> <author> 		<first-name>Dale</first-name> 		<last-name>Kim</last-name> 	</author> 	<abstract> 		. . . .<company>MarkLogic</company> </abstract> 	<body> 		<section> 			<section>. . .</section> 		</section> 		<section>. . . index . . . </section> 	</body> 	<copyright>Copyright©  . . . </copyright> </article> <article> <title> MarkLogic Server: . . . <author> <first-name> Dale <last-name> Kim <abstract> <company> MarkLogic <body> <section> <section> <section> . . . index. . .  <copyright>
Load As Is XML is self-describing <article> <body> <copyright> <title> <author> <abstract> "MarkLogic Server: . . ." <company> " . . . " " . . . " <first-name> <section> <section> <last-name> “ . . . " <section> "Dale" "MarkLogic" " . . . index. . . " "Kim" " . . . "
Load As Is XML is self-describing No Schema Needed! <article> <body> <copyright> <title> <author> <abstract> "MarkLogic Server: . . ." <company> " . . . " " . . . " <first-name> <section> <section> <last-name> “ . . . " <section> "Dale" "MarkLogic" " . . . index. . . " "Kim" " . . . "
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Spatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …
Spatial Query Data examples Latitude / Longitude Any other pair (e.g. volume / price) Query types Point (exact value) Point-Radius (circle) Lat/Lon bound (Mercator “rectangle”) Polygon (10K+ vertices) Composition with… Full Text XML structure XML semantics Other range indexes (e.g. temporal)
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Query Registration Canonicalize a cts:query() Hash for an ID Resolve the query Cache the term list in memory Reuse as materialized sub-query (AKA Topics, Concepts, Macros, etc.) (This is not alerting)
Registered Query Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . .  “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . .  Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . .  <article>/<abstract> . . .  126, 130, 167, … <product>IMS</product> Directory(“/articles/”) Collection(Red) Role:Editor + Action:Read cts:query(<cts:word-query><cts:text>…)
Query Indexing “Alerting” Real-time search, selectors, tippers, standing queries, filters, “triggers*”, content-based routing, stream DBMS, etc. Search(query, Index[docs]) -> docs Alert(doc, Index[queries]) -> queries Queries are XML documents XML serialization of cts:query()  First step is O(1) in number of queries O(n) in results returned cts:reverse-query() * MarkLogic also has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile
The Reverse Index REVERSE INDEX Query Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) 437 and “data” and year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) 562 “web” and and (2000 <= year <= 2010) and “web” and 597 . . .  year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010
Alerting in Composition Scalar Query on scalar data with range queries Alert on range data with scalar reverse-queries Geospatial Query on point data with box, circle, polygon query constraint Compose with text, structure, XML-semantic query Alert on box, circle, polygon data with point reverse-query Compose with text, structure, XML-semantic data Search [Forward-]Query composes with Boolean operations (AND, OR, NOT, (()()(()())) Reverse- and forward-query compose (AND, OR, NOT, (()()(()())) Why would you ever want to do that?
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Search Composed with Alerting In Soviet Russia, the document searches YOU! If you express yourself as XML Documents are XML Elements Text Typed data Serialized query against documents [cts:query()] Attributes Typed data Composition  Boolean (AND) incorporating [forward-]query and reverse-query
Matchmaking Constraints upon each other Matching pairs or one-to-many Examples Suitable date:  man / woman mixed pool Employment: job / resume Medication: patient  / drug Search security: document  / user Battle: target  / shooter Carpool ride: driver / rider
Carpool Driver Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas Requires female passenger within five miles of start and end Passenger Woman will pay up to $20 From: 3001 Summit View Dr, San Ramon, CA 94582 To: 400 Concourse Drive, Belmont, CA 94002 Requires non-smoking car  Won’t listen to country music
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert( "/driver.xml",   <driver>     <from>{$from}</from>     <to>{$to}</to>     <when>2010-01-20T08:00:00-08:00</when>     <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences>       {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to))        ))       }     </preferences> </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert(   "/driver.xml",   <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when>     <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences>       {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to))        ))       } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert(   "/driver.xml",   <driver>     <from>{$from}</from>     <to>{$to}</to>     <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences>       {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to))        ))       } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert(   "/driver.xml",   <driver>     <from>{$from}</from>     <to>{$to}</to>     <when>2010-01-20T08:00:00-08:00</when>     <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ))  }     </preferences>   </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert(   "/driver.xml",   <driver>     <from>{$from}</from>     <to>{$to}</to>     <when>2010-01-20T08:00:00-08:00</when>     <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences>       {cts:and-query(( cts:element-value-query(xs:QName("gender"),          "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to))        ))       }     </preferences>   </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119)  (: San Carlos :) return xdmp:document-insert(   "/driver.xml",   <driver>     <from>{$from}</from>     <to>{$to}</to>     <when>2010-01-20T08:00:00-08:00</when>     <gender>female</gender>     <smoke>no</smoke>     <music>rock, pop, hip-hop</music>     <cost>10</cost>     <preferences>       {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"),  cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"),  cts:circle(5, $to)) ...
Driver xdmp:document-insert( "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from>     <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from>     <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"),           "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female")         ))       }     </preferences>   </passenger>)
Driver xdmp:document-insert(   "/passenger.xml",   <passenger>     <from>37.739976,-121.915821</from> 	 <to>37.53244,-122.270969</to>     <gender>female</gender>     <preferences>       { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"),           "female")         ))       }     </preferences>   </passenger>)
Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
Search and Alerting Composed XML data typing expectations  In both directions Arbitrary schema  At each document In each query Don’t [have to] declare anything You choose the logic for empty data Specify typed range indexes for scalar, geo, faceting Search engine speed and scalability O(1) term lookup Word, structure, values Query sub-expression O(log(n)) range lookup and term list intersection Shared-nothing (sharded) query evaluation
Document Security Document queries the user Rules for who can see me, the document Open, ad-hoc security [lack of] model Each document declares any rules it wants “user eye color not blue” Descriptive model/schema if desired  Extensible without changing DBMS schema
Medication Patient Diagnosis Background Idiopathic history Vital Statistics Treatment baseline Drug Therapeutics Side effects Interactions Contraindications
What is a strange loop? Not mere hierarchies of abstraction Abstraction is simplification and distortion Useful if bounded and well-ordered Problems if poorly bounded: Watch Ourselves Strange loops are disorderings of the hierarchy: heterarchy Hofstadter: “a paradoxical level-crossing feedback loop” Good strange loops gracefully accommodate the disordering Established examples in Computer Science Godel’s Incompleteness Theorems Self-compiling languages
Strange Loop: Fwd∘Rev Query Queries are an abstraction over the data /myroot//foo[@bar=7]/bash Reverse-query Indexed data are [serialized] queries.  Documents are the queries.  Queries can still be abstractions over a stream of docs. Composing forward and reverse query Strange Loop! Escher Drawing Hands
Strange Loop: XQuery throughout the Stack Query language is an abstraction over the data But the query language is re-used in the application The [No]SQL is the PL Declarative query becomes functional programming Creeping Lazy evaluation Parallelization Discarding unneeded work  Schrödinger’s tuple Not religious side effects available when required DBMS transactions xdmp:set() Escher Waterfall
Strange Loop: XQuery on XQuery In SVN Project organization No search Dependency tracking Import requires absolute paths Namespace prefix conflicts Surprise modules (functx) XQuery (with cts: extensions) to discover and automate imports
The Composable, Universal Index Full text XML structure XML semantics Range indexes Range queries Aggregations Co-occurrence Spatial indexes Query indexes
Database of documents Stored in partitions Database Partition3 Partition2 Partition1 Databases
Simple Architecture Host partition1 partition2 partition3
Shared Nothing Architecture Host 1 Host 2 partition1 partition2 partition3
Bi-Directionally Scalable Architecture Host 1 Host 3 Host 2 Host 4 Host 5 Host 6 Host k partition1 partition2 partition3 partitionm partition4
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Multiversion Concurrency Control /articles/codd.xml /articles/codd.xml Document Document Title Title Author Author Metadata Metadata Section Section Last Last Year ∞ 628 ∞ 523 First First 628 ∞ Section Section Section Section Section Section Section Section Section Section c Creation Timestamp d Deleted Timestamp
Multiversion Concurrency Benefits High Throughput Queries don’t require locks Queries and Updates do not conflict ACID Cluster consistency: 2-phase commit Zero-latency ingestion and Indexing Append Only Ingest/update rates of ~400GB per partition per day /articles/codd.xml Document Title Author Metadata Section Last Year First 628 ∞ Section Section Section Section Section
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
A Single Forest Host Stand1 Stand2 Standn Buffer Buffer Forestk …
1. Create A New Tree Host Stand1 Stand2 Standn Buffer Buffer Forestk …
2. Expire Trees Host Stand1 Stand2 Standn Buffer Buffer Forestk …
3. Save A Buffer To Disk Host Stand1 Stand2 Standn Buffer Buffer Forestk …
4. Optimization: Merge Stands Host Buffer Forestk
The Four Forest Operations Create a new document ,[object Object],Mark a document as expired ,[object Object],Write buffer out to disk ,[object Object]
For performance, double bufferMerge ,[object Object]
Optimization: reduces number of stands in forest,[object Object]
How Did We Get Here? Founder: Christopher Lindblad MIT Architect of Ultraseek Server Intranet seach engine product Met people that wanted to use a search engine like a database Rich query language Guaranteed correctness Transactions
So We Built XML as data model Ad hoc schema A search engine core Universal Index Transaction model based on multiversion concurrency High throughput while keeping . . . Performance and scalability of a search engine
Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
Who Uses MarkLogic? Magazine Publishing Education Healthcare Software / Services Legal Tax Financial Enterprise Aggregation Scientific Technical Medical
Intelligence Community Department of Defense Selected Federal Customers Office of the Director of National Intelligence Intelligence Community Enterprise Services … Office of the Secretary of Defense US Army US Air Force Defense Information Systems Agency Defense Contract Management Agency Civilian ,[object Object]

Contenu connexe

Tendances

NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais
 
NoSQL Endgame JCON Conference 2020
NoSQL Endgame JCON Conference 2020NoSQL Endgame JCON Conference 2020
NoSQL Endgame JCON Conference 2020Thodoris Bais
 
JSON in der Oracle Datenbank
JSON in der Oracle DatenbankJSON in der Oracle Datenbank
JSON in der Oracle DatenbankUlrike Schwinn
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchMongoDB
 
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...FIWARE
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingPrashank Singh
 
NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021Thodoris Bais
 
Data Analytics: Understanding Your MongoDB Data
Data Analytics: Understanding Your MongoDB DataData Analytics: Understanding Your MongoDB Data
Data Analytics: Understanding Your MongoDB DataMongoDB
 
Easy data-with-spring-data-jpa
Easy data-with-spring-data-jpaEasy data-with-spring-data-jpa
Easy data-with-spring-data-jpaStaples
 
Devoxx08 - Nuxeo Core, JCR 2, CMIS
Devoxx08 - Nuxeo Core, JCR 2, CMIS Devoxx08 - Nuxeo Core, JCR 2, CMIS
Devoxx08 - Nuxeo Core, JCR 2, CMIS Nuxeo
 
Paintfree Object-Document Mapping for MongoDB by Philipp Krenn
Paintfree Object-Document Mapping for MongoDB by Philipp KrennPaintfree Object-Document Mapping for MongoDB by Philipp Krenn
Paintfree Object-Document Mapping for MongoDB by Philipp KrennJavaDayUA
 

Tendances (14)

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
NoSQL Endgame JCON Conference 2020
NoSQL Endgame JCON Conference 2020NoSQL Endgame JCON Conference 2020
NoSQL Endgame JCON Conference 2020
 
JSON in der Oracle Datenbank
JSON in der Oracle DatenbankJSON in der Oracle Datenbank
JSON in der Oracle Datenbank
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB Stitch
 
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...
FIWARE Global Summit - NGSI-LD: Modelling, Linking and Utilizing Context Info...
 
Hidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scriptingHidden automation-the-power-of-gel-scripting
Hidden automation-the-power-of-gel-scripting
 
NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021NoSQL Endgame LWJUG 2021
NoSQL Endgame LWJUG 2021
 
Data Analytics: Understanding Your MongoDB Data
Data Analytics: Understanding Your MongoDB DataData Analytics: Understanding Your MongoDB Data
Data Analytics: Understanding Your MongoDB Data
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Easy data-with-spring-data-jpa
Easy data-with-spring-data-jpaEasy data-with-spring-data-jpa
Easy data-with-spring-data-jpa
 
Devoxx08 - Nuxeo Core, JCR 2, CMIS
Devoxx08 - Nuxeo Core, JCR 2, CMIS Devoxx08 - Nuxeo Core, JCR 2, CMIS
Devoxx08 - Nuxeo Core, JCR 2, CMIS
 
Spring Data in 10 minutes
Spring Data in 10 minutesSpring Data in 10 minutes
Spring Data in 10 minutes
 
Paintfree Object-Document Mapping for MongoDB by Philipp Krenn
Paintfree Object-Document Mapping for MongoDB by Philipp KrennPaintfree Object-Document Mapping for MongoDB by Philipp Krenn
Paintfree Object-Document Mapping for MongoDB by Philipp Krenn
 

Similaire à Mark Logic StrangeLoop 2010

New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadatasuyu22
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentationaskankit
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
Working With XML in IDS Applications
Working With XML in IDS ApplicationsWorking With XML in IDS Applications
Working With XML in IDS ApplicationsKeshav Murthy
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
Itemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integrationItemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integration{item:foo}
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7Deniz Kılınç
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
Metastudio DRM. WhitePaper (eng)
Metastudio DRM. WhitePaper (eng)Metastudio DRM. WhitePaper (eng)
Metastudio DRM. WhitePaper (eng)Ireneusz Chmielak
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentJay Luker
 
Semantics In Declarative Systems
Semantics In Declarative SystemsSemantics In Declarative Systems
Semantics In Declarative SystemsOptum
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineMongoDB
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 

Similaire à Mark Logic StrangeLoop 2010 (20)

New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentation
 
Struts2
Struts2Struts2
Struts2
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Relational data as_xml
Relational data as_xmlRelational data as_xml
Relational data as_xml
 
Working With XML in IDS Applications
Working With XML in IDS ApplicationsWorking With XML in IDS Applications
Working With XML in IDS Applications
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Itemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integrationItemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integration
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 
Odp
OdpOdp
Odp
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Metastudio DRM. WhitePaper (eng)
Metastudio DRM. WhitePaper (eng)Metastudio DRM. WhitePaper (eng)
Metastudio DRM. WhitePaper (eng)
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Semantics In Declarative Systems
Semantics In Declarative SystemsSemantics In Declarative Systems
Semantics In Declarative Systems
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage Engine
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Mark Logic StrangeLoop 2010

  • 1. Unifying Search Engine and NoSQL DBMS with a Universal Index Chris Biow MarkLogic Federal CTO
  • 2. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 3. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 4. >200 customers, >170 employees HQ: San Carlos, CA Lead investor: Sequoia Capital 2008: Top 5 fastest growing technology companies in Silicon Valley (Deloitte) 2009, 2010: Best DBMS (SIIA CODiE). Previously best Search, CMS. About MarkLogic MarkLogic Corporation makes a purpose-built database for unstructured information
  • 5. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 6. What is MarkLogic Server? A hybrid (integrated parts) Special purpose DBMS for XML, with enterprise expectations ACID transactions DBA, backup, replication Search engine kernel, with enterprise expectations Full text Faceted navigation, at massive scale Boolean, proximity, stemming, tokenization, decompounding, case, diacritics, … Application Server HTTP XCC Java/.NET WebDAV
  • 7. MarkLogic as Special DBMS Not relational (RDBMS) XML The only data model required Schema agnostic Text a first-class citizen among data types XQuery (SQL) Search engine algorithms for many DB queries Order(1) initial lookup in number of docs O(log(n)) in range indexing Very low DBA overhead (0.5 FTE / 100 hosts) 5-minute install 5-minute scale-out Database and search engine are the same
  • 8. MarkLogic as NoSQL DBMS SQL XQuery ! Extensions: cts:search() / xdmp:document-insert() NoSQL Categories [per AKF Partners] Key->Value store URI -> document (XML, JSON, text, binary) Extensible Record store Extensible Markup Language Document store XML documents, natch Differentiators ACID transactions in LAN cluster Ad hoc XQuery XML declares what is to be indexed, independently for each document DBMS and Search Engine are the same
  • 9. MarkLogic as Special Search Engine timeline Understands document structure Transactional: high CRUD load Unicode Holds the documents Update / reindexing Delivery Geospatial: Box, Point/radius, polygon Alerting: Profiles, alerts, filters, tipping, selectors, “triggers,” … Analytics: Facets, Co-occurrence, word lexicons, … Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) Processing near the data Relational joins and inferencing Database and search engine are the same message message @id 3 @id 5 status status Oh boy… Testing XProc Element node Attribute Node Text Node
  • 10. MarkLogic as Special App Server Native HTTP(S) server RESTful XML by default Transform to HTML Transform to PDF, MS Office, etc. PKI with no dependencies Optionally with external Auth HTTP(S) client XCC Java / .NET server Similar to JDBC / ADO.NET WebDAV Folder on the user’s desktop RESTful Architecture user/ Representations + get .json.xqy .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy
  • 11. MarkLogic at Scale Scale up: typically 1-2TB+ XML per server Scale out: low hundreds(++) of servers in a cluster Commodity hardware Typically ~$15K HW budget per server 2-CPU x 6-core/hyperthreaded 32+ GB RAM 3x disk: local mount with failover OS Linux RHEL 5 Solaris 10 Windows 2003/8 (XP/Vista/7 for dev)
  • 12. Collapsing the Stack The extended stack Data (CSV: data) DBMS (SQL: result sets) Search Engine (Search languages: result page) Web service (Java: XML, JSON) Application server (Ruby: HTML) MarkLogic Data (XML: xml) DBMS, Search, Service, App (XQuery, XQuery, XQuery, XQuery)
  • 13. Data Model A database for unstructured (and semi-structured) information XML Data Model fpML Document Trade Product Title Author Metadata Trade Cashflow Section ID ID Last TradeLeg First TradeLeg Amount TradeLeg Event Event Event Event Section Section Section Section
  • 14. Example Document Document Title Section Section (cont’d) Author Abstract Section Metadata Section Section Footer
  • 15. Serialized as XML <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
  • 16. Target Query Classes Target to optimize performance for these kinds of queries Full-Text Search Find all documents that contain the phrase “uniquely identify”. XML Structure Find all articles that have an abstract. XML Semantics Find all documents that mention the product “IMS”. Aggregate Queries How many articles that contain “data base” were written in each of the last 5 decades. All of the above . . . Count all articles that contain “data” in the title and mention the product “IMS” in a section, grouping by year. at the same time
  • 17. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 18. 1) Full-text Search Find all documents that contain the phrase “uniquely identify” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
  • 19. 1) Full-text Search Find all documents that contain the phrase “uniquely identify” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . “which uniquely” . . . 126, 130, 167, 212, 219, 377 . . . “identify each” . . .
  • 20. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 21. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 2) XML Structure Find all articles that have an abstract
  • 22. 2) XML Structure Find all articles that have an abstract UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .
  • 23. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 3) XML Semantics Find all documents that mention the product “IMS”
  • 24. 3) XML Semantics Find all documents that mention the product “IMS” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . . <product>IMS</product>
  • 25. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 26. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 4) Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades?
  • 27. 4) Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
  • 28. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article> 5) All Of The Above Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year.
  • 29. The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product>
  • 30. Additional Uses For Universal Index Directories Exclusive, hierarchical, analogous to file system URI: /some/directory/hierarchy/me.xml Collections Set-based, N:M relationship Document URI : Collection URI Security Invisible to your app Document: Role, action
  • 31. The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product> Directory(“/articles”) Collection(Red) Role:Editor + Action:Read
  • 32. Universal Index: Schema Agnostic XML is self-describing <article> <title>MarkLogic Server: . . .</title> <author> <first-name>Dale</first-name> <last-name>Kim</last-name> </author> <abstract> . . . .<company>Mark Logic</company> </abstract> <body> <section> <section>. . .</section> </section> <section>. . . index . . . </section> </body> <copyright>Copyright© . . . </copyright> </article>
  • 33. Load As Is XML is self-describing <article> <title>MarkLogic Server: . . .</title> <author> <first-name>Dale</first-name> <last-name>Kim</last-name> </author> <abstract> . . . .<company>MarkLogic</company> </abstract> <body> <section> <section>. . .</section> </section> <section>. . . index . . . </section> </body> <copyright>Copyright© . . . </copyright> </article> <article> <title> MarkLogic Server: . . . <author> <first-name> Dale <last-name> Kim <abstract> <company> MarkLogic <body> <section> <section> <section> . . . index. . . <copyright>
  • 34. Load As Is XML is self-describing <article> <body> <copyright> <title> <author> <abstract> "MarkLogic Server: . . ." <company> " . . . " " . . . " <first-name> <section> <section> <last-name> “ . . . " <section> "Dale" "MarkLogic" " . . . index. . . " "Kim" " . . . "
  • 35. Load As Is XML is self-describing No Schema Needed! <article> <body> <copyright> <title> <author> <abstract> "MarkLogic Server: . . ." <company> " . . . " " . . . " <first-name> <section> <section> <last-name> “ . . . " <section> "Dale" "MarkLogic" " . . . index. . . " "Kim" " . . . "
  • 36. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 37. Spatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …
  • 38. Spatial Query Data examples Latitude / Longitude Any other pair (e.g. volume / price) Query types Point (exact value) Point-Radius (circle) Lat/Lon bound (Mercator “rectangle”) Polygon (10K+ vertices) Composition with… Full Text XML structure XML semantics Other range indexes (e.g. temporal)
  • 39. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 40. Query Registration Canonicalize a cts:query() Hash for an ID Resolve the query Cache the term list in memory Reuse as materialized sub-query (AKA Topics, Concepts, Macros, etc.) (This is not alerting)
  • 41. Registered Query Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product> Directory(“/articles/”) Collection(Red) Role:Editor + Action:Read cts:query(<cts:word-query><cts:text>…)
  • 42. Query Indexing “Alerting” Real-time search, selectors, tippers, standing queries, filters, “triggers*”, content-based routing, stream DBMS, etc. Search(query, Index[docs]) -> docs Alert(doc, Index[queries]) -> queries Queries are XML documents XML serialization of cts:query() First step is O(1) in number of queries O(n) in results returned cts:reverse-query() * MarkLogic also has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile
  • 43. The Reverse Index REVERSE INDEX Query Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) 437 and “data” and year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) 562 “web” and and (2000 <= year <= 2010) and “web” and 597 . . . year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010
  • 44. Alerting in Composition Scalar Query on scalar data with range queries Alert on range data with scalar reverse-queries Geospatial Query on point data with box, circle, polygon query constraint Compose with text, structure, XML-semantic query Alert on box, circle, polygon data with point reverse-query Compose with text, structure, XML-semantic data Search [Forward-]Query composes with Boolean operations (AND, OR, NOT, (()()(()())) Reverse- and forward-query compose (AND, OR, NOT, (()()(()())) Why would you ever want to do that?
  • 45. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 46. Search Composed with Alerting In Soviet Russia, the document searches YOU! If you express yourself as XML Documents are XML Elements Text Typed data Serialized query against documents [cts:query()] Attributes Typed data Composition Boolean (AND) incorporating [forward-]query and reverse-query
  • 47. Matchmaking Constraints upon each other Matching pairs or one-to-many Examples Suitable date: man / woman mixed pool Employment: job / resume Medication: patient / drug Search security: document / user Battle: target / shooter Carpool ride: driver / rider
  • 48. Carpool Driver Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas Requires female passenger within five miles of start and end Passenger Woman will pay up to $20 From: 3001 Summit View Dr, San Ramon, CA 94582 To: 400 Concourse Drive, Belmont, CA 94002 Requires non-smoking car Won’t listen to country music
  • 49. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
  • 50. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
  • 51. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
  • 52. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
  • 53. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
  • 54. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ...
  • 55. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 56. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 57. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 58. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 59. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 60. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 61. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 62. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53244,-122.270969</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
  • 63. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
  • 64. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
  • 65. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
  • 66. Driver (: I'm the driver, find me passengers :) let $me := fn:doc("/driver.xml")/driver for $match in cts:search(/passenger, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me))))[1 to 3] return fn:base-uri($match)
  • 67. Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
  • 68. Driver (: I'm a passenger, find me a driver :) let $me := fn:doc("/passenger.xml")/passenger for $match in cts:search(/driver, cts:and-query(( cts:query($me/preferences/element()), cts:reverse-query($me)))) return fn:base-uri($match)
  • 69. Search and Alerting Composed XML data typing expectations In both directions Arbitrary schema At each document In each query Don’t [have to] declare anything You choose the logic for empty data Specify typed range indexes for scalar, geo, faceting Search engine speed and scalability O(1) term lookup Word, structure, values Query sub-expression O(log(n)) range lookup and term list intersection Shared-nothing (sharded) query evaluation
  • 70. Document Security Document queries the user Rules for who can see me, the document Open, ad-hoc security [lack of] model Each document declares any rules it wants “user eye color not blue” Descriptive model/schema if desired Extensible without changing DBMS schema
  • 71. Medication Patient Diagnosis Background Idiopathic history Vital Statistics Treatment baseline Drug Therapeutics Side effects Interactions Contraindications
  • 72. What is a strange loop? Not mere hierarchies of abstraction Abstraction is simplification and distortion Useful if bounded and well-ordered Problems if poorly bounded: Watch Ourselves Strange loops are disorderings of the hierarchy: heterarchy Hofstadter: “a paradoxical level-crossing feedback loop” Good strange loops gracefully accommodate the disordering Established examples in Computer Science Godel’s Incompleteness Theorems Self-compiling languages
  • 73. Strange Loop: Fwd∘Rev Query Queries are an abstraction over the data /myroot//foo[@bar=7]/bash Reverse-query Indexed data are [serialized] queries. Documents are the queries. Queries can still be abstractions over a stream of docs. Composing forward and reverse query Strange Loop! Escher Drawing Hands
  • 74. Strange Loop: XQuery throughout the Stack Query language is an abstraction over the data But the query language is re-used in the application The [No]SQL is the PL Declarative query becomes functional programming Creeping Lazy evaluation Parallelization Discarding unneeded work Schrödinger’s tuple Not religious side effects available when required DBMS transactions xdmp:set() Escher Waterfall
  • 75. Strange Loop: XQuery on XQuery In SVN Project organization No search Dependency tracking Import requires absolute paths Namespace prefix conflicts Surprise modules (functx) XQuery (with cts: extensions) to discover and automate imports
  • 76. The Composable, Universal Index Full text XML structure XML semantics Range indexes Range queries Aggregations Co-occurrence Spatial indexes Query indexes
  • 77. Database of documents Stored in partitions Database Partition3 Partition2 Partition1 Databases
  • 78. Simple Architecture Host partition1 partition2 partition3
  • 79. Shared Nothing Architecture Host 1 Host 2 partition1 partition2 partition3
  • 80. Bi-Directionally Scalable Architecture Host 1 Host 3 Host 2 Host 4 Host 5 Host 6 Host k partition1 partition2 partition3 partitionm partition4
  • 81. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 82. Multiversion Concurrency Control /articles/codd.xml /articles/codd.xml Document Document Title Title Author Author Metadata Metadata Section Section Last Last Year ∞ 628 ∞ 523 First First 628 ∞ Section Section Section Section Section Section Section Section Section Section c Creation Timestamp d Deleted Timestamp
  • 83. Multiversion Concurrency Benefits High Throughput Queries don’t require locks Queries and Updates do not conflict ACID Cluster consistency: 2-phase commit Zero-latency ingestion and Indexing Append Only Ingest/update rates of ~400GB per partition per day /articles/codd.xml Document Title Author Metadata Section Last Year First 628 ∞ Section Section Section Section Section
  • 84. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 85. A Single Forest Host Stand1 Stand2 Standn Buffer Buffer Forestk …
  • 86. 1. Create A New Tree Host Stand1 Stand2 Standn Buffer Buffer Forestk …
  • 87. 2. Expire Trees Host Stand1 Stand2 Standn Buffer Buffer Forestk …
  • 88. 3. Save A Buffer To Disk Host Stand1 Stand2 Standn Buffer Buffer Forestk …
  • 89. 4. Optimization: Merge Stands Host Buffer Forestk
  • 90.
  • 91.
  • 92.
  • 93. How Did We Get Here? Founder: Christopher Lindblad MIT Architect of Ultraseek Server Intranet seach engine product Met people that wanted to use a search engine like a database Rich query language Guaranteed correctness Transactions
  • 94. So We Built XML as data model Ad hoc schema A search engine core Universal Index Transaction model based on multiversion concurrency High throughput while keeping . . . Performance and scalability of a search engine
  • 95. Agenda MarkLogic: company MarkLogic Server: DBMS product Search: indexing text Query: indexing XML structure and semantics Scalars: indexing ranges, analyzing aggregations Geo: indexing spatial locations Alerting: indexing queries, performing reverse query Strange Loop: composing forward and reverse query MVCC: temporizing the DBMS No-compromise lock-free queries LSM (Log Structured Merge) Trees No-compromise ingest , query, search Examples: BusinessWeek.com, WarriorGateway.org, USAF KM
  • 96. Who Uses MarkLogic? Magazine Publishing Education Healthcare Software / Services Legal Tax Financial Enterprise Aggregation Scientific Technical Medical
  • 97.
  • 101.

Notes de l'éditeur

  1. Ordered index in lat-major order. log(n) lookup in latitude bounds, then scans longitude bounds
  2. Generate all the normal indexing terms for the reverse-query document, then do a linear merge to match query-document terms with the root nodes of the unified expression tree. Based on which terms do or don&apos;t match, nominate documents that may contain matching queries. For each nominated query-document, evaluate from the root of the query tree on the right side towards the leaf nodes at the left of the slide. Once a subtree has been evaluated for one query-document, we remember the result and short-circuit that evaluation for any other query-documents that share the subquery.