Contenu connexe Similaire à Data Segmenting in Anzo (20) Plus de LeeFeigenbaum (6) Data Segmenting in Anzo1. Data Segmenting in Anzo
Contact:
Lee Feigenbaum
lee@cambridgesemantics.com
©2011 Cambridge Semantics Inc. All rights reserved.
2. Simple Introduction to Cambridge Semantics & Anzo
• Cambridge Semantics is a software startup founded
by a team of engineers from IBM’s Advanced Internet
Technology group in 2007
• We sell the Anzo platform and tools to (mainly)
Fortune 500 companies
• Anzo is Semantic Web middleware that often stores
large amounts of data for diverse uses
2 ©2011 Cambridge Semantics Inc. All rights reserved.
3. We Use Named Graphs
• Primary tool for segmenting data in Anzo
• Smallest unit of granularity for:
– Versioning & provenance
– Access control
– Notifications
– Replication
• (Concretely: we use TriG extensively)
3 ©2011 Cambridge Semantics Inc. All rights reserved.
4. Which Triples Go Into a Named Graph?
• Everything
– Effectively a triple store
• Single triple
– Gives per statement access control, etc.
• Whatever was in the source document
– OK in some cases, but documents are often an artificial
construct
– What happens when doing a bulk load of hundreds of
millions of triples?
• All triples that share a subject
– Decent compromise / default state in our experience
• Closure of triples from a given subject following
predicated annotated as “internal”
4 ©2011 Cambridge Semantics Inc. All rights reserved.
5. Typical Anzo Data Segmenting
debut showing 10/14/1994
Pulp
Fiction budget $ 8,500,000
director
directed
Tarantino Reservoir
Dogs
birth date full name
Quentin Jerome
3/27/1963 Tarantino
5 ©2011 Cambridge Semantics Inc. All rights reserved.
6. Impact of Typical Anzo Data Segmenting
• Many, many (millions) of small graphs
• Often corresponds with the natural granularity at
which you want to do things like
permissions, versioning, alerting, etc.
• Significant overhead for per-graph metadata
– Sometimes encourages other partitioning schemes
6 ©2011 Cambridge Semantics Inc. All rights reserved.
7. Finding the Graph for a Particular Resource
• Default case: graph name is the same as the resource
name
– Not Kosher, but works well
• Fallback case: system-wide SPARQL query
• General case: graph resolution framework that can
identify appropriate graph(s) via:
– SPARQL DESCRIBE query (just kicks the can down the road
a bit)
– Lookup (registry)
– Pattern matching (similar to POWDER)
• (Graphs do not have to be local; sometimes
resolution ends up retrieving them via HTTP or from
an RDB)
7 ©2011 Cambridge Semantics Inc. All rights reserved.
8. Accessing Graphs
• Replication service
– Chunked to handle large graphs gracefully
– Client replicas kept up to date via JMS-driven notification
service
– Replicas are cached aggressively – encourages smaller
graphs to limit client memory footprint (e.g. in a Web
browser)
8 ©2011 Cambridge Semantics Inc. All rights reserved.
9. Linked Data in Anzo
• Data in Anzo can be exposed as linked data
• Anzo will dereference external URIs to get at
data, but that’s of limited utility
– Allows single-instance views, but not faceted browsing
• Anzo does not use linked data internally for data
access
• Linked Data consumption/publication is a
feature, not a core part of Anzo’s architecture
9 ©2011 Cambridge Semantics Inc. All rights reserved.
10. Accessing Graphs
• SPARQL queries
– Clients (e.g. Anzo on the Web facetted browser) target
subsets of the server data with SPARQL queries
– Impractical to enumerate millions of graphs in FROM or FROM
NAMED clauses
– Extend SPARQL with named datasets
• Server-based lists of graphs that comprise an RDF dataset (default
graph and named graphs)
• Add FROM DATASET clause to reference named datasets from a
query
10 ©2011 Cambridge Semantics Inc. All rights reserved.
11. Anzo and other Sem Web Technologies
• Everything described in RDFS and OWL (used as a rich
data modeling language mostly)
• We publish RDFa
• We use JSON serializations of SPARQL results and RDF
• We implement SPARQL Update but don’t use it from our
tools
• SPARQL-based rules (used to be CONSTRUCT, now INSERT )
• We use SPARQL ASK queries for transaction pre-
conditions and validation
• We have our own long-in-the-tooth implementation of
the D2RQ mapping language that we don’t use often
11 ©2011 Cambridge Semantics Inc. All rights reserved.
12. This is the full architecture that drives the Anzo
Server and applications.
15. We can’t & shouldn’t standardize everything.
• Need to leave room for competitive differentiation
that goes beyond simply who has the “best”
implementation of a standard
• For standardization work, take a disciplined approach
to identifying what problems are both:
– Costly (a.k.a. valuable to solve)
– Impacting interoperability
15 ©2011 Cambridge Semantics Inc. All rights reserved.
16. What we could use
• We often get asked “can we use your tools against
<insert arbitrary SPARQL endpoint or linked data
source here>?”
– “No.”
• We need standards for & adoption of:
– Richly advertising contents of linked data sources
• c.f. VoID
– Richly advertising capabilities of SPARQL endpoints
• c.f. SPARQL 1.1 Service Description and Basic Federated Query
– Named datasets
– Various other SPARQL extensions (though we can work
around many of these)
16 ©2011 Cambridge Semantics Inc. All rights reserved.