The document discusses how the World Wide Web can be viewed as the world's largest NoSQL distributed data system. It describes how core web technologies like URIs, HTTP, and HTML enable querying and retrieving data from distributed sources. While the web has limitations like inconsistent results and availability issues, caching, APIs, and content delivery networks help optimize the system. The document argues the web's approach to querying distributed data sources can be improved by reforming URIs and hyperlinks to enhance query semantics and better support non-presentation needs.
4. History
• DNS (1983)
The first large-scale
DDS, using Flat files
• WWW (1989)
“a single user-interface to many
large classes of stored information
such as reports, notes, data-
bases, computer documentation
and on-line systems help”
Berners-Lee & Cailliau,
1989
But Why NoSQL?
6. Typology of Hyperlink Queries
• Hypertext links come in two flavors:
transitive and intransitive
• Transitive queries are usually for inactive
content – presentation material to
supplement the user’s queried data
• Intransitive queries are user-actuated
and usually provide navigation and
business logic for the query
8. What Do HTTP URIs Identify?
• Not a single resource
• WWWDB query syntax is split between
HTTP ‘verbs’ (POST, GET, PUT,
DELETE) and their objects, addressed
by URIs
• URI encapsulates a resource as the
object identified by a query
(Note that transitive and intransitive
hyperlinks almost always go to different
locations)
9. CDN as a Caching Mechanism
• CDNs such as Akamai and Cloudfront
provide local caching services for
WWWDB, mostly for static, presentation-
related objects
– Frequency-based caching for transitive
hyperlinks
– Most secondary queries go to the CDN
– 95%+ of all the bytes transported over the
Web
– ~90% of all WWWDB queries (HTTP
requests/responses)
10. APIs as Secondary Queries
• Active Subqueries
• Usually dynamic
• URIs function as a selection mechanism
• Often User-Actuated, Intransitive Events
• Query results often modify the display
11. REST as a Query Syntax
Mechanism
• Common Semantics
– REST provides a
means of specifying
the proper query for
an object in a specific
state
• Demands NoSQL
due to state
constraints
• Uses query strings
for ranged searches
Image courtesy IBM
12. Indexing WWWDB
• Google, Bing, Yahoo! and other ‘index
searches’ on WWWDB
– Inconsistent results are accepted
• Query Cache or a Data Cache?
• Secondary Query Routing
• Alternative query indices – Wolfram
Alpha, Index Mundi, Twitter act as
‘almanacs’
13. Does the CAP Theorem Apply?
Yes, It Does, But Only Partially
• Partition and Availability – 404’s, DDOS
• WWWDB Relaxes the Consistency
Constraint
• We accept inconsistent queries and
broken links as a tradeoff for real-time
availability and high-velocity updates
But We Can Do Better!
14. Drawbacks of the CAP Model
• Caching – All data is Not cached
everywhere
– Some sites are single-location/single source
– Hard (static) assets are far more widely
cached
• What does CAP mean when data is only
partially distributed?
– Very little – consistency only applies to part
of the queries
15. Improving WWWDB
• Better Data Clients
– HTML5 provides new query mechanism via
Web Sockets, WebStorage, and other
means
– Still mostly presentation-level improvments
• Better Caching, Distribution & Tranport
– Work currently being done at IETF on HTTP
2.0
• Better Queries
– Very little work being done – more on this
16. RDF and the Semantic Web
• Changes query patterns but not storage
– Queries based on semantic ID of resource
• Requires content to be semantically
labeled
• Work on Sparql reduces query
limitations
– But may also make things slower (!)
• Cloud computing and query distribution
will prove a more powerful force for
improving WWWDB than semantic
17. Browsers as Data Clients
• Presentation First!
– Data is treated as secondary
• Designed for Browsing Not Querying
– Query patterns are inefficient
– Semi-stateful nature of Web sessions
• Bedeviled with Legacy Issues
18. Optimizing Web Queries
• REST doesn’t imply FAST
– Use a domain model to limit query
endpoints
– May require unnecessary requests
• Query-string semantics allows for joins,
arbitrary comparison
• Recognize that some queries require
state and use it
• Distribute intransitive queries more
widely
19. Reforming Hypertext for
Querying WWWDB
• Enlarge the number of link types
• Distinguish transitive links
• Add bidirectional linking
• Enhance the semantics of the query
string
• Make hypertext more useful for mobile
and devices
20. IPv6 and Query Routing for
WWWDB
• The IPv6 space is large enough to allow
for multiple query addressing schemes:
– Semantic addressing of objects by type
– Objects in the Internet of Things
– Dynamic, context driven addressing
21. Scaling the WWWDB
• This may require
expanding our notions
of URIs and links
(queries)
• Semantic mapping of
resources requires
additional complexity for
queries
• Explicit state
management for
efficiency
Every system has a
scaling limit
22. Final Thoughts
• The Web is the largest NoSQL Distributed
Data System
– URIs address the resultset of a NoSQL query
– Transitive and Intransitive hyperlinks
• We can add power and simplicity to our
queries by carefully reforming the URI
syntax and the current implementations of
hypertext
• HTTP and HTML are undergoing
significant evolution – now it’s time for
URIs!
23. Reconceiving the Web as a
Distributed Data System
Thank You!
Daniel Austin
PayPal, Inc.
NoSQL Now! Conference
August 22, 2013
V1.2
@daniel_b_austin
Editor's Notes
We’ll use this idea to suggest improvments.
It has to be a ‘NoSQL’ system because it’s stateless and SQL is inherently stateful.