Reconceiving the Web as a Distributed (NoSQL) Data System

Reconceiving the Web as a
Distributed (NoSQL) Data
System
Daniel Austin
PayPal, Inc.
NoSQL Now! Conference
August 22, 2013
V1.2

The Big Idea
“The World-Wide Web is the World’s
Largest NoSQL Distributed Data
System”

History
• DNS (1983)
The first large-scale
DDS, using Flat files
• WWW (1989)
“a single user-interface to many
large classes of stored information
such as reports, notes, data-
bases, computer documentation
and on-line systems help”
Berners-Lee & Cailliau,
1989
But Why NoSQL?

WWWDB: Anatomy
WWW
HTML
(Presentation)
URI
(Addressing)
HTTP
(Transport)

Typology of Hyperlink Queries
• Hypertext links come in two flavors:
transitive and intransitive
• Transitive queries are usually for inactive
content – presentation material to
supplement the user’s queried data
• Intransitive queries are user-actuated
and usually provide navigation and
business logic for the query

Data Clients Query Data
Sources

What Do HTTP URIs Identify?
• Not a single resource
• WWWDB query syntax is split between
HTTP ‘verbs’ (POST, GET, PUT,
DELETE) and their objects, addressed
by URIs
• URI encapsulates a resource as the
object identified by a query
(Note that transitive and intransitive
hyperlinks almost always go to different
locations)

CDN as a Caching Mechanism
• CDNs such as Akamai and Cloudfront
provide local caching services for
WWWDB, mostly for static, presentation-
related objects
– Frequency-based caching for transitive
hyperlinks
– Most secondary queries go to the CDN
– 95%+ of all the bytes transported over the
Web
– ~90% of all WWWDB queries (HTTP
requests/responses)

APIs as Secondary Queries
• Active Subqueries
• Usually dynamic
• URIs function as a selection mechanism
• Often User-Actuated, Intransitive Events
• Query results often modify the display

REST as a Query Syntax
Mechanism
• Common Semantics
– REST provides a
means of specifying
the proper query for
an object in a specific
state
• Demands NoSQL
due to state
constraints
• Uses query strings
for ranged searches
Image courtesy IBM

Indexing WWWDB
• Google, Bing, Yahoo! and other ‘index
searches’ on WWWDB
– Inconsistent results are accepted
• Query Cache or a Data Cache?
• Secondary Query Routing
• Alternative query indices – Wolfram
Alpha, Index Mundi, Twitter act as
‘almanacs’

Does the CAP Theorem Apply?
Yes, It Does, But Only Partially
• Partition and Availability – 404’s, DDOS
• WWWDB Relaxes the Consistency
Constraint
• We accept inconsistent queries and
broken links as a tradeoff for real-time
availability and high-velocity updates
But We Can Do Better!

Drawbacks of the CAP Model
• Caching – All data is Not cached
everywhere
– Some sites are single-location/single source
– Hard (static) assets are far more widely
cached
• What does CAP mean when data is only
partially distributed?
– Very little – consistency only applies to part
of the queries

Improving WWWDB
• Better Data Clients
– HTML5 provides new query mechanism via
Web Sockets, WebStorage, and other
means
– Still mostly presentation-level improvments
• Better Caching, Distribution & Tranport
– Work currently being done at IETF on HTTP
2.0
• Better Queries
– Very little work being done – more on this

RDF and the Semantic Web
• Changes query patterns but not storage
– Queries based on semantic ID of resource
• Requires content to be semantically
labeled
• Work on Sparql reduces query
limitations
– But may also make things slower (!)
• Cloud computing and query distribution
will prove a more powerful force for
improving WWWDB than semantic

Browsers as Data Clients
• Presentation First!
– Data is treated as secondary
• Designed for Browsing Not Querying
– Query patterns are inefficient
– Semi-stateful nature of Web sessions
• Bedeviled with Legacy Issues

Optimizing Web Queries
• REST doesn’t imply FAST
– Use a domain model to limit query
endpoints
– May require unnecessary requests
• Query-string semantics allows for joins,
arbitrary comparison
• Recognize that some queries require
state and use it
• Distribute intransitive queries more
widely

Reforming Hypertext for
Querying WWWDB
• Enlarge the number of link types
• Distinguish transitive links
• Add bidirectional linking
• Enhance the semantics of the query
string
• Make hypertext more useful for mobile
and devices

IPv6 and Query Routing for
WWWDB
• The IPv6 space is large enough to allow
for multiple query addressing schemes:
– Semantic addressing of objects by type
– Objects in the Internet of Things
– Dynamic, context driven addressing

Scaling the WWWDB
• This may require
expanding our notions
of URIs and links
(queries)
• Semantic mapping of
resources requires
additional complexity for
queries
• Explicit state
management for
efficiency
Every system has a
scaling limit

Final Thoughts
• The Web is the largest NoSQL Distributed
Data System
– URIs address the resultset of a NoSQL query
– Transitive and Intransitive hyperlinks
• We can add power and simplicity to our
queries by carefully reforming the URI
syntax and the current implementations of
hypertext
• HTTP and HTML are undergoing
significant evolution – now it’s time for
URIs!

Reconceiving the Web as a
Distributed Data System
Thank You!
Daniel Austin
PayPal, Inc.
NoSQL Now! Conference
August 22, 2013
V1.2
@daniel_b_austin

Reconceiving the Web as a Distributed (NoSQL) Data System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reconceiving the Web as a Distributed (NoSQL) Data System

Similar to Reconceiving the Web as a Distributed (NoSQL) Data System (20)

More from Daniel Austin

More from Daniel Austin (20)

Recently uploaded

Recently uploaded (20)

Reconceiving the Web as a Distributed (NoSQL) Data System

Editor's Notes