As the amount of GIS data we need to keep track of increases, the amount of devices accessing it increases, and the amount of GIS writes increase, we’re finding that, much like real-time web applications, normal RDBMS’s are not well suited to scaling. This talk covers why GIS data is hard to scale in a normal RDBMS, what nonrelational stores exist out there, and some basic examples of how to do spatial queries within a nonrelational store.
4. SimpleGeo
Scalable turnkey location infrastructure
Allows you to easily add geo-aware features
to an existing application
That result: we need to store and query lots
of data (data set is already approaching
1TB, and we haven’t launched)
Tuesday, March 30, 2010
5. Scaling HTTP is easy
No shared state - shared-nothing architecture
• HTTP requests contain all of the information
necessary to generate a response
• HTTP responses contain all of the information
necessary for clients to interpret them
• In other words, requests are self-contained and
different requests can be routed to different servers
Uniform interface - allows middleware
applications to proxy requests, creating a tiered
architecture and making load balancing trivial
Tuesday, March 30, 2010
6. So what’s the problem?
Individual HTTP requests have no shared
state, but the applications that
communicate via HTTP can and do
Application state has to live somewhere
• Path of least resistance is usually a relational
database
• But RDBMSs aren’t always the best tool for the
job
Tuesday, March 30, 2010
7. Desirable Data Store Characteristics
Massively distributed
Horizontally scalable
Fault tolerant
Fast
Always available
Tuesday, March 30, 2010
8. Relational Databases
Based on the “relational model” first
proposed by E.F. Codd in 1969
Tons of implementation experience and
lots of robust open source and proprietary
implementations
Tuesday, March 30, 2010
9. RDBMS Strenghts
Theoretically pure
Clean abstraction
Declarative syntax
Mostly standardized
Easy to reason about data
Tuesday, March 30, 2010
10. ACID
Atomicity - if one part of a transaction fails,
the entire transaction fails
Consistency - all data constraints must be
met for a transaction to be successful
Isolation - other operations can’t see a
transaction that has not yet completed
Durability - once the client has been
notified that a transaction succeeded, the
transaction will not be lost
Tuesday, March 30, 2010
11. RDBMS Weaknesses
SQL is opaque, and query parsers don’t
always do the right thing
• Geospatial SQL is particularly bad
The best ones are crazy expensive
Really bad at scaling writes
Strong consistency requirements make
horizontal scaling difficult
Tuesday, March 30, 2010
12. RDBMS Writes
Relational databases almost always use B-
Tree (or some other tree-based) indexes
Writes are typically implemented by doing
an in-place update on disk
• Requires random seek to a specific location on
disk
• May require additional seeks to read indexes if
they outgrow the disk cache
Disk seeks are bad.
Tuesday, March 30, 2010
13. CAP Theorem
There are three desirable characteristics of
a shared data system that is deployed in a
distributed environment like the web.
Tuesday, March 30, 2010
14. CAP Theorem
1. Consistency - every node in the system
contains the same data (e.g., replicas are
never out of date)
2. Availability - every request to a non-failing
node in the system returns a response
3. Partition Tolerance - system properties
(consistency and/or availability) hold even
when the system is partitioned and data is
lost
Tuesday, March 30, 2010
15. CAP Theorem
Choose two.
Tuesday, March 30, 2010
16. Client
reads & writes
reads & writes
replicates
Node A Node B
Tuesday, March 30, 2010
17. Client
writes
replicates
Node A Node B
Tuesday, March 30, 2010
18. Client
responds
acknowledges
Node A Node B
Tuesday, March 30, 2010
19. Client
responds
o noes!
Node A Node B
Tuesday, March 30, 2010
20. What now?
1. Write fails: data store is unavailable
2. Write succeeds on Node A: data is
inconsistent
Tuesday, March 30, 2010
21. RDBMS Consistency
Relational databases prioritize consistency
Large scale distributed systems need to be
highly available
• As we add servers, the possibility of a network
partition or node failure becomes an inevitability
We could write an abstraction layer on top of a
relational data store that trades consistency for
availability
Or we could switch to a data store that
prioritizes the characteristics we really want
Tuesday, March 30, 2010
22. Nonrelational DBs
Over the past couple years, a number of specialized
data stores have emerged
• CouchDB • Redis
• Cassandra • MongoDB
• Dynamo • SimpleDB
• BigTable • Memcached
• Riak • MemcacheDB
Tuesday, March 30, 2010
23. Also Known As NoSQL
Not entirely appropriate, since SQL can be
implemented on non-relational DBs
But SQL is an opaque abstraction with lots
of features that are difficult or impossible to
efficiently distribute
Tuesday, March 30, 2010
24. So what’s different?
Most “non-relational” stores specifically
emphasize partition tolerance and
availability
Typically provide a more relaxed guarantee
of eventually consistent
Tuesday, March 30, 2010
26. BASE
Basically Available
Soft State
Eventually Consistent
Tuesday, March 30, 2010
27. Eventual Consistency
Write operations are attempted on n nodes
that are “authoritative” for the provided key
In the event of a network partition, data is
written to another node in the cluster
When the network heals and nodes become
available again, inconsistent data is
updated
Tuesday, March 30, 2010
28. SimpleGeo Cassandra
No single point of failure
Efficient online cluster rebalancing allows for
incremental scalability
Emphasizes availability and partition tolerance
• Eventually consistent
• Tradeoff between consistency and latency
exposed to the client
Battle tested - large clusters at Facebook, Digg,
and Twitter
Tuesday, March 30, 2010
29. Cassandra Data Model
Column - a tuple containing a name, value,
and timestamp
Column Family - a group of columns that
are stored together on disk
Row - identifier for a specific group of
columns in a column family
Super Column - a column that has columns
Tuesday, March 30, 2010
34. Writes are crazy fast
Writes are written to a commit log in the
order they’re received - serial I/O
New data is stored in an in-memory table
Memory table is periodically synced to a file
Files are occasionally merged
Reads may end up checking multiple files
(bloom filter helps) and merging results
• Thats okay because reads are pretty easy to scale
Tuesday, March 30, 2010
35. How can I query?
Depends on the partitioner you use
• Random partitioner: makes it really easy to
keep a cluster balanced, but can only do
lookups by row key
• Order-preserving partitioner: stores data
ordered by row key, so it can query for ranges
of keys, but it’s a lot harder to keep balanced
Tuesday, March 30, 2010
36. BYOI
• If you need an index on something other than
the row key, you need to build an inverted
index yourself
• Row key: attribute you're interested in plus row key
being indexed
• “dr5regy3zcfgr:com.simplegeo/1”
• But what about indexing multiple attributes..?
Tuesday, March 30, 2010
37. The Curse of Dimensionality
Location data is multidimensional
Traditional GIS software typically uses
some variation of a Quadtree or R-Tree for
indexes
Like B-Trees, R-Trees need to be updated
in-place and are expensive to manipulate
when they outgrow memory
Tuesday, March 30, 2010
38. Dimensionality Reduction
If we think of the world as two-dimensional
cartesian plane, we can think of latitude
and longitude as coordinates for that plane
Instead of using (x, y) coordinates, we can
break the plane into a grid and number
each box
• Space-filling curve: a continuous line that
intersects every point in a two-dimensional
plane
Tuesday, March 30, 2010
40. Geohash
A convenient dimensionality reduction
mechanism for (latitude, longitude) coordinates
that uses a Z-Curve
Simply interleave the bits of a (latitude,
longitude) pair and base32 encode the result
Interesting characteristics
• Easy to calculate and to reverse
• Represent bounding boxes
• Truncating bits from the end of a geohash results
in a larger geohash bounding the original
Tuesday, March 30, 2010
41. Geohash Drawbacks
Z-Curves are not necessarily the most
efficient space-filling curve for range queries
• Points on either end of the Z’s diagonal seem
close together when they’re not
• Points next to each other on the spherical
earth may end up on opposite sides of our
plane
These inefficiencies mean we sometimes
have to run multiple queries, or expand
bounding box queries to cover very large
expanses
Tuesday, March 30, 2010
42. Geohash Alternatives
Hilbert curves: improve on Z-Curves but
have different drawbacks
Non-algorithmic unique identifiers
• Provide unique identifiers for geopolitical and
colloquial bounding polygons
• Yahoo! GeoPlanet’s WOEIDs are a good
example
Tuesday, March 30, 2010
44. Memcache
Useful for storing ephemeral or short-lived
data and for caching
Super crazy extra fast
Robust support from pretty much every
language in the world
Tuesday, March 30, 2010
45. MemcacheDB
BDB backed memcache
We use it for statistics
• Can’t use Cassandra because it doesn’t
support eventually consistent increment and
decrement operations (yet)
Giant con: it’s pretty much impossible to
rebalance if you add a node
Tuesday, March 30, 2010
46. Pushpin Service
Custom storage solution
R-Tree index for fast lookups
Mostly fixed data sets so it’s ok that we can’t
update data efficiently
Tuesday, March 30, 2010
47. MySQL!
Our website still uses MySQL for some
stuff... though we’re moving away from it
Tuesday, March 30, 2010
So first of all, I’ve been head-down coding like 14 hours a day for the past couple weeks. So these slides aren’t as polished as I’d like them to be. Luckily, I’ve been working on exactly this stuff, so it’s all in the front of my mind. And since this is a workshop I can be more interactive. Interrupt with questions any time!
I’ve been interested in GIS for a while, but I’m relatively new to the scene. I’ve done lots of work building scalable websites though. And, honestly, that’s a problem that’s more or less solved.
HTTP has no “session state,” but applications that communicate via HTTP do have to maintain state. Without application state there’d really be no reason to have a web site or web service - if there’s no application state (nothing the web server knows that the client doesn’t) then the algorithm can be completely distributed.
And, the truth is, relational databases are pretty awesome. They have a number of great characteristics. They’re well understood. And they’re robust.
Most RDBMS systems are ACID compliant or at least pretty close. These characteristics allow client code to make simplifying assumptions about the data that is returned from the data store.
RDBMS weaknesses are essentially the inverse of their strengths.
The upshot of all this is that write performance is much poorer than read performance. But writes are much harder to scale than reads because they have to happen on an authoritative node. Reads can be scaled easily using replication.
Popularized by Eric Brewer at Principles of Distributed Computing in 2000. Brewer’s Conjecture. Later formally proven.
You can design a scalable architecture that maintains some of the ACID characteristics of a typical relational datastore, but eventually you’ll have to relax some of these constraints.
Note that consistency, as defined here, does not mean the same thing as “consistency” in “ACID” - before it meant that all data constraints were met.
So this is probably something you’ve heard before. And people often just throw it out there without an explanation. But it’s pretty easy to prove to yourself by contraction, so let’s try that.
Node A and B are a master/master pair, so they replicate data to one another. To meet the ACID requirements both nodes have to write the data durably before one of them can respond successfully to a write.
Abstraction layers: sharding, caching, and client-side replication are basically kludges on top of relational data stores that trade consistency for availability and partition tolerance - but why re-invent the wheel?
Specialized data stores: graph databases, document databases, key-value stores, and various combinations.
Facebook’s Hive project adds SQL on top of HBase, but SQL queries are translated into map-reduce jobs that are run across the distributed system
Specialized data stores: graph databases, document databases, key-value stores, and various combinations.
We should probably call these datastores NoACID instead of Nonrelational or NoSQL.
That’d make much more sense. But I digress.
Eventual consistency is a concept that was popularized by Amazon CTO Werner Vogel.
This is a gross simplification, and the approaches data stores take to perform node recovery, rebalancing, and repair are often their most distinguishing characteristics. This is actually why we chose Cassandra - the distributed cluster logic is more robust than any other store I’ve seen.
The outermost layer is the key.
Column families are stored together on disk and are the next layer of the structure.
And finally we have columns. You can have as many of them as you want, and each row and column family can have different columns. It’s schema-less - thus non-relational.
Cassandra partitioners are used to decide which node data should be stored on and which node responds to a query.
If we can get our data to fit model where we’re simply retrieving items by key from a sorted set then it’s pretty easy to store and query efficiently. Anything more complicated usually requires heuristics and deep insight into the data set to do at scale.
At the very least we need to index on (latitude, longitude)... may also need to through altitude and time into the mix. That’s four dimensions.
Space-filling curve: developed by Peano, refined by Hilbert.
The non-algorithmic approach is a massive undertaking that requires constant attention and involves a large amount of ambiguity.