This document provides an overview of how Apache Accumulo, a distributed key-value store, can be used to scale a social media application that allows users to post photos and comments to groups. It discusses how the application's data was initially stored in a relational database but scaling became difficult. By mapping the data to Accumulo cells indexed by keys, the data can be sharded across multiple servers. Accumulo provides features like automatic rebalancing of tablets that simplify scaling compared to sharding a relational database. It also ensures data durability and consistency through write-ahead logging.
Accumulo is a distributed key value store based on the BigTable paper.
Technology introduction talks are often driven from the perspective of what and how an application works rather than why.
So this evening I’m going to start with a common use case and the talk about how Accumulo’s implementation addresses issues for that use.
This means you’ll have to bear with me as we establish how we might end up with a set of problems that Accumulo helps solve.
Involved in Accumulo for a while. Recently I’ve been working more on HBase, another distributed key value store based on bigtable.
I do still have the scale problems of our supported customers and the project communities I work within. I also spent my years prior to Cloudera building scalable systems I can’t talk about.
The upshot is that we’ll have to come up with a convincing contrivance.
Because Cute Cat Conversations Dot Com cares about giving users control over their pictures, we require that when a user removes someone from a group that person doesn’t still see old conversations.
So groups are both a distribution and an authorization mechanism
These are logical components. You’re probably running them on a single node.
All nice, tidy, and easy to reason about. We can straightforwardly look at how we can add conversations, comments, and manage groups with a minimal number of updates.
Our cat conversations get a little popular, we need to make sure one component failure doesn’t sink us.
So you add another application server, and you set up a replication for your database. Depending on the specifics of the database you use and how much time you invest, you might have to deal with brief outages while you handle fail over yourself.
For most applications, as you gain users and activity there’ll be substantial gains to just doing sessionized load balancing against more application servers. In most cases, this will also buy you more robustness at the application layer.
More importantly, you still have a nice to reason about relational model for your data and a relatively easy to administer data store.
You’ll need to do some filtering an ordering in addition, but you’ll need to hit these two joins because you must turn the current user into a set of groups.
You might break this into two queries, one for a set of discussions and one for a set of comments, but you’ll still going to have latency because of having to look across tables or round trip to the application server.
Now we can get everything we need for a given user’s page with a single table.
(We could also just set the image url to null for the rows that are comments)
However, originally each new comment or conversation start only required updating a single row. Like when this comment was added.
Now that same single comment involves writing 3x as many rows. If you look above, the same applies for posting a new conversation image.
When your read latency are important this is an established optimization technique. It’s called fan-out because the writes from a given consumer form a “fan” as they connect to a subset of the potential consumer lists.
This pattern comes up whenever you’re going to have a large scale number of readers in a time sensitive context, like the web, that receive updates from a smaller set of producers. Think social sharing sites like Pinterest, Twitter, or even Google Plus.
Generally, I think if you get to the point of implementing fan-out you should look seriously at moving to one of the distributed keyvalue stores. But traditional RDBMS isn’t done just yet.
When we want to build the page for a given user, we just need the rows corresponding to them.
As you can see, under this set up, each of the application servers will talk to some set of databases depending on what users they are servicing. When writes happen, they will need to be broadcast to every shard that contains a user in the appropriate group.
Hopefully it’s been ~15-20 minutes.
The cost for that lower operational overhead is that Accumulo is going ot make us think about our data organization more.
The headline description from the project itself. We’ll break down the pieces of this description and how they end up easing the pain in our current scaled up application.
As you can see, under this set up, each of the application servers will talk to some set of databases depending on what users they are servicing. When writes happen, they will need to be broadcast to every shard that contains a user in the appropriate group.
It’s important that we start with the fundamental limitation of Accumulo: it’s a key value store and does not provide a relational model.
You read and write values given a particular key.
Keys are made up of a row, a column, and timestamp.
A Column, in turn, is actually made up of 3 parts. A family, or general grouping of similar columns, a qualifier that specifies which coordinate within the family, and a visibility.
We’ll cover how some of these key-parts are treated specially in a little bit. Generally, you can just think of it like a big multi-dimensional map.
We’ll cover this last bit more in a few minutes.
This is our read-oriented de-normalized conversation table
When we want to build the page for a given user, we just need the rows corresponding to them.
So we’ll take the cell-per-record approach, and use the reader id as a shard indicator in the row id.
Mind you, this is just a first pass.
Note that we’d set each cell’s visibility to be the group the message went to.
If storage is at a premium, we can handle deleting cells we know a user won’t see in an offline way.
Because Accumulo only deals with cells at its core, it doesn’t presume that a column being present in one row means it will be present in another. It stores nothing when a column doesn’t exist. This means we can have extremely wide tables that are only sparsely populated; perfect for the fan-out of our cat conversations.
Accumulo asserts a total ordering across all keys.
Sort is done key-component wise with decreasing priority across: row, family, qualifier, visibility, and finally time.
A common difficulty for building on Accumulo is that you need an increased awareness of how parts of Accumulo will interact with your chosen data layout. Rather than something you can reason about once there are issues (like adding an index to a RDBMS), you need to work it out at the time of application design.
To be performant, we need to make sure that our access pattern for a given user will be a small number of these sequential ranges. That means we have to understand how our chosen keys lay out for Accumulo scans.
This layout and scan makes me think of two issues for our application.
Out of the box, Accumulo will give you lexicoders for all the primitive types as well as java Strings, Date, and BigInteger objects. It will also let you build a sortable representation of a list of encoded values.
This is how we’re laying things out again.
First entry is always just the image; later entries never need the image because they got it at the start of the scan.
Recall earlier when I mentioned that Accumulo just treats the bytes as-is and it’s common for applications to use multiple meanings for a cell value depending on the column.
So let’s remove the placeholders in our values and instead make it explicit when a cell is the image for the start of a conversation and when it’s a comment.
By default, Accumulo will only keep a single version of a given key around; it decides which one to keep based on whichever is newest according to the timestamp in the key.
To simplify our current data model, we’re going to configure it to keep an arbitrary number of versions. This will allow us to leave the “comment order” out of our key entirely. We can either set the time based on the posting client or we can rely on the order Accumulo receives updates. We’d always receive them when reading most-recent-first.
We’ve complicated the mapping from our original database. We’re relying on the way scans work in Accumulo to simplify how we interact with our dataset
By relying on the timestamp and multiple cell versions, we do end up with most-recent-first ordering on comments. On the downside, we’ll have to reverse for display. On the plus side, we can easily do things like previews of most-recent-comment.
At its simplest, this must means that Accumulo will scale across many machines. Unlike our manual database sharding, this should be transparent to you.
Our diagram is a bit of an oversimplification
We can add in a bit of detail. The requests from our clients are going to be served within the cluster by a set of Tablet Servers.
Unfortunately, it won’t make much sense to talk about them without fist going back to our data model for a second.
When I said earlier that we’ll treat the row like a database shard, I wasn’t just talking for our application.
Internally, Accumulo manages cells in groups of rows.
Practically, this means that the row is the atomic unit of parallelizability within an Accumulo system. In our case, we don’t expect one use to be in so many other people’s cat conversation groups that a single machine couldn’t handle their stream. In other use cases we may have to account for this in our key design.
If you look closely, you can see the tablets!
In particular, this means that we should probably use a hash on our user ids to ensure we don’t get a contiguous block of group members all going to the same server.
Besides having to know about the contiguous group members issue, we don’t need to embed any other knowledge about the way sharding is handled into our application.
Not having the logic in our application server also means that maintenance tasks like expanding our cluster is easier.
Accumulo is horizontally scalable and tested at very large cluster sizes.
Adding new hardware resources is equivalent ot adding a new shard.
Hard engineering work mean expensive.
That’s it. Once the server comes online, Accumulo’s internal coordination service will recognize that there are more physical resources available on the cluster and safely migrate Tablets from busier servers over to the new one.
Accumulo has no single point of failure and safely recovers from partial failures.
This is the CAP theorem, in brief.
You can’t choose to give up “partition tolerance.”
This is the CAP theorem, in brief.
You can’t choose to give up “partition tolerance.”
We had some fault tolerance. If fortunate we had automatic failover. If _very_ fortunate we had those without data loss.
If we lost more nodes in a particular shard then we had replication set up for, that set of users was just out of luck until someone got paged.
Whether this storage system favored availability or consistency is very implementation dependent. Most that I have dealt with chose availability because the replication was not synchronous.
Remember zooming in here?
Can’t write directly to persistent storage, because that’s all sorted.
First, we write to a distributed write ahead log. These logs are append-only and written to other nodes via an underlying distributed file system. The are only used in the event that the node fails before we can update persistent storage.
Once we are assured that there is a safe copy for recovery, we write the update into our buffer of accepted writes.
Then we ack the client and the write is visible to the world.
Now it’s possible that after that ack we’ll have a failure.
Like this.
After a tunable timeout, there’s a coordinator system that will notice the node is down.
It will have the remaining Tablet Servers load the write ahead logs from the down server and
When they’re done, then the Tablets from the down server will be reassigned.
This assignment is a light weight RPC. It just tells the Tablet Server to take ownership of the Tablets, perform any recovery out of distributed storage, and then serve client requests.
Can’t write directly to persistent storage, because that’s all sorted.
In order to keep the size of write ahead logs down, the Tablet Server occasionally flushes buffered writes out into newly sorted files on persistent storage.
Now that we’ve covered the internals of recovery, we can see that in addition to easier migration paths, we also have better robustness guarantees because our shards will move themselves around as failures occur, allowing for a more graceful degradation in the face of failures.
What are the big gaps still?
In open source. There’s a private company that has modified Presto.
Replication in currently named 1.7.0