Presentation by: Trey Cain and Mikko Lipasti
Paper and more information: http://soft.vub.ac.be/races/paper/edge-chasing-delayed-consistency-pushing-the-limits-of-weak-memory-models/
OK, so giving this talk is kind of a blast from the past for me. I actually did this work with Mikko Lipasti , who was my advisor at the time, when I was finishing up my PhD in 2004. We never presented it outside of my defense, after which I started my job at IBM and my research directions shifted gears. Since this was way back in 2004, if you don’t mind I just wanted to jog my memory a little bit at the outset of the talk, so like Marty McFly in Back to the Future I’m going to take a journey back to some highlights of 2004. So, this was the year that Mark Zuckerberg launched facebook. Boy, I can barely remember life before facebook. But then. It was also the year Janet Jackson suffered a wardrobe malfunction during the superbowl. How could anyone forget that? And lastly, it was the year that an incumbent president was being challenged by a Massachusetts politician in an election year. OK, so maybe things haven’t changed that much since then, both for politics as well as multiprocessor systems. So that was eight years ago, and some has changed but much has stayed the same. And when you think about it, relaxing synchronization is about as close as a programmer can come to time travel. Will I get the old value, or will I get the new value? Will I get what I expect, or will I get a wardrobe malfunction? OK, so pushing the gas pedal down to 88 mph , the flux capacitor is lit, and let’s go!
So like I said in the lightning round , when I saw the CFP for Races, I knew this would be a great place to share this prior work. While most of the discussion so far has been about software mechanisms for relaxing synchronization, but we were working within the constraints of a hardware developer. We were trying to achieve the same sort of scalable performance, while supporting legacy applications written to the PowerPC weakly ordered memory model , which we were unable to change. Given that constraint, the lever that we used was the hardware cache coherence protocol , where we attempted to avoid coherence misses by allowing a core to continue using stale data in its cache for as long as possible. So by avoiding coherence misses, hopefully we would be able to improve performance. We came up with a new implementation of the PowerPC weakly ordered memory model, which we called edge chasing delayed consistency.
Not that I need to motivate the problem to this audience, but you know shared memory multiprocessors are proliferating everywhere that you look. While they used to be relegated to high-end servers, now many cell phones, tvs, game consoles and tablets are SMPs. And the performance of these SMPs suffers due to coherence misses, even for relatively small systems.
This graph measures a 16 core system operating with a 16MB L3 Cache per core. And it shows the number of L3 misses per 1000 instructions, broken down by type, where the lower blue portion of the bar is the number of coherence misses, as you can see if that it is a significant fraction across ll of the workloads.
So what I’m going to be describing is a optimized implementation of weak ordering called edge-chasing delayed consistency. This is not a new consistency model for the programmer, it is a new implementation of weak ordering that allows a cache line to continue being read after it has been invalidated by another processor. In fact, it is going to allow that cache line to be read until it is absolutely necessary that the core see a new version of the line, where the necessary conditions are dictated by the consistency model, and that time is when the reading processor is causally dependent upon the new value. It is going to continue reading the old data until it is necessary for it to observe the old data. That is, until it observes a value in a memory location that precedes the invalidation of the stale block in the happens before relationship, that is it causally depends upon it.
So we were interested in developing a coherence protocol that enforced the necessary conditions of a consistency model, not sufficient conditions. In order to do this, to really understand what is necessary, we relied on a formalism called a “constraint graph” which many of you are probably aware of. Describe the constraint graph The key thing about the constraint graph is that if it is acyclic, then the execution is correct. If it contains a cycle, then it is impossible to put the set of operations in a total order, therefore it is incorrect.
So we extended the definition of the constraint graph to weakly ordered systems, where instead of there being edges between every instruction executed by a single thread, there are only edges between instructions and memory barriers, as well as a few other edges corresponding to single-threaded data dependnces
So the edge-chasing consistecy model derives its name froma class of deadlock detection algortihms that have been described for distributed database systems.
With 30% updates, speedups of 2.74, 1.82, and 1.18 for these list lengths With 100% updates, speedups of 3.11, 3.87, and 1.35 for these list lengths
Intolerable vs. tolerable misses Bars, left to right We expect ECDC to improve performance for reductions in false sharing misses and true sharing misses to data. As we can see from this chart, most of the reduction comes from misses to falsely shared data and misses to truly shared synchronization data. We do not believe that any of these applications exhibit the data-race tolerant quality of the lock free list insertion microbenchmark or convergent iterative algorithms. Raytrace exhibits the most reduction, over 50 percent of all coherence misses can be tolerated using ECDC, however most of these are synchronization misses. Other applications who can use a significant amount of stale data are TPC-H, SPECweb99, and SPECjbb2000.
So this graph shows the normalized execution time for three variants of the ECDC protocol, so lower is better, relative to a baseline coherence protocol. In terms of performance improvements for real applications, it is a little disappointing, around 4% for SPECweb99 and 7.5% for TPC-H. (don’t go back)
So our conclusions after staring at the data for a while was that the two success stories were mostly benefiting from the false sharing reduction. For the other applictions, either there weren’t enough coherence misses, or the avoidance of those misses does not improve performance. For example in the case of synchronization variables, you may be able to see the “locked” value for a little longer than you would otherwise. So instead of being stalled on a cache miss to retrieve the lock from the processor releasing a lock, you’re simply able to see the old value, and spin longer. It is unclear to us why one would expect results to be any different for applications that rely on lock-based synchronization. For other sorts of synchronization models, the story may be different: for example lock free data structures like the linked list example we showed, or for the transactional programming model perhaps. So one final word of caution before concluding. While Hans data races are pure evil, Donald Knuth has stated that premature optimization is the root of all evil. If you have a Barnes Hut, and a vision for attacking the problem go for it, in other words find your nail before inventing hammers.
So, when I talk about causality and causal dependences, what do I mean by that?
Ended at 16:00
E.G. OoO processor
E.G. OoO processor
Ask Mikko
Infrastructure issues with models weaker than weak ordering