Unger

TEL-AVIV UNIVERSITY
RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES
SCHOOL OF COMPUTER SCIENCE
An Experimental Evaluation of
Combinatorial Preconditioners
Thesis submitted in partial fulllment of the requirements for the M.Sc. degree
of Tel-Aviv University by
Uri Unger
The research work for this thesis has been carried out at Tel-Aviv University
under the direction of Prof. Sivan Toledo
July 2007

Abstract
This thesis presents experimental results of a comparison between several
types of preconditioners. Primarily, it is a comparison between Vaidya's algo-
rithm, which is by now a well understood algorithm for constructing combina-
torial preconditioners, and a family of new algorithms presented in a series of
articles by Dan Spielman. These new algorithms include a novel spanning tree
constructor for building low-stretch spanning trees, and a new approach to tree
augmentation. We have implemented these algorithms in Java language, us-
ing a exible and robust graph-algorithmic framework we have created for this
purpose.
Experimentation was conducted using a test harness that allows extensive
exploration of each preconditioner performance and behavior. This test harness
makes use of several simple but useful tools like automatic algorithmic parameters
generation and the usage of a synthetic performance metric to assess and compare
between preconditioners.
Our main result is that the new augmentation algorithm can outperform
Vaidya's algorithm, but not in a consistent way. Actually, for most of our test
matrices, its performance was similar to that of Vaidya's. On the other hand,
using low stretch trees as bases for augmentation does have a consistent positive
eect on preconditioners performance, but it is not dramatic.
3

Contents
Abstract 3
Chapter 1. Introduction 6
Chapter 2. Background 7
1. Direct Solvers 7
2. Cholesky Factorization and Fill 7
3. Iterative solvers 8
4. Convergence of Iterative Solvers 9
5. Preconditioners 10
6. Incomplete Cholesky 10
Chapter 3. An Overview of Combinatorial Sparsication Algorithms 12
1. Graphs and Symmetric Diagonally-Dominant Matrices 12
2. Combinatorial Bounds on the Condition Number κ 13
3. From Graph Algorithms to Linear Solvers 14
Chapter 4. Constructing Spanning Trees 16
1. Maximum Spanning Trees 16
2. Low Stretch Spanning Trees 17
Chapter 5. Augmenting Trees 21
1. Vaidya's Augmentation Method 21
2. Ultra Sparsify 22
Chapter 6. Partition-First Sparsication 25
Chapter 7. Additional Heuristics 26
1. Spielman's Heuristic Algorithm 26
2. Modied Vaidya 31
Chapter 8. Other Combinatorial Preconditioners 32
1. Preconditioning in a Larger Space 32
2. Combinatorial Preconditioners for Problems that are not Diagonally
Dominant 33
3. Recursive Combinatorial Algorithms 33
Chapter 9. Experimental Results 34
1. Methodology 34
2. Test Matrices 35
3. Algorithm Legend 36
4. Test Infrastructure 36
5. Results 37
Chapter 10. Conclusions 54
4

CHAPTER 1
Introduction
The solution of large linear systems of the form Ax = b is arguably one of the
most common problems in scientic computing applications. Such linear systems
are usually solved using iterative methods. When using iterative methods, one
can use a technique called preconditioning to speed up the convergence rate.
This technique involves a carefully chosen matrix B called the preconditioner
whose particularity is that the system B−1
Ax = B−1
b converges faster than the
original system, while having the same solution. Chapters 2 lays the theoretical
background needed for understanding this technique.
In 1991, Pravin Vaidya was the rst to propose using combinatorial tech-
niques for the construction of preconditioners [22]. Essentially, his methods
interpret the matrices A and B as graphs. The matrix B is constructed to be
a sub-graph of A. Vaidya then uses an embedding of A's edges as paths in B
in order to bound the convergence rate. This algebraic tool is called Support
Theory, and its essentials are presented in Chapter 3.
Although Vaidya never published his work, its theoretical and practical de-
tails were formed and published by others. Much of the theoretical details are
present in the PhD thesis of Anil Joshi, a student of Vaidya, which was published
in 1997 [16]. An implementation and experimental study of Vaidya's algorithm
was done in 2001 in [7]. Numerous extensions to the basic theoretical framework
were proposed in the following years. For example, a framework for constructing
preconditioners whose graph representations contains more vertices than in the
original problem was presented in [4] and used in [1] and [19]. Another extension
addressed the problem of ill conditioned problems by splitting problem matrices
into layers ([14]).
An additional line of improvements was the creation of new graph algorithms
operating within the same or similar framework to that of Vaidya ([21, 11, 17,
18, 20]). Theoretically, these new algorithms are capable of producing better
graph-based preconditioners than that of Vaidya, but how would these algorithms
perform practically? The behavior of Vaidya is fairly well understood ([7], [1]),
and is not always spectacular (incomplete Cholesky is often better). Can the
new graph algorithms change the picture completely?
This thesis studies this question. We have built prototypes of the algorithms
in [21, 11, 20] using a Java framework that allows us to easily implement al-
gorithmic variants. We conducted extensive experiments of these algorithms in
order to answer this question. All of the preconditioning algorithms that partic-
ipated in our experimentation are described in Chapters 4, 5, 6 and 7.
Our results are presented in Chapter 9. They indicate that the new algorithms
we have tested do not behave consistently better than Vaidya's algorithm, at
least not dramatically so, although they do exhibit dierent behaviors, which
are sometimes (but not always) signicantly better than Vaidya's algorithm.
6

CHAPTER 2
Background
Consider a linear system of the form Ax = b, where A is a known n-by-
n coecient matrix, b is a known n-by-1 vector and x is the unknown n-by-1
vector. We are especially interested in linear systems that are sparse, symmetric
and positive denite. We say that an m-by-n matrix is sparse if it contains
O(max(m, n)) non-zeros. A square matrix A is positive denite if xT
Ax 0 for
all x.
Algorithms for solving linear systems fall into one of two main categories:
direct methods and iterative methods.
1. Direct Solvers
Direct methods use a nite number of steps treating the values in A and b
symbolically in order to produce the solution vector x. Direct solvers usually
work by factoring A into a number of simpler matrices. For example, the Gauss-
ian elimination algorithm can be understood as factoring A into lower and upper
triangle matrices L and U such that A = LU.
If A is symmetric and positive-denite, it is better to use the Cholesky fac-
torization of A:
A = LLT
where L is a lower triangle matrix. This decomposition is guaranteed to exist
for all symmetric positive-denite matrices. The system then becomes LLT
x = b,
which can be solved by two invocations of backward/forward substitution solves:
one that nds y such that Ly = b and another that nds x such that LT
x = y.
2. Cholesky Factorization and Fill
The time complexity of a direct solver based on Cholesky factorization is
proportional to number of non-zeros in L. The non-zero count of L is at least
that of A, but it may be much higher. In extreme cases, L may even contain
O (n2
) non-zeros even if A has only O(n) non-zeros. We use the term ll to
refer to the amount and locations of additional non-zeros in L in comparison to
A. There is also a relation between the time complexity of nding the factor L
and the amount of ll it introduces. Specically, the work needed in order to
calculate L is proportional to the sum of squares of the non-zeros count in the
columns of L. This relationship between work and ll shows two things. First,
that sparser factors might require less work. Secondly, that factors with balanced
non-zeros count require less work to compute than factors with some relatively
dense columns, even if the two factors have the same dimension and the same
total number of non-zeros.
There is a a simple way to characterize the ll issued during factoring. The
characterization uses a graph representation of the matrix A called the Pattern
Graph. The pattern graph PA of an n-by-n symmetric matrix A is an undirected
graph PA = (V = {1, 2, ..., n} , E) whose edge set E contains all pairs (i, j) such
7

that Ai,j = 0. We dene an operation called eliminate(PA, v) that modies the
pattern graph PA according to an input vertex v. The modication is simply the
addition of a clique on the neighbors of v in PA whose indices are larger than v.
Formally,
E(eliminate(PA, v)) = E(PA) {(i, j)|i v and j v and (i, v) ∈ E(PA) and (j, v) ∈ E(PA)}
Successive elimination of all the vertices in PA will leave us the Fill Graph -
the pattern graph that represents the non-zero pattern of L.
It is clear from this characteristics of ll that the order in which we elimi-
nate vertices may have an inuence on the amount of ll we would experience.
Therefore, it is standard practice to permute the rows and columns of A prior to
factoring using a permutation matrix P. That is, to factor the permuted system
PT
AP instead. We apply P on both sides in order to permute both the columns
and the row of A. Finding a permutation matrix whose resulting ll is minimal
is NP-hard. However, heuristics like Minimum Degree and Nested Dissection do
nd low ll ordering in practice. Minimum Degree is a greedy heuristics. In each
step it tries to eliminate the vertex having the least number of non-eliminated
neighbors. Nested dissection is a more robust algorithm which usually works bet-
ter than minimum degree for larger graphs. It is based on nding small vertex
separators in the graph whose removal breaks the graph into two components of
roughly the same size. The elimination ordering is constructed such that all the
vertices in the rst component are eliminated rst, than the vertices of the second
component and lastly the vertices of the separator itself. The vertices of each
component (and the separator) are ordered recursively using further dissections.
The problem of ll and the extra work it introduces, even when using good
orderings, might render direct solvers on sparse matrices inecient in both time
and space considerations in comparison to iterative methods.
3. Iterative solvers
An alternative to the direct methods are the iterative methods. These meth-
ods nd a sequence of vectors x(t)
which approximates the solution vector x,
such that x(t)
→ x as t gets large. The iterative methods are run until the the
relative residual error
Ax(t)
− b
b
is suciently low. Using iterative solvers is justied when A is very large and
sparse or if A exist only implicitly as a subroutine that, given a vector w, re-
turns Aw. Usually, each iteration requires a multiplication of A by a vector and
several more vector only operations, and need to keep in memory only a few
vectors. These memory requirement are much lower than the requirements of
direct solvers, chiey because of the ll. Also, as long as the iteration count is
kept below n, the time complexity of iterative solvers would be lower too.
One class of iterative solvers which is being used in practice is the Krylov-
subspace solvers. In iteration t, these solvers nd a vector x(t)
which minimizes
Ax(t)
− b within the Krylov subspace Kt = span {b, Ab, A2
b, . . . , At−1
b}. Look-
ing for approximate solutions specically within Krylov subspaces can be justi-
ed, see [15]. The Conjugate-Gradients (CG) method and the Minimal-Residual
method (MINRES) are two Krylov-subspace methods that are appropriate for
symmetric positive-denite matrices. MINRES is more theoretically appealing,
8

because it minimizes the residual Ax(t)
− b in the 2-norm. It works even if A is
indenite. Its main drawback is that it suers from a certain numerical insta-
bility when implemented in oating point. Conjugate-Gradients is more reliable
numerically, but it minimizes the residual in a norm that is less useful than the
2-norm. Our experiments used Conjugate-Gradients exclusively.
4. Convergence of Iterative Solvers
The relative residual norm can be bounded using spectral properties of the
matrix A. This bound can be used to bound iteration count. The essential
step is to express the residual b − Ax(t)
using a polynomial. Since x(t)
∈ Kt,
there is some vector y such that x(t)
= y1b + y2Ab + . . . + ytAt−1
b. Therefore,
b − Ax(t)
= b − y1Ab − y2A2
b − · · · − ytAt
b = p(A)b for some polynomial p of
degree t. It can be shown that the relative norm is bounded:
b − Ax(t)
2
b 2
≤
n
max
i=1
{p(λi)}
This result mean that if there are low-degree polynomials that are low on the
eigenvalues of A and have the value 1 at 0, than Krylov-subspace solvers will
converge quickly for A. For example, if the eigenvalues of A are clustered, than
a polynomial with only one root within the cluster could assume low values for
all the eigenvalues in the cluster, and thus A would convergence quickly as a
relatively lower-degree polynomial would hit many eigenvalues.
In each iteration, Krylov-subspace solvers choose the polynomial p that min-
imizes the residual. To be a true minimizer, this polynomial must be dependent
on A and b. In order to obtain a generic bound on convergence, we need to
remove this dependency. To do so, lets rst assume that the smallest and largest
eigenvalues of A are λmin and λmax respectively. We then need to look for a
sequence of polynomials pt such that pt(0) = 1 and maxx∈[λmin,λmax] |pt(x)| is as
small is it can get by any polynomial of degree t. It turns out that such se-
quence of polynomials exists, and it is derived from Chebyshev Polynomials. The
Chebyshev sequence of polynomials is dened by the following recurrence:
c0(x) = 1
c1(x) = x
ct(x) = 2ct−1(x) − ct−2(x)
and the required polynomial pt(x) is dened as:
pt(x) = c−1
t
λmax + λmin
λmax − λmin
ct
λmax + λmin − 2x
λmax − λmin
The value of Pt(x), which bounds the relative residual error, can bounded by:
|pt(x)| ≤ 2
√
κ + 1
√
κ − 1
t
+
√
κ + 1
√
κ − 1
−t −1
≤ 2
√
κ − 1
√
κ + 1
t
Where the value κ, called the spectral condition number of A, is dened as
κ = λmax/λmin. For xed κ, we can expect convergence to a xed tolerance (for
9

example, 10−12
) to happen within a constant number of iterations. As κ grows,
√
κ − 1
√
κ + 1
→ 1 −
2
√
κ
so we are guaranteed convergence to a xed tolerance with O(
√
κ) iterations.
This bound may not be very tight, as true convergence rate is dependent on the
clustering of the eigenvalues.
5. Preconditioners
The term preconditioning refers to the transformation of the linear system
into another system with more favorable properties for iterative solution. Gen-
erally, preconditioning attempts to improve the spectral properties of the coe-
cient matrix, which eects the rate of convergence as seen in the previous section.
Specically, we are looking for a matrix B such that the system B−1
Ax = B−1
b
would be easier to solve than the original system Ax = b. These two systems have
the same solution. In this example, we have applied the matrix B−1
on the left,
but it can also be applied on the right instead. When we use Krylov-subspace
methods, it is not necessary to actually form the preconditioned matrix B−1
A (or
AB−1
) explicitly. Instead, the iterative algorithm uses A as is for matrix-vector
products, and uses B−1
implicitly, as a linear system of the form Bz = r must
be solved in each iteration.
Left (or right) preconditioning is not appropriate for algorithms like MINRES
or Conjugate Gradients since they rely on A's symmetry, whereas B−1
A is gener-
ally not symmetric. Luckily, there is another form of preconditioning called split
preconditioning. Using this approach, we factor the preconditioner B = B1B2
and solve the following preconditioned system (B−1
1 AB−1
2 )(B2x) = B−1
1 b to nd
y = B2x and then solve for x = B−1
2 y. If B is symmetric positive denite, we
can factor B using the Cholesky factorization to obtain B = LLT
= B1B2, and
this way retain the symmetry as L−1
AL−T
is symmetric.
When using a preconditioner, we need a more general denition for κ, which
is:
κ(A, B) = κ(B−1
A) = λmax(A, B)/λmin(A, B)
where λ(A, B) is the generalized eigenvalues of A and B. Chapter 3 shows how
this value can be bounded combinatorially.
Finding a good preconditioner is a balance between two often contradicting
demands. On the one hand, we want B−1
to be easy to apply, or equivalently
to be able to solve Bz = y cheaply. This means that B should be sparser (or
easier to factor somehow) than A, otherwise solving for B would be as complex
as solving for A which is our original problem. On the other hand, we need B−1
A
to have low condition number κ. The rst demand tries to minimize the time we
spend in each iteration, while the second demand tries to minimize the number
of iterations. We will show later that these demands roughly state that B needs
to approximate A, obviously with fewer non-zeros.
6. Incomplete Cholesky
There are a few heuristics that reduce ll in Cholesky factors at the expense
of providing only an approximation of the true factor. These heuristics are re-
ferred to as Incomplete Cholesky factorization. The general idea these algorithms
employ is to drop some values here and there during the normal calculation
10

of the factor L, i.e. setting some values forcefully to zero when some predened
rules apply. Droping these values naturally lowers the ll locally, but it may
also deeply inuence the calculation of the rest of the elements in L, as element
values of L propagate throughout the calculation.
The rst type of incomplete Cholesky factorization we describe is the drop-
tolerance incomplete Cholesky factorization. This algorithm drops an element if
its absolute value is lower than some predened threshold e. Another form of
incomplete Cholesky factorization is the so called zero-level ll in. This variant
drops an element if it lls a cell Li,j for which Ai,j = 0. That is, this algorithm
does not allow any additional ll to occur. The generalization of this approach
gives us the k-level ll-in algorithm; this variant tracks the propagation of lled
elements during the calculation of L, allowing additional ll only for elements
whose rank is lower than k. The rank of an element li,j = 0 is measured by the
number of steps which were required for ll to propagate from the initial ll of
A to the element li,j.
Using incomplete Cholesky factors works very well as preconditioners. The
experiments shown in this thesis also included using the drop-tolerance variant
as a preconditioner.
11

CHAPTER 3
An Overview of Combinatorial Sparsication Algorithms
This thesis evaluates preconditioners that are constructed by sparsifying a
graph GA that represents the coecient matrix A. By sparsifying we mean that
edges are dropped from GA to form a sub-graph GB, and then GB is viewed as
a matrix B to serve as a preconditioner.
1. Graphs and Symmetric Diagonally-Dominant Matrices
An isomorphism between graphs and symmetric diagonally-dominant matri-
ces enables us to use graph algorithms to construct preconditioners.
Definition 1.1. Let A ∈ Rn×n
. Row i of A is diagonally dominant if
Ai,i ≥
j=i
|Ai,j| .
If there is an equality, the row is called weakly dominant, and if there is a
strict inequality, the row is called strictly dominant. If all the rows of A are
dominant then A is called diagonally dominant, and if at least one row is strictly
dominant, A is strictly dominant.
From here on we assume that A is symmetric and diagonally dominant. Sup-
pose that Ai,j 0 for some i = j. Let i, −j be a length-n column vector with
a 1 in position i and a −1 in position j (and zeros everywhere else). The matrix
|Ai,j| i, −j i, −j T
is a rank-1 matrix with Ai,j in positions i, j and j, i and
with |Ai,j| in positions i, i and j, j. The matrix A − |Ai,j| i, −j i, −j T
is also
diagonally dominant, but with a zero in positions i, j and j, i. We can therefore
continue to subtract from the dierence additional rank-1 matrices of this type.
If Ai,j 0, we can perform a similar trick but with a vector i, j that has ones in
positions i and j. Eventually, the only remaining non-zeros in the matrix will be
on its diagonal. We can then subtract from it matrices of the form |Ai,j| i i T
,
where i is the ith unit vector. Therefore, we can express A as
A =
i=j
Ai,j0
|Ai,j| i, −j i, −j T
+
i=j
Ai,j0
|Ai,j| i, j i, j T
+
i
Ai,i j=i |Ai,j|
Ai,i −
j=i
|Ai,j| i i T
This establishes the isomorphism: each term in the rst two summations
above represents an edge of a graph. We can view a symmetric diagonally-
dominant (SDD) matrix A as a graph GA whose vertex set is {1, 2, . . . , n} and
with an edge between i and j for each Ai,j = 0. The edges are weighted and
signed. The sign of the edge is positive if Ai,j 0 and negative if Ai,j 0
(there is a reason for this strange assignment of signs [3]). The weight of the
edge can be set to either |Ai,j| or to |Ai,j|; we can use either one, but choosing
12

it properly will make the results below more elegant. The vertices also have
weights: Ai,i − j=i |Ai,j|.
Given a SDD matrix, we can clearly construct GA using the rules in the previ-
ous paragraphs. Conversely, given a graph GA with non-negative edge and vertex
weights and an assignment of signs to the edges, we can build the corresponding
matrix A, which will be SDD.
Definition 1.2. The Laplacian matrix A = (ai,j) of an undirected weighted
graph GA = (V, E) with weight function w : E → R is a |V | × |V | symmetric
matrix whose rows and columns corresponds to the vertices of GA. The matrix
element values dened as follows:
ai,j =



n
k=1 wk,i if i=j
−wi,j if i = j and (i, j) ∈ E
0 otherwise
In this work, we only treat matrices with non-positive o-diagonal elements
(that is, matrices whose graphs have no negative edges). SDD matrices with
positive o-diagonals are rare in applications. Moreover, a simple transformation
reduces a linear system with a general SDD coecient matrix to a problem with
an SDD matrix with non-positive o-diagonals [12].
2. Combinatorial Bounds on the Condition Number κ
A fundamental question in the construction of such preconditioners is how to
bound κ(A, B). It turns out that we can bound κ(A, B) using an embedding π
of the edges of GA in paths in GB. Combinatorial properties of the embedding
provide bounds on κ(A, B).
Definition 2.1. Let GA and GB be weighted (but unsigned) graphs with
vertex sets {1, 2, . . . , n}. A path embedding π maps all the edges of GA to simple
paths in GB. For every edge (i1, i ) in GA,
π(i1, i ) = {(i1, i2), (i2, i3), . . . (i −1, i )}
is a simple path i1 ↔ i2 ↔ i3 ↔ · · · ↔ i −1 ↔ i ) in GB.
The following denition provides relevant metrics of an the quality of an
embedding. These metrics are dened on a per-edge basis; we later use them to
dene global metrics that bound κ(A, B).
Definition 2.2. The weighted dilation of an edge of GA in an path embed-
ding π of GA into GB is
dilationπ(i1, i2) =
(j1,,j2)
(j1,,j2)∈π(i1,i2)
Ai1,i2
Bj1,j2
.
The weighted congestion of an edge of GB is
congestionπ(j1, j2) =
(i1,,i2)
(j1,,j2)∈π(i1,i2)
Ai1,i2
Bj1,j2
.
13

The weighted stretch of an edge of GA is
stretchπ(i1, i2) =
(j1,,j2)
(j1,,j2)∈π(i1,i2)
Ai1,i2
Bj1,j2
.
The weighted crowding of an edge in GB is
crowdingπ(j1, j2) =
(i1,,i2)
(j1,,j2)∈π(i1,i2)
Ai1,i2
Bj1,j2
.
Note that stretch is a summation of the squares of the quantities that constitute
dilation, and similarly for crowding and congestion. Unfortunately, papers in the
combinatorial-preconditioning literature are not consistent about these terms.
When we use dilation and congestion, it makes sense to dene the edge
weights in GA and GB to be |Ai,j|, since the dilation and congestion become
sums of edge ratios. When we work with stretch and crowding, it makes more
sense to dene the edge weights in terms of square roots of the |Ai,j|'s.
These denitions allow us to state results that relate the condition number
of the preconditioned system to combinatorial metrics of an embedding of GA in
GB. We do not show the proofs here. For the proofs, see [2, 5, 6, 20].
Lemma 2.3. Let A and B be weighted Laplacians with the same row sums
and such that GB is a sub-graph of GA, and let π be a path embedding of GA into
GB. Then
κ(A, B) ≤
(i1,i2)∈GA
stretchπ(i1, i2)
κ(A, B) ≤
(j1,j2)∈GB
crowdingπ(j1, j2)
κ(A, B) ≤ max
(i1,i2)∈GA
dilationπ(i1, i2)
· max
(j1,j2)∈GB
congestionπ(j1, j2) .
These bounds can be tightened to:
κ(A, B) ≤ max
(j1,j2)∈GB
(i1,i2)∈GA
(j1,j2)∈π(i1,i2)
stretchπ(i1, i2)
κ(A, B) ≤ max
(i1,i2)∈GA
(j1,j2)∈GA
(j1,j2)∈π(i1,i2)
crowdingπ(j1, j2) .
We omit two additional bounds that are similar to the ones just given.
3. From Graph Algorithms to Linear Solvers
These results can be used to develop ecient linear solvers. First, we run a
graph algorithm to construct GB given GA. In the algorithms that we explore
in this work, GB is a sub-graph of GA. We will explain below the objectives of
the sparsication phase. Next, we construct B from GB. The preconditioner B
is then factored into its Cholesky factors and permutation factors,
B = PLLT
PT
.
14

The permutation is chosen so as to reduce the ll in the Cholesky factor of
PT
BP. This factorization allows us to easily apply B−1
; the cost of applying
B−1
is proportional to the number of non-zeros in L. Now, that we have an
easy way to apply B−1
, we invoke the Conjugate Gradients algorithm or the
MINRES algorithm with B as a preconditioner. In fact, we perform the iterations
somewhat more eciently by solving PAPT
(Px) = Pb for Px and then recover
x = PT
(Px). This eliminates the need to apply the permutations in every
iterations.
The eciency of this method depends on the eciency of each phase. We
want the algorithm that constructs GB to be ecient. We want B to be easy
to factor. We want the number of Conjugate-Gradients or MINRES iterations
to be small, and we want each iteration to be cheap. The cost of factoring B
depends on its sparsity pattern. In general, small balanced vertex separators in
GB lead to a low factorization costs. The cost of every iteration also depends on
the density of the Cholesky factor of B. Te number of iterations in the iterative
algorithm is bounded by O( κ(A, B)). Therefore, the sparsication algorithm
tries to achieve a small κ(A, B) and to ensure that B has a sparse Cholesky factor
(under a suitable symmetric permutation). We can express both the bounds on
κ(A, B) and the bounds on ll in terms of the graph structures of GA and GB,
which allows us to use graph algorithms to sparsify A.
15

CHAPTER 4
Constructing Spanning Trees
1. Maximum Spanning Trees
One of the simplest ways to construct GB is to select a spanning tree of GA.
This ensures that an embedding π exists. A tree is also very cheap to factor: the
factorization can start from the leaves and go up so there is no ll at all. Tree
preconditioners do not balance well the cost of factoring the preconditioner and
the cost of the iterations: κ(A, B) is usually too high, leading to a large number
of iterations. Therefore, sophisticated preconditioning methods usually augment
trees with additional edges. We explain the construction of trees in this chapter
and their augmentation in the next.
The rst trees that were proposed for preconditioning were maximum span-
ning trees [22]. These trees maximize the sum of the edge weights in the tree.
The identity of the maximum spanning tree depends only on the ordering of the
edge weights, not on the exact weights, so it does not matter whether we weigh
the edges by Ai,j or by Ai,j; we get the same tree in either case. Maximum
spanning trees can be constructed by any algorithm designed to compute min-
imum spanning trees, such as Prim's algorithm, Kruskal's algorithm, and more
(see [8] for descriptions of these algorithms).
Maximum spanning trees have a property that makes it easy to analyze
κ(A, B). Let GB be a maximum spanning tree of GA and suppose that the
edge (i, j) is in GA but not in GB. Clearly, (i, j) is not heavier than any of the
edges along the single path between i and j in GB. If it were heavier than any
of them, we could increase the total weight of GB by including (i, j) in it and
dropping the lighter-than-(i, j) edge in that path.
Therefore, all the ratios in the denitions of congestion, dilation, stretch and
crowding are bounded by 1 when GB is a maximum spanning tree of GA. This
implies that the dilation and stretch are bounded by n − 1 (the maximal path
length), and that the congestion and crowding by m−(n−2) (maximum number
of paths that use an edge of GB). Therefore, by the congestion-dilation product
bound,
κ(A, B) ≤ (n − 1)(m − n + 2) = O(mn) .
The sum-of-stretch and sum-of-crowding bounds give similar expressions. This
bound may seem pessimistic, but experiments show that the maximum-spanning-
tree preconditioner really is quite bad.
Ignoring the cost of constructing the preconditioner (which is negligible in this
case), the total cost of the solver is bounded by O(m1.5
n0.5
), since the dominant
cost in each iteration is the application of the Cholesky factor of the precondi-
tioner.
The sum-of-stretch bound can lead to better spanning-tree preconditioners
(this was originally noted by Erik Boman in an unpublished manuscript).
16

2. Low Stretch Spanning Trees
It is possible to construct a spanning tree for GA for which the average edge
stretch is bounded by O(log3
n). In the graph-algorithms literature, bounds on
low-stretch trees are usually specied in terms of the average stretch per edge,
and not in terms of the sum of stretches. It is clear, however, that the average
and the sum always dier by exactly a factor of m, and thus the same con-
struction minimizes both. Therefore, the sum of stretch (which bounds κ(A, B))
is bounded by O(m log3
n). Ignoring the cost of building the tree, the solver
performs O(m1.5
log1.5
n) operations.
The following notation is used throughout the description of the algorithm:
• The weight w(i, j) of an edge (i, j) in GA is simply |Ai,j|.
• The length (i, j) of an edge is the inverse of its weight, (i, j) = w−1
(i, j).
• The distance d(i, j) between two vertices is the length of the shortest
path between i and j, where the length of a path is the sum of the
lengths of the edges along it.
We also dene the radius rG(v) of a graph G with a designated vertex v called
the root of G as the maximal distance between v and any other vertex in G. The
denition extends naturally to induced sub-graphs with a root.
The algorithm that we have implemented to construct low-stretch trees is
due to Emek et al. [11]. In its heart lies a graph decomposition algorithm
that produces a so called star-decomposition. Given a graph G = (V, E) and
an arbitrary root vertex x0 ∈ V , this algorithm produces a partition of G's
vertices into disjoint subsets {V0, . . . , Vk} such that x0 ∈ V0. The partitioning
also produces a set of bridge edges EB = {(xi, yi) ∈ E(G)} such that xi ∈ V0
and yi ∈ Vi for all 1 ≤ i ≤ k. The algorithm guarantees that all the sub-graphs
induced by the Vi's are connected and that the union of these sub-graphs together
with the bridge edges has a radius of at most (1 + ) times rG(v0). That is, the
algorithm eectively removes all the edges that link one Vi to another Vj except
for the bridge edges EB without increasing the radius of G by much.
The construction of the sets {V0, . . . , Vk} uses a set-growing technique. In
order to grow a set, a root vertex must be chosen, from which the growing
starts. Then, each vertex in G is labeled with its distance to that root vertex,
and the vertices are sorted according to their distance. The distance function
is not necessarily d; it can be induced by any set of non-negative edge weights.
The actual set-growing phase iteratively adds vertices into the set one by one
in increasing distance order. While the set is growing, we keep track of two
metrics. The rst is the cut weight: the sum of weights of all edges with exactly
one endpoint in the grown set. The second is the volume count: the number of
edges with at least one endpoint inside the grown set. The growing stops when a
certain ratio is achieved between these two metrics. The ratio is chosen to ensure
halting. After each set was fully grown, its vertices and their incident edges are
removed from G before we grow the next set.
Each set Vi is grown independently, starting with V0. Growing V0 starts from
the arbitrary x0 and uses the distance d induced by the edge lengths .
Growing the rest of the sets is a little trickier. First, we need to choose their
root vertices. For this purpose we consider the set S of all vertices which are right
outside V0 along some path of the shortest paths tree rooted at x0. That is, all the
vertices u ∈ V −V0 with a neighbor w ∈ V0 such that d(x0, u) = d(x0, w)+ (w, u).
Each if these vertices will serve as the growth root for exactly one of the sets
17

Figure 1. A low stretch spanning tree and a maximum weight
spanning tree (below, Figure 2). The full graph is shown in Fig-
ure 3 in Chapter 7. The edges inside the ring are the heaviest,
and therefore contained within the maximum weight spanning tree.
The consequence of this is that the inner area within the smaller
circle is connected to the area within the ring with only one edge.
Thus there is a high stretch for all the (removed) edges that con-
nects a vertex within the ring and a vertex within the internal
circle. This is not the case in the low stretch spanning tree.
V1, . . . , Vk. The set S also determines the bridge edges: for each such u and w,
the edge (u, w) is in EB.
We now dene the distance function d that we use for growing V1, . . . , Vk.
This distance function is induced by an edge-length function (u, v) that the
following rules dene:
• If u is on a shortest path from S to v, then (u, v) = 0. Note that there
may be several such shortest paths. For this rule to apply, it suce that
at least one obeys the rule's conditions.
• Otherwise, (u, v) = (u, v).
The length function induces a distance function d on the remaining vertices
of G.
18

Figure 2. A maximum weight spanning tree. Contrast with the
above Figure 1.
We process the vertices of S iteratively, growing a new set Vi from each vertex
vi ∈ S. Thus, the size of S determines k. In particular, all the vertices u for
which the shortest path to S leads to v1 will be in the set V1. A set Vi may
include vertices w whose distance to vj ∈ S is shorter than their distance to vi
if Vi is grown before Vj. This happens only if including w in Vi is benecial for
the decomposition, e.g., when the edge leading to w is heavy and thus better be
outside the cut.
The bridge edges are always included in the low-stretch tree, but they may
not form a tree. If they do not, we recurse: we decompose each of the Vi's again,
using x0 as the root of V0 and the vertices of S as the roots of the other Vi's. The
recursion stops at sub-graphs with only one or two vertices. The resulting tree is
the union of all the bridge edges produced by all the levels of recursion, together
with the edges of non decomposed sub-graphs (each of which must contain no
more than one edge since it has no more than 2 vertices).
19

Figure 3. On the left: the set V0 (red) and the roots used for
growing the sets V1, . . . , Vk (green). The arrows show the structure
of the shortest path DAG starting from these roots. If u is a
root and v is some vertex reachable from u using the depicted
arrows, than (u, v) = 0. On the right: all the sets V0, . . . , Vk, each
depicted with a dierent color. Note how the order in which we
consider the roots may change the nal result. For example, we can
tell that the yellow set growth preceded the white one since some
of the yellow vertices are reachable from the white root. These
vertices would have been contained in the white set had the white
set been grown rst.
20

CHAPTER 5
Augmenting Trees
Trees are usually not eective preconditioners. One way to construct more
eective preconditioners is to augment spanning trees of GA with extra edges
taken from GA. We denote the spanning tree of GA by TA.
1. Vaidya's Augmentation Method
Vaidya's algorithm augments a spanning tree by cutting up the tree into
connected components and then adding the heaviest edge between every two
components. We do not add edges between components that do not have any
edges between them in the original graph or if the heaviest edge between them
is already part of the tree.
The cleverness of the algorithm lies in the partitioning algorithm. Ideally,
we would have liked to partition the tree into a given number k of connected
sub-trees with n/k to n/k vertices. This is not always possible. Consider,
for example, a star: any connected subset of its vertices with more than one
vertex must contains its center, and therefore there can be only one such subset.
On the other hand, a path can obviously be partitioned into exactly k connected
components with n/k to n/k vertices each.
We use the following partitioning algorithm, which is a simplied and non-
recursive version of the algorithm in [7]. The algorithm starts from a rooted tree
and cuts out connected sub-trees if their size falls within a certain range. The
algorithm visits all the vertices of the rooted tree in post-order (parent after its
children). We use a breadth-rst search to generate the post-order, but any post-
ordering will do. The algorithm processes a vertex as follows. It uses the child
pointers to nd its children in the remaining tree (that is, not including children
in the original tree that were already cut away). Each child structure contains
a eld that stores the number of vertices in the sub-tree rooted in that child (in
the remaining tree - we don't count vertices which were already cut away). The
algorithm sums these numbers and adds 1 (for the vertex itself). If the sum is at
least n/k, the algorithm cuts the vertex from its parent. It is easy to see that a
sub-tree rooted at a non-root vertex v can have at most 1+degree(v)n/k vertices
and at least n/k. The sub-tree rooted at the root of the original tree has fewer
than 1 + degree(v)n/k vertices, but it is not bounded from below - it might even
contain only one vertex. If the vertex degrees are bounded by dmax, then all the
sub-trees except perhaps one have between n/k and 1 + dmaxn/k vertices.
The classic version of Vaidya's algorithm uses a maximum weight span-
ning tree (Section 1 in the previous chapter) and the augmentation algorithm
described in this section. Our experiments included applying the augmentation
step on low-stretch trees too.
21

2. Ultra Sparsify
Ultra Sparsify is a more sophisticated augmentation algorithm. The algo-
rithm is based on theoretical algorithms from [21, 20]. The versions of Ultra
Sparsify that are presented in these papers are very complex and use various
(large) constants that are designed to make the algorithm asymptotically e-
cient. We have designed and implemented a simpler algorithm that is based on
the same ingredients.
The algorithm uses three components to augment a spanning tree TA of a
weighted graph GA. We describe these components below. The set of all edges
considered for augmentation is denoted with Eaug = E(GA)−E(TA). The behav-
ior of the algorithm is controlled by two external parameters, c ≥ 1 and k ≥ 1.
For simplicity, we denote by T and G the graphs TAand GA respectively.
Partitioning the Edges of Eaug. The rst component of the algorithm
partitions the edge set Eaug into disjoint subsets E1, E2, . . . , Ep. Each edge e ∈
Eaug is labeled with two values: its weight w(e) and the weighted dilation of
its path embedding in TA. We explain below how the weighted dilations are
computed eciently. Once the labels have been computed, we partition the
edges into subsets such that all the edges in a given subset dier in their weight
and dilation by at most a factor of two, i.e. we place into the same subset all
the edges with the same log2 w(e) and log2 dilation(e) values.
To compute the weighted dilations, we rst need to nd a centroid vertex in
T. A centroid vertex is a vertex whose removal from the tree creates connected
components each with at most n/2 vertices . For any tree there must be at least
one centroid, and at most two (Jordan, 1869). Finding a centroid vertex in T
can be done in linear time as follows. First, we designate an arbitrary vertex
in T as its root, and form a rooted tree. Then, for each vertex v ∈ T, we let
R(v) be the vertex count of the sub-tree rooted in v, including v itself. The
pre-calculated array R(v) allows us to quickly answer queries of the form what
is the maximum size of any of the sub-trees remaining after the removal of some
query vertex v. We denote this value with Q(v). Calculating R(v) can be done
in a straightforward way using DFS. The core of the algorithm now follows. We
set v1 to be an arbitrary vertex in T. If the removal of v1 satises our desired
property, than we're nished. Otherwise, there is some sub-tree TB whose size
is bigger than n/2. We let v2 be any adjacent vertex on v1 which is closer to
TB, and repeat. This algorithm must halt, since the sequence Q(v1), Q(v2), . . . is
monotonously decreasing, and therefore must reach the target size.
Now that the centroid has been found, we proceed with labeling all the ver-
tices of T with their distance to the centroid. As before, we dene the length of
an edge to be the inverse of its weight. Note that by this denition, the weighted
dilation of an edge e(u, v) ∈ Eaug is simply the distance between u and v in T
times its weight. We now consider all the edges (u, v) ∈ Eaug for which the path
from u to v in T passes through the centroid. The distance between u and v
can be easily calculated by summing the distances between each of them and the
centroid. After these distances has been calculated, we remove the centroid from
T together with its adjacent edges and reiterate on the sub-trees left. Each iter-
ation handles more edges from Eaug. We keep iterating until we have handled all
the edges in Eaug. The fact that we used the centroid ensures that the iteration
count is logarithmic on n.
22

Tree Partitioning. Unlike Vaidya's algorithm that partitions the tree only
once, Ultra Sparsify uses each set Ei to induce a separate partitioning of T. For
a given Ei, dene the i-weight of a vertex v to be the sum of the weights of the
edges in Ei that are incident on v. Let
φ = k e∈Ei
w(e)
|Ei|
,
where k is a parameter given to the Ultra Sparsify algorithm. The partitioning
algorithm receives the parameter φ and partitions the tree into at most
4
φ e∈Ei
w(e)
connected sub-trees, such that the total vertex i-weight of each non-singleton
component is bounded by φ. Unlike the partitioning used in Vaidya, the resulting
sub-trees may be non-disjoint. Specically, two sub-trees may contain at most
one common vertex. For any vertex v, we denote by Tv the set of all vertices
contained in the sub-tree rooted in v, and by w (v) the sum of their i-weights.
The partitioning algorithm works as if it tries to calculate w (v0) for some
input vertex v0 using depth rst traversal of T. That is, we iterate through v0's
children {vi : 1 ≤ i ≤ degree(v0)}, recursively calculating w (vi) for each child
and summing the results together with the i-weight of v0 itself to obtain w (v0).
If the accumulated i-weight sum bypasses φ/2 at the t-th child for some
t degree(v0), that is, before all the children were processed, then all vertices in
∪t
i=1Ti are moved into a new component and the vertices in ∪t
i=1Ti are removed
from the tree. If t 1, we also add v0 itself to the new component in order to
ensure connectivity, but we don't remove it from the tree for the sake of the rest
of the children.
Otherwise, if the resulting w (v0) maintains φ/2 ≤ w (v0) ≤ φ then all ver-
tices in Tv0 are removed from the tree and moved into a new component in the
partition.
The last case we need to handle is when w (v0) φ. If this happens, we
create one component that contains the vertex v0 alone, and t components, one
per each of the sets Tv1 , . . . , Tvt .
When the algorithm returns from handling v, the remaining weight w (v) is
smaller than φ/2.
Augmentation by Sampling. Given an edge subset Ei and a partitioning
of T that Ei induces, the last component of the algorithm chooses a subset of Ei
to add to the preconditioner. We construct a contracted graph in which every
vertex x represents one of the connected components of T and each edge (x, y)
represents all the edges of Ei with one endpoint in component x and the other in
y. The weight b(x, y) of an edge (x, y) is the number of edges that it represent.
For each vertex x in the contracted graph we also compute a weight b(x)
which is simply the sum of the weights of the edges incident on it. We now
compute for each edge a ratio r(x, y) and a probability p(x, y),
r(x, y) = max
b(x, y)
b(x)
,
b(x, y)
b(y)
p(x, y) = min(cr(x, y), 1)
23

We then drop the edge (x, y) from the contracted graph with probability
1−p(x, y). Intuitively, edges which are light relative to their surrounding will be
dropped with high probability. As c goes up the probability for dropping edges
decreases linearly .
For each edge (x, y) that is left in the contracted graph after the random
sampling, we add the heaviest edge in Ei which is represented by (x, y) in the
contraction.
The overall set of augmentation edges is the union of the edges contributed
by each Ei using this algorithm.
Ultra Sparsify Variants. The algorithm described above is our simplest
implementation of Ultra Sparsify, denoted in our results by the name noMerge.
We have also implemented two additional variants. The rst variant, denoted
by Graph in our results, extends the way we classify edges by adding a third
label e(u, v) for each edge (u, v) ∈ Eaug. These labels are used to rene the
partitioning of Eaug describe above; all the edges within a given subset Ei must
have the same e value.
In order to assign these labels, we rst normalize all of the tree's edge weights
such that the heaviest edge would weigh 1. We then construct a sequence of
forests {T(i)
} such that T(i)
is the forest containing all edges from TA whose
weight is greater than 2−i
. The label e(u, v) is set to be the minimal l such that
u and v are within the same connected component in the forest Tl
. In other
words, e(u, v) = l if there is a path between u and v in T such that all the
edges in the path have weight of at least 2−l
, but there is no such path with
weights heavier than 2−l+1
. Calculating e(u, v) can be done quickly by using
a disjoint-set data structure as follows. We start with T0
, which is the empty
forest, and repeatedly build Ti+1
by a merging edge endpoints in the forest Ti
of edges whose weight is greater than 2−i−1
. We nish when Ti+1
is a full tree
(has one connected component).
Our next variant of Ultra Sparsify uses the same e labels, but uses them
to select a dierent set of augmenting edges. Recall that each Ei induces a
partitioning of T, and this partitioning is later used to nd augmenting edges.
When the e labels are used in the construction of the Ei's, all the edges in a
given Ei have the same label, say l. In this variant use partition the forest Tl
rather than partition the entire tree T.
24

CHAPTER 6
Partition-First Sparsication
The algorithms that we described so far begin with a spanning tree and then
partition it (once or several times) in order to augment the tree with additional
edges. It is also possible to construct a preconditioner by partitioning the original
graph, building a preconditioner in each part, and then augmenting the union of
the sub-preconditioners with extra edges if necessary.
We are aware of two such algorithms. One is not described in the literature,
but is used in a Java code by Dan Spielman. The other is described in a theo-
retical paper by Koutis and Miller [17]. The Koutis-Miller algorithm works only
on planar graphs.
The algorithm by Koutis and Miller constructs a vertex cover of the graph
rather than a partition: each vertex belongs to at least one subset in the cover
but possibly more. Each edge is covered (both endpoints) by at least one subset
in the cover and at most two. The size of the (overlapping) sub-graphs induced
by the cover is roughly equal, and the number of vertices contained in more
than one sub-graph is small. The construction of the cover depends only on the
connectivity of GA, not on the edge weights.
Once the cover is found, the algorithm constructs a preconditioner in each
subset. Their union is the overall preconditioner. Clearly, if every edge is sup-
ported well by paths in at least one of the sub-graphs to which it belongs, then
all the edges are supported well. There is no need to add inter-subset edges.
To obtain low asymptotic running times, Koutis and Miller propose to call
this algorithm recursively. That is, to construct an initial preconditioner us-
ing cover with subsets bounded by a constant. Degree one and two vertices of
this preconditioner are then eliminated from the preconditioner using Gaussian
elimination, but once all the vertices have degree three or more, the elimina-
tion stops, and we construct a new preconditioner for the partially-eliminated
matrix.We have not implemented this algorithm.
We describe Spielman's partition-rst algorithm in Section 7 below.
25

CHAPTER 7
Additional Heuristics
The total complexity of some of the algorithms that we have described so
far can be analyzed rigorously. This means that it is possible to bound the cost
of constructing the preconditioner, the cost of factoring it, and the cost of the
iterations.
It is also possible to use the same algorithmic building blocks to design heuris-
tic sparsication algorithms and heuristic solvers. It may or may not be possible
to bound their worst-case behavior, but they may still work well on some prob-
lems.
The description of Ultra Sparsify above describes one such heuristic. The
algorithm that we have described is based on provably-eective algorithms, but
the particular variant that we have described has not been rigorously analyzed.
1. Spielman's Heuristic Algorithm
Another heuristic that we have used in our experimental study is by Dan
Spielman. The heuristic is implemented in a Java code; to the best of our
knowledge, it is not described in a paper. This code also starts by decomposing
GA into disjoint vertex sets. This decomposition is again based on set growing
technique.
Each set is grown from a root vertex in two distinct phases. In the rst phase,
the set is grown until its size reaches 2S/3 for a given target size S. Once the
set contains at least 2S/3 vertices, we switch to the second phase. In the second
phase, we continue to grow the set as long as it has at most 4S/3 vertices, but
we also monitor the edge-count to vertex-count ratio. Once the second phase
ends, we choose for the nal set the state during the second phase at which the
ratio was highest. Note that during the second phase, some vertices may be
tentatively added to the set but removed when the set is nalized.
The actual growing of a set is done using a variant of Dijkstra's shortest-
paths algorithm. This variant views the graph as a network of resistors and
approximates the resistance between points in the network. Suppose that we
are processing a vertex v and relaxing the distance from the root to a neighbor
u. Let ¯d be the vector of shortest distance estimates. In Dijkstra's standard
algorithm, the new ¯d(u) is the minimum between ¯d(v) + d(v, u) and ¯d(u). In
Spielman's modication, the new ¯d(u) is set to be
1
¯d(u)
+
1
¯d(v) + d(v, u)
−1
.
This value is always lower than both ¯d(u) and ¯d(v) + d(v, u). The entries of ¯d
are initialized to ∞, so if this is the rst relaxation done for u, then we have
1
¯d(u)
+
1
¯d(v) + d(v, u)
−1
= 0 +
1
¯d(v) + d(v, u)
−1
= ¯d(v) + d(v, u) .
26

When the algorithm nishes forming one set, it selects a vertex that is not
yet in any set as the root of the next set to grow. The vertices are considered
as potential roots according to their distance from the rst root (in order of
increasing distances).
Once the graph is decomposed, we nd a rooted spanning tree within each
component. The root is the middle vertex of the component. This vertex is
found as follows. We nd all the vertices that are adjacent to vertices outside
the component. We start a breadth-rst search from these vertices and use as
the root (the middle vertex) the last vertex found in this traversal. The tree
itself is constructed by the same modied Dijktra algorithm described above.
Finally we reconnect with one edge each pair of components that was origi-
nally connected in the graph. The edge that is added is the one that minimizes
the ratio
cheur(u) + cheur(v)
w(u, v)
,
where cheur(v) is a heuristic metric assigned to all the vertices. The computation
of cheur(v) is complex, so we do not describe it in detail. The choice of the
connecting edge is designed to heuristically minimize the stretch of the connecting
edges that are not chosen.
Figure 1. Spielman's heuristics partitioning, with small set tar-
get size (16, on the left) and bigger target size (36, on the right).
The regular partitioning on the left was obtained only when using
a target size of 16.
27

Figure 2. METIS partitioning, again with small and larger set
target sizes. The partitioning is less regular then the partitioning
in Figure 1.
28

Figure 3. A non uniform graph with edge weights ranging from
0.1 to 1.5E8. Edges are colored using a color palette that allocates
the reds to heavy edges and the blues to light edges.
29

Figure 4. Spielman heuristic partitioning. Note how the sets are
aligned with the ring of heavier edges.
30

Figure 5. In contrast to Figure 4, we can see that METIS does
not take edge weights into consideration.
2. Modied Vaidya
Another form of heuristics that we have tested is a variant of Vaidya's al-
gorithm: instead of adding the heaviest edge between components we add the
edge with the highest weighted dilation. The algorithm that calculates weighted
dilation for non-tree edges is a component of Ultra Sparsier. This allowed us
to observe what is the relative benet of using this single component of Ultra
Sparsier. We call this variant Modied Vaidya.
We have also experimented with additional variants and ideas, but when
initial experimentation indicated that a particular variant is not particularly
eective, we dropped it from our study.
31

CHAPTER 8
Other Combinatorial Preconditioners
The literature on combinatorial preconditioners contains other ideas that we
have not tested. There are three main categories of such ideas.
1. Preconditioning in a Larger Space
The rst category of algorithms that we have not tested precondition A using
the Schur complement B of a larger matrix M. To solve the preconditioning
equation Bz = r for z, these algorithms rst extend r with zeros to match the
dimension of M. Denoting the extended vector by ˜r, the algorithm now solves
M ˜z = ˜r and then takes r to be the rst entries of ˜r. Because of the extension with
zeros, this is equivalent to solving Bz = r where B is the Schur complement of
M with respect to the elimination of M's last rows and columns. The resulting
solver is exactly equivalent to using M as a preconditioner for the extended
problem Ã˜x = ˜b where Ã is A extended with zero rows and columns and ˜b is an
extension of b with zeros.
From the algorithmic point of view, these algorithms really solve Ã˜x = ˜b using
M as a preconditioner, except that the extended vectors are explicitly extended
only when M−1
is applied. In the rest of the algorithm, only their rst entries
are maintained. From the spectral point of view, convergence is governed by the
generalized eigenvalues of ( Ã, M) which are also the eigenvalues of (A, B).
The rst algorithm of this type is due to Gremban and Miller [13]. Their
algorithm constructed a preconditioner M whose graph GM is a balanced binary
tree whose leaves are the vertices of GA. Each internal vertex of GM is viewed
as a set of vertices of GA. The weight of the edge from a vertex v in GM to
its parent is the total weight of the edges in GA linking the set represented by
v to the rest of GA. This construction leads to congestion-dilation bounds in
which there is no congestion and the dilation is logarithmic. This bounds the
generalized eigenvalues from one side. The other bound is much more dicult to
analyze. Gremban and Miller presented such analysis for regular grids. A later
paper by Woo et al. [18] extended this idea to general graphs, but without an
ecient algorithm to construct M.
Shklarski and Toledo used the same algebraic framework to construct pre-
conditioners for matrices arising from nite-element discretizations, even when
the matrices are not diagonally dominant [19]. In their method, called fretsaw
preconditioning, the introduction of new vertices essentially allows the precon-
ditioner to relax continuity constraints in the original problem. This framework
was recently used by Daitch and Spielman to construct provably-eective pre-
conditioners for certain two-dimensional problems in linear elasticity [9].
32

2. Combinatorial Preconditioners for Problems that are not
Diagonally Dominant
The fretsaw extension method of Shklarski and Toledo allows researchers to
design combinatorial preconditioners for problems that are not diagonally domi-
nant. There are additional techniques for handling specic families of problems.
One technique is again due to Gremban and Miller [12]. It reduces a sym-
metric diagonally-dominant problem with positive o diagonals to a larger linear
system (twice as large) with a coecient matrix that is symmetric, diagonally
dominant, and has only non-positive o diagonals.
Another technique for the same class of problems relies on constructing a
maximum-weight basis for the matroid induced by the weighted incidence ma-
trix U of A = UUT
. This maximum-weight basis is a generalization of the
maximum spanning tree. As in Vaidya's algorithm, this maximum-weight basis
can be augmented with extra edges. This algorithm is due to Boman, Chen,
Hendrickson, and Toledo [3].
A more recent technique, which also appears to be more widely applica-
ble (symmetric diagonally-dominant matrices with positive o diagonals rarely
arise in applications) is due to Boman, Hendrickson and Vavasis. This tech-
nique approximates each term Ae in a nite-element matrix A = e Ae by
a diagonally-dominant approximation Le with only non-positive o diagonals.
These approximations are summed and the sum L = e Le, which is also diago-
nally dominant, is approximated using a combinatorial sparsication algorithm.
Avron, Chen, Shklarski and Toledo proposed a more general and more practical
version of this method [1].
3. Recursive Combinatorial Algorithms
To achieve the best asymptotic running times, linear solvers that are based
on combinatorial preconditioners use recursion. We have already mentioned this
in the context of the algorithm of Koutis and Miller.
This technique works as follows. A combinatorial sparsication algorithm
is used to sparsify GA, but not by much, in order to keep the bound on the
number of iterations low (typically some large constant independent of n). The
preconditioner is not factored. Instead, a partial elimination of degree 1 and
2 vertices is carried out until there are no more such vertices. This partial
factorization is used in every preconditioning step. But to solve Bz = r, we need
to solve the reduced system. This is done iteratively, using another combinatorial
preconditioner for the reduced system.
Preliminary experiments by Toledo suggested that it is dicult to obtain high
performance with such recursive (or nested) preconditioners. We are not aware
of any other experimental evidence showing that they are eective. Therefore,
we have not tested them in this work.
33

CHAPTER 9
Experimental Results
We have conducted extensive experiments to evaluate these preconditioners.
This chapter describes the experiments and their results.
1. Methodology
We ran each preconditioning algorithm on each matrix several times using
dierent algorithmic parameters. Each algorithm has parameters that control
its behavior. For example, Vaidya's algorithm has one parameter, the target
number of sub-graphs. Other preconditioners, like Ultra Sparsify, have two pa-
rameters. We did not know in advance which algorithmic parameters are worth
investigating for each preconditioning algorithm and for each matrix. We also
did not know which sampling resolution to use in order to explore the parameters
space thoroughly. Lastly, we did not want to run preliminary tests only to nd
what regions of the parametric space are the interesting ones.
In order to address all these problems, we have used an automatic mechanism
for parameters generation. This mechanism allocates parameter values within
the range (0, 1) in increasing resolution. All preconditioning algorithms were
implemented such that their parameters were expected to be in the range (0, 1).
Each algorithm interpreted its normalized parameters as was required by its
structure. For example, Vaidya's was hardwired to look for nx
sub-graphs where
n is the number of vertices in the graph and drop tolerance Cholesky factorization
used a drop tolerance value of 2−16x
(x being the normalized parameter value).
Each test run was allocated a serial number, beginning with 1. Our test harness
used this serial number to generate the exact parameters values using a simple
bijection function ord : N → (0, 1). This function generated all the values
in the range (0, 1) in increasing resolution (1
2
, 1
4
, 3
4
, 1
8
, 3
8
, 5
8
, . . . ) as required. This
approach was also extended to allow multiple parameter preconditioner too, using
a second bijection kary(k) : N → Nk
. Using this mechanism, we could simply run
the tests continuously until enough data has accumulated.
For each run, we solved one linear system Ax = b with a random (known)
solution x. We recorded the density of the factor of the preconditioner and the
number of arithmetic operations that the solver performed, including both the
factorization phase and the iterative phase. We did not attempt to quantify the
computational eort involved in the graph algorithm that constructs the precon-
ditioner. Doing so in a meaningful way requires equally-ecient implementations
of all the algorithms, which we did not have. We focused instead on trying to
nd graph algorithms that produce eective preconditioners, leaving to further
research the question of how fast each graph algorithm is.
Similarly, we used an abstract metric of performance rather than actual run-
ning times. We compared the algorithms according to the number of arithmetic
operations that they perform. This does not always rank them by actual running
time, but it is an eective way to factor out the performance characteristics of
34

specic implementations of key subroutines, such as the sparse factorization code
and the iterative solver.
We present the results using graphs that relate the density of the factor of
the preconditioner to the total number of arithmetic operation that the solver
performed (see Figure 1 for typical example). The horizontal axis of the graphs
ends at the ll of the complete Cholesky factorization of GA itself. Each graph
also shows a blue line that bounds the work from above. This line shows the
amount of work done by the direct solver. For some of the graphs we also provide
a zoom-in. We also used another type of graph when we wanted to highlight
a dierences between two variant of the same algorithm. This other graph is
a histogram whose variable is the ratio of performance between the compared
algorithms.
In the tests performed on augmentation algorithms (Vaidya and Ultra Spar-
sify) we used the same underlying tree as augmentation base through out the
entire test runs.
2. Test Matrices
We used three families of test matrices. The rst family consists of regular 2-
and 3-dimensional meshes sized 216 × 216 and 36 × 36 × 36 respectably (a total
of 66
= 46656 vertices). In 2-dimensional meshes internal vertices have degree 4
and in the 3-dimensional meshes internal vertices have degree 6. We used meshes
with both uniform edge weights and with random edge weights (with uniform
and independent weights in (0, 1)).
The second family of matrices are from nite-elements solvers. These matri-
ces were produced by an approximation algorithm that approximates the coe-
cient matrix of a scalar elliptic nite-elements problem by a diagonally-dominant
one [1]. All of these matrices correspond to discretizations of 3-dimensional
structures. Their edge weights are usually not uniform (sometimes they vary by
several orders of magnitude), but they are not random. In total, there were 4
such matrices:
matrix
name
number of
unknowns
problem's domain
SC_G 93202
A 10-by-10-by-10 cube containing a spherical
shell of inner radius 3 and thickness 0.1.
CH_G 32867
A 1-by-1-by-1 cube with a 1-by-0.1-by-0.79 hole
in its middle.
B_G 196053 A 1-by-1-by-10000 box.
C_G 326017 A unit box.
The third family of matrices is a small subset of the University of Florida
Sparse Matrix Collection. We have used only SPDDD matrices from this collec-
tion with at least 9000 vertices:
35

Name Number of unknowns Number of non-zeros
Andrews-Andrews 60000 760154
Gaertner-Nopoly 10774 70842
GHS_psdef-Apache1 80800 542184
GHS_psdef-Jnlbrng1 40000 199200
Norris-Fv1 9604 85264
Norris-Fv2 9801 87025
Norris-Fv3 9801 87025
For more details about these matrices, see [10].
3. Algorithm Legend
The following tables summarizes the precondition algorithms that we have
tested with their source. Each algorithm has two names. The short names are
used in the text, and the long ones in the graphs.
Short name Variant Naming in graphs
Ultra Sparsify graph UltraSparsify.graph.mst
Ultra Sparsify noMerging UltraSparsify.noMerging.mst
Ultra Sparsify subtree UltraSparsify.subtree.mst
Vaidya Standard vaidya augment.standard.center-root.mst
Vaidya Modied vaidya augment.special.center-root.mst
Cholinc - cholinc
javaClus - javaClus
Preconditioner name Source
UltraSparisfy Dan Spielman's theoretic work [21, 20]
Vaidya Vaidya [7]
Modied Vaidya Our own heuristics
Cholinc Standard drop tolerance incomplete Cholesky
javaClus Dan Spielman's Java code
4. Test Infrastructure
In order to develop code quickly and to simplify the implementations of dier-
ent heuristics, all the implementations of the graph algorithms used for building
preconditioners were done in Java 5. These implementations used a exible and
robust graph-algorithmic framework we created for this purpose. Our framework
supports all the basic graph operations which were required in order to implement
the preconditioners that we have tested. Although there are many open source
graph packages for Java, we did not nd any which supported the operations
that we needed and whose performance was adequate. Still, we did not write
everything from scratch; we used GNU's Trove (High performance collections for
Java, http://trove4j.sourceforge.net/) and Colt (High Performance Scien-
tic and Technical Computing in Java, http://dsd.lbl.gov/~hoschek/colt/).
In total, around 15k lines of Java code were written, both for the framework and
for the preconditioners' implementations. These line counts do not include Dan
Spielman's Java code (javaClus), whose size is roughly similar to ours.
36

The test harness itself was implemented purely in Matlab, using around 10k
lines of code. We have used version Matlab 7.2 which supports calling directly
into Java 5 code. We have also used the internal incomplete Cholesky implemen-
tation of Matlab. Since our performance metrics are abstract, we could run tests
on dierent machines with dierent amounts of RAM and architectures. In total,
we have used 4 machines to run all the experiments, sometimes concurrently.
Running all the tests was time consuming. A typical experiment on a mod-
erately sized matrix (32 × 32 × 32 3D mesh, for example) takes up to a week to
nish. For some of the bigger matrices, the tests took even longer times to nish.
Memory requirements was also quite high, with up to 5 GB of RAM needed for
the tests on larger matrices.
5. Results
Figures 1 and 2 show a typical performance graph. Each point in this graph
is the result of a single solve of the same linear system, a uniform 3D mesh in this
case, using a unique combination of algorithmic parameters. We can see that all
of the preconditioning techniques can improve upon the direct solver, whose work
requirements are depicted with the blue line. The most ecient preconditioner
is the incomplete Cholesky factorizations (Figure 3), and Spielman's Heuristic
Algorithm (javaClus) comes second with around 21% more work. For almost
all of our test matrices, the incomplete Cholesky factorization was the most
ecient preconditioner. It is interesting to note, however, that javaClus was
able to achieve the same amount of work as incomplete Cholesky with fewer non
zeros. A similar performance pattern is visible in 3D random meshes too (Figure
4).
An important aspect of these performance graphs is that they also allow us to
easily grasp the overall behavior of the preconditioners. That is, the graphs show
the performance eects of varying the algorithmic parameters of a preconditioner.
Regular behaviors like continuous slopes or bands show that the algorithm is
stable with respect to its parameters, which makes it easier to choose these
parameters. For example, it is evident that javaClus is not particularly well
behaved since its results are more scattered. This behavior is visible in most of
our results.
Our tests contained some matrices which were hard to precondition eec-
tively. These matrices include the 2D meshes, both unweighted and random
(Figure 6), the matrix B_G (Figure 8) and the matrix Norris-Fv3 (gure 24).
On These matrices, the direct solver was the most ecient algorithm.
Vaidya vs. Modied Vaidya. Our variant of Vaidya (described in Chapter
7) performed better the standard Vaidya in most of the tests. In order to establish
this in a precise way, we compared how each Vaidya variant performed on the
same matrix and using the same algorithmic parameters. The ratios between
the two variants performances is plotted in histograms. A ratio r means that
the Modied Vaidya variant outperformed the standard Vaidya by a factor of
r. Figure 9 shows a few of these histograms. The average improvement of
using Modied Vaidya over all experiments was around 4%. For one matrix
(SC_G) we have seen an average improvement of about 16%. On the other
37

0 1 2 3 4 5 6 7 8 9
x 10
6
0
1
2
3
4
5
6
7
8
9
10
x 10
9 mesh3d.36
UltraSparsify.graph.mst
UltraSparsify.noMerging.mst
UltraSparsify.subtree.mst
cholinc
javaClus
vaidya augment.special.center−root.mst
vaidya augment.standard.center−root.mst
Figure 1. Performance graphs for a 3D uniform mesh 36 × 36 × 36.
2 4 6 8 10 12 14 16
x 10
5
2
4
6
8
10
12
14
16
x 10
8 mesh3d.36
cholinc
javaClus
Figure 2. A magnication of the Figure 1.
hand, on one matrix (Andrews-Andrews; taken from the university of Florida
matrix collection), modied Vaidya performed 7% worse than standard Vaidya.
Ultra Sparsify. We have used the same comparison method to compare be-
tween the three variants of Ultra Sparsier. Our nding was that the dierences
between these three variants was negligible. Specically, the average performance
ratios among the three variants, as measured between runs that used the same
matrix and the same algorithmic parameters, was less than 1%. We have noticed,
however, that the variant graph was consistently slightly better than the two
other variants.
38

2 4 6 8 10 12
x 10
5
0.5
1
1.5
2
2.5
3
3.5
4
x 10
8 mesh3d.36
cholinc
javaClus
Figure 3. A magnication of the Figure 2.
0 1 2 3 4 5 6 7 8 9
x 10
6
0
1
2
3
4
5
6
7
8
x 10
9 mesh3d−rand.36
cholinc
javaClus
Figure 4. Performance graphs for a 3D random mesh 36 × 36 × 36.
Comparing Ultra Sparsify to Vaidya shows that for about half of the matrices,
the performance of these preconditioners was similar, with no single precondi-
tioner showing a consistent edge over the other. However, for the other half, there
was some big dierences. For the matrix Andrews-Andrews, Ultra Sparsify per-
formed about twice as good as Vaidya, and for the matrix GHS_psdef-Apache1
Vaidya performed about twice as good as Ultra Sparsify. We did not nd any
explanation for this phenomenon.
We were also interested in how varying each of the parameters k and c aects
the Ultra Sparsier preconditioner. Figure 25 shows graphs which depicts the
relation between each of these parameters and the total performance. It can
39

1 2 3 4 5 6 7 8 9
x 10
5
1
2
3
4
5
6
7
x 10
8 mesh3d−rand.36
cholinc
javaClus
Figure 5. Magnication of Figure 4.
0 2 4 6 8 10
x 10
5
0
0.5
1
1.5
2
2.5
3
x 10
9 mesh2d.216
cholinc
javaClus
Figure 6. Performance graphs for a 2D uniform and random
meshes of size 216 × 216.
be seen that the performance is dependent almost exclusively on the value of
k. However, closer examination of the data revealed (as seen on the bottom
gure) that for small values of k, the parameter c does also have a big impact
on performance. This dependency on k and c is the typical behavior of Ultra
Sparsify, and is evident in the result of other matrices too. Generally, the best
performance was achieved for n0.4
≤ k ≤ n0.5
and 1 ≤ c ≤ 2.
Low Stretch Trees. Trees usually make poor preconditioners if not aug-
mented with additional edges. Preliminary tests did show that preconditioning
with a low stretch spanning trees is better than with maximum weight spanning
40

0 2 4 6 8 10
x 10
5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x 10
9 mesh2d−rand.216
cholinc
javaClus
Figure 7. magnication of Figure 6
0 0.5 1 1.5 2 2.5 3 3.5
x 10
6
0
1
2
3
4
5
6
x 10
10 b_g.196053
cholinc
javaClus
Figure 8. Performance graphs for the matrix B_G. The direct
solver (its blue line is barely visible in the bottom of the frame)
was about 6 times more ecient than the most ecient precondi-
tioner.
trees. However, both trees performed much worse than any of the other precon-
ditioners. In order to compare low stretch spanning trees with maximum weight
spanning trees in a practical context, we conducted tests in which these trees
were used as bases for augmentation using Vaidya and Ultra Sparsify. In order
to conduct a fair comparison, we applied the same preconditioning algorithm
and the same algorithmic parameters to both trees. We again show the results
of these experiments using a set of histograms. The histograms variable is the
41

0.95 1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35
40
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
0
5
10
15
20
25
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6
0
1
2
3
4
5
6
Figure 9. Ratios between performance of standard Vaidya and
Modied Vaidya. A ratio r means that the Modied Vaidya out-
performed the standard Vaidya by a factor of r. The matrices used
were uniform 3D mesh (left), SC_G (middle) and B_G (right).
ratio between the performances obtained from both tree types. Specically, a ra-
tio r means that the low stretch tree outperformed the maximum weight tree by
a factor of r. We show 5 such histograms in Figures 10 to 11, one histogram per
each augmentation algorithm. The matrix used in this experiment was CH_G.
These results show that low stretch trees do improve the quality of the resulting
preconditioner over maximum weight spanning trees, but not by a large factor.
The maximal improvement was evident in Vaidya, with an average of about 14%
improvement. Ultra Sparsify showed average improvement of less than 3%.
In order to evaluate the benet of low stretch spanning trees in general, and
not necessarily just one implementation of then, we have conducted another
experiment. For this experiment we implemented a straightforward algorithm
that produces spanning trees with bounded O(log n) stretch but only for n×n×n
3d uniform meshes. This stretch bound is lower than the bound of the general
low stretch spanning tree (of Section 2 in Chapter 4), and is obtainable only
because the structure of the graph is known beforehand. In Figure 12 we can
easily see that our lower stretch spanning tree is much more eective than the
general algorithm. All variants of augmentation algorithms performed better on
our very low stretch tree.
42

0.8 1 1.2 1.4 1.6
0
10
20
30
40
50
60
0.8 1 1.2 1.4 1.6 1.8
0
5
10
15
20
25
30
35
40
Figure 10. Ratios of performance between low stretch and max-
imum weight spanning trees for algorithm Vaidya (left) and Modi-
ed Vaidya (right). Ratios bigger than 1 means that the low stretch
tree performed better.
0.8 1 1.2 1.4 1.6
0
10
20
30
40
50
60
0.8 1 1.2 1.4 1.6
0
5
10
15
20
25
30
35
40
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6
0
10
20
30
40
50
60
Figure 11. Ratios of performance between low stretch and max-
imum weight spanning trees for algorithm Ultra Sparsify using
its three variants: Graph (left), subtree (right) and noMerge
(down). Ratios bigger than 1 indicates that low stretch trees was
more eective.
43

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
8 mesh3d.32
UltraSparsify.graph.lsst
UltraSparsify.graph.opt
UltraSparsify.noMerging.lsst
UltraSparsify.noMerging.opt
UltraSparsify.subtree.lsst
UltraSparsify.subtree.opt
vaidya augment.special.center−root.lsst
vaidya augment.special.center−root.opt
vaidya augment.standard.center−root.lsst
vaidya augment.standard.center−root.opt
Figure 12. Performance graph of augmentation algorithms
Vaidya and Ultra Sparsify, with spanning tree bases of low stretch
spanning tree and our adhok O(log n)-bounded low stretch tree for
a 3d mesh 32 × 32 × 32. The eectiveness of using a lower stretch
tree can be easily seen.
0 0.5 1 1.5 2 2.5
x 10
8
0
1
2
3
4
5
6
7
8
9
x 10
11 c_g.326017
cholinc
javaClus
Figure 13. Performance graph for matrix C_G.
44

1 2 3 4 5 6 7 8 9
x 10
6
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
10 c_g.326017
cholinc
javaClus
Figure 14. Zoom-in of Figure 13.
0 1 2 3 4 5 6 7 8 9
x 10
6
0
1
2
3
4
5
6
7
8
9
10
x 10
9 ch_g.32867
cholinc
javaClus
Figure 15. Performance graph for matrix CH_G.
45

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
x 10
5
1
2
3
4
5
6
7
8
9
10
11
x 10
8 ch_g.32867
cholinc
javaClus
0 0.5 1 1.5 2 2.5
x 10
7
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
10 sc_g.93202
cholinc
javaClus
Figure 17. Performance graph for matrix SC_G.
46

0.5 1 1.5 2 2.5
x 10
6
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
9 sc_g.93202
cholinc
javaClus
0 0.5 1 1.5 2 2.5 3 3.5
x 10
7
0
1
2
3
4
5
6
7
8
x 10
10 Andrews−Andrews.60000
cholinc
javaClus
Figure 19. Performance graph for matrix Andrews-Andrews.
47

2 4 6 8 10 12 14
x 10
5
1
2
3
4
5
6
x 10
9 Andrews−Andrews.60000
cholinc
javaClus
48

0 2 4 6 8 10 12 14 16
x 10
4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x 10
8 Gaertner−nopoly.10774
cholinc
javaClus
2 4 6 8 10 12 14 16
x 10
4
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
7 Gaertner−nopoly.10774
cholinc
javaClus
Figure 21. Performance graph for matrix Gaertner-Nopoly and
its zoom-in.
49

0 1 2 3 4 5 6 7 8 9 10
x 10
6
0
1
2
3
4
5
6
7
8
9
10
x 10
9 GHS_psdef−apache1.80800
cholinc
javaClus
2 4 6 8 10 12
x 10
5
1
2
3
4
5
6
7
8
9
x 10
8 GHS_psdef−apache1.80800
cholinc
javaClus
Figure 22. Performance graph for matrix GHS_psdef-Apache1
and its zoom-in.
50

0 1 2 3 4 5 6 7 8 9
x 10
5
0
2
4
6
8
10
12
14
16
18
x 10
7 GHS_psdef−jnlbrng1.40000
cholinc
javaClus
Figure 23. Performance graph for matrix GHS_psdef-Jnlbrng1.
51

0 0.5 1 1.5 2 2.5 3
x 10
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
7 Norris−fv2.9801
cholinc
javaClus
0 0.5 1 1.5 2 2.5 3
x 10
5
0
0.5
1
1.5
2
2.5
3
x 10
8 Norris−fv3.9801
cholinc
javaClus
0 0.5 1 1.5 2 2.5 3
x 10
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
7 Norris−fv1.9604
cholinc
javaClus
Figure 24. Performance graph for matrix Norris-Fv1, Norris-Fv2
and Norris-Fv3.
52

1 1.5 2 2.5 3 3.5 4 4.5 5
0
1
2
3
4
5
6
7
8
9
10
x 10
9
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
1
2
3
4
5
6
7
8
9
10
x 10
9
1 1.5 2 2.5 3 3.5 4 4.5 5
0.5
1
1.5
2
2.5
3
3.5
x 10
9
Figure 25. This gure shows the relation between the parameters
c (left) and k (right) to the performance of the Ultra Sparsify
preconditioner (using the noMegre variant and the matrix CH_G).
It seems that the value of k is more inuential than that of c
regarding the preconditioner's performance. However, this is not
the case, as seen by examining the gure on bottom. This gure
shows the relation between c and the total performance only for
20 ≤ k ≤ 30. At this range, performance was optimal for k. It is
evident that the value of c also has an big impact on performance.
53

CHAPTER 10
Conclusions
The experimental results that we have presented lead to several important
conclusions.
First, the dierent graph algorithms that we have tested lead to moderate
quantitative performance dierences, but these dierences are not dramatic and,
more importantly, are not always consistent. On some matrices, all the graph
algorithms performed similarly (e.g., SC_G). On other matrices, the Vaiyda
variants performed better than the Ultra Sparsify variants (e.g., GHS_psdef-
Apache1) or the Ultra Sparsify performed better (e.g., Andrews-Andrews). Spiel-
man's heuristic, javaClus, outperformed all the other graph algorithms on regular
3D meshes, but was slower than all of them on several other matrices.
Second, the behaviors of the dierent graph algorithms are similar in several
qualitative aspects. For example, incomplete Cholesky usually beats all of the
other preconditioners but when it it does not, it is slower than all of them.
Similarly, the direct solver either beats all of them or is slower than all of them.
Therefore, the main conclusions of this research are that the graph algorithms
that have been recently proposed in the literature for graph sparsication are
not signicantly better than Vaidya's algorithm, whose behavior is now well
understood [1, 7]. Performance can be improved using a better initial tree (a tree
with very low stretch, for example), but such trees improves both Vaidya's and
the newer algorithms' performance almost equally. Nevertheless, lower stretch
trees do lead to better performance generally.
This may seem like a fairly trivial conclusion, but it is not. Without the
extensive experimentation reported in this thesis, it is impossible to know if the
improved asymptotics of recent algorithms translates into better performance in
practice. It was widely believed that the recently-proposed algorithms would
outperform Vaidya by a wide margin, thereby improving the competitiveness of
combinatorial preconditioners over other classes of solvers. Our data does not
support this belief, and in the context of the matrices used for the experiments,
contradicts it.
We have not attempted to understand what features of the matrices help
each algorithm excel. For example, we do not know what property of Andrews-
Andrews contributes to the high performance of Ultra Sparsify and what prop-
erty of GHS_psdef-Apache1 causes Vaidya to outperform Ultra Sparsify. This
remains an interesting question for future research.
We have also ignored the eciency of the graph algorithms themselves, as op-
posed to the quality of preconditioners that they produce. To address this issue,
one would need to implement Ultra Sparsify and low-stretch-tree constructors
eciently. These implementations usually require a signicant amount of eort.
Given our results, it is not clear whether such eorts are justied, especially
since we cannot isolate a single variant that is clearly better than the others (so
that the implementation eort could focus on it). One exception may be the
54

construction of low-stretch trees; lower stretch seems to usually translate into
better preconditioners, so such implementation eort may be justied.
Acknowledgement. This research was supported by an IBM Faculty Part-
nership Award, by grant 848/04 from the Israel Science Foundation (founded by
the Israel Academy of Sciences and Humanities), and by grant 2002261 from the
United-States-Israel Bi-national Science Foundation.
55

Bibliography
[1] Haim Avron, Doron Chen, Gil Shklarski, and Sivan Toledo. Combinatorial precondition-
ers for scalar elliptic nite-elements problems. Submitted to SIAM Journal on Scientic
Computing, November 2006.
[2] Marshall Bern, John R. Gilbert, Bruce Hendrickson, Nhat Nguyen, and Sivan Toledo.
Support-graph preconditioners. SIAM Journal on Matrix Analysis and Applications,
27:930951, 2006.
[3] Erik G. Boman, Doron Chen, Bruce Hendrickson, and Sivan Toledo. Maximum-weight-
basis preconditioners. To appear in Numerical Linear Algebra with Applications, 29 pages,
June 2001.
[4] Erik G. Boman, Bruce Henderickson, and Stephen Vavasis. Solving elliptic nite element
systems in near-linear time with support preconditioners. Submitted for publication, 2004.
[5] Erik G. Boman and Bruce Hendrickson. Support theory for preconditioning. SIAM Journal
on Matrix Analysis and Applications, 25(3):694717, 2004.
[6] Doron Chen, John R. Gilbert, and Sivan Toledo. Obtaining bounds on the two norm
of a matrix from the splitting lemma. Electronic Transactions on Numerical Analysis,
21:2846, 2005.
[7] Doron Chen and Sivan Toledo. Vaidya's preconditioners: Implementation and experimen-
tal study. Electronic Transactions on Numerical Analysis, 16:3049, 2003.
[8] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein. Introduc-
tion to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001.
[9] Samuel I. Daitch and Daniel A. Spielman. Support-graph preconditioners for 2-dimensional
trusses.
[10] T. Davis. University of orida sparse matrix collection. NA Digest, 92(42), October 16,
1994.
[11] Michael Elkin, Yuval Emek, Daniel A. Spielman, and Shang-Hua Teng. Lower-stretch
spanning trees. In STOC '05: Proceedings of the thirty-seventh annual ACM symposium
on Theory of computing, pages 494503, Baltimore, MD, USA, 2005.
[12] Keith D. Gremban. Combinatorial Preconditioners for Sparse, Symmetric, Diagonally
Dominant Linear Systems. PhD thesis, School of Computer Science, Carnegie Mellon
University, October 1996. Technical Report CMU-CS-96-123.
[13] Keith D. Gremban, Gary L. Miller, and Marco Zagha. Performance evaluation of a new
parallel preconditioner. In Proceedings of the 9th International Parallel Processing Sympo-
sium, pages 6569. IEEE Computer Society, 1995. A longer version is available as Technical
Report CMU-CS-94-205, Carnegie-Mellon University.
[14] Victoria E. Howle and Stephen A. Vavasis. Preconditioning complex-symmetric layered
systems arising in electrical power modeling. In Proceedings of the 5th Copper Mountain
Conference On Iterative Methods, March 1998. 7 unnumbered pages.
[15] Ilse C. F. Ipsen and Carl D. Meyer. The idea behind Krylov methods. American Mathe-
matical Monthly, 105(10):889899, 1998.
[16] Anil Joshi. Topics in Optimization and Sparse Linear Systems. PhD thesis, Department
of Computer Science, University of Illinois at Urbana-Champaign, 1997.
[17] Ioannis Koutis and Gary L. Miller. A linear work, o(n1/6
) time, parallel algorithm for
solving planar laplacians.
[18] Bruce M. Maggs, Gary L. Miller, Ojas Parekh, R. Ravi, and Shan Leung Maverick Woo.
Finding eective support-tree preconditioners. Unpublished manuscript; available online
at http://www-2.cs.cmu.edu/ maverick, 2005.
[19] Gil Shklarski and Sivan Toledo. Rigidity in nite-element matrices: Sucient conditions
for the rigidity of structures and substructures. Submitted to SIAM Journal on Matrix
Analysis and Applications, January 2006.
56

[20] Daniel A. Spielman and Shang-Hua Teng. Solving sparse, symmetric, diagonally-dominant
linear systems in time 0(m1.31
). In Proceedings of the 44th Annual IEEE Symposium on
Foundations of Computer Science.
[21] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph parti-
tioning, graph sparsication, and solving linear systems. In STOC '04: Proceedings of the
thirty-sixth annual ACM symposium on Theory of computing, pages 8190, Chicago, IL,
USA, 2004.
[22] Pravin M. Vaidya. Solving linear equations with symmetric diagonally dominant matri-
ces by constructing good preconditioners. Unpublished manuscript. A talk based on the
manuscript was presented at the IMA Workshop on Graph Theory and Sparse Matrix
Computation, October 1991, Minneapolis.
57

Unger

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

En vedette

En vedette (16)

Similaire à Unger

Similaire à Unger (20)

Unger