The best grid shape is determined by the algorithm implemented in the library and the underlying physical network. A one link physical network will favor or , where is the process grid. This affects the scalability of the algorithm, but reduces the overhead due to message collisions. It is possible to predict the best grid shape given the number of processes available. The current algorithms for the factorization or reduction routines can be split into two categories.

If at every step of the algorithm a block of columns and/or rows
needs to be broadcast, as in the *LU* or *QR* factorizations,
it is possible to pipeline this communication phase and overlap it
with some computation. The direction of the pipeline determines the
shape of the grid. For example, the *LU*, *QR* and *QL* factorizations
perform better for ``flat'' process grids ( ). These
factorizations share a common bottleneck of performing a reduction
operation along each column (for pivoting in *LU*, and for computing
a norm in *QR* and *QL*). The first implication of this observation
is that large latency message passing perform better on a ``flat''
grid than on a square grid. Secondly, after this reduction has been
performed, it is important to update the next block of columns as
fast as possible. This is done by broadcasting the current block of
columns using a ring topology, i.e, feeding the ongoing communication
pipe. Similarly, the performance of the *LQ* and *RQ* factorizations take
advantage of ``tall'' grids ( ) for the same reasons, but
transposed.

The theoretical efficiency of the *LU* factorization can be estimated
by (3):

For large *n*, the last term in the denominator dominates, and it is
minimized by choosing a slightly smaller than . works
well on Intel machines. For smaller *n*, the middle term dominates,
and it becomes more important to choose a small . Suppose that
we keep the ratio constant as *P* increases, thus we
have and , where *u* and *v*
are constant [10]. Moreover, let us ignore the
factor for a moment. In this case, and
are proportional to and must grow with *P*
to maintain efficiency. For sufficiently large , the
factor cannot be ignored, and the performance will slowly degrade
with the number of processors *P*. This phenomenon is observed in
practice as shown in Figure 10 for the efficiency of the
*LU* factorization on the Intel Paragon.

The second group of routines
are two-sided algorithms as opposed to one-sided algorithms.
In these cases, it is not
usually possible to maintain a communication pipeline, and thus
square or near square grids are more optimal. This is the case for
the algorithms used for implementing the Cholesky factorization,
the matrix inversion and the reduction to bidiagonal form (BRD),
Hessenberg form (HRD), tridiagonal form (TRD) and
the nonsymmetric *QR* eigenvalue algorithm (HQR). For example,
the update phase of the Cholesky factorization of a lower symmetric
matrix physically transposes the current block of columns of the
lower triangular factor.

Assume now that at most *P* processes are available. A natural
question arising is: could we decide what process grid
should be used? Similarly, depending
on *P*, it is not always possible to factor to
create the appropriate grid. For example, if *P* is prime, the
only possible grids are and . If such
grids are particularly bad for performance, it may be beneficial
to let some processors remain idle, so the remainder can be formed
into a ``squarer'' grid [24].
These problems can be analyzed by a complicated function of the
machine and problem parameters. It is possible to develop models
depending on the machine and problem parameters which accurately
estimate the impact of modifying the shape of the grid on the total
execution time, as well as predicting the necessary amount of extra
memory required for each routine.

**Figure:** Intel Paragon LU and QR Performance

**Figure:** IBM SP-2 and Intel Paragon Cholesky Factorization

**Figure:** Execution speed in Mflop/s for the band algorithm on the IBM SP-2.

**Figure:** Performance of the Parallel QR Algorithm on the Intel MP Paragon for
a 2x2 up to a 9x9 grid of processes.

**Figure:** Performance with different blocksize *NB* (*MB*=*NB*) and
prediction of percentage of time to performance operations.

**Figure:** IBM SP-2 and Intel Paragon Performance

Thu Jul 25 15:38:00 EDT 1996