1 Introduction
Krylov subspace (KSP) methods, such as GMRES [85, 87] and BiCGSTAB [100], are widely used for solving largescale sparse unsymmetric or indefinite linear systems, especially those arising from numerical discretizations of partial differential equations (PDEs). For relatively illconditioned matrices, the KSP methods can significantly benefit from a robust and efficient preconditioner. Incomplete factorization techniques, such as incomplete or Cholesky factorization for Mmatrices [75] and symmetric and positive definite (SPD) systems [49, 59, 69, 93] and incomplete LU (or ILU) factorization for symmetric indefinite [63, 91] and unsymmetric [67, 71, 82] systems, are among the most robust preconditioners. Earlier ILU methods, such as ILUT [82], ILUP [81], ILUTP [29, 85], ILUC [63, 64, 71], etc., lacked robustness for many indefinite systems; see e.g. [29, 42, 46, 67, 110] for some challenging benchmark problems that caused failures. With the more recent development of multilevel ILU techniques, such as ARMS [86, 88], ILUPACK [20, 22], MDRILU [109], ILU++ [73, 74], etc., the robustness of ILU has improved significantly for many applications. However, robustness and efficiency are sometimes problematic for highly illconditioned or saddlepoint systems, which often arise from discretizations of PDEs, such as Stokes, NavierStokes, and Helmholtz equations, in application areas such as computational fluid dynamics (CFD), climate modeling, materials science, structural mechanics, multiphysics coupling, etc.; see e.g. [86] and Section 5 in this work for some examples. The objective of this work is to improve the robustness and efficiency of multilevel ILU for such systems. To this end, we introduce a new preconditioner, called HILUCSI (pronounced as HiLuxi), which stands for Hierarchical Incomplete LUCrout with Scalabilityoriented and Inversebased dropping.
The development of HILUCSI was motivated by two observations. First, many linear systems from systems of PDEs are often nearly or partially symmetric, with some block structures. Without loss of generality, we assume the matrix has the form
(1.1) 
where . By near symmetry, we mean that has a nearly symmetric nonzero pattern, and . This (near) pattern symmetry is intrinsic to some numerical methods, such as finite differences [62], finite elements [30, 41], or finite volumes [61], because the local support of the basis functions (a.k.a. trial functions in finite elements) and that of the test functions in the variational formulations are typically the same or have significant overlaps. The numerical asymmetry in these cases tends to be small when it is due to small nonselfadjoint terms (such as advection in a diffusiondominant advectiondiffusion problem [41, p. 243]) or due to truncation errors (such as in a PetrovGalerkin method for a selfadjoint PDE [41, p. 88]). By partial symmetry, we mean that is nearly symmetric, although and may differ significantly. The strong asymmetry between and may be due to strongly imposed constraints in a variational formulation [40], highorder treatment of Neumann boundary conditions in finite elements [16] or finite differences [62], imposition of jump conditions in immersed/embedded boundary methods [58, 78], or a system of PDEs with mixed selfadjoint and nonselfadjoint subsystems (such as in a monolithically coupled fluidstructure interaction system [57]).
The second observation was that it is sometimes advantageous to treat a fully symmetric system as unsymmetric, especially for some indefinite systems. This claim seems to contradict the conventional wisdom that “it is rarely sensible to throw away symmetry in preconditioning” [106]. Indeed, some stateoftheart packages for symmetric indefinite systems always try to preserve symmetry and use some variant of the BunchKauffman pivoting [26, 47]; see e.g. [48, 63, 91, 92]. Such pivoting strategies require a mixture of and blocks, which not only complicate the implementation but also are sometimes less effective than unsymmetric ILU in avoiding small pivots. At the same time, the conventional wisdom is also sound in that leveraging symmetry can reduce the computational cost by nearly half for some problems, but may improve robustness, which we observe for some linear systems matrices from systems of PDEs. A novel feature of HILUCSI is to apply symmetric preprocessing at the top levels for nearly or partially symmetric matrices but unsymmetric factorization at lower levels for all indefinite systems, including symmetric indefinite systems. In addition, HILUCSI introduces a static deferring strategy for (nearly) symmetric saddlepoint problems, which eliminates the need for pivots for stability.
Due to the numerous applications of nearly or partially symmetric and indefinite systems, there have been significant interests in solving such systems in linear algebra, numerical PDE, and engineering communities. Many recent problems in the SuiteSparse Matrix Collections [33] and Matrix Market [17] belong to these classes, including some of the cases highlighted in [29, 67, 74, 86, 110]. To address these challenging problems, the PDE and multigrid communities have developed some customized preconditioners and solvers for specific equations; see [1, 2] for examples of special multigrid solvers and see [14] for a survey on special techniques for saddlepoint problems. In terms of generalpurpose solvers, a common practice is to treat such systems as general unsymmetric systems, such as in ARMS [88], ILU++ [73], supernodal ILUTP in SuperLU [67], etc. We note two exceptions. First, one may build a symmetric preconditioner for an unsymmetric , such as using in an algebraic preconditioner [32, 107] or using the selfadjoint terms in a physicsbased preconditioner [3]. However, the symmetric part may not approximate accurately for some applications, and factorization of may break down for partially symmetric systems, for example, if and in (1.1) as in the unsymmetric positivedefinite variant of symmetric saddlepoint problems [14]. Second, for a partially symmetric matrix where is symmetric but in (1.1), one may factorize using a symmetric ILU and then apply unsymmetric ILU on the Schur complement [14]. However, this approach breaks down if is singular, which can happen even if is wellconditioned [14]. HILUCSI differs from these earlier approaches in that it does not explicitly construct a symmetric approximation of , and it does not suffer from potential instabilities associated with factorizing a singular or .
Another novel feature of HILUCSI is its scalabilityoriented dropping together with modified inversebased dropping in the Crout version of multilevel ILU [22, 64]. Our scalabilityoriented dropping shares some similarity to the spacebased droppings (such as those in ILUT [82], ICMF [69], PARDISO 6 [92]) as well as areabased dropping in SuperLU [67]. Those space or areabased droppings traditionally focused on controlling space complexity. In contrast, the primary goal of our scalabilityoriented dropping is to achieve (near) lineartime complexity in the number of nonzeros in the input matrix. Although linear time complexity implies linear space complexity as a side product, the converse is not true in general. Also, using a pseudospectral analysis, we derive a modified inversebased dropping strategy, which is more robust than that of Bollhï¿œfer in [18, 21] and is more efficient than that of Bollhï¿œfer and Saad in [22]. Using a similar analysis, we propose a modification to the inversebased deferring of Bollhï¿œfer and Saad [22] to control not only the norms of and but also that of . We show that our new dropping strategies along with mixed symmetric and unsymmetric processing enabled superior robustness and efficiency for HILUCSI compared to ILUPACK [20] and SuperLU [67]. In addition, we show that the serial performance of HILUCSI with optimized parameters is competitive with the stateoftheart multithreaded direct solvers in MUMPS [6] and PARDISO [92] on 24 cores for large systems with more than one million unknowns.
The remainder of the paper is organized as follows. In Section 2, we review some background on incomplete LU factorization and its multilevel variants. In Section 3, we describe the algorithm components of HILUCSI for robustness. In Section 4, we address the efficiency issues with a focus on scalability with respect to problem sizes. In Section 5, we present numerical results with HILUCSI as a right preconditioner for restated GMRES and compare its performance with some stateoftheart packages. Finally, Section 6 concludes the paper with a discussion on future work.
2 Background and mathematical foundation
In this section, we review some incomplete LU preconditioners, especially multilevel ILU. Because there is a vast literature on preconditioning and ILU, we focus on only some of the most relevant techniques and mathematical analysis. For comprehensive reviews, we refer readers to some surveys [13, 27, 89, 106] and textbooks [85, 101].
2.1 Incomplete LDU factorizations
ILU, or more precisely incomplete LDU (or ILDU) factorization with pivoting, performs an approximate factorization
(2.1) 
where and are unit lower and upper triangular matrices, respectively, and and are row and column permutation matrices, which may be obtained from some static reordering, static or dynamic pivoting, or a combination of them. Let , and is a preconditioner of , or equivalently is a preconditioner of . We consider only right preconditioning in this work. In this context, when solving a linear system using an iterative method, especially a Krylov subspace method, one solves the rightpreconditioned linear system
(2.2) 
which ideally would converge much faster than solving the original linear system, and then .
2.1.1 Singlelevel ILU
We refer to the ILU with the basic form in (2.1) as a singlelevel ILU. Such a technique has been used to solve linear systems from PDEs since 1950s; see e.g. [104]. In 1977, Meijerink and van der Vorst [75] showed that incomplete Cholesky (IC) factorization is stable for a symmetric Mmatrix, i.e., a matrix with nonpositive offdiagonal entries and nonnegative entries in its inverse. Since then, IC has been extended to and become popular for SPD systems [49, 59, 69, 93]. However, ILU for unsymmetric or indefinite systems had turned out to be much more challenging, and it has been an active research topic over the past few decades; see e.g. [7, 19, 22, 29, 28, 72, 67, 82, 86].
In its simplest form, ILU does not involve any pivoting, and and preserve the sparsity patterns of the lower and upper triangular parts of , respectively. This approach is often referred to as ILU0 or ILU(0). To improve its robustness, one may allow fills, a.k.a. fillins, which are new nonzeros in the and factors. The fills may be introduced based on their levels in the elimination tree or based on the magnitude of numerical values. The former leads to the socalled ILU(), which zeros out all the fills of level or higher in the elimination tree. The combination of the two is known as ILU with dual thresholding (ILUT) [82]. The levelbased fills may be replaced with some other dropping to control the numbers of fills in each row or column. The ILU implementations in PETSc [9] and hypre [96] use some variants of ILUT with dual thresholding.
ILUT may encounter zero or tiny pivots, which can lead to a breakdown of the factorization. One may replace tiny pivots with a small value, but such a trick is not robust [29]. The robustness may be improved by using pivoting, leading to the socalled ILUP [81] and ILUTP [85]. The ILU implementations in MATLAB [97], SPARSKIT [83], and SuperLU [67], for example, are based on ILUTP. However, ILUTP cannot prevent small pivots [86], so it is still not robust in practice; see [67, 74, 86] and Section 5 in this work for some failed cases with ILUTP.
2.1.2 Multilevel ILU
A more sophisticated form of ILU takes advantage of the global block structure of similar to that in (1.1) to compute a block factorization of . Specifically, let and denote the row and column permutation matrices. A blockstructured preconditioner for the permuted matrix can be constructed via the approximation
(2.3) 
where , , , and is the Schur complement. If is further approximated recursively using such a block factorization, one obtains a multilevel ILU (MLILU) [10, 11, 84, 90, 109, 86, 22, 74], also known as multilevel block factorization [105]
. Given a block vector
, then can be computed as(2.4) 
There are several motivations to use multilevel ILU. One of them is to expose parallelism, such as in ILUM and BILUTM [84, 90]. The idea is to permute so that in (2.3) is a block diagonal matrix, of which each block can then be factorized independently. This approach was inspired by the domaindecomposition (such as additive Schwartz) methods [94], but ILUM factorizes the Schur complement recursively instead of resolving the overlapping regions of subdomains iteratively. Another motivation of multilevel ILU is to construct an algebraic analogy of multigrid methods [80, 108]; see e.g. [10, 11, 12]. However, the finecoarse grid relationship in multilevel ILU is quite different from the geometric [80, 25] and algebraic multigrid methods [54, 103, 102]. More recently, multilevel ILU has been used successfully in improving the robustness of ILU as a preconditioner; see e.g. [22, 86, 74, 109]. In this work, we strive to improve the robustness of multilevel ILU further along with its efficiency.
2.1.3 Criteria of ILU
When developing ILU methods, one must answer the following question: what is a good incomplete factorization? Intuitively, in (2.2) or in (2.3) should be as “close” to . However, more precise definitions are needed for analysis and derivations. In the following, we consider the robustness and efficiency issues separately. For simplicity, we will omit and in the discussions.
Criteria for stability and accuracy
For ILU to be a “robust” preconditioner, it should be accurate and stable. In [29], Chow and Saad consider these issues from an algorithmic point of view. Most notably, they emphasized the importance of avoiding small pivots. In [18], Bollhï¿œfer pointed out the importance of monitoring and
, of which the norms can be estimated incrementally as in
[55]. Based on this idea, Bollhï¿œfer and Saad developed a robust multilevel ILU approach [22], which dynamically defers the rows and columns that lead to large and .From an empirical point view, Benzi [13] measured the stability and accuracy of a preconditioner using and
, respectively. The Frobenius norm is used probably due to its ease of computation. These measures work well for SPD systems, but the correlation between a small
and a good preconditioner is weaker for general matrices. In this work, we use an alternative measure based on the spectral radius of .Definition 1.
A preconditioner is accurate for if for some ; the smaller the spectral radius, the more accurate the preconditioner.
This measure of the spectral radius is relevant to preconditioning because
for . Hence, the smaller
is, the better the eigenvalues of
are clustered, and the better the preconditioned KSP method may converge. Note that it is unnecessary for to be less than for the convergence of a KSP method. However, a small imposes constraints on stability. In addition, if , then can be used in place of a stationary iterative method as a smoother in multigrid methods [108], which can further accelerate convergence. Hence, we will use the approximate minimization of in our derivation.In the context of multilevel ILU, the accuracy and stability of the preconditioner is more complicated. For a twolevel incomplete factorization of an SPD system, Notay [76, Theorem 2] showed that the stability requires both the leading block and the Schur complement (i.e., and in (2.3)) to be wellconditioned. In addition, the Cauchy–Bunyakowski–Schwarz (CBS) constant [8, Chapter 9] should be (far) smaller than ; assuming in (2.3), this condition requires that
(2.5) 
These conditions are specific to SPD systems. As a heuristic for general matrices, we may argue that the leading blocks must be well conditioned at each level, and it is advantageous for the offdiagonal blocks to be as small as possible.
An unfortunate aspect of Notay’s analysis is that it demands well conditioned Schur complement, which is impractical for matrices from elliptic PDEs or secondorder systems of PDEs, because the condition numbers are as the mesh resolution tends to zero [41, 62]. Fortunately, in multilevel ILU, some preprocessing (such as equilibration [99, 34]) can be applied to the Schur complement, which can improve the condition number and in turn improve stability. In [22], Bollhï¿œfer and Saad emphasized the accuracy of the Schur complement, who suggested to use a formulation due to Tismenetsky [98], which unfortunately often leads to excessive fills. A simpler alternative is to tighten the dropping thresholds for the Schur complement, as used in [20, 74, 88, 109]. In this work, we use a more elaborate version of the latter strategy, which we refer to hierarchical dual thresholding.
Criteria for efficiency
Traditionally, the efficiency of ILU is equated to the “sparsity” of the and factors [29]. In this work, we give a more strict definition.
Definition 2.
A preconditioner is efficient if it can be constructed in (nearly) linear time in the number of nonzeros in the input matrix , and can be computed in (nearly) linear time for any .
For ILU, this definition requires not only the number of nonzeros in and , but also the computational time of the factorization, to be (approximately) proportional to the number of nonzeros in . For linear systems arising from finite differences or finite elements for PDEs, the number of nonzeros per row is typically bounded by a constant. For such systems, Definition 2 requires (near) linear time complexity of ILU in the number of unknowns. In addition, the efficiency requirement must be satisfied under the constraint of the accuracy of . This is admittedly a tall order because a direct sparse solver has superlinear time complexity for matrices from 2D and 3D PDEs [52]. To the best of our knowledge, none of the existing ILU technique achieved this efficiency goal under the accuracy constraint. Although ILU0 and a simple ILUT meet this criterion of efficiency, they do not satisfy the stability constraint in general. Note that our efficiency objective is similar to that of Hackbusch’s hierarchical matrices [51]. However, Hackbusch measures the accuracy by some norm of , but we measure it using , which is more appropriate for preconditioning [13].
2.2 Dropping strategies
For ILU, the dropping strategies are important for both robustness and efficiency. Most modern ILU techniques use some form of “dual thresholding” [82], which combines a dropping strategy based on numerical values along with another dropping strategy based on some combinatorial (or symbolic) properties, such as the level of the elimination tree or the number of fills in rows or columns. For the former, the dropping is controlled by a parameter known as drop tolerance (or in short, droptol or ). Different weighting schemes can be applied to numerical values. Let and denote the and factors of the ILU of the leading block of . In [18], Bollhï¿œfer proposed to use and as weights for the entries in the th column of and the th row of , respectively. Li et al. adopted it in [64] and referred to it as dropping based on condition number estimation. In [22, Theorem 6], Bollhï¿œfer and Saad proposed a much more strict dropping strategy, which weights the values with a userspecified threshold, which corresponds to an upper bound of . To distinguish the two variants, we refer to the former as variable inversebased dropping and the latter as fixed inversebased dropping. In [72], Mayer proposed to weigh the values in the th column of with some norm of the th row of and vice versa. In this work, we use a modified inversebased dropping, which we derive based on Definition 1. We also introduce a scalabilityoriented dropping to achieve near lineartime complexity.
We note two concepts closely related to dropping. The first is the Crout version of ILU (or ILUC) [64], also known as the leftlooking ILU for symmetric matrices [38]. Unlike the typically ordered Gaussian elimination, at the th step, the Crout ILU computes the th column of (i.e., ) by left looking, and analogously for the th row of (denoted by ). In this way, the numerical values in and are updated as late as possible, so it allows more accurate dropping compared to rightlooking algorithms. In addition, and can be estimated incrementally for the inversebased dropping in Crout ILU. The second concept is modified ILU (MILU) [85], introduced by Dupont et al. [37] and also by Gustafsson [50] for elliptic PDEs. MILU modifies the diagonal entries to compensate the discarded entries so that the sum of each row is preserved. This strategy improves the accuracy for certain applications [39], and it is adopted in SuperLU [67]. We do not use it in HILUCSI, because it was ineffective for general linear systems in our testing.
2.3 Pivoting versus deferring
It is well known that small pivots can make Gaussian elimination unstable [47], and similar for ILU [29]. Analogous to and factorizations, small pivots in ILU can be mitigated by using some variant of partial pivoting for unsymmetric matrices (such as column pivoting in ILUP and ILUTP [81, 85], row pivoting in SuperLU [67], and “dual pivoting” in [71]) and by using BunchKauffman pivots [26, 47] for symmetric indefinite matrices (such as in ILDUCBKP [63] and in ILUPACK [91]). The pivoting strategies may be dynamic in that they are determined at each step. However, unlike complete factorization, dynamic pivoting cannot guarantee the stability of the factorization, even for nonsingular systems. To overcome this issue, Saad [86] advocated static pivoting, which permutes the matrix in a preprocessing step. Saad’s work was motivated by that of Olschowka and Neumaier [77], who showed that any structurally nonsingular matrix can be permuted and scaled to obtain an Imatrix, whose diagonal entries have magnitude 1 and whose offdiagonal entries have a magnitude less than or equal to 1. To this end, Saad proposed a permutation strategy called ddPQ to achieve some weak diagonal dominance of a leading block, which he then used as the leading block in multilevel ILU without further permutation. However, the weak diagonal dominance from ddPQ cannot guarantee good pivots. In our testing, ddPQ is quite sensitive to the input matrix.
Instead of pivoting, Bollhöfer and Saad [22] proposed to defer the rows and columns to the next level dynamically if they lead to too large norms of or . Such a strategy naturally leads to a dynamic construction of multilevel structure. In this work, we use deferring to bound the norms of both and the inverse triangular factors, so that it would also achieve the same goal as pivoting in avoiding small pivots, in addition to controlling the triangular factors. Analogous to dynamic pivoting, we refer to these deferring strategies as dynamic deferring. An advantage of dynamic deferring versus dynamic pivoting is that the former allows using more efficient data structures, as will explain in Section 4. In addition, we introduce static deferring for (nearly) symmetric saddlepoint problems that have many zeros in the diagonal, as we will describe in Section 3. In Table 1, we compare different ILU methods in terms of pivoting and deferring.
2.4 Reordering and equilibration
Reordering is an effective preprocessing technique for complete and incomplete factorizations. Some commonly used reordering methods include reverse CuthillMckee (RCM) [70], approximate minimum degree (AMD) [5], nested dissection (ND) [45], etc. Among these techniques, RCM aims to reduce the bandwidth, and it is widely used for (nearly) SPD systems [15, 49]. In contrast, AMD aims to reduce fills, and it is effective in improving the quality of droppings, especially for general unsymmetric matrices. Both ILUPACK and SuperLU use some variants of AMD as the default reordering strategy.
While reordering is concerned with sparsity patterns, equilibration focuses on numerical properties. A simple equilibration technique is to scale rows and columns [47, 99]. A more sophisticated technique, known as Hungarian scaling in [56], combines scaling with column permutation of , so that is an Imatrix [77]. Motivated by the work of Olschowka and Neumaier, Duff and Koster [34, 35] developed similar strategies for sparse matrices, which were implemented in the MC64 routine of the HSL library [60]. The strategy was adopted by several packages, including ILUPACK, SuperLU, PARDISO, MUMPS [6], MATLAB R2019a [97], etc. An Imatrix can reduce the condition number; in addition, an Imatrix has a weak diagonal dominance like ddPQ [86]. MC64 was designed for unsymmetric matrices. A sophisticated equilibration for symmetric indefinite systems was developed by Duff and Pralet [36]. A simple symmetrization strategy was described used in the routine HSL_MC64 in the HSL library, which applies a postprocessing step after calling MC64, by setting and , so that preserves symmetry. We adopt the latter approach in this work.
3 Algorithmic Components for Robustness
To achieve robustness, we must take into account all the aspects reviewed in the previous section. At the same time, to keep the algorithm as simple as possible, we integrate only the essential components that have some mathematical or heuristic justifications, and we modify (and sometimes simplify) the techniques used in the stateoftheart ILU techniques. Through mathematical analysis and extensive numerical experimentation, we reach at the factorization procedure in HILUCSI as sketched in Figure 3.1. In the following, we focus on four of its algorithmic components.
3.1 Mixed symmetric and unsymmetric preprocessing
Like ILUPACK, HILUCSI leverages some preprocessing techniques, including reordering and equilibration, at each level of multilevel ILU. A novelty of HILUCSI is that it mixes the preprocessing techniques for symmetric and unsymmetric matrices in a hierarchical fashion. In particular, for nearly and partially symmetric matrices, we apply symmetric equilibration (similar to that in HSL_MC64 [60] as described in Section 2.4) and RCM reordering in the first level, and we apply unsymmetric equilibration and AMD reordering on at lower levels, where nzp denotes the nonzero pattern. In addition, after equilibration, we perform a postprocessing to defer zero (or tiny) diagonal entries to the next level, as we describe in Section 3.2; in this case, we apply symmetric equilibration and AMD reordering on the second level.
The mixed preprocessing requires some justification. First, we observe that equilibration is particularly important in multilevel ILU. Besides stabilizing the factorization of the Schur complement, after scaling, the norm of the leading block (i.e., ) is well bounded. Due to the CauchySchwarz inequality,
so bounding norms of , , and allows controlling , which Notay [76] showed to be important for SPD systems. In Section 3.2, we will give another simple argument to show that it is desirable to bound the norms of , and for unsymmetric matrices.
Second, we observe that the symmetric equilibration is often more effective than unsymmetric equilibration for nearly symmetric matrices arising from systems of PDEs, especially when used in conjunction with RCM. This behavior is probably because those matrices may have some block diagonal dominance as defined in [44], which may be destroyed by unsymmetric equilibration. Symmetric equilibration with RCM reordering permutes the dominant block diagonal within a narrower band, so that it may preserve block diagonal dominance more effectively. In addition, for SPD matrices, RCM tends to lead to smaller offdiagonal blocks and in turn a smaller CBS constant in (2.5), which also improves the quality of the preconditioner. In our experiments, the combination of symmetric equilibration with RCM reordering at the top level significantly improves both robustness and efficiency for nearly symmetric matrices. However, if the sparsity pattern is unsymmetric or highly irregular, unsymmetric equilibration and AMD reordering tend to be more robust; see Section 5.3 for some comparisons.
Note that unlike MC64, symmetric equilibration does not produce Imatrices. Indeed, any zeros in the diagonal of the input matrix would remain in the diagonal after symmetric equilibration. The zero or tiny diagonal entries require some special attention, which we address using static deferring.
3.2 Static and dynamic deferring
As shown in Table 1, HILUCSI differs from the other ILU techniques in that it avoids small pivots and illconditioned triangular factors using static and dynamic deferring. During static deferring, we symmetrically permute the zero and tiny diagonal entries to the lowerright corner of the matrix, to be factorized in the next level. During dynamic deferring, we also symmetrically permute a row and column to the lowerright corner if the diagonal is smaller than a threshold () or the estimated norms and exceed some threshold ( and ). In practice, we use the threshold for , , and , and a value of 3 or 5 is very robust.
Our strategy requires some justification. First, unlike the Bollhï¿œferSaad dynamic deferring [22], our dynamic deferring considers not only and but also and . Bounding norms of , , and can be justified from Definition 1, in that
Hence, given an “accurate” preconditioner , it is important that is bounded, which can be achieved computationally by bounding norms of , , and . In contrast, bounding and alone, as done in [22], is insufficient. We attribute to this difference as the key factor in improved robustness of HILUCSI compared to ILUPACK.
Second, note that dynamic deferring can effectively resolve zero or tiny pivots in most cases. However, our experiments show that it is advantageous to perform a static deferring, because zero or tiny pivots tend to lead to fast growth of and , which tends to lead to poorerquality preconditioners. The static deferring also eliminates the need of BunchKaufman pivoting, which not only simplifies the implementation but also improves robustness and efficiency in our comparison, especially for (nearly) symmetric saddlepoint problems. Finally, note that in the extreme case, a matrix may have all zero diagonals, which does not introduce complications because HILUCSI switches to unsymmetric equilibration in the next level.
3.3 Hierarchical dual thresholding
HILUCSI uses a combination of numerical and combinatorial dropping strategies, analogous to other “dual thresholding.”
3.3.1 Modified inversebased dropping
For numerical dropping, we use a modified inversebased thresholding. In particular, unlike the variable and fixed inversebased thresholding in [18] and [22], we use and as weights for dropping in the th column in and and th row in , respectively, where is a usespecified upper bound of during dynamic deferring.
We motivate the modified inversebased thresholding based on Definition 1. Let , where , and , where , , and denote the perturbations to , , and , respectively. Hence,
where we omit the higherorder terms that involve more than one matrix. Hence,
(3.1) 
Note that if we omit the factor and we bound and using their norm and norm, respectively, we would obtain the variable inversebased dropping in [18] and [64]; we refer readers to these references for the details on estimating and . However, the dropping strategy in [18] and [64] omitted the conditioning of the diagonal matrix, because their analysis was not based on . In dynamic deferring, we restrict the magnitude of the diagonal entries to be no smaller than , and we estimate the maximum magnitudes to be approximately . Hence, can be bounded approximately by , which leads to our modified inversebased dropping. Like the original inversebased dropping, the modified version can be easily incorporated into Crout ILU.
3.3.2 Scalabilityoriented dropping
For combinatorial dropping, we introduce a scalabilityoriented dropping. The basic idea is to limit the number of nonzeros (nnz) in the th column of , namely , by a factor of the input matrix, i.e.,
(3.2) 
where denotes the average number of nonzeros in the columns of and is introduced to avoid excessive dropping for small columns in a highly nonuniform matrix. We limit the nnz in the rows of in a symmetric fashion. In the multilevel setting, we limit the rows and columns in all the levels based on the original input matrix, instead of based on the Schur complement from the previous level. This strategy is important for the scalability analysis. Furthermore, before computing the Schur complement, we apply dropping to limit the nonzeros in each column in based on the righthand side of (3.2), and similarly for the rows in . In Section 4, we will show that with this strategy, HILUCSI achieves near lineartime complexity.
3.3.3 Hierarchical dual thresholding
In HILUCSI, the numerical dropping is controlled by and , and the combinatorial dropping is controlled by . In practice, we found that the combination , , and is very robust for almost all the systems we have tested, while , , and work well for most problems from systems of PDEs. In addition, we observe that the accuracy of the factorization of the second level is often the most critical because it corresponds to factorizing the largest Schur complement. For this reason, we refine the thresholds for level 2 by reducing by a factor of 10, reducing by up to a factor of 2 (while restricting ), and doubling . For lower levels, however, we revert back for efficiency but use the refined and for stability and accuracy.
4 Achieving efficiency and scalability
In this section, we focus on the efficiency issues of HILUCSI for largescale linear systems.
4.1 Linear time complexity
We show that HILUCSI meets the efficiency requirement in Definition 2, especially for those arising from systems of PDEs.
Proposition 3.
If the input matrix has a constant number of nonzeros per row and per column, excluding the preprocessing step, the total cost of HILUCSI on each level is linear in .
We outline the proof for the first level. Given a matrix , let , , , and be the output of the factorization of the current level. Omitting dynamic deferring and thresholding, it is easy to show that the number of floating point operations in Crout ILU is bounded by
(4.1) 
where and denote the th row and th column of , respectively. Given an efficient data structure, the floatingpoint operations in dynamic deferring is proportional to Crout ILU. Furthermore, in the scalabilityoriented dropping, we use quick select, which has an expected linear time complexity, to find the largest nonzeros, followed by quick sort after dropping. Hence, the asymptotic complexity of dropping is lower than that of Crout update. Assuming there is a constant number of nonzeros in each row and column,
and . Hence, Crout ILU with deferring takes linear time. Finally, by ensuring the nonzeros in each row of and in each column of are bounded by a constant, the Schur component can be computed in linear time. The above analysis also applies to the other levels. Finally, HILUCSI uses a dense factorization in the last level if its size is , so its cost is also linear in .
Note that the overall time complexity of HILUCSI may be slightly superlinear because the worstcase complexities of some preprocessing components (including MC64 and AMD [53, 35]) are superlinear. However, MC64 has an expected lineartime complexity under our assumptions [35], and the cost of AMD is typically negligible. In addition, the number of levels may not be a constant in the worst case, but it is typically small. Hence, by design HILUCSI is expected to deliver nearly lineartime complexity. In Section 5, we will show that HILUCSI indeed scales better than both ILUPACK and SuperLU, and its near linear complexity makes its performance competitive with highly optimized direct solvers for large systems.
4.2 Implementation details
To realize the near lineartime complexity in Proposition 3, we need an efficient data structure for sparse matrices. In particular, the data structure must support efficient sequential access of the th row of along with all the rows in , and similarly for and , as required by the th step of the Crout ILU. In addition, it must support efficient static and dynamic deferring. The standard storage formats, such as CSR (Compressed Sparse Row), CSC (Compressed Sparse Column), or AIJ, are insufficient. We use a biindex data structure, similar to that proposed by Jones and Plassmann [59] for the rowversion of incomplete Cholesky factorization and that used in Crout version of ILU without pivoting for unsymmetric matrices by Li, Saad, and Chow in [29, 64].
For simplicity, let us first consider the data structure in [64] for without deferring. Because is constructed row by row, we store it using the CSR format, in which the column indices for each row are sorted in ascending order. We then augment it using two size arrays: Ufirst and Ulist, where the former stores the index of the first nonzero in each row of in CSR, and the latter maintains the linked lists of the entries in Ufirst by columns. At the th step, all the nonzero entries in the th column of must be in Ufirst. Hence, we can access a complete linked list of the entries in the th columns in through Ulist. The same holds for , except that we use CSC as the base for storing .
The data structure in [64] does not support deferring or pivoting. We extend it to support deferring. For simplicity, we describe the procedure for as follows. At the th step, suppose there have been deferrals and the current column in is going to be the th deferral. We move the current column in to the th position. By induction, there will be a gap of for the column indices in . Hence, before processing the th column, we move column to the th position in , which eliminates the gap in a “justintime” fashion for . For the indices in , we eliminate the gap at the end of the Crout ILU for the current level. The processing of is symmetric to that of . To implement this efficiently, we enlarge some size arrays (e.g., the permutation vectors, Llist, and Ulist) to size . This memory overhead is negligible, and it enables us to preserve the amortized constant time per nonzero. In contrast, for Crout ILU with dynamic pivoting, such as those in [63] and [74], the data structure needs to support efficient sequential access to all the rows and columns in , and similarly for . That requirement would double the storage and also incur significantly more data movement. Note that ILUPACK [20] also extended the data structures in [64] to support deferring, but we were unable to find its implementation details for comparison.
5 Numerical results
We have implemented HILUCSI using the C++11 standard. In this section, we assess the robustness and efficiency of our implementation for some challenging benchmark problems and compare its performance against some stateoftheart packages. In particular, we chose ILUPACK v2.4 [20, 23] as the representative of multilevel ILU, partially because HILUCSI is based on the same Croutversion of multilevel ILU as in ILUPACK, and more importantly, ILUPACK has been optimized for both unsymmetric and symmetric matrices. In comparison with other packages, our tests showed that ILUPACK outperformed ARMS in ITSOL v2 [88] by up to an order of magnitude for larger unsymmetric systems, and the gap is even larger for symmetric systems; ILUPACK is significantly more robust than ILU++ [73, 74]. In addition, we chose the supernodal ILUTP in SuperLU v5.2.1 [65, 67] as a representative of singlelevel ILU, because ofits efficient supernodal implementation. In all of our tests, we used rightpreconditioning for restarted GMRES(30), which limits the dimension of the Krylov subspace to 30. We used for the convergence criteria of GMRES and limited the number of iterations to . For HILUCSI and ILUPACK, we used our own implementation of flexible GMRES; for SuperLU, we used the GMRES implementation in the latest PETSc v3.11.3. Furthermore, we also compare HILUCSI against the multithreaded direct solvers in MUMPS 5.2.0 [6], MKL PARDISO v2018/3.222, and PARDISO 6.0 [79, 92].
We conducted our tests on a single node of a cluster running CentOS 7.4 with two 2.5 GHz Intel Xeon CPU E52680v3 processors and 64 GB of RAM. We compiled HILUCSI, SuperLU and PETSc all using GCC 4.8.5 with the optimization option O3, and we used the binary release of ILUPACK v2.4 for GNU64. We accessed ILUPACK through its MATLAB mex functions, of which the overhead is negligible. For accurate timing, both turbo and power saving modes were turned off for the processors.
5.1 Robustness as “blackbox” preconditioners
We have tested HILUCSI using more than 60 challenging, largerscale benchmark problems that were highlighted in the ILU literature. These matrices were mostly from the SuiteSparse Matrix Collections [33] and the Matrix Market [17], representing a wide range of applications. In our tests, HILUCSI succeeded for all the nonsingular real linear systems and most singular systems with a lowdimensional null space and a meaningful righthand side.^{1}^{1}1HILUCSI failed for the system invextr1_new used in [110], where all their tested methods failed it as well. The problem is challenging because it has a large nullspace dimension of 2,910. We present results on select benchmark problems from some highlighted problems in [22], [67], and [110], together with two larger unsymmetric systems for NavierStokes (NS) equations. Table 2 summarizes these unsymmetric matrices, including their application areas, types, sizes, and estimated condition numbers. Note that PR02R is a singular system with a meaningful righthand side.
In addition, we generated two sets of symmetric saddlepoint problems using FEniCS v2017.1.0 [4] by discretizing the 3D Stokes equation and the mixed formulation of the Poisson equation. These equations have a wide range of applications in CFD, solid mechanics, heat transfer, etc. We discretized the former using Taylor–Hood elements [95], and discretized the latter using a mixture of linear BrezziDouglasMarini (BDM) elements [24] and degree0 discontinuous Galerkin elements [31]. These problems are challenging in that the matrices have some implicit and nonuniform block structures, and they have many zeros in the diagonals. To facilitate scalability study, for each set, we generated three linear systems using meshes of different resolutions. Note that the matrices generated by FEniCS do not enforce symmetry exactly and contain some nearly zero values due to rounding errors. We manually filtered out the small values that are close to machine precision and then enforced symmetry using .
We first assess the robustness of HILUCSI versus ILUPACK and SuperLU for GMRES(30). In this test, we treat ILU as nearly a “blackbox” preconditioner. We used a fixed droptol for all the codes, as in [67], and used the recommended defaults for the other parameters for most problems. In particular, for ILUPACK, we used MC64 matching, AMD ordering, and condest (i.e., ) 5. For SuperLU, when using its default options, we only solved four problems. We doubled its fill factor from 10 to 20, which allowed SuperLU to solve another five problems. For HILUCSI, we used our default values and for most cases, except that we used for twotone, which is not a PDEbased problem.
In Table 2, we report the overall runtimes (including both factorization and solve times) for each code, and the fastest runtime for each case is highlighted in bold. HILUCSI had a 100% success rate for these problems, and it was the fastest for 70% of the cases. In contrast, ILUPACK solved 85% of cases and it was the fastest for 5% of the cases. Among the successful cases, ILUPACK ran out of the 64GB of main memory for RM07R and it used about 30GB of swap space, so the timing result was unreliable. For the symmetric problems, ILUPACK automatically detects symmetric matrices and then applies ILDL factorization with mixed and pivots automatically. This optimization in ILUPACK benefitted its timing for those problems but hurt its robustness for the two larger systems from Stokes equations, which we could solve only by explicitly forcing ILUPACK to use unsymmetric ILU. In addition, ILUPACK was unable to solve PR02R, regardless of how we tuned condest. SuperLU was the least robust among the three: it solved only 45% of cases^{2}^{2}2In [67], SuperLU had a higher success rate using GMRES(50) and unlimited fill factor, but we used GMRES(30) (the default in PETSc [9]) and a fill factor 10 (the default in SuperLU) or 20. and it was the fastest for 25% of cases. Note that for the largest system solved by all the codes, namely atmosmodl, HILUCSI outperformed ILUPACK and SuperLU by a factor of 6 and 9, respectively. On the other hand, for a mediumsized problem, namely e40r5000, SuperLU outperformed HILUCSI and ILUPACK by a factor of 7.5 and 15, respectively. This result shows that SuperLU is efficient for its excellent cache performance, but its ILUTP is fragile compared to multilevel ILU in ILUPACK and HILUCSI. Overall, HILUCSI delivered the best robustness and efficiency.
Matrix  Appl.  n  nnz  Cond.  total time (sec.)  
H  I  S  
(nearly) patternsymmetric, indefinite systems from PDEs  
venkat01  2D Euler  62,424  1,717,792  2.5e6  6.81  8.40  5.91 
rma10  3D CFD  46,835  2,374,001  4.4e10  10.6  31.7  4.73 
mixtank_new  29,957  1,995,041  2.2e11  40.2  128    
Goodwin_071  3D  56,021  1,797,934  1.4e7  20.8  15.5  16.3 
Goodwin_095  Navier  100,037  3,226,066  3.4e7  39.9  42.6  21.1* 
Goodwin_127  Stokes  178,437  5,778,545  8.2e7  75.1  95.1  65.0* 
RM07R  (NS)  381,689  37,464,962  2.2e11  3.3e3  3e4*  
PR02R  2D NS  161,070  8,185,136  1.1e21  261  
e40r5000  17,281  553,956  2.5e10  18.7  36.8  2.4*  
xeono2  materials  157,464  3,866,688  1.4e5  44.8  198  
atmosmodl  Helmholtz  1,489,752  10,319,760  1.5e3  85.1  502  718 
other unsymmetric systems  
onetone1  circuit  36,057  335,552  9.4e6  0.44  1.12  
twotone  simulation  120,750  1,224,224  4.5e9  7.71*  18.5  
bbmat  2D airfoil  38,744  1,771,722  5.4e8  31.9  36.5  59.0* 
symmetric, saddlepoint problems from PDEs  
S3D1  3D Stokes  18,037  434,673  1.2e7  1.63  6.56  9.90* 
S3D2  121,164  3,821,793  8.9e7  80.7  
S3D3  853,376  31,067,368  6.3e8  781  
M3D1  3D  29,404  522,024  1.7e5  6.66  8.31  
M3D2  mixed  211,948  4,109,496  2.3e6  67.1  127  
M3D3  Poisson  1,565,908  31,826,184  3.8e7  599  1.3e3  
success rate  100%  85%  45%  
leader rate  70%  5%  25% 
5.2 Efficiency with optimized parameters
The default parameters in the preceding comparison are robust for general problems, but they may be too strict for linear systems from specific classes of PDEs. We now compare the software for some saddlepoint problems arising from PDEs, including some nearly symmetric indefinite systems and symmetric saddlepoint problems.
5.2.1 Nearly symmetric, indefinite systems
For nearly symmetric matrices, we use six PDEbased problems in Table 2, which are from different types of equations in CFD, including 2D Euler, 3D NavierStokes equations, and Helmholtz equations. Table 3 shows the comparison of HILUCSI, ILUPACK, and SuperLU for these systems in terms of the factorization times, total times, GMRES iterations, and nonzero rations. We highlighted the fastest runtimes in bold. For a fair comparison, we used for all the codes, used for both HILUCSI and ILUPACK, and used for HILUCSI. It can be seen that HILUCSI was the fastest for all the cases in terms of both factorization and total times. HILUCSI required more GMRES iterations than ILUPACK, while SuperLU required significantly more iterations for the largest systems. In addition, we note that HILUCSI could solve all the problems with , which would further improve the factorization time but at the cost of more GMRES iterations for some systems.
Matrix  factor. time (sec.)  total time (sec.)  GMRES iters.  nnz ratio  

H  I  S  H  I  S  H  I  S  H  I  S  
venkat01  1.11  3.50  2.64  1.25  3.62  2.94  7  5  7  3.0  2.8  2.4 
rma10  2.31  11.3  2.93  2.47  11.4  3.43  9  4  9  2.0  3.8  2.8 
atmosmodl  10.5  33.8  686  19.8  41.0  748  33  22  75  2.9  4.0  6.2 
Goodwin_071  4.01  5.40  4.12  4.99  5.63  7.95  41  12  52  4.8  3.5  5.0 
Goodwin_095  7.36  11.8  9.74  10.7  12.3  21.9  78  14  84  4.8  3.9  5.7 
Goodwin_127  13.3  32.1  22.5  17.7  33.3  146  56  16  449  4.8  5.0  6.1 
Figure 5.1 shows the relative speedups of HILUCSI and SuperLU versus ILUPACK in terms of factorization and solve times. It can be seen that HILUCSI outperformed ILUPACK for all six cases by a factor between 1.1 and 4.9. For the Goodwin problems, it is clear that the relative speedup increased as the problem sizes increased, thanks to the near lineartime complexity of HILUCSI as shown in Proposition 3. We note that ILUPACK has a parameter elbow for controlling the size of reserved memory, but the parameter made no difference in our testing. ILUPACK also has another parameter lfil for spacebased dropping, but its use is discouraged in its documentation, so we did not use it by default. Our tests showed that using a small lfil in ILUPACK decreases its robustness, while its time complexity still remained higher than HILUCSI.
In addition, we observe that although SuperLU outperformed ILUPACK in terms of factorization times for all the Goodwin problems, it underperformed in terms of the overall times for these problems, because its solve time was significantly larger due to too many GMRES iterations. This again shows the superior robustness of multilevel ILU in HILUCSI and ILUPACK versus the ILUTP in SuperLU.
5.2.2 Symmetric saddlepoint problems
We now assess the robustness and efficiency of HILUCSI as the problem sizes increase. To this end, we use the symmetric saddlepoint problems and compare HILUCSI with two different solvers in ILUPACK for symmetric and unsymmetric matrices, respectively. Because SuperLU v5.2.1 failed for most of these problems, we do not include it in this comparison. For these saddlepoint problems, because there are static deferring, our algorithm enabled symmetric matching in HILUCSI on the first two levels, and we applied RCM ordering for the first level and applied AMD ordering for all the other levels. For ILUPACK, we used AMD ordering, as recommended by ILUPACK’s documentation.
Table 4 shows the comparison of HILUCSI with ILUPACK in terms of the numbers of GMRES iterations, the nonzero ratios, and the numbers of levels, along with the runtimes of HILUCSI. Figure 5.2 shows the relative speedups of HILUCSI and symmetric ILUPACK relative to the unsymmetric ILUPACK. It can be seen that HILUCSI outperformed the unsymmetric ILUPACK by a factor of four to ten for these problems. The improvement was mostly due to the improved dropping in HILUCSI; in addition, HILUCSI also had fewer levels than unsymmetric ILUPACK. In contrast, the symmetric ILUPACK failed for the two larger systems for the Stokes equations due to encountering a structurally singular system during preprocessing. For the two larger cases for the mixed formulation of the Poisson equation, symmetric ILUPACK was notably less robust and required many more GMRES iterations. Among the four solved problems, symmetric ILUPACK improved the runtimes of unsymmetric ILUPACK by a factor of 1.2 to 2.6, due to performing computations only on the lower triangular part and different dropping strategies. Finally, note that the improvement from symmetric ILUPACK cannot be achieved if we did not explicit symmetrize the matrices or force ILUPACK to use the symmetric solver. In contrast, since HILUCSI treats the pattern symmetric matrices as nearly symmetric, it delivers more consistent performance even if the matrix is slightly unsymmetric due to rounding or truncation errors. Note that the timing results in Table 4 for HILUCSI did not take advantage of numerical symmetry. Using a symmetric kernel in the first two levels would further improve its performance by 10–20%. Hence, the penalty of losing numerical symmetric is relatively small when using HILUCSI. In addition, we observe that in Figure 5.2, the relative speedup of HILUCSI versus ILUPACK improves significantly as the size of the problem grows, similar to what we observed for unsymmetric systems.
Matrix  HILUCSI  GMRES iters.  nnz ratio  #levels  

factor.  total  H  IU  IS  H  IU  IS  H  IU  IS  
S3D1  0.44  0.45  4  3  7  2.0  4.6  6.4  3  6  5 
S3D2  5.56  5.83  7  3  2.5  6.3  3  6  
S3D3  61.7  64.1  7  4  2.7  8.4  4  9  
M3D1  0.69  0.78  14  6  11  2.7  7.9  5.8  4  8  5 
M3D2  6.25  7.75  26  11  29  2.6  9.5  7.3  5  10  5 
M3D3  52.9  76.8  53  24  62  2.6  10  7.2  6  15  5 
5.3 Benefits of mixed processing
To assess the effectiveness of mixing symmetric and unsymmetric preprocessing for HILUCSI, we applied symmetric preprocessing on zero, one, and two levels. Table 5 shows a comparison of the factorization times, total times, GMRES iterations, and nnz ratios for three different classes of problems. It can be seen that for matrices with fully unsymmetric structures, using symmetric preprocessing did not improve robustness and decreased efficiency. However, for many unsymmetric matrices with nearly symmetric structures, using symmetric preprocessing on the first level significantly improved robustness and efficiency. Furthermore, when static deferring is invoked in (nearly) symmetric saddlepoint problems, using two levels of symmetric preprocessing further reduced the factorizations times, but the total runtime remained about the same.
Matrix  factor. time  total time  GMRES iters.  nnz ratio  
H0  H1  H2  H0  H1  H2  H0  H1  H2  H0  H1  H2  
general unsymmetric systems  
bbmat  31.4  45.5  55.2  31.9  46.3  55.9  9  11  9  17  25  32 
(nearly) pattern symmetric systems  
rma10  4.85  2.31  2.53  5.02  2.47  2.69  67  9  9  3.4  2.0  2.3 
PR02R  256  293  261  300  14  15  28  32  
symmetric, saddlepoint problems  
M3D3  53.6  52.9  77.3  76.8  52  53  2.6  2.6  
M3D2  8.06  6.37  6.25  16.1  7.69  7.75  120  23  26  4.0  2.6  2.6 
5.4 Comparison with multithreaded solvers
Our comparisons above showed that HILUCSI improved the robustness and efficiency compared to the stateoftheart ILU preconditioners for iterative solvers. In recent years, significant research has been devoted to developing parallel solvers, such as PETSc [9], hypre [43], SuperLU_MT/SuperLU_Dist [65], pARMS [68], MUMPS [6], PARDISO [79, 92], finegrained ILU [28], etc. Since most computers nowadays have multiple cores, it is therefore a natural question to ask how well HILUCSI performs compared to parallel solvers when using all the cores on a desktop in a way “transparent” to the enduser. Hence, we compare our serial HILUCSI against three stateoftheart multithreaded solvers, including MUMPS 5.2.0, MKL PARDISO v2018/3.222, and PARDISO 6.0. Although these are all branded as direct solvers, MKL PARDISO and PARDISO trade stability for performance by pivoting within supernode only and replacing tiny pivots by a small epsilon, similar to that in [66]. In addition, PARDISO 6 has an implementation of multilevel ILU for symmetric indefinite systems based on ILUPACK [91], but its overall performance was worse than the direct solver in PARDISO 6, so we only report the comparison with the direct solvers.
First, we compare the robustness. MUMPS solved all the problems in Table 2. In contrast, both versions of PARDISO failed for RM07R and PR02R with the default settings, which have iterative refinements enabled, resulting in a success rate of 90%, which is comparable to ILUPACK but worse than HILUCSI. The failures were due to encountering too many tiny or zero pivots within supernodes, which “rendered the factorization welldefined but essentially useless” [92] for these linear systems.
Next, we assess the performance for larger systems. We focus on the four largest systems in Table 2. Figure 5.3 shows the relative speed of HILUCSI (including its timing for RM07R in Table 2 and its timing for the others in Tables 3 and 4) versus those three packages. For the three larger problems, namely atmosmodl, S3D3, and M3D3, which had approximately one million unknowns, HILUCSI outperformed MUMPS and both versions of PARDISO on a single core. In addition, the serial HILUCSI is competitive with the parallel performance of MUMPS and PARDISO on 24 cores. However, for RM07R, which is relatively small, MUMPS outperformed HILUCSI significantly both in serial and in parallel. Overall, HILUCSI was highly competitive for large systems with more than one million unknowns when using optimized parameters.
6 Conclusions and Future Work
In this paper, we described a multilevel incomplete LU factorization technique, called HILUCSI. HILUCSI is sophisticated in that it mixes the equilibration and reordering techniques for symmetric and unsymmetric matrices, and it also introduces static pivoting, modified inversebased dynamic deferring, and hierarchical dual thresholding. Nevertheless, compared to some of the advanced ILU packages, such as ILUPACK and SuperLU, HILUCSI is relatively simple in that it does not require partial or BunchKaufman pivoting for stability. We showed that HILUCSI achieved robustness and efficiency for a wide range of linear systems, especially for problems arising from systems of PDEs. Those systems are often challenging when using only symmetric or unsymmetric techniques alone. We demonstrated the robustness of HILUCSI as a rightpreconditioner of restarted GMRES for symmetric and unsymmetric saddlepoint problems from mixed Poisson, Stokes, and NavierStokes equations as well as some nonPDE problems. Our results showed that HILUCSI outperforms ILUPACK by a factor of four to ten for medium to large problems. Because HILUCSI scales better as the number of unknowns increases, its performance advantage would become even wider for larger problems.
In its current form, HILUCSI has several limitations. First, if the memory is very limited, there may be too many scalabilityoriented droppings and the preconditioner may lose robustness. We plan to optimize HILUCSI further for limitedmemory situations. Second, for vectorvalued PDEs, the matrices may exhibit block structures. It is worthwhile to explore such block structures to improve cache performance, similar to that in [67] and [49]. Finally, HILUCSI is presently sequential. Although its serial performance is competitive with the parallel performance of MUMPS on 24 cores for large systems, it is desirable to speed up HILUCSI via parallelization.
Acknowledgments
Results were obtained using the Seawulf and LIRED computer systems at the Institute for Advanced Computational Science of Stony Brook University, which were partially funded by the Empire State Development grant NYS #28451. We thank Dr. Matthias Bollhï¿œfer for sharing ILUPACK with us, and we thank the anonymous reviewers for their many valuable comments and suggestions on the earlier version of this paper, which have helped improve this work.
References
 [1] M. F. Adams. Algebraic multigrid methods for constrained linear systems with applications to contact problems in solid mechanics. Numer. Linear Algebra Appl., 11(23):141–153, 2004.
 [2] J. H. Adler, T. R. Benson, and S. P. MacLachlan. Preconditioning a massconserving discontinuous galerkin discretization of the stokes equations. Numer. Linear Algebra Appl., 24(3):e2047, 2017.
 [3] B. Aksoylu and H. Klie. A family of physicsbased preconditioners for solving elliptic equations on highly heterogeneous media. Appl. Numer. Math., 59(6):1159–1186, 2009.
 [4] M. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells. The FEniCS project version 1.5. Arc. Num. Softw., 3(100):9–23, 2015.
 [5] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl., 17(4):886–905, 1996.
 [6] P. R. Amestoy, I. S. Duff, J.Y. L’Excellent, and J. Koster. MUMPS: a general purpose distributed memory sparse solver. In International Workshop on Applied Parallel Computing, pages 121–130. Springer, 2000.
 [7] H. Anzt, E. Chow, and J. Dongarra. ParILUT—a new parallel threshold ILU factorization. SIAM J. Sci. Comput., 40(4):C503–C519, 2018.
 [8] O. Axelsson. Iterative Solution Methods. Cambridge Press, 1994.
 [9] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, and H. Zhang. PETSc Users Manual. Technical Report ANL95/11  Revision 3.7, Argonne National Laboratory, 2016.
 [10] R. E. Bank and R. K. Smith. The incomplete factorization multigraph algorithm. SIAM J. Sci. Comput., 20(4):1349–1364, 1999.
 [11] R. E. Bank and C. Wagner. Multilevel ILU decomposition. Numer. Math., 82(4):543–576, 1999.
 [12] R. E. Bank and J. Xu. The hierarchical basis multigrid method and incomplete LU decomposition. Contemporary Mathematics, 180:163–173, 1994.
 [13] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput. Phys., 182(2):418–477, 2002.
 [14] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems. Acta Numerica, 14:1–137, 4 2005.
 [15] M. Benzi, D. B. Szyld, and A. Van Duin. Orderings for incomplete factorization preconditioning of nonsymmetric problems. SIAM J. Sci. Comput., 20(5):1652–1670, 1999.
 [16] P. Bochev, J. Cheung, M. Perego, and M. Gunzburger. Optimally accurate higherorder finite element methods on polytopial approximations of domains with smooth boundaries. arXiv preprint arXiv:1710.05628, 2017.
 [17] R. F. Boisvert, R. Pozo, K. Remington, R. F. Barrett, and J. J. Dongarra. Matrix Market: a web resource for test matrix collections. In Qual. Numer. Softw., pages 125–137. Springer, 1997.
 [18] M. Bollhöfer. A robust ILU with pivoting based on monitoring the growth of the inverse factors. Linear Algebra Appl., 338(13):201–218, 2001.
 [19] M. Bollhöfer. A robust and efficient ILU that incorporates the growth of the inverse triangular factors. SIAM J. Sci. Comput., 25(1):86–103, 2003.
 [20] M. Bollhöfer, J. I. Aliaga, A. F. Martın, and E. S. QuintanaOrtí. ILUPACK. In Encyclopedia of Parallel Computing. Springer, 2011.
 [21] M. Bollhöfer and Y. Saad. On the relations between ILUs and factored approximate inverses. SIAM J. Matrix Anal. Appl., 24(1):219–237, 2002.
 [22] M. Bollhöfer and Y. Saad. Multilevel preconditioners constructed from inversebased ILUs. SIAM J. Sci. Comput., 27(5):1627–1650, 2006.
 [23] M. Bollhöfer and Y. Saad. ILUPACK preconditioning software package. Available online at http://ilupack.tubs.de/. Release V2.4, June., 2011.
 [24] F. Brezzi, J. Douglas, and L. D. Marini. Two families of mixed finite elements for second order elliptic problems. Numer. Math., 47(2):217–235, 1985.
 [25] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial. SIAM, second edition edition, 2000.
 [26] J. R. Bunch and L. Kaufman. Some stable methods for calculating inertia and solving symmetric linear systems. Math. Comput., 31(137):163–179, 1977.
 [27] T. F. Chan and H. A. Van Der Vorst. Approximate and incomplete factorizations. In Parallel Numerical Algorithms, pages 167–202. Springer, 1997.
 [28] E. Chow and A. Patel. Finegrained parallel incomplete LU factorization. SIAM J. Sci. Comput., 37(2):C169–C193, 2015.
 [29] E. Chow and Y. Saad. Experimental study of ILU preconditioners for indefinite matrices. J. Comput. Appl. Math., 86(2):387–414, 1997.
 [30] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
 [31] B. Cockburn, G. E. Karniadakis, and C.W. Shu. The Development of Discontinuous Galerkin Methods. Springer, Berlin Heidelberg, 2000.
 [32] P. Concus and G. H. Golub. A generalized conjugate gradient method for nonsymmetric systems of linear equations. In Computing Methods in Applied Sciences and Engineering, pages 56–65. Springer, 1976.
 [33] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1, 2011.
 [34] I. S. Duff and J. Koster. The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl., 20(4):889–901, 1999.
 [35] I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl., 22(4):973–996, 2001.
 [36] I. S. Duff and S. Pralet. Strategies for scaling and pivoting for sparse symmetric indefinite problems. SIAM J. Matrix Ana. Appl., 27(2):313–340, 2005.
 [37] T. Dupont, R. P. Kendall, and H. Rachford, Jr. An approximate factorization procedure for solving selfadjoint elliptic difference equations. SIAM J. Num. Anal., 5(3):559–573, 1968.
 [38] S. C. Eisenstat, M. H. Schultz, and A. H. Sherman. Algorithms and data structures for sparse symmetric Gaussian elimination. SIAM J. Sci. Stat. Comp., 2(2):225–237, 1981.
 [39] H. C. Elman. A stability analysis of incomplete LU factorizations. Math. Comput., pages 191–217, 1986.
 [40] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics. Oxford University Press, 2014.
 [41] A. Ern and J.L. Guermond. Theory and Practice of Finite Elements, volume 159. Springer, 2013.
 [42] O. G. Ernst and M. J. Gander. Why it is difficult to solve Helmholtz problems with classical iterative methods. In Numerical Analysis of Multiscale Problems, pages 325–363. Springer, 2012.
 [43] R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In Int. Conf. Comp. Sci., pages 632–641. Springer, 2002.
 [44] D. G. Feingold, R. S. Varga, et al. Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem. Pac. J. Math., 12(4):1241–1250, 1962.
 [45] A. George and J. W. Liu. Computer solution of large sparse positive definite systems. PrenticeHall, Englwood Cliffs. NJ, 1981.
 [46] A. Ghai, C. Lu, and X. Jiao. A comparison of preconditioned Krylov subspace methods for largescale nonsymmetric linear systems. Numer. Linear Algebra Appl., page e2215, 2017.
 [47] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins, 4th edition, 2013.

[48]
C. Greif, S. He, and P. Liu.
SYMILDL: Incomplete
factorization of symmetric indefinite and skewsymmetric matrices.
ACM Trans. Math. Softw., 44(1):1–21, 2017.  [49] A. Gupta and T. George. Adaptive techniques for improving the performance of incomplete factorization preconditioning. SIAM J. Sci. Comput., 32(1):84–110, 2010.
 [50] I. Gustafsson. A class of first order factorization methods. IT Numer. Math., 18(2):142–156, 1978.
 [51] W. Hackbusch. Hierarchical Matrices: Algorithms and Analysis, volume 49. Springer, 2015.
 [52] M. T. Heath. Scientific Computing: An Introductory Surve. McGrawHill, New York, second edition, 2002.
 [53] P. Heggernes, S. Eisenstat, G. Kumfert, and A. Pothen. The computational complexity of the minimum degree algorithm. In Proceedings of 14th Norwegian Computer Science Conference, pages 98–109, 2001.
 [54] V. E. Henson and U. M. Yang. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155–177, 2002.
 [55] N. J. Higham. A survey of condition number estimation for triangular matrices. SIAM Review, 29:575–596, 1987.
 [56] J. Hook, J. Pestana, F. Tisseur, and J. Hogg. Maxbalanced Hungarian scalings. SIAM J. Matrix Ana. Appl., 40(1):320–346, 2019.
 [57] B. Hübner, E. Walhorn, and D. Dinkler. A monolithic approach to fluid–structure interaction using space–time finite elements. Comput. Meth. Appl. Mech. Engrg., 193(2326):2087–2104, 2004.
 [58] H. Johansen and P. Colella. A Cartesian grid embedded boundary method for Poisson’s equation on irregular domains. J. Comput. Phys., 147(1):60–85, 1998.
 [59] M. T. Jones and P. E. Plassmann. An improved incomplete Cholesky factorization. ACM Trans. Math. Softw., 21(1):5–17, 1995.
 [60] S. R. A. Laboratory. HSL. a collection of Fortran codes for large scale scientific computation. http://www.hsl.rl.ac.uk/, retrieved on July 6th, 2019.
 [61] R. J. LeVeque. Finite Volume Methods for Hyperbolic Problems, volume 31. Cambridge University Press, 2002.
 [62] R. J. LeVeque. Finite Difference Methods for Ordinary and Partial Differential Equations: Steady State and Time Dependent Problems. SIAM, Philadelphia, 2007.
 [63] N. Li and Y. Saad. Crout versions of ILU factorization with pivoting for sparse symmetric matrices. Electron. T. Numer. Ana., 20:75–85, 2005.
 [64] N. Li, Y. Saad, and E. Chow. Crout versions of ILU for general sparse matrices. SIAM J. Sci. Comput., 25(2):716–728, 2003.
 [65] X. S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans. Math. Softw., 31(3):302–325, 2005.
 [66] X. S. Li and J. Demmel. A scalable sparse direct solver using static pivoting. In Proceeding of the 9th SIAM conference on Parallel Processing for Scientific Computing, 1999.
 [67] X. S. Li and M. Shao. A supernodal approach to incomplete LU factorization with partial pivoting. ACM Trans. Math. Softw., 37(4), 2011.
 [68] Z. Li, Y. Saad, and M. Sosonkina. pARMS: a parallel version of the algebraic recursive multilevel solver. Numer. Linear Algebra Appl., 10(56):485–509, 2003.
 [69] C.J. Lin and J. J. Moré. Incomplete Cholesky factorizations with limited memory. SIAM J. Sci. Comput., 21(1):24–45, 1999.
 [70] W.H. Liu and A. H. Sherman. Comparative analysis of the Cuthill–McKee and the reverse Cuthill–McKee ordering algorithms for sparse matrices. SIAM J. Numer. Ana., 13(2):198–213, 1976.
 [71] J. Mayer. ILUCP: a Crout ILU preconditioner with pivoting. Numer. Linear Algebra Appl., 12(9):941–955, 2005.
 [72] J. Mayer. Alternative weighted dropping strategies for ILUTP. SIAM J. Sci. Comput., 27(4):1424–1437, 2006.
 [73] J. Mayer. ILU++: A new software package for solving sparse linear systems with iterative methods. In PAMM: Proceedings in Applied Mathematics and Mechanics, volume 7, pages 2020123–2020124. Wiley Online Library, 2007.
 [74] J. Mayer. A multilevel Crout ILU preconditioner with pivoting and row permutation. Numer. Linear Algebra Appl., 14(10):771–789, 2007.
 [75] J. A. Meijerink and H. A. van der Vorst. An iterative solution method for linear systems of which the coefficient matrix is a symmetric Mmatrix. Math. Comput., 31(137):148–162, 1977.
 [76] Y. Notay. Algebraic multigrid and algebraic multilevel methods: a theoretical comparison. Numer. Linear Algebra Appl., 12(56):419–451, 2005.
 [77] M. Olschowka and A. Neumaier. A new pivoting strategy for Gaussian elimination. Linear Algebra Appl., 240:131–151, 1996.
 [78] C. S. Peskin. The immersed boundary method. Acta Numerica, 11:479–517, 2002.
 [79] C. G. Petra, O. Schenk, M. Lubin, and K. Gärtner. An augmented incomplete factorization approach for computing the Schur complement in stochastic optimization. SIAM J. Sci. Comput., 36(2):C139–C162, 2014.
 [80] J. W. Ruge and K. Stüben. Multigrid Methods, chapter Algebraic Multigrid, pages 73–130. Number 13. SIAM, 1987.
 [81] Y. Saad. Preconditioning techniques for nonsymmetric and indefinite linear systems. J. Comput. Appl. Math., 24:89–105, 1988.
 [82] Y. Saad. ILUT: A dual threshold incomplete LU factorization. Numer. Linear Algebra Appl., 1, 1994.
 [83] Y. Saad. SPARSEKIT: a basic toolkit for sparse matrix computations. Technical report, University of Minnesota, 1994.
 [84] Y. Saad. ILUM: a multielimination ILU preconditioner for general sparse matrices. SIAM J. Sci. Comput., 17(4):830–847, 1996.
 [85] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003.
 [86] Y. Saad. Multilevel ILU with reorderings for diagonal dominance. SIAM J. Sci. Comput., 27(3):1032–1057, 2005.
 [87] Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7(3):856–869, 1986.
 [88] Y. Saad and B. Suchomel. ARMS: An algebraic recursive multilevel solver for general sparse linear systems. Numer. Linear Algebra Appl., 9(5):359–378, 2002.
 [89] Y. Saad and H. A. Van Der Vorst. Iterative solution of linear systems in the 20th century. In Numerical Analysis: Historical Developments in the 20th Century, pages 175–207. Elsevier, 2001.
 [90] Y. Saad and J. Zhang. BILUTM: a domainbased multilevel block ILUT preconditioner for general sparse matrices. SIAM J. Matrix Anal. Appl., 21(1):279–299, 1999.
 [91] O. Schenk, M. Bollhöfer, and R. A. Römer. On largescale diagonalization techniques for the anderson model of localization. SIAM Rev., 50(1):91–112, 2008.
 [92] O. Schenk and K. Gärtner. Parallel sparse direct solver PARDISO – user guide version 6.0.0, 2018.
 [93] H. D. Simon et al. Incomplete LU preconditioners for conjugategradienttype iterative methods. SPE Reservoir Engineering, 3(01):302–306, 1988.
 [94] B. Smith, P. Bjorstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 2004.
 [95] C. Taylor and P. Hood. A numerical solution of the NavierStokes equations using the finite element technique. Computers & Fluids, 1(1):73–100, 1973.
 [96] The HYPRE Team. hypre HighPerformance Preconditioners User’s Manual, 2017. version 2.12.2.
 [97] The MathWorks, Inc. MATLAB R2019a. Natick, MA, 2019.
 [98] M. Tismenetsky. A new preconditioning technique for solving large sparse linear systems. Linear Algebra Appl., 154(331–353), 1991.
 [99] A. van der Sluis. Condition numbers and equilibration of matrices. Numer. Math., 14:14–23, 1969.
 [100] H. A. van der Vorst. BiCGSTAB: A fast and smoothly converging variant of BiCG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13(2):631–644, 1992.
 [101] H. A. Van der Vorst. Iterative Krylov Methods for Large Linear Systems, volume 13. Cambridge University Press, 2003.
 [102] P. Vanek, M. Brezina, and J. Mandel. Convergence of algebraic multigrid based on smoothed aggregation. Numerische Mathematik, 88:559–579, 2001.
 [103] P. Vaněk, J. Mandel, and M. Brezina. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing, 56(3):179–196, 1996.
 [104] R. S. Varga. Factorization and normalized iterative methods. Technical report, Westinghouse Electric Corp., Pittsburgh, 1959.
 [105] P. S. Vassilevski. Multilevel Block Factorization Preconditioners: MatrixBased Analysis and Algorithms for Solving Finite Element Equations. Springer, 2008.
 [106] A. J. Wathen. Preconditioning. Acta Numerica, 24:329–376, 2015.
 [107] O. Widlund. A Lanczos method for a class of nonsymmetric systems of linear equations. SIAM J. Numer. Ana., 15(4):801–812, 1978.
 [108] G. Wittum. On the robustness of ILU smoothing. SIAM J. Sci. Stat. Comp., 10(4):699–717, 1989.
 [109] J. Zhang. A multilevel dual reordering strategy for robust incomplete LU factorization of indefinite matrices. SIAM J. Matrix Anal. Appl., 22(3):925–947, 2001.
 [110] Y. Zhu and A. H. Sameh. How to generate effective block jacobi preconditioners for solving large sparse linear systems. In Advances in Computational FluidStructure Interaction and Flow Simulation, pages 231–244. Springer, 2016.
Comments
There are no comments yet.