M. Merta, J. Zapleta. Acceleration of the BEM4I library using the Intel Xeon Phi coprocessors
M. Merta, J. Zapleta. Acceleration of the BEM4I library using the Intel Xeon Phi coprocessors
M. Merta, A. Veit, J. Zapletal, D. Lukas. Parallel time domain boundary element method for 3-dimensional wave equation
J. Zapletal, M. Merta. Boundary element method for manycore architectures
J. Zapletal, M. Merta. Boundary element method for Helmholtz transmission problems
J. Zapletal, M. Merta, L. Maly. Boundary element quadrature schemes for multi- and many-core architectures
S. Dohr, M. Merta, G. Of, J. Zapletal. Efficient evaluation of space-time boundary integral operators on SIMD architectures
J. Zapletal, G. Of, M. Merta Vectorized approach to the evaluation of boundary integral operators
Zapletal, J. Kravcenko, Claeys, X., Merta, M. Parallel adaptive cross approximation for distributed memory systems
M. Merta, J. Zapleta. Acceleration of the boundary element library BEM4I on the Knights Corner and Knights Landing architectures
J. Zapletal, M. Merta, L. Maly, G. Of. Boundary element method quadrature schemes for multi- and many-core architectures
M. Merta, J. Zapleta., M. Kravcenko, L. Maly BEM4I: A massively parallel boundary element solver
J. Zapletal, M. Merta, L. Maly, G. Of. A paralell spacetime boundary element method
We describe a parallel solver for the discretized weakly singular space-time boundary integral equation of the spatially two-dimensional heat equation. The global space-time nature of the system matrices leads to improved parallel scalability in distributed memory systems in contrast to time-stepping methods where the parallelization is usually limited to spatial dimensions. We present a parallelization technique which is based on a decomposition of the input mesh into submeshes and a distribution of the corresponding blocks of the system matrices among processors. To ensure load balancing, the distribution is based on a cylic decomposition of complete graphs. In addition, the solution of the global linear system requires the use of an efficient preconditioner. We present a robust preconditioning strategy which is based on boundary integral operators of opposite order, and extend the introduced parallel solver to the preconditioned system.
We present a highly parallel version of the boundary element method accelerated by the adaptive cross approximation for the efficient solution of scattering problems with composite scatterers. Individual boundary integral operators are treated independently, i.e. the boundary of every homogeneous subdomain is decomposed into clusters of elements defining a block structure of the local matrix. The blocks are distributed across computational nodes by a graph algorithm providing a load balancing strategy. The intra-node implementation further utilizes threading in shared memory and in-core SIMD vectorization to make use of all features of modern processors. The suggested approach is validated on a series of numerical experiments presented in the paper.
We present an approach for a distributed memory parallelization of the boundary element method. The given mesh is decomposed into submeshes and the respective matrix blocks are distributed among computational nodes (processes). The distribution which takes care of the load balancing during the system matrix assembly and matrix-vector multiplication is based on the cyclic graph decomposition. Moreover, since the individual matrix blocks are approximated using the adaptive cross approximation method, we describe its modification capable of dealing with zero blocks in the double-layer operator matrix since these are usually problematic when using the original adaptive cross approximation algorithm. Convergence and scalability of the method are demonstrated on the half- and full-space sound scattering problems modeled by the Helmholtz equation.
In this paper we introduce a new parallel solver for the weakly singular space–time boundary integral equation for the heat equation. The space–time boundary mesh is decomposed into a given number of submeshes. Pairs of the submeshes represent dense blocks in the system matrices, which are distributed among computational nodes by an algorithm based on a cyclic decomposition of complete graphs ensuring load balance. In addition, we employ vectorization and threading in shared memory to ensure intra-node efficiency. We present scalability experiments on different CPU architectures to evaluate the performance of the proposed parallelization techniques. All levels of parallelism allow us to tackle large problems and lead to an almost optimal speedup.
In this paper, we describe an efficient analytic evaluation of boundary integral operators. Firstly, we concentrate on a novel approach based on the simultaneous evaluation of all three linear shape functions defined on a boundary triangle. This results in a speedup of 2.35-3.15 times compared to the old approach of separate evaluations. In the second part we comment on the OpenMP parallelized and vectorized implementation of the suggested formulae. The employed code optimizations include techniques such as data alignment and padding, array-of-structures to structure-of-arrays data transformation, or unit-strided memory accesses. The presented scalability results, with respect both to the number of threads employed and the width of the SIMD register obtained on an Intel Xeon processor and two generations of the Intel Xeon Phi family (co)processors, validate the performed optimizations and show that vectorization needs to be an inherent part of modern scientific codes.
The paper presents parallelization of the boundary element method in distributed memory of a cluster equipped with many-core based compute nodes. A method for efficient distribution of boundary element matrices among MPI processes based on the cyclic graph decompositions is described. In addition, we focus on the intra-node optimization of the code, which is necessary in order to fully utilize the many-core processors with wide SIMD registers. Numerical experiments carried out on a cluster consisting of the Intel Xeon Phi processors of the Knights Landing generation are presented.
In this work we describe the accelerated assembly of system matrices for the boundary element method using the Intel Xeon Phi coprocessors. We present a model problem, provide a brief overview of its discretization and acceleration of the system matrices assembly using the coprocessors, and test the accelerated version using a numerical benchmark.
In this work we present an approach for distribution of system matrices occurring in the fast boundary element method (BEM) among computational nodes in order to accelerate their assembly. An underlying mesh is decomposed into a given number of submeshes, pairs of which represent blocks in a system matrix. The aim is to distribute the submeshes among computational nodes in a way which minimizes the amount of mesh parts owned by a single process and which inherently defines the distribution of the system matrix. Additionally, each process owns exactly one diagonal block since their assembly is typically the most time consuming in the fast BEM. The distribution of submeshes is based on a cyclic decomposition of complete graphs into dense subgraphs. We briefly present the method, a boundary element environment it is implemented in and provide results of numerical experiments.
In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel Xeon processors we concentrate on the portability of the code to the many-core family Intel Xeon Phi. Contrary to the GP-GPU programming accelerating many scientific codes, the standard x86 architecture of the Xeon Phi processors allows to reuse the already existing multi-core implementation. Although in many cases a simple recompilation would lead to an inefficient utilization of the Xeon Phi, the effort invested in the optimization usually leads to a better performance on the multi-core Xeon processors as well. This makes the Xeon Phi an interesting platform for scientists developing a software library aimed at both modern portable PCs and high performance computing environments. Here we focus at the manually vectorized assembly of the local element contributions and the parallel assembly of the global matrices on shared memory systems. Due to the quadratic complexity of the standard assembly we also present an assembly sparsified by the adaptive cross approximation based on the same acceleration techniques. The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.
We consider the efficient numerical solution of the three-dimensional wave equation with Neumann boundary conditions via time-domain boundary integral equations. A space-time Galerkin method with C-infinity-smooth, compactly supported basis functions in time and piecewise polynomial basis functions in space is employed. We discuss the structure of the system matrix and its efficient parallel assembly. Different preconditioning strategies for the solution of the arising systems with block Hessenberg matrices are proposed and investigated numerically. Furthermore, a C++ implementation parallelized by OpenMP and MPI in shared and distributed memory, respectively, is presented. The code is part of the boundary element library BEM4I. Results of numerical experiments including convergence and scalability tests up to a thousand cores on a cluster are provided. The presented implementation shows good parallel scalability of the system matrix assembly. Moreover, the proposed algebraic preconditioner in combination with the FGMRES solver leads to a significant reduction of the computational time.
The paper presents the boundary element method accelerated by the Intel Xeon Phi coprocessors. An overview of the boundary element method for the 3D Laplace equation is given followed by the discretization and its parallelization using OpenMP and the offload features of the Xeon Phi coprocessor are discussed. The results of numerical experiments for both single- and double-layer boundary integral operators are presented. In most cases the accelerated code significantly outperforms the original code running solely on Intel Xeon processors.
Shape optimization problems are one of the areas where the boundary element method can be applied efficiently. We present the application of the BEM4I library developed at IT4Innovations to a class of free surface Bernoulli problems in 3D. Apart from the boundary integral formulation of the related state and adjoint boundary value problems we present an implementation of a general scheme for the treatment of similar problems.
Although parallelization of computationally intensive algorithms has become a standard with the scientific community, the possibility of in-core vectorization is often overlooked. With the development of modern HPC architectures, however, neglecting such programming techniques may lead to inefficient code hardly utilizing the theoretical performance of nowadays CPUs. The presented paper reports on explicit vectorization for quadratures stemming from the Galerkin formulation of boundary integral equations in 3D. To deal with the singular integral kernels, two common approaches including the semi-analytic and fully numerical schemes are used. We exploit modern SIMD (Single Instruction Multiple Data) instruction sets to speed up the assembly of system matrices based on both of these regularization techniques. The efficiency of the code is further increased by standard shared-memory parallelization techniques and is demonstrated on a set of numerical experiments.
We propose a method of a parallel distribution of densely populated matrices arising in boundary element discretizations of partial differential equations. In our method the underlying boundary element mesh consisting of n elements is decomposed into N submeshes. The related NxN submatrices are assigned to N concurrent processes to be assembled. Additionally we require each process to hold exactly one diagonal submatrix, since its assembling is typically most time consuming when applying fast boundary elements. We obtain a class of such optimal parallel distributions of the submeshes and corresponding submatrices by cyclic decompositions of undirected complete graphs. It results in a method the theoretical complexity of which is (Formula presented.) in terms of time for the setup, assembling, matrix action, as well as memory consumption per process. Nevertheless, numerical experiments up to n=2744832 and N=273 on a real-world geometry document that the method exhibits superior parallel scalability O((n/N)logn) of the overall time, while the memory consumption scales accordingly to the theoretical estimate.
A boundary element library for parallel solution of engineering problems is presented in this paper. The boundary element method is especially suitable for solving problems stated in unbounded domains, since the problem is reduced to the boundary of the domain. The structure of the library is described with focus on the module for parallel solution of wave scattering problems. The parallelization in shared memory using OpenMP is described and the possible parallelization in distributed memory using MPI is discussed. The results for numerical experiments are presented in the last part of the paper.
We present a newly developed library based on the boundary element method (BEM) for solving boundary value problems in 3D. The advantage of BEM over the widely used finite element method is clear when discretizing a problem in an unbounded domain. This is, for example, the case of sound scattering problems modelled by the Helmholtz equation, which is one of the possible applications of the library and is discussed in this paper.
In this paper we present a software for parallel solution of engineering problems based on the boundary element method. The library is written in C++ and utilizes OpenMP and MPI for parallelization in both shared and distributed memory. We give an overview of the structure of the library and present numerical results related to 3D sound-hard scattering in an unbounded domain represented by the boundary value problem for the Helmholtz equation. Scalability results for the assembly of system matrices sparsified by the adaptive cross approximation are also presented.