• M. Merta, J. Zapleta. Acceleration of the BEM4I library using the Intel Xeon Phi coprocessors

  • M. Merta, A. Veit, J. Zapletal, D. Lukas. Parallel time domain boundary element method for 3-dimensional wave equation

  • J. Zapletal, M. Merta. Boundary element method for manycore architectures

  • J. Zapletal, M. Merta. Boundary element method for Helmholtz transmission problems

  • J. Zapletal, M. Merta, L. Maly. Boundary element quadrature schemes for multi- and many-core architectures

  • S. Dohr, M. Merta, G. Of, J. Zapletal. Efficient evaluation of space-time boundary integral operators on SIMD architectures

  • J. Zapletal, G. Of, M. Merta Vectorized approach to the evaluation of boundary integral operators


  • M. Merta, J. Zapleta. Acceleration of the boundary element library BEM4I on the Knights Corner and Knights Landing architectures

  • J. Zapletal, M. Merta, L. Maly, G. Of. Boundary element method quadrature schemes for multi- and many-core architectures

  • M. Merta, J. Zapleta., M. Kravcenko, L. Maly BEM4I: A massively parallel boundary element solver



  • Zapletal, J., Merta, M., Maly, L. Boundary element quadrature schemes for multi- and many-core architectures. Computers and Mathematics with Applications, available online. DOI: 10.1016/j.camwa.2017.01.018
  • Preprint |

    In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel Xeon processors we concentrate on the portability of the code to the many-core family Intel Xeon Phi. Contrary to the GP-GPU programming accelerating many scientific codes, the standard x86 architecture of the Xeon Phi processors allows to reuse the already existing multi-core implementation. Although in many cases a simple recompilation would lead to an inefficient utilization of the Xeon Phi, the effort invested in the optimization usually leads to a better performance on the multi-core Xeon processors as well. This makes the Xeon Phi an interesting platform for scientists developing a software library aimed at both modern portable PCs and high performance computing environments. Here we focus at the manually vectorized assembly of the local element contributions and the parallel assembly of the global matrices on shared memory systems. Due to the quadratic complexity of the standard assembly we also present an assembly sparsified by the adaptive cross approximation based on the same acceleration techniques. The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.

  • Veit, A., Merta, M., Zapletal, J., Lukas, D. Efficient solution of time-domain boundary integral equations arising in sound-hard scattering. International Journal for Numerical Methods in Engineering 107, Wiley 2016. DOI: 10.1002/nme.5187.
  • Preprint |

    We consider the efficient numerical solution of the three-dimensional wave equation with Neumann boundary conditions via time-domain boundary integral equations. A space-time Galerkin method with C-infinity-smooth, compactly supported basis functions in time and piecewise polynomial basis functions in space is employed. We discuss the structure of the system matrix and its efficient parallel assembly. Different preconditioning strategies for the solution of the arising systems with block Hessenberg matrices are proposed and investigated numerically. Furthermore, a C++ implementation parallelized by OpenMP and MPI in shared and distributed memory, respectively, is presented. The code is part of the boundary element library BEM4I. Results of numerical experiments including convergence and scalability tests up to a thousand cores on a cluster are provided. The presented implementation shows good parallel scalability of the system matrix assembly. Moreover, the proposed algebraic preconditioner in combination with the FGMRES solver leads to a significant reduction of the computational time.

  • Merta, M., Zapletal, J., Jaros, J. Many core acceleration of the boundary element method. In Kozubek et al. High Performance Computing in Science and Engineering. LNCS 9611, Springer 2016. DOI: 10.1007/978-3-319-40361-8_8.

    The paper presents the boundary element method accelerated by the Intel Xeon Phi coprocessors. An overview of the boundary element method for the 3D Laplace equation is given followed by the discretization and its parallelization using OpenMP and the offload features of the Xeon Phi coprocessor are discussed. The results of numerical experiments for both single- and double-layer boundary integral operators are presented. In most cases the accelerated code significantly outperforms the original code running solely on Intel Xeon processors.

  • Zapletal, J., Merta, M., Cermak, M. BEM4I applied to shape optimization problems. AIP Conf. Proc 1738, 360012, 2016. DOI: 10.1063/1.4952145
  • Abstract

    Shape optimization problems are one of the areas where the boundary element method can be applied efficiently. We present the application of the BEM4I library developed at IT4Innovations to a class of free surface Bernoulli problems in 3D. Apart from the boundary integral formulation of the related state and adjoint boundary value problems we present an implementation of a general scheme for the treatment of similar problems.


  • Merta, M., Zapletal, J. Acceleration of boundary element method by explicit vectorization. Advances in Engineering Software 86, Elsevier 2015. DOI: 10.1016/j.advengsoft.2015.04.008.
  • Preprint |

    Although parallelization of computationally intensive algorithms has become a standard with the scientific community, the possibility of in-core vectorization is often overlooked. With the development of modern HPC architectures, however, neglecting such programming techniques may lead to inefficient code hardly utilizing the theoretical performance of nowadays CPUs. The presented paper reports on explicit vectorization for quadratures stemming from the Galerkin formulation of boundary integral equations in 3D. To deal with the singular integral kernels, two common approaches including the semi-analytic and fully numerical schemes are used. We exploit modern SIMD (Single Instruction Multiple Data) instruction sets to speed up the assembly of system matrices based on both of these regularization techniques. The efficiency of the code is further increased by standard shared-memory parallelization techniques and is demonstrated on a set of numerical experiments.

  • Lukas, D., Kovar, P., Kovarova, T., Merta, M. A parallel fast boundary element method using cyclig graph decompositions Numerical Algorihtms 70, Springer 2015. DOI: 10.1007/s11075-015-9974-9.
  • Preprint |

    We propose a method of a parallel distribution of densely populated matrices arising in boundary element discretizations of partial differential equations. In our method the underlying boundary element mesh consisting of n elements is decomposed into N submeshes. The related NxN submatrices are assigned to N concurrent processes to be assembled. Additionally we require each process to hold exactly one diagonal submatrix, since its assembling is typically most time consuming when applying fast boundary elements. We obtain a class of such optimal parallel distributions of the submeshes and corresponding submatrices by cyclic decompositions of undirected complete graphs. It results in a method the theoretical complexity of which is (Formula presented.) in terms of time for the setup, assembling, matrix action, as well as memory consumption per process. Nevertheless, numerical experiments up to n=2744832 and N=273 on a real-world geometry document that the method exhibits superior parallel scalability O((n/N)logn) of the overall time, while the memory consumption scales accordingly to the theoretical estimate.

  • Merta, M., Zapletal, J. Library of parallel boundary element method based solvers for solution of the time-dependent wave equation Civil-Comp Proceedings 107, Civil-Comp Press 2015.
  • Abstract

    A boundary element library for parallel solution of engineering problems is presented in this paper. The boundary element method is especially suitable for solving problems stated in unbounded domains, since the problem is reduced to the boundary of the domain. The structure of the library is described with focus on the module for parallel solution of wave scattering problems. The parallelization in shared memory using OpenMP is described and the possible parallelization in distributed memory using MPI is discussed. The results for numerical experiments are presented in the last part of the paper.

  • Zapletal, J., Merta, M., Cermak, M. A novel boundary element library with applications. AIP Conf. Proc 1648, 830010, 2015. DOI: 10.1063/1.4913036
  • Abstract

    We present a newly developed library based on the boundary element method (BEM) for solving boundary value problems in 3D. The advantage of BEM over the widely used finite element method is clear when discretizing a problem in an unbounded domain. This is, for example, the case of sound scattering problems modelled by the Helmholtz equation, which is one of the possible applications of the library and is discussed in this paper.


  • Merta, M., Zapletal, J. A parallel library for boundary element discretization of engineering problems. Mathematics and Computers in Simulation, Elsevier. DOI: 10.1016/j.matcom.2016.05.013. (In press)
  • Abstract

    In this paper we present a software for parallel solution of engineering problems based on the boundary element method. The library is written in C++ and utilizes OpenMP and MPI for parallelization in both shared and distributed memory. We give an overview of the structure of the library and present numerical results related to 3D sound-hard scattering in an unbounded domain represented by the boundary value problem for the Helmholtz equation. Scalability results for the assembly of system matrices sparsified by the adaptive cross approximation are also presented.