The CheSS solver uses an expansion based on Chebyshev polynomials to calculate the density matrix, thereby exploiting the sparsity of the overlap and Hamiltonian matrices. It works best for systems with a finite HOMO-LUMO gap and a small spectral width. CheSS exhibits a two-level parallelization using MPI and OpenMP and can scale to many thousands of cores (see Fig. 1). It has been made into a stand-alone library starting from the original codebase within BigDFT, and can be downloaded and installed freely from Stephan Mohr (BSC) has integrated the CheSS library into SIESTA.

The performances of CheSS are benchmarked against PEXSI and (Sca)LAPACK for the calculation of the density kernel and the inverse repectively. The comparison of the runtimes are shown in Figs. 2, 3 demonstrating that CheSS is the most efficient method.

Fig. 1 Left. Parallel scaling for the calculation of the inverse using CheSS. Speedup with respect to the minimal number of cores that was possible due to memory restrictions (80, 160 and 320, respectively). To ease the comparison, the curves for the 24000 matrix and 36000 matrix start at 2.0 and 4.0, respectively. The benchmarks were run on MareNostrum 3.
Right. Extreme scaling behavior of CheSS for the calculation of the inverse, going from 1536 cores up to 16384 cores. The runs were performed using 8 OpenMP threads, only varying the number of MPI tasks, and the speedup is shown with respect to the first data point. The benchmarks were done on the K computer.

Fig. 2 Comparison of the runtimes for the matrix inversion using CheSS, the Selected Inversion from PEXSI, ScaLAPACK and LAPACK, for various matrices and as a function of the condition number. All runs were performed in parallel, using 1920 cores (160 MPI tasks spanning 12 OpenMP threads each). The CheSS runs were started with well adjusted bounds for the eigenvalue spectrum, and the polynomial degree ranged from 60 to 260. For LAPACK, no results for matrices larger than 18000 are shown due to their long runtime. The benchmarks were done on MareNostrum 4.


Fig. 3 Comparison of the runtimes for the density kernel calculation using CheSS and PEXSI, for various matrices and as a function of the spectral width. All matrices had a HOMO-LUMO gap of 1 eV. The runs were performed in parallel using 1920 cores, once with a hybrid setup (160 MPI tasks spanning 12 OpenMP threads each) and once with an MPI-only setup. The CheSS runs were started with a good input guess for the eigenvalue bounds, with the polynomial degree ranging from 270 to 790. For PEXSI, the number of poles used for the expansion was set to 40, following Ref. 24. The benchmarks were done on MareNostrum 4.

More info here:

  • “Efficient computation of sparse matrix functions for large scale electronic structure calculations: The CheSS library”, S. Mohr, W. Dawson, M. Wagner, D. Caliste, T. Nakajima, L. Genovese,  J. Chem. Theory Comput. 13 (10), 4684–4698 (2017)  DOI: 10.1021/acs.jctc.7b00348,
  • Stephan Mohr invited talk during the MaX International Conference 2018 slides and video


One of the most important obstacles when keeping the codes up to date with hardware is the programming style based on old-style (i.e. non object-oriented) languages. Programming styles in community codes are often naive and lack the modularity and flexibility. From here, the need to disentangle such codes is essential for implementing new features or simply refactoring the application in order to efficiently run on the new architectures. Rewriting from scratch one of these codes is not an option because the communities behind these codes would be disrupted. One of the possible approaches that could permit to evolve the code is to progressively encapsulate the functions and subroutines, breaking up the main application in small (possibly weakly dependent) parts.

This strategy was followed by Quantum ESPRESSO: two main types of kernels were isolated in the independent directories and proposed as candidates for the domain-specific libraries for third-party applications.

The first set of kernels has been isolated in the LAXlib, available at here.
LAXlib contains all the low-level linear algebra routines of Quantum ESPRESSO, and in particular those used by the Davidson solver (e.g. the Cannon algorithm for the matrix-matrix product). The LAXlib also contains a mini-app that permits to evaluate the features of a HPC interconnect measuring the linear algebra routines contained therein.

The second library encapsulates all the FFT related functions, including the drivers for several different architectures, and it is available at
The FFTXlib library is self-contained and can be built without any dependencies on the remaining part of the Quantum ESPRESSO suite. Similarly, in the FFTXlib there is a mini-app that permits to mimic the FFT cycle for the SCF calculation of the charge density tuning the parallelization parameters of Quantum ESPRESSO. This mini-app has been also used to test the new implementation using the MPI-3 non blocking collectives. This proof-of-concept is available here and in Fig. 4 a wall-time comparison between the old and the new implementation is shown.

Fig. 4 Comparison between old and new implementation of the FFT. The dimension of the calculation refer to a buffer of data obtained through the mini-app by considering a fictitious system having a wave-function energy cut-off of 80 Rydberg and a cubic cell with a lattice parameter of 20 a.u. The calculation have been performed on a 2*18-core Intel(R) Xeon(R) E5-2697 v4 @ 2.30GHz.


CSCS is working on the initial implementation of the domain specific library for electronic structure calculations which encapsulates the full-potential linearized augmented plane-wave (FP-LAPW) and pseudopotential plane-wave (PP-PW) DFT ground state “engines”. The library is open-source (BSD 2-clause licence) and is freely available on the GitHub repository: . The library is written in C++11 with the MPI+OpenMP+CUDA programming model. In order to demonstrate the usage of the SIRIUS library the fork of Quantum ESPRESSO code was created and modified to work with the SIRIUS API: . In the current QE+SIRIUS implementation the generation of the effective potential and density mixing is done by QE and the rest (band diagonalization, density summation, charge augmentation and symmetrization) is done by SIRIUS. A comparison between the performances of QE and QE+SIRIUS is shown in Fig. 5.

Fig. 5 Performance comparison of the QE+SIRIUS and original QE codes. Time to solution for the DFT ground state of 53-atom CsCl (left panel) and 144 atom SiO 2 (right panel) is reported. QE code was run on the dual-socket CPU nodes (8-core Ivy Bridge, 12-core Haswell, 18-core Broadwell) while QE+SIRIUS code was run on a hybrid CPU-GPU node (12-core Haswell + P100 NVIDIA GPU card).

By closely analysing the MaX flagship codes the following common compute-intensive kernels can be identified:
● 3D FFT
● inner product of the wave-functions
● transformation of the wave-functions
● orthogonalization of the wave-functions

These kernels have been the focus of the initial implementation and testing phase within the SIRIUS framework while recent developments have focused on the refactoring of the SIRIUS library and on the creation of an independent sub-library with the aforementioned kernels. The sub-library is called SDDK (Slab Data Distribution Kit, ) and it is designed to work with the wave-functions distributed in “slabs”, where the G+k vector index is partitioned between MPI ranks and the band index is kept as a whole. The sub-library can be compiled and used independently of the main SIRIUS library. As an example we report in Fig. 6 the performances of the SDDK kernels as measured by a characteristic tests namely the application of the local potential to the set of wave-functions (benchmark of the wave-functions redistribution and FFT driver).

Fig. 6 Left: Execution time in seconds of the Vloc kernel on the dual-socket Intel Broadwell nodes (blue) and hybrid Intel Haswell + NVIDIA P100 nodes (green). Node-to-node comparison is performed. In case of CPU runs 4 MPI ranks (each with 9 OMP threads) were placed on every node. In case of CPU+GPU runs single MPI rank per node was spawned. Right: Parallel performance of the FFT driver for CPU nodes (blue) and hybrid nodes (green). One 18-thread MPI rank per CPU socket (2 ranks per node) was spawned in the case of CPU runs. One MPI rank per node was spawned in the case of GPU runs. The high performance of the single-node GPU run is explained by the absence of the MPI communication.