- Contact us
Quantum ESPRESSO (Quantum opEn-Source Package for Research in Electronic Structure, Simulation, and Optimisation) is a suite of open-source codes for quantum materials modelling using the plane-wave pseudopotential method; the suite provides a wide set of common fundamental routines and has been the platform for such important methodological innovations as Car-Parrinello molecular dynamics and Density-Functional Perturbation Theory. It is released under the GNU GPL. It implements density-functional theory (DFT) in the plane-wave pseudopotential approach, but it also includes more advanced levels of theory: DFT+U, hybrid functionals, various functionals for van der Waals forces, many-body perturbation theory, adiabatic-connection fluctuation-dissipation theory.
Quantum ESPRESSO has a wide diffusion among scientists working on solid state physics and materials science. The Quantum ESPRESSO manifesto  has collected more than 10,000 citations since its publication in 2009; the users mailing list counts about 3,000 messages per year; schools and tutorials based on Quantum ESPRESSO have been constantly done (more than 30 since 2001) in various parts of the world including, thanks to the collaboration with ICTP, many developing countries in Asia and Africa.
The community of Quantum ESPRESSO developers is also very large and spread all over the world; currently more than 40 active developers contribute directly to the main branch of the code. Experts from many HPC centres in Europe and USA and hardware firms contribute to improve the efficiency of the code. The code is part of the PRACE Unified European Applications Benchmark Suite and is being used for validation and co-design by major hardware hip vendors: Intel, NVIDIA, and ARM.
Parallelism and scaling
Quantum ESPRESSO codes may be compiled so as to use pure MPI parallelism or hybrid: MPI + openMP. The pw.x may also be compiled in order to exploit GPU acceleration. The MPI parallelism is organized with multilevel MPI grouping. The groups are called pools, band-groups, task-groups and diag-groups (ortho-groups in cp.x). The scaling with the number of pools is in principle linear, in fact it is case dependent. Each pool may include many band-groups whose MPI tasks may be organized in task-groups. The data on the 3D data grids (charge-density and wavefunctions) are distributed among the different tasks of each band-group. The band-group performs 3D FFT and matrix-vector multiplication operations on the 3D data. The intra band-group parallelism is also instrumental for downscaling the memory per node requirement.
The scaling with the number of MPI tasks per band-group is optimal as long as the number of tasks is comparable with the linear dimensions of the 3D grid, after which the parallelization becomes too saturate and task-group parallelism is used to distribute the work-load. The diag-group uses parallel linear ELPA and SCALAPACK libraries to perform exact diagonalization of matrices whose linear dimension is at least twice the number of bands. Scaling of these two mechanisms is shown in Fig.1 for one recent profiling test case.
Fig 1: Scaling of the average WALL time for scf iteration for a MAX CNTPOR8 benchmark case. The average time is also separated in h_psi (FFT + matrix vector products) and _diaghg (exact diagonalization) parts. The h_psi part scales with FFT up to 2048 cores: beyond that, scaling becomes non-optimal. Exact diagonalisation (_diaghg) with parallel linear algebra (ScaLAPACK) scales up to a 16×16 BLACS grid (256 MPI tasks). The number of tasks used for parallel linear algebra are indicated by the yellow labels. 4 OpenMP threads per MPI task are used for all runs.
Since version 6.4.1 of Quantum ESPRESSO, pw.x can be used on systems equipped with GPUs. As the compilation for GPUs requires a specific setup and linking to architecture specific libraries, the code is distributed GPU-ready via a dedicated repository together with specific routines. Users willing to compile and use pw.x on GPU-enabled architectures can find the code here https://gitlab.com/QEF/q-e-gpu. The most computation-intensive kernels executed with GPU acceleration are the parallel linear algebra and, at variable levels depending upon the system size, the FFT routines. If the memory of a node is sufficient to store all 3D grid data, 3D FFTs are performed directly with GPU-specific libraries using devices on one node. When instead it is necessary to distribute 3D data among MPI tasks, the GPU acceleration is used to perform local 1D and 2D FFT operations and data is then scattered via MPI in order to perform the net 3D FFT operation. In the most common setup of GPU devices, this requires frequent data transfer between device and host memory. In order to improve efficiency, local FFT operations and MPI data scattering are thus done concurrently on different batches of wave functions; while a batch is processed by the GPU, other batches are scattered via non-blocking MPI.
DFT toward the exascale: Quantum ESPRESSO on GPUs
Since release 6.4, a fully functional version of pw.x -- the main quantum engine of Quantum ESPRESSO, can also be compiled and run on systems based on hybrid MPI + GPU acceleration on architectures based on the NVIDIA GPUs. The porting to such platforms has been done using the CUDA-FORTRAN programming model.
The acceleration of the two most compute-intensive kernels of pw.x, namely FFTXlib for 3D FFTs and LAXlib for parallel linear algebra, has been found to be crucial for the performance portability.
Concerning the technical aspects of the porting, the LAXlib library on GPUs exploits the libraries provided by the CUDA package, namely cuSOLVER and cuFFT. FFTXlib performs the acceleration at various levels depending on the size of the FFT mesh. If the whole mesh fits into the memory of a single GPU, then a 3D FFT is performed directly by the CUDA specific kernel for 3D FFT. If instead the mesh has to be distributed on multiple MPI tasks, the GPU acceleration is obtained by performing local 1D and 2D FFT operations and data is then scattered via MPI to complete the 3D FFT operation. In order to reduce latency times, FFTs on wave functions can be performed in batches, allowing the code to overlap the MPI communications on one batch with the GPU FFT operations on another batch. The pool-parallelism on K-points is completely portable and the FFT parallelism performs satisfactorily. In Fig. 2 we show the speedup of Quantum ESPRESSO on different architectures equipped with NVIDIA Tesla V100 cards (courtesy of Josh Romero and Massimiliano Fatica from NVIDIA). The obtained results demonstrate a good portability of pw.x towards hybrid systems based on accelerators. More details here.
Fig.2: Speedup of Quantum ESPRESSO on different architectures equipped with NVIDIA Tesla V100 cards. Courtesy of Josh Romero and Massimiliano Fatica from NVIDIA.
Quantum ESPRESSO is a suite of open-source codes for quantum materials modelling using the plane-wave pseudopotential method. It performs ab-initio calculations based on density-functional theory (DFT) and more advanced levels of theory: DFT+U, hybrid functionals, various functionals for van der Waals forces, many-body perturbation theory, adiabatic-connection fluctuation-dissipation theory. Within the many features available using Quantum ESPRESSO programs we mention:
Quantum ESPRESSO shows a very good scalability by exploiting a hybrid MPI-OpenMP paradigm. Quantum ESPRESSO (more specifically PWscf) is parallelized on different levels: k-points (linear scaling with the number of processors), bands, and plane waves/real space grids (reaching high cpu scaling and memory distribution). Custom (domain specific) FFT’s are implemented and parallelized over planes or sticks and implement task group techniques. Parallel dense linear algebra is also exploited to improve scalability and memory distribution. Thanks to this multi-level parallelism both computation and data structures are distributed in order to fully exploit massively multi-core parallel architectures.
Quantum ESPRESSO is an open source codes distributed under GNU General Public Licence. The code source is hosted on the Github hosting service. It works with Fortran and C compilers and take advantage of highly optimized library such as BLAS, LAPACK, SCALAPACK, FFTW3, HDF5, GDLib and GSL.
Two different parallelization paradigms are currently implemented in Quantum ESPRESSO, namely MPI and OpenMP. MPI is a well-established, general-purpose parallelization scheme. In Quantum ESPRESSO several parallelization levels, specified at run-time via command-line options to the executable, are implemented with MPI. This is the first choice for execution on a parallel machine. OpenMP can be implemented via compiler directives (explicit OpenMP) or via multithreading libraries (library OpenMP). Explicit OpenMP requires compilation for OpenMP execution; library OpenMP only requires linking to a multithreading version of mathematical libraries, e.g.: ESSLSMP, ACML MP, MKL (the latter is natively multi-threading).
Electronic structure data, like for instance the wavefunctions, are written and read by Quantum ESPRESSO PWscf executable using a direct access to file: each task uses a raw format to store its information. These files can be used to produce checkpoints (restart).