In order to assess the performances of the MAX codes, we are building up a system of continuous benchmarking. Due to the complexity of the flagship codes, it is not realistic to consider benchmarks exploring all the running parameters, or the features related to all the possible simulations. For this reason, we select a number of scientific challenges, relevant in estimating the flagship code performances. These "use cases" represent our set of benchmarks, on which the progresses of the work made by MAX are evaluated.
Quantum Espresso
A medium sized test case runned on Marconi-100 (32 Power9 cores and 4 Nvidia Volta GPUs per node) it is the computation of a few Ir atoms adsorbed on a Graphene sheet. The system counts 686 atoms, 3100 bands, 4 k-points, and is run with a spin-polarized GGA-PBE exchange-correlation functional. It is thus possible to use pool parallelism up to 8 pools (4 k-points times 2 spin channels). Runned with a wave function cutoff of 30 Ry, requiring a FFT grid of {180, 180, 216}. With such setup and using 2 pool parallelization, it requires a total host RAM of 556 GB. With different tests, we have determined that it is necessary to distribute the data of each pool on at least 20 Volta cards. This is due to the need to leave enough memory on the device to perform dense diagonalizations on the iterative space within the Davidson algorithm (matrices 6200X6200). To have a clearer insight on the performances of the code, we report in the figure above the average time taken by each single SCF iteration in a single pool. The plot shows how the performance is already close to the optimal one at 20 GPUs per pool, reaching best performance for 40 GPUs per pool. The dense diagonalization contribution is performed by one device and thus does not change, while the h_psi and residual parts both show only a small improvement increasing the number of GPUs per pool.
Siesta
Time per scf step for a piece of sars-cov-2 protein surrounded by water molecules, with approximately 8800 atoms (58000 orbitals), using various SIESTA solvers. CPU refers to 32 MPI tasks per node on Marconi100 (Power9 architecture). GPU values are for 32 MPI tasks per node, plus 4 Volta GPU devices per node. The prefix ELSI indicates the use of the ELPA diagonalization library through the ELSI interface. PEXSI is an alternate solver not based on diagonalization. Two sets of PEXSI results (for 20 and 30 poles) are shown. The thin line shows the ideal scalability behavior. Note the double logarithmic scale. The figure highlights the speedup achieved by the GPU diagonalization version, and the very good scalability properties of the PEXSI solver.
Yambo
Complete GW workflow for a N7-AGNR graphene nanoribbon on graphene sheet, a very large scale system (64 irreducible kpts, 2000 bands, 5x105 G-vect density), obtained using 4 MPI tasks per node, 32 threads per task, and 4 V100 GPUs per node on Cineca’s cluster Marconi100. In the figure is shown a scaling test upto a 8 PFlops run (1000 tasks), with a parallel efficiency greater than 50% (compared with the one at 64 tasks). In the same benchmarks campaign it was possible to obtain a single run up to 600 nodes, 2400 GPUs, that means a portion of the machine of the order of ~20 PFlops.
Fleur
Strong scaling of the test case TiO2 (1078 atoms, 1 k-point, 1 self-consistency iteration step). The plot starts with 16 nodes, the percentage on the legend refers to this run. The total scaling (purple) is compared with the scaling of the main parts (green, light blue, orange) and with the ideal scaling (yellow). Measurements were done on the CLAIX 2016 supercomputer at RWTH Aachen University.
CP2K
The last work improvement has been focused on integrating COSMA library and its pdgemm wrapper into CP2K code and verifying the performance of the new implementation in the RPA calculations of a 128 water molecule system. We performed the runs on 128 and 1024 nodes of Piz Daint and collected the data listed below.
COSMA library outperforms MKL on CPU nodes and Cray’s accelerated LibSci_acc on the GPU nodes. We were able to achieve 65% of peak performance on the hybrid GPU nodes.
Runs on 1024 nodes of Piz Daint showed that COSMA library outperforms Cray’s accelerated LibSci_acc library in pdgemm calls by a factor of ~2.
BigDFT
Strong parallel scaling of the exact exchange implementation in BigDFT for different systems (from a very small, 12 atom cell, up to a 1,029 atom cell). The ideal curve is depicted in blue, CPU, GPU and GPU-GPUDirect are included for small cells.
As shown in the Figure the GPU accelerated version scales well up to 3 orbitals/node. When going to two or even one orbital per node the degradation of the scalability is due to the fact that computation time is not high enough to overlap the communications. For one of the large systems used in this study (768 atoms, i.e. 12800 orbitals), the 3200 node run (4 orbitals/node) was 75% quicker than the 1600 nodes run.