A new scalability record in a materials science application

Another step towards the exascale was made by a team of MaX researchers at CNR (NANO & ISM) who have run a multi petaFlop simulation with the MaX flagship application Yambo.

A single GW calculation has run on 1000 Intel Knights Landing (KNL) nodes of the new Tier-0 MARCONI KNL partition, corresponding to 68000 cores and ~ 3 pFlop/second. (The whole MARCONI KNL partition includes 50 racks, 3600 nodes, 68 cores/node and peak performance of about 11 PFlop/s.) The simulation, related to the growth of complex graphene nanoribbons on a metal surface, is part of an active research project combining computational spectroscopy with cutting edge experimental data from teams in Austria, Italy, and Switzerland. Simulations were performed exploiting computational resources granted by PRACE (via call 14).

This result was made possible thanks to the intense work done by the Yambo developers team on improving the performance of Yambo on large scale HPC architectures. The parallel scaling performance of two of the main kernels of Yambo in a GW run, i.e., the independent particle linear response Χ0, and the correlation self-energy Σc, were benchmarked by the CNR MaX team on the KNL partition of Marconi, by executing GW simulations for a realistic polymer used as chemical precursor of chevron-shaped graphene nanoribbons. This one-dimensional system counts 136 atoms and 388 electrons in the unit cell, 8 k-points in the irreducible Brillouin Zone and about 3500000 G vectors to represent the charge density. For this simulation, a FFT grid of (144,288,180) has been adopted and 800 empty states are used to calculate GW calculations.

In Fig. 1 panel (a) the execution times for the Χ0 and Σc routines (the most expensive in terms of cpu time for a single GW run) are reported as a function of the number of cores. The speedup for the same routines is reported in panel (b). The results obtained show a good scalability up to 68000 cores.

Figure 1: Panel (a): the execution time for the Χ0 and Σc routines is reported as a function of the number of cores. Panel (b): the speedup for the Χ0 and Σc routines is reported.