Quantum ESPRESSO - YAMBO - Siesta - Fleur
Thanks to MAX support during the phase I of the CoE, the fundamental and general functionalities the Quantum ESPRESSO suite have been collected and reorganized as stand-alone modules. Some of these will be further refactored so as to be fully released as stand-alone libraries by the MAX CoE during the phase II. Within the many contributions supported or directly done during phase I on MAX operation we would like to highlight:
Refactoring planned for MAX phase I
Reorganization of I/O with the adoption of XML schemes
A more robust and interoperable format for the input and small size output of Quantum ESPRESSO has been adopted. The new format is based on publicly released XML schemes. A toolchain based on Python scripts has been developed in order to generate and adapt automatically the Fortran routines to any modification of the scheme. The adoption of the Document Object Model for XML I/O as implemented in FOX libraries has been instrumental to the adoption of this toolchain. To simplify the interoperability with Python applications we released the xmlschema package, which automatically builds objects for decoding and encoding in the new format. The large size binary output has also been rationalized in a more portable format and it is now possible to compile Quantum ESPRESSO in order to use the HDF5 format to write these files. A pure Pyhton package (qeschema) and a Python extension base on Fortran sources -postqe- provide access to these new file formats for Python applications and basic post-processing tasks. (Release qe-6.2.1)
Refactoring of FFTXlib and LAXlib
These two most compute-intensive libraries have been extensively refactored and disentangled from the rest of the code providing them with a new more portable interface. The new FFTXlib allows a more granular redistribution of 3D data and it is possible to opt between different distributions. (Release qe-6.2.1)
Creation of the KS_solvers library
To create a more transparent usage of different iterative eigensolvers, the implementations of the algorithms have been collected in this single library. (Release qe-6.3)
Refactoring and development of specific libraries for enabling the usage of GPU accelerated systems
pw.x can be used on systems equipped with GPUs. As the compilation for GPUs requires a specific setup and linking to architecture specific libraries, the code is distributed GPU-ready via a dedicated repository together with specific routines. Users willing to compile and use pw.x on GPU-enabled architectures may find the code here. (Release qe-6.4.1)
Adoption of the ACE method for the Exact Exchange operator in hybrid functional calculation
The ACE algorithm presented in J.Chem.TheoryComput. 2016, 12, 2242 − 2249 has been implemented and has become the default algorithm for the evaluation of the Fock Exchange operator in pw.x. The adoption of this more efficient algorithm has greatly reduced the workload for hybrid functionals computations making them feasible for a wider area of computations. (Release qe-6.3)
Conversion tool for localized orbitals
The implementation of the SCDM method to construct a set of optionally orthogonal, localized basis functions associated to the density matrix of the system is available for Quantum ESPRESSO developer inside the code. The SCDM procedure has also been improved to efficiently treat real large-scale molecular systems: the new implementation provides similar localization with much smaller memory usage. (Release qe-6.4)
Exact Exchange operator built in a localized representation
For hybrid functional calculations in large systems it is possible to opt for an evaluation of the Fock exchange operator using a representation of the density matrix based on localized orbitals [Electron. Struct. 1 015009]. Setting a localization threshold it is possible to restrict the exchange operator to the orbital pairs with overlap beyond the indicated threshold, and reduce significantly the number of exchange integrals computed and makes their number grow linearly with the size of the system. To obtain the localized representation of the density matrix we uses the recently proposed SCDM approach which is fast and non iterative. (Release qe-6.3)
Filtering of UPPP and PAW augmentation charges and projectors
The evaluation of scalar products of trial wavefunctions with the projectors defining the non-local part of the pseudopotentials and the addition of localized augmentation charges for ultrasoft pseudopotentials are important parts of the calculation, and adsorb a significant fraction of the computational resources. These operations can be performed either in reciprocal space or in real space. The real space evaluation, exploiting the localization of pseudopotential projectors and augmentation charges, is more efficient for large systems but suffers from severe aliasing errors if the adopted planewave expansion is incomplete. For this reason the less efficient reciprocal space evaluation is usually used. In order to improve efficiency, especially important when dealing with hybrid functionals, a Fourier space filtering of the augmentation charges and projectors have been developed, able to remove the high Fourier components causing aliasing error, thus allowing the use of the more efficient real space algorithm with equivalent results to the reciprocal space one. (Release qe-6.3)
Collective excitations with TDDFPT
The description of the collective excitations involving a creation of electron-hole pairs (i.e. excitons) can be made only by taking into account the non-local nature of the exchange interaction between the electron and the hole. This can be achieved using many-body perturbation theory (MBPT) methods, which are computationally very demanding. We have implemented a less expensive approach based on TDDFPT with hybrid functionals in conjunction with models for the dielectric screening in materials. This implies an extension of the turboEELS code to include the implementation of the non-local exchange interactions.
Modelling of magnons (spin waves), which are collective excitations of spin degrees of freedom, can be made using TDDFPT generalized to the magnetic case, i.e. by taking into account not only charge density fluctuations but also spin (magnetization) density fluctuations due to the external perturbing magnetic field. The TDDFPT module has been extended to the magnetic case in the general noncollinear framework of the alignment of spins, including relativistic spin-orbit coupling effect [Gorni, T. Spin-fluctuation spectra in magnetic systems: a novel approach based on TDDFT. Ph.D. thesis, Scuola Internazionale Superiore di Studi Avanzati (SISSA), Trieste, Italy (2016). Gorni, T., Timrov, I. & Baroni, S. Dynamical spin-wave excitations from time- dependent density functional perturbation theory. Eur. Phys. J. B 91: 249 (2018)]. The method is based on the magnetic Liouville-Lanczos approach within TDDFPT in ALDA or AGGA, which gives fairly accurate description of magnon dispersions in materials at reasonable computational cost (which is, of course, lower than in MBPT methods). The code is currently under further testing and refinement, and is scheduled to be distributed in one of the next official releases of Quantum ESPRESSO.
This new feature is based on the recent theory of heat conduction within DFT, developed in the group of Prof. Stefano Baroni in SISSA [Nature Phys. 12, 80–84, 2016], where an energy current time series is extracted from AIMD simulations at thermodynamic equilibrium, and thermal conductivity is computed within a Green-Kubo formulation. At the present state, the code post-processes the atomic positions and velocities, given as output of a previous AIMD simulation, in order to regenerate many needed DFT ingredients (Kohn-Sham orbitals and energies, electron charge density, etc.) – whose storage during the simulation is unfeasible for memory reasons – and finally calculates the energy current. The code is in the verification and refinement phase and will be added to the Quantum ESPRESSO suite as soon as it will be possible.
WanT interpolation tools
A set of tools to exploit localised basis sets within Quantum ESPRESSO is now available in the WANT package [http://www.wannier-transport.org]. By projecting the plane-wave represented wave functions on localised pseudo-atomic-orbitals (i.e. atomic orbitals obtained within a pseudopotential description) it is possible to construct effective tight-binding Hamiltonians and to use them to Fourier interpolate physical quantities such as band structures, density of states, Fermi surfaces, Boltzmann transport coefficients, dielectric properties, to name only a few. Similar interpolations can also be obtained when using maximally-localised Wannier functions (MLWFs), also computed by WANT.
During MAX, the Yambo code has been refactored and improved following modern software design practices such as modularity, reuse of routines and libraries, HW and SW separation of concerns. Beside the reorganization of the code, it has been rigorous self-testing and automation frameworks have been added to the suite. Methodological advances have been incorporated in the code to improve its performance and concurrently new features have been implemented to compute novel state-of-the-art physical phenomena. To better exploit massively-parallel architectures, the implementation of MPI and OpenMP has been improved. A list of the main new features and algorithms released within MAX is listed below.
I/O: parallel and serial
I/O procedures have been made more flexible and efficient. Yambo stores binary data using the netcdf library. Depending on the configuration flags, data can be stored in classic netcdf format ( size limit of 2 GB, activated with –enable-netcdf-classic), 64-bit NetCDF format (no le size limit, default) or HDF5 format (requires at least NetCDF v4 linked to HDF5, activated using the –enable-netcdf-hdf5 option). In case the HDF5 format is specified, parallel I/O can also be activated (–enable-hdf5-par-io) to store the response function in G space or the kernel of the BSE. For the G-space response function, parallel I/O avoids extra communication among the MPI tasks and also reduces the amount of memory allocated per task. For the BSE case, parallel I/O makes it possible to perform restats and load the kernel computed from a previous calculation using a different parallel scheme and/or a different number of MPI tasks. (Release Yambo 4.4)
Interface with the FFTXLib domain specific library
The calls to the FFT kernel were modularized and an interface with FFTXLib was implemented. This library is now distributed with Yambo (under the name of FFTQE) and can be called by the main code (a subset of the features implemented in FFTXLib is currently supported, though a more extended interfacing is planned). (Release Yambo 4.2)
SLEPC & PETSC support
Yambo has been interfaced with the SLEPc library which use objects and methods of the PETSc library to implement Krylov subspace algorithms to iteratively solve eigenvalue problems. These are used in Yambo to obtain selected eigenpairs of the excitonic Hamiltonian. (Release Yambo 4.3)
Parallelization
A refactoring of the parallel structure has been put in place in the last years in order to take full advantage of nodes with many-cores and a limited amount of memory per core. In particular, within MAX, a new level of parallelism on the response matrix over space degrees of freedom (plane waves) has been added to the MPI multilevel structure. Moreover, important efforts have been done to make efficient the OpenMP scaling of the BSE kernel. (Release Yambo 4.4)
Response function and Green’s function terminators
In order to reduce the number of virtual/empty Kohn-Sham states necessary to converge both polarisability and GW self-energy, we have implemented in Yambo both response (X) and self-energy (G) terminator techniques, as described in Bruneval, F. & Gonze, X. Accurate GW self-energies in a plane-wave basis using only a few empty states: Towards large systems. Phys. Rev. B 78, 085125 (2008). In this procedure extra terms, whose calculation implies a small computational overhead, are introduced to correct both polarizability and self-energy by approximating the effect of the states not explicitly taken into account. The method consists in replacing the energies of empty states that are above a certain threshold, and that are not explicitly treated, by a single adjustable parameter defined extrapolar energy. (Release Yambo 4.2)
Interface with the WANT package for the interpolation of GW results
We have interfaced Yambo to the WANT package to exploit the real-space tools based on pseudo-atomic orbital projections and Wannier functions to Fourier interpolate (real and complex) band structure and density of states in the framework of GW. This feature is particularly appealing for GW calculations, where the cost of computing QP corrections per k-point is significantly higher than for DFT calculations. Technically, after having produced the standard DFT data required by Yambo, the projwfc.x tool of Quantum ESPRESSO needs to be run to dump the projections of Kohn-Sham states on a basis of pseudo-atomic orbitals. The calculation of GW corrections proceeds as usual and, once generated the database with QP corrections, WANT tools (bands.x, dos.x, cmplx_bands.x) can be called specifying the QP db in the input file (among other keywords). Importantly, this procedure relies on the internal symmetrisation of k-points in the irreducible Brillouin zone performed by WANT. Overall, the procedure is very simple and quite user friendly. (Release Yambo 4.2)
Parallel linear algebra distribution of response functions
One of the most serious bottlenecks individuated in GW calculations is the accumulation and storing in memory of the frequency dependent dielectric matrix X. This is commonly obtained as solution of the Dyson equation for the response function and the size of the dielectric matrix is related to the number of plane waves used to represent the system under study. In the latest release we implemented an advanced slicing of the response function that is efficiently distributed in memory and workload. This slicing is a unique feature of Yambo and it was obtained by coding an ad hoc interface between the slicing and the blacs/scalapack structure. In this way all steps of the Dyson equation are distributed in memory and, together with the use of the scalapack library, the dielectric matrix is never fully allocated in memory. This new feature allows us to overcome one of the most serious difficulties of Many-Body ab-initio calculations. (Release Yambo 4.3)
Extension of the p2y interface with QUANTUM ESPRESSO
P2Y is the main utility interfacing Yambo with Quantum ESPRESSO. Its function is reading, checking and importing all data (such as crystal, electronic structure info, energies and wavefunctions) produced by the preliminary DFT steps. We have extended p2y capabilities by implementing: (i) auto-recognition of the Quantum ESPRESSO data formats, (ii) parallel execution of p2y (parallelism is expressed at the MPI level over conversion of wavefunctions data blocks; this feature is particularly relevant when large scale systems are addressed, resulting in wavefunction volume easily larger than 100 GB) and (iii) handling of non-linear core corrections (NLCC) and USPPs data, a pseudopotential option that broadens the class of potentials Yambo can deal with. (Release Yambo 4.3)
A real-time approach to the solution of the Bethe-Salpeter equation and non-linear optics
The evaluation of the Bethe–Salpeter Kernel and the solution of the corresponding linear problem represents one of the most cumbersome and heavy steps in the evaluation of many–body effects in the optical properties. Despite its complexity the solution of the Bethe-Salpeter equation (BSE) is, de facto, the most up–to–date and accurate approach to calculate with sharp precision the optical and electron–energy loss spectra for a wealth of materials. There are two main bottlenecks that cooperate to make the solution of the BSE cumbersome: the first is the calculation of screened Coulomb interaction, the second is the actual storing of the Bethe–Salpeter Kernel (BSK) and the solution of the linked linear problem. Regarding the first point, several lines of coding and development of Yambo are active to reduce the workload in the calculation of the screened interaction. The approaches in this direction are based on a more intensive use of the parallel infrastructure and of advanced termination techniques. The size of the BSK can be as large as hundreds of thousands for reasonable small nanostructures and, for realistic materials, can even reach the million. It is immediately clear that such a matrix cannot be stored on disk. In Yambo, the problem of the BSK is tackled in two different ways: one is an advanced and combined use of linear–algebra tools and parallel memory distribution; the other, totally alternative, is based on the re-formulation of the problem by using Non-equilibrium many-body perturbation theory which reduces the solution of the BSE to a real–time problem where the kernel is never calculated explicitly. In particular, Yambo can now calculate directly the polarization function P(t) in real-time. The same procedure is used to describe the time evolution of the system under the action of ultra-short laser pulses (a very active research field) and thus the generation of non-equilibrium carriers populations, the time-dependent magnetization and other real-time properties of the system which can be deduced from the non-equilibrium density matrix. The real–time approach implemented in Yambo is also exploited to calculate non linear response function, via the time propagation of single particle wave–functions (in collaboration with C. Attaccalite and M. Gruning). (Release Yambo 4.3)
Wigner-Seitz truncation scheme
The use of a truncated Coulomb potential allows to achieve faster convergence in the vacuum size, eliminating the interaction between the repeated images. Recently, a new procedure has been implemented in Yambo, that is the Wigner-Seitz truncation scheme. In this scheme the Coulomb interaction is truncated at the edge of the Wigner–Seitz super-cell compatible with the k-point sampling. This truncated Coulomb potential turns out to be suitable for finite systems as well as for 1D and 2D systems, provided that the supercell size, determined by the adopted k-point sampling, is large enough to get converged results. (Release Yambo 4.3)
Alternative formulation and computation of dipole matrix elements
Yambo implement new strategies to calculate dipole matrix elements. The first one is the so called “shifted grids” approach. This procedure uses wave-functions computed on four different grids in k-space, i.e., a starting k-grid plus three grids that are slightly shifted along the three Cartesian directions. Such approach is computationally more efficient, although it requires to generate more wave–functions. The second one is the “Covariant” approach, which exploits the definition of the position operator in k space. The dipole matrix elements are then evaluated as finite differences in between the k-point of a single regular grid. A five-point midpoint formula is used, with a truncation error O (k4). Finally, for finite systems, the dipole matrix elements can be directly evaluated in real space (R-space x approach). For details see D. Sangalli et al, J. Phys.: Condens. Matter 31 (2019) 325902. (Release Yambo 4.3)
New Interpolation driver
The old interpolation routines in ypp have been re-organized and modularized (a novel INTERPOLATION_driver has been added). The interpolation methods “Nearest Neighbout” and “Boltzmann” have been coded. (Release Yambo 4.4)
During the first phase of the MAX project, SIESTA has seen improvements in several areas. A number of modules and interfaces have been refactored, new solvers have been added, and performance levels increased. Besides, new physical functionalities have been made available. Note that SIESTA development has traditionally followed a conservative model of releases, with three major branches “stable/beta/development” available to users at any given time. In the Launchpad platform, the current “stable” branch is 4.0; the “beta” 4.1, and the “development” is “trunk”. Some of the developments mentioned below are integrated already in 4.1, and some others in ‘trunk’. A few developments, while functional, have not yet been merged into ‘trunk’, but are nevertheless available in the platform to interested users.
Enhanced control of quantum engine operations through an embedded Lua interpreter
In order to control fine-grained run-time parts of the code and/or change implicit parameters while the code is running, we have implemented a scheme based on an embedded Lua interpreter. The Lua interpreter can process scripts that query the internal state of a SIESTA instance and modify the flow of operations. Currently the Lua interface can:
Query and/or change any of the internal energy components.
Set the values of the atomic coordinates and cell parameters. This feature enables the implementation of a variety of custom geometry relaxation and molecular-dynamics algorithms, all through Lua scripts, without touching the SIESTA code itself and without recompilation.
The ability to manipulate the coordinates also allows a much richer set of user-defined constraints in internal geometry MD or relaxations.
Adapt convergence criteria while running. This is convenient due to the complex nature of quantum engine convergence. There are numerous parameters that determine the convergence criteria, convergence path and speed. The mixing parameters and methods can be changed at will during the self-consistent cycle, allowing the testing and implementation of new heuristics much more efficiently.
This scripting interface, while not an API in the standard sense, provides the required flexibility to truly control the quantum engine on-the-fly. (Branch 4.1)
Refactoring of diagonalization-based solver interfaces
All the available diagonalization methods and options have been collected in a single module, with enhanced compatibility checks and streamlining of the logic flow. One important by-product of the refactoring has been a reduction in the memory footprint in most cases. Secondly more diagonalization methods are now available for all parts of the code, i.e. k-point implementation of 2-stage solvers, MRRR and ELPA. (Branch 4.1)
Enhanced parallelization opportunities
Hybrid OpenMP/MPI parallelization has been introduced in SIESTA. Currently, matrices involved in the diagonalization step can fit in a typical multi-core single node even for systems of around a thousand atoms. This means that, for all practical purposes, calculations that need k-point sampling can be parallelized over k-points with MPI over the appropriate number of nodes, and within each node with OpenMP. (Branch 4.1)
Integration of the CheSS linear-scaling solver
The CheSS library has been integrated into SIESTA. The CheSS solver uses an expansion based on Chebyshev polynomials to calculate the density matrix, thereby exploiting the sparsity of the overlap and Hamiltonian matrices. It works best for systems with a finite HOMO-LUMO gap and a small spectral width. CheSS exhibits a two-level parallelization using MPI and OpenMP and can scale to many thousands of cores. It has been made into a stand-alone library starting from the original codebase within BigDFT, and can be downloaded and installed freely from here. (Branch ‘trunk’)
Integration of the libPSML and LibXC libraries and support of the PSML pseudopotential format
We have implemented an interface to the libPSML library that enables the processing of PSML files by SIESTA. This opens the door to the use of curated databases of pseudopotentials (notably from the oncvpsp code), enhancing the user experience and the interoperability of the code. Moreover, we have integrated the libGridXC DSL for exchange-correlation. As part of the work on enabling the use of PSML pseudopotentials, we have made the former SiestaXC library stand-alone, and further extended it by providing an interface to the de-facto standard libXC library. (Branch ‘psml-support’)
Interface to the ELSI library
We have developed for SIESTA an interface to the ELSI library that provides transparent access to several high-performance solvers (ELPA, OMM, PEXSI) with minimal code changes. The interface works currently for collinear-spin calculations, and a spin-orbit version is nearing completion. With this interface, SIESTA can immediately benefit from the availability of new solvers in ELSI (for example, no further changes are needed to exploit the recently added NTPoly linear-scaling solver using density-matrix-purification). (Branch ‘trunk-elsi-dm’)
Implementation of full-hamiltonian spin-orbit interaction
A complete rewrite of the spin-orbit coupling functionality has been carried out. The formalism uses lj projectors instead of the “scalar-relativistic plus spin-orbit” scheme, and exhibits also improved convergence and stability. This development will allow a more robust simulation of different systems in which the spin-orbit interaction is crucial. These systems include exotic materials like topological insulators, technologically relevant 2D materials like transition metal dichalcogenides, or photovoltaic materials like hybrid organic-inorganic perovskites. (Branch ‘trunk’)
Implementation of real-time TDDFT and electron-ion non-adiabatic coupled dynamics
The time-dependent density-functional theory formulation in real time, TD-DFT(t) has been effectively parallelised in SIESTA, bringing it to the performance levels of the regular, Born-Oppenheimer Siesta, including k-point sampling, spin polarization, etc. The formalism used for the wave-function evolution in the presence of moving basis sets [Todorov, T. N. Time-dependent tight binding. Journal of Physics: Condensed Matter 13, 10125 (2001)] has been revised [Artacho, E. & O’Regan, D. D. Quantum mechanics in an evolving hilbert space.Phys. Rev. B 95, 115155 (2017)], and this has led to a better optimisation of time integrators. (Branch ‘trunk’)
Integration of linear-response functionality
SIESTA now includes an implementation of the Density Functional Perturbation Theory (DFPT). The present implementation of DFPT was coded with the aim of calculating the vibrational properties of clusters and solids via the Force Constant matrix (FC), i.e., the second derivatives of the energy with respect to atomic displacements. This implementation is based on the old LINRES module by J. M. Pruneda, J. Junquera and P. Ordejon for the SIESTA-0.12 version (J. M. Pruneda, PhD. Thesis, Universidad de Oviedo (2002)). For the calculation of the FC matrix, a standard SIESTA run is performed in order to calculate the unperturbed density matrix. Once the system reaches the self-consistent solution, the LINRES method is called. The density matrix is calculated using DFPT for each perturbed atom (external loop over the perturbed atoms) and the contributions to the second energy derivative are added. It is important to note that the calculated perturbed density matrix for each perturbed atom is independent of the rest. Once the atom loop ends, the FC matrix is written to a file that can be processed to extract phonon frequencies and vectors. (Branch ‘linres-parallel’)
Added the Cold-smearing method for zero-temperature limità
To further extend the options available for the improvement of the convergence in the treatment of metals, the cold-smearing method by Marzari and Vanderbilt [Marzari, N., Vanderbilt, D., De Vita, A. & Payne, M. C. Thermal contraction and disordering of the al(110) surface. Phys. Rev. Lett. 82, 3296–3299 (1999)] has been implemented. This joins the existing choices of Fermi-Dirac and Methfessel-Paxton smearing. (Branch ‘trunk’)
Hybrid MPI+OpenMP parallelisation
The profiling identified the construction of the Hamilton and overlap matrices and the calculation of the core-electron charge representation as two of the main code sections featuring limited parallelisation scaling. Tests showed that this was due to increased memory and bandwidth demands of the applied MPI-only parallelisation. To reduce this demand additional OpenMP parallelisation layers have been implemented. (Release 1)
Wrappers and calls to domain specific libraries
To exploit the high performance of vendor-provided multithreaded domain specific libraries an extensive effort has been undertaken to identify code parts that can benefit from calls to such libraries. Especially calls to the BLAS library have been introduced. Tests indicated significant performance gains. In the profiling the diagonalisation has been identified as a code section critical to parallelisation performance. It was limited due to related performance characteristics of the used ScaLAPACK library. To increase the scalability of this program part wrappers have been introduced to enable the use of further DSLs providing the required functionality and more favorable performance. In detail, the utilization of the ELPA library, the ELEMENTAL library, and an in-house iterative solver have been realized in this way. (Release 1)
I/O revamping, based on XML and HDF5
The whole I/O structure of FLEUR has been redesigned from the ground up, based on XML and HDF5 concepts and libraries.
XML
To support the development of tools interfacing FLEUR and to enhance the software ergonomics of the code a validation-ready, machine- and human-readable XML I/O schema has been developed. As an alternative to the conventional multiple file input for FLEUR a single unifying XML based input file has been defined and implemented in terms of an XML Schema. This has been implemented in two code parts: i) the input generator now generates this XML input file starting from very basic data on the unit cell, and sets many parameters to reasonable defaults. ii) This XML file is used to specify all calculation parameters used by FLEUR and is validated against the associated XML Schema definition to enable the easy detection of input errors. An XML output file has been defined and implemented analogously. Though not required for FLEUR calculations, the XML Schema file defining the output file format is provided as part of the FLEUR code. The FLEUR AiiDA plugin is based on this XML I/O scheme.
HDF5 and distributed in-memory storage
So far FLEUR used a binary direct access Fortran I/O to store eigenvectors. This cannot be used in parallel calculations, where a parallel I/O scheme is needed. This triggered the complete redesign of the eigenvector I/O. We created a common interface and implemented different I/O strategies: (i) the old Fortran direct access I/O as this is most efficient on small machines, (ii) an HDF5-based file I/O that can also be used in shared memory parallelism. As parallel I/O from all nodes into a single file as provided by HDF5 can still lead to scaling bottlenecks on some computers with many computing nodes (for example BlueGene/Q), we also implemented an alternative storage scheme. Instead of doing disk I/O, we keep the data in working memory. This storage scheme relies on a distribution of the data over all nodes and uses one-sided MPI communication for the distribution and collection of the eigenvectors. It is implemented using the same interface as the two file I/O modes and the user can choose between these modes by command line switches. (Release 1)
Refactoring of the setup of the Hamiltonian and overlap matrices
Significant part of the execution time is spent in the generation of the Hamiltonian and overlap matrices. In order to facilitate the performance tuning of FLEUR and to encapsulate its functionality, relevant parts have been further redesigned and encapsulated. The main challenge in this process is the zoo of different functionalities implemented in the code that leads to many special cases we need to considere and deal with. While the most simple case of non-spin polarized calculations without additional local-basis functions is easy to consider and implement, the inclusion of non-collinear magnetism, spin-orbit coupling and the use of additional local orbitals of various kinds require a carefully designed refactoring process. Several tests have been performed to identify a suitable path and the corresponding steps have been implemented to refactor this code part. (Release 2.1)
Redesign of the IO for the charge density and the potential
Besides the IO of the eigenvalues and eigenvectors which already has been redesigned in Release 1, we also implemented a new IO module for the charge density and the potential. Currently, these quantities can be written into binary files using standard Fortran IO functionality. In addition, a HDF5 based version of this IO functionality has been implemented. This development aimed at several goals: (i) better modularity of the code-base. All IO-relevant routines should be encapsulated into separated modules to increase maintainability and portability. (ii) Use of modern transferable standard libraries like HDF5. (iii) Possibility to offer different IO schemes starting at a clearly defined interface between the numerically intensive parts of the code and the IO. In this context we also aim at offering the possibility to avoid IO completely by keeping data in various levels of memory. (Release 2.1)
Performance enhancements for large setups
The performance enhancements and modifications of the most computational expensive parts as performed for the Release 1 improved the scalability of the code significantly. Now, other code sections which initially did not require a significant part of the computational time become bottlenecks for the scalability of the code. Hence, we implemented several small changes that help improving this situation: (i) We replaced our Fortran FFT code by calls to FFT routines available in optimized libraries. (ii) We used MPI+OpenMP parallelization to speed up the calculation of the potential. (iii) We also parallelized parts of the setup routines, in particular the calculation of the step-function which could use significant time. While this is only performed once in a self-consistency cycle, a serial setup should not take too much time in large parallel jobs. For a large test system these modifications made the efficient use of more than 64 nodes with 24 cores each (CLAIX RWTH-Aachen) possible. On JUQUEEN (Forschungszentrum Juelich), an IBM BlueGene supercomputer, we could demonstrate scaling to more than 16000 threads. (Release 2.1)
Modularisation and refactoring
With the modularisation of the code for Hamiltonian generation and diagonalization being the main target to previous refactoring efforts, we focused on the remaining codeparts for this MAX release. In particular we redesigned the basic charge density mixing routines of the code to achieve a flexible and transparent interface of these routines for future implementations of advanced preconditioners. In addition, the implementation of the XML-IO has been enhanced by including many specialised program features so far controlled by specialised files. This allows to use such features within the FLEUR-AiiDA interface in complex workflows. On a more fundamental level, several new data-types have been added to the code allowing more concise formulations of the algorithms and cleaner APIs. Some basic operations on these types have been implemented in type-bound procedures to increase data locality and encapsulation and to push the use of modern programming paradigms. (Release 3)
Performance enhancements in magnetic calculations
While the previous releases of FLEUR already achieved significant performance gains and demonstrated improved scaling behaviour on many cores, in this release we focussed on the transfer of the knowledge obtained into more complex use-cases. In particular, we investigated the differences of the standard non-magnetic or collinear magnetic case with setups in which non-collinear magnetism has to be treated. Different cases have to be considered here, each introducing additional complexity in the algorithm and additional needs for optimisation and parallelisation:
• The use of spin-orbit coupling in a second variation scheme. Here the parallelisation efforts on the setup of the Hamiltonian and overlap matrix only affect the first variation step and additional effort was needed to achieve a similar level of parallel efficiency also in the second variation step.
• The calculation of general non-collinear structures includes a spin-rotation of the Hamiltonian in the spheres which interacts with the optimised setup and had to be adjusted.
• The treatment of spin-spirals using the generalised Bloch-theorem is even more complex as it includes additional spin-loops and hence introduces additional computational cost in the Hamiltonian generation.
Performance for all these special simulation setups has been performed by extending the strategies used before to these code parts and by adjusting the structure of the code accordingly. The test suite of regression tests has been extended to cover these aspects and to allow the constant monitoring of the correctness of the changes implemented. (Release 3)
EELS module
The development of the new Electron-Energy-Loss Spectroscopy (EELS) module has been finalized and this feature is now available for production runs of FLEUR. The calculation of the required matrix-elements between core and valence states as well as the proper summation of these transitions is implemented and has been successfully tested. The corresponding modules of the code can be found in the ’eels’ directory of the source. (Release 3)
Higher-dimensional Wannier functions
The code to implement higher dimensional Wannier functions is now included in the third MAX release. This code interfaces with special versions of the Wannier90 program able to deal with Wannier functions constructed from more than the usual Bloch states on the 3D k-point grid. Hence, a 4D and 5D parameter space for the generalized construction is used and the corresponding evaluation of matrix elements has been implemented into FLEUR. (Release 3)