Preparing for Exascale: Aurora Software Development – Initial Packing and Materials

0


By coordinating efforts to improve the stability of early exascale hardware at the Argonne Leadership Computing Facility (ALCF), computer scientist Servesh Muralidharan strives to make it easier for application developers to use Joint Aurora test beds. Laboratory for System Evaluation (JLSE) of Argonne. His work will facilitate a faster transition to the Aurora system upon delivery and help activate science from day one.

JLSE, a collaboration between Argonne’s IT divisions, has enabled researchers to refine several generations of Intel GPU test beds as the arrival of Aurora approaches. These test beds consist of preproduction GPU / CPU samples often designed for verification and validation of key architectural features.

JLSE test benches

Muralidharan test benches are used to develop applications that can optionally run on the Intel Xe GPU and CPU Sapphire Rapids targeted for Aurora.

As with any old hardware, making these parts usable (i.e. ensuring that they can run applications well) is a difficult process. Intel provides specialized driver components and software development kits (SDKs), including compilers, to run applications on early graphics processors. These components and SDKs require customization to work in the JLSE environment and be usable by application developers participating in the ALCF Early Science Program and the US Department of Energy (DOE) Exascale Computing Project. This is accomplished by working with multiple teams at Intel.

Through his background in computer science, Muralidharan has past experience in processing a variety of early hardware, including characterizing its stability and performance, in order to validate its intended behavior. His work throughout the last year with several revisions of the Intel GPU test bed hardware silicon has resulted in a better understanding of the system components and their interactions, so that he is able to maintain them and, if necessary, to reconfigure them.

Coordinate hardware and software deployment

Argonne Computer Scientist Servesh Muralidharan

Muralidharan’s role with regards to packaging and initial hardware, in which he coordinates efforts to deploy early hardware and software with JLSE’s system operating teams, is twofold.

Part one includes creating custom driver stacks and validating hardware behavior after systems operations teams install and configure a server, followed by the challenge of creating usable software stacks on top of the hardware. . Muralidharan works through different components until it reaches the SDK, where the compilers reside.

Second, once a usable test bed is in place, Muralidharan helps diagnose low-level issues resulting from day-to-day use of the system. These problems range from specific code causing a hardware failure to unexpected degradation in performance resulting from driver issues. Once the problematic hardware is identified, Muralidharan works with the corresponding Intel team to sort out the problem and assess the appropriate fixes in the JLSE benchmark hardware.


source: Nils Heinonen, scientific editor / editor-in-chief, Laboratoire national d’Argonne


Share.

Comments are closed.