Edinburgh Research Archive >
Physics, School of >
Physics thesis and dissertation collection >
Please use this identifier to cite or link to this item:
Files in This Item:
||Size||Format||thesis-src.zip||File not available for download||12.65 MB||Unknown|
|Edwards2010.pdf||PhD thesis||3.57 MB||Adobe PDF||View/Open|
|Title: ||Optimising a fluid plasma turbulence simulation on modern high performance computers|
|Authors: ||Edwards, Thomas David|
|Supervisor(s): ||Hein, Joachim|
|Issue Date: ||2010|
|Publisher: ||The University of Edinburgh|
|Abstract: ||Nuclear fusion offers the potential of almost limitless energy from sea water and lithium without
the dangers of carbon emissions or long term radioactive waste. At the forefront of fusion
technology are the tokamaks, toroidal magnetic confinement devices that contain miniature
stars on Earth. Nuclei can only fuse by overcoming the strong electrostatic forces between
them which requires high temperatures and pressures. The temperatures in a tokamak are
so great that the Deuterium-Tritium fusion fuel forms a plasma which must be kept hot and
under pressure to maintain the fusion reaction. Turbulence in the plasma causes disruption
by transporting mass and energy away from this core, reducing the efficiency of the reaction.
Understanding and controlling the mechanisms of plasma turbulence is key to building a fusion
reactor capable of producing sustained output.
The extreme temperatures make detailed empirical observations difficult to acquire, so numerical
simulations are used as an additional method of investigation. One numerical model
used to study turbulence and diffusion is CENTORI, a direct two-fluid magneto-hydrodynamic
simulation of a tokamak plasma developed by the Culham Centre for Fusion Energy (CCFE
formerly UKAEA:Fusion). It simulates the entire tokamak plasma with realistic geometry,
evolving bulk plasma quantities like pressure, density and temperature through millions of
timesteps. This requires CENTORI to run in parallel on a Massively Parallel Processing (MPP)
supercomputer to produce results in an acceptable time.
Any improvements in CENTORI’s performance increases the rate and/or total number of
results that can be obtained from access to supercomputer resources. This thesis presents the
substantial effort to optimise CENTORI on the current generation of academic supercomputers.
It investigates and reviews the properties of contemporary computer architectures then
proposes, implements and executes a benchmark suite of CENTORI’s fundamental kernels.
The suite is used to compare the performance of three competing memory layouts of the primary
vector data structure using a selection of compilers on a variety of computer architectures.
The results show there is no optimal memory layout on all platforms so a flexible optimisation
strategy was adopted to pursue “portable” optimisation i.e optimisations that can easily be
added, adapted or removed from future platforms depending on their performance.
This required designing an interface to functions and datatypes that separate CENTORI’s
fundamental algorithms from repetitive, low-level implementation details. This approach offered
multiple benefits including: the clearer representation of CENTORI’s core equations as
mathematical expressions in Fortran source code allows rapid prototyping and development of
new features; the reduction in the total data volume by a factor of three reduces the amount
of data transferred over the memory bus to almost a third; and the reduction in the number of
intense floating point kernels reduces the effort of optimising the application on new platforms.
The project proceeds to rewrite CENTORI using the new Application Programming Interface
(API) and evaluates two optimised implementations. The first is a traditional library
implementation that uses hand optimised subroutines to implement the library functions. The
second uses a dynamic optimisation engine to perform automatic stripmining to improve the
performance of the memory hierarchy. The automatic stripmining implementation uses lazy
evaluation to delay calculations until absolutely necessary, allowing it to identify temporary
data structures and minimise them for optimal cache use. This novel technique is combined
with highly optimised implementations of the kernel operations and optimised parallel communication
routines to produce a significant improvement in CENTORI’s performance. The
maximum measured speed up of the optimised versions over the original code was 3.4 times
on 128 processors on HPCx, 2.8 times on 1024 processors on HECToR and 2.3 times on 256
processors on HPC-FF.|
Massively Parallel Processing supercomputer
automatic stripmining implementation
|Appears in Collections:||Physics thesis and dissertation collection|
Items in ERA are protected by copyright, with all rights reserved, unless otherwise indicated.