Show simple item record

dc.contributor.advisorO'Boyle, Michael
dc.contributor.advisorDubach, Christopher
dc.contributor.authorMagni, Alberto
dc.date.accessioned2016-06-01T08:51:49Z
dc.date.available2016-06-01T08:51:49Z
dc.date.issued2016-06-27
dc.identifier.urihttp://hdl.handle.net/1842/15831
dc.description.abstractIn the last decade graphics processors (GPUs) have been extensively used to solve computationally intensive problems. A variety of GPU architectures by different hardware manufacturers have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor programming framework for GPU computing. Writing and optimising OpenCL applications is a challenging task, the programmer has to take care of several low level details. This is even harder when the goal is to improve performance on a wide range of devices: OpenCL does not guarantee performance portability. In this thesis we focus on the analysis and the portability of compiler optimisations. We describe the implementation of a portable compiler transformation: thread-coarsening. The transformation increases the amount of work carried out by a single thread running on the GPU. The goal is to reduce the amount of redundant instructions executed by the parallel application. The first contribution is a technique to analyse performance improvements and degradations given by the compiler transformation, we study the changes of hardware performance counters when applying coarsening. In this way we identify the root causes of execution time variations due to coarsening. As second contribution, we study the relative performance of coarsening over multiple input sizes. We show that the speedups given by coarsening are stable for problem sizes larger than a threshold that we call saturation point. We exploit the existence of the saturation point to speedup iterative compilation. The last contribution of the work is the development of a machine learning technique that automatically selects a coarsening configuration that improves performance. The technique is based on an iterative model built using a neural network. The network is trained once for a GPU model and used for several programs. To prove the flexibility of our techniques, all our experiments have been deployed on multiple GPU models by different vendors.en
dc.contributor.sponsorEngineering and Physical Sciences Research Council (EPSRC)en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionAlberto Magni, Christophe Dubach, Michael F.P. O’Boyle. ”A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening”. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013.en
dc.relation.hasversionAlberto Magni, Christophe Dubach, Michael F.P. O’Boyle. ”Exploiting GPU Hardware Saturation for Fast Compiler Optimization”. In Proceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU), March 2014.en
dc.relation.hasversionAlberto Magni, Christophe Dubach, Michael F.P. O’Boyle. ”Automatic Optimization of Thread-Coarsening for Graphics Processors”. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), August 2014.en
dc.subjectcompliersen
dc.subjectgraphics processorsen
dc.subjectperformance optimizationen
dc.titleAnalysis and parameter prediction of compiler transformation for graphics processorsen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record