Mapping parallel programs to heterogeneous multi-core systems
MetadataShow full item record
Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code.