Managing the memory hierarchy in GPUs
Dublish, Saumay Kumar
MetadataShow full item record
Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU architectures to address the needs of upcoming application domains. One such vital improvement is the introduction of the on-chip cache hierarchy, used primarily to filter the high bandwidth demand to the off-chip memory. However, in contrast to traditional CPUs, the cache hierarchy in GPUs is presented with significantly different challenges such as cache thrashing and bandwidth bottlenecks, arising due to small caches and high levels of memory traffic. These challenges lead to severe congestion across the memory hierarchy, resulting in high memory access latencies. In memory-intensive applications, such high memory access latencies often get exposed and can no longer be hidden through multithreading, and therefore adversely impact system performance. In this thesis, we address the inefficiencies across the memory hierarchy in GPUs that lead to such high levels of congestion. We identify three major factors contributing to poor memory system performance: first, disproportionate and insufficient bandwidth resources in the cache hierarchy; second, poor cache management policies; and third, high levels of multithreading. In order to revitalize the memory hierarchy by addressing the above limitations, we propose a three-pronged approach. First, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs and identify the architectural parameters that are most critical in alleviating congestion. Subsequently, we explore the architectural design space to mitigate the bandwidth bottlenecks in a cost-effective manner. Second, we identify significant inter-core reuse in GPUs, presenting an opportunity to reuse data among the L1s. We exploit this reuse by connecting the L1 caches with a lightweight ring network to facilitate inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache, freeing up the bandwidth for other accesses. Third, we present Poise, a machine learning approach to mitigate cache thrashing and bandwidth bottlenecks by altering the levels of multi-threading. Poise comprises a supervised learning model that is trained offline on a set of profiled kernels to make good warp scheduling decisions. Subsequently, a hardware inference engine is used to predict good warp scheduling decisions at runtime using the model learned during training. In summary, we address the problem of bandwidth bottlenecks across the memory hierarchy in GPUs by exploring how to best scale, supplement and utilize the existing bandwidth resources. These techniques provide an effective and comprehensive methodology to mitigate the bandwidth bottlenecks in the GPU memory hierarchy.