Speaker: Dani Voitsechov
Affiliation: Electrical Eng., Technion
Over the last decade, GPUs have been successfully deployed to accelerate a wide array of highly parallel general purpose applications ranging from machine learning tasks such as image classification, speech recognition and natural language processing to complex mathematical computations such as Gaussian elimination and matrix multiplication.
These massively parallel processors are capable of achieving high computational throughput while maintaining high power efficiency. Nevertheless, existing GPUs employ a von-Neumann compute engine and, therefore, suffer from the model's power inefficiencies.
This work presents a Multithreaded Coarse-grain Reconfigurable Architecture (MT-CGRA) that combines coarse-grain reconfigurable computing with static and dynamic dataflow to deliver massive thread-level parallelism. The CUDAcompatible MT-CGRA architecture is positioned as a fast and energy efficient design alternative for GPGPUs. The architecture maps a compute kernel, represented as a dataflow graph, onto a coarse-grain reconfigurable fabric composed of a grid of interconnected functional units. These functional units dynamically schedule instances of the
same static instruction and thus enable streaming the data of multiple threads through the grid. The combination of statically mapped instructions and direct communication between functional units obviate the need for a full instruction pipeline and a centralized register file, whose energy overheads burden GPGPUs.
Our simulations of various CUDA benchmarks running on the new system show that MT-CGRAs provide an average speedup of 2.5x (13.5x max) and reduces system power by an average of 7x (33x max), when compared to an equivalent Nvidia GPGPU.
* PhD. Student under the supervision of Prof. Yoav Etsion