[MICRO 2020]


Deep learning systems have been applied mostly to Euclidean data such as images, video, and audio. In many applications, however, information and their relationships are better expressed with graphs. Graph Convolutional Networks (GCNs) appear to be a promising approach to efficiently learn from graph data structures, having shown advantages in many critical applications. As with other deep learning modalities, hardware acceleration is critical. The challenge is that real-world graphs are often extremely large and unbalanced; this poses significant performance demands and design challenges.


In this work, we propose Autotuning-Workload-Balancing GCN (AWB-GCN) to accelerate GCN inference. To address the issue of workload imbalance in processing real-world graphs, three hardware-based autotuning techniques are proposed: dynamic distribution smoothing, remote switching, and row remapping. In particular, AWB-GCN continuously monitors the sparse graph pattern, dynamically adjusts the workload distribution among a large number of processing elements (up to 4K PEs), and, after converging, reuses the ideal configuration. Evaluations are performed using an Intel D5005 FPGA with five commonly-used datasets. Results show that 4K-PE AWB-GCN can significantly elevate PE utilization by 7.7x on average and demonstrate considerable performance speedups over CPUs (3255x), GPUs (80.3x), and a prior GCN accelerator (5.1x).


[TC 2020],

[FCCM 2018],

[FPL 2018]


Deep convolutional Neural Networks (CNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling CNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that, to make the distributed cluster work with high utilization, the workload distributed to each node must be large; this implies nontrivial growth in the SGD mini-batch size.


In this work, we propose a framework, called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train CNNs. This approach has numerous benefits. First, the design does not suffer from performance loss due to batch size growth. Second, work and storage are balanced among nodes through novel workload and weight partitioning schemes. Part of the mechanism is the surprising finding that it is preferable to store excess weights in neighboring devices rather than in local off-chip memory. Third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time that features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. And fourth, we find that the simplest topology, a 1D array, is preferred for interconnecting the FPGAs thus enabling widespread applicability.


[TPDS 2020],

[SC 2019], [ICS 2019]

[ASAP 2019]


Binarized Neural Networks (BNN), which significantly reduce computational complexity and memory demand, have shown potential in cost- and power-restricted domains, such as IoT and smart edge-devices, where reaching certain accuracy bars is sufficient and real-time is highly desired.


In this work, we demonstrate that the highly-condensed BNN model can be shrunk significantly by dynamically pruning irregular redundant edges. Based on two new observations on BNN-specific properties, an out-of-order (OoO) architecture, O3BNN-R, which can curtail edge evaluation in cases where the binary output of a neuron can be determined early at runtime during inference, is proposed. Similar to Instruction-Level-Parallelism, fine-grained, irregular, runtime pruning opportunities are traditionally presumed to be difficult to exploit. To further enhance the pruning opportunities, we conduct an algorithm-architecture co-design approach where we augment the loss function during the training stage with specialized regularization terms favoring the edge pruning. We evaluate our design on an embedded FPGA using networks, including VGG-16, AlexNet for ImageNet, and a VGG-like network for Cifar-10. Results show that O3BNN-R without regularization can prune, on average, 30% of the operations, without any accuracy loss, bringing 2.2x inference-speedup, and on average 34x energy-efficiency improvement over state-of-the-art BNN implementations on FPGA/GPU/CPU. With regularization at training, the performance is further improved, on average, by 15%.



[HPEC 2020]


Quantized Neural Networks (QNNs) have drawn tremendous attention as QNNs reduce computation, communication, and storage demands dramatically with negligible loss of accuracy comparing to Convolution Neural Networks (CNNs). To find the optimal balance between performance and accuracy, researchers use different data-widths at different layers and even at different channels of QNNs. It is challenging to design a QNN accelerator which is generally efficient for QNNs with various and flexible model configurations.

In this work, we propose a novel Coarse-Grained Reconfigurable Architecture-based (CGRA) QNN acceleration framework, CQNN. CQNN has a large number of basic components for binary functions. By programming CQNN at runtime according to the target QNN models, these basic components are integrated efficiently to support QNN functions with any data-width and hyper-parameter requirements, and CQNN is reconfigured to have the optimal architecture for the target models. The framework includes compiler, architecture design, simulator, and RTL generator. Experimental results show CQNNs can complete the inference of AlexNet and VGG-16 within 0.13ms and 2.63ms.  



[ICS 2020]


Recurrent neural networks (RNNs) have been widely adopted in temporal sequence analysis, where realtime performance is often in demand. However, RNNs suffer from heavy computational workload as the model often comes with large weight matrices. Pruning schemes have been proposed for RNNs to eliminate redundant weight values. On one hand, the non-structured pruning methods achieve a high pruning rate but introducing computation irregularity, which is unfriendly to parallel hardware. On the other hand, hardware-oriented structured pruning suffers from a low pruning rate due to restricted constraints on allowable pruning structure.


This work presents CSB-RNN, an optimized full-stack RNN framework with a novel compressed structured block (CSB) pruning technique. The CSB pruned RNN model comes with both fine pruning granularity that facilitates a high pruning rate and regular structure that benefits the hardware parallelism. To address the challenges in parallelizing the CSB pruned model inference with fine-grained structural sparsity, we propose a novel hardware architecture with a dedicated compiler. Gaining from the architecture-compilation co-design, the hardware not only supports various RNN cell types but is also able to address the challenging workload imbalance issue and therefore significantly improves the hardware efficiency (utilization). The CSB-RNN inference can achieve faster-than-real-time latency (0.79µs-6.58µs) in an FPGA implementation, which provides 1.12×-12.57× faster inference comparing with the state-of-the-art.



[SC 2019],

[ASAP 2019]


The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, has consisted of either proof-of-concept implementations of components, usually the range-limited force; full systems, but with much of the work shared by the host CPU; or prototype demonstrations, e.g., using OpenCL, that neither implement a whole system nor have competitive performance.


In this work, we present what we believe to be the first full-scale FPGA-based simulation engine, and show that its performance is competitive with a GPU (running Amber in an industrial production environment). The system features on-chip particle data storage and management, short- and long-range force evaluation, as well as bonded forces, motion update, and particle migration. Other contributions of this work include exploring numerous architectural trade-offs and analysis of various mappings schemes among particles/cells and the various on-chip compute units. The potential impact is that this system promises to be the basis for long-timescale Molecular Dynamics with a commodity cluster



ExaScale MPI 2018


We are working on a novel approach to supporting communicators for in-switch & in-router & in-nic processing of MPI collectives.


MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI applications.