Ren, Fengbo

Fido Tracker

Description

Millions of pets go missing every year and this project has the purpose of offering a pet GPS tracking solution to aid in this issue. An Arduino microcontroller was combined with a GPS module and GSM module to create the hardware of the device, which was then connected to a mobile application that was developed for the explicit purpose of this project. Amazon Web Services was used to significantly bring down the cost of connecting the hardware to the mobile app. Upon the completion of the project, a prototype pet GPS tracking device and mobile application were developed, and instructions were given so that any user could re-create the same solution for their own purposes.

Date Created

2021-05

Agent

Author (aut): Kiaei, Ariana
Thesis director: Ren, Fengbo
Committee member: Abraham, Seth
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Electrical Engineering Program
Contributor (ctb): Barrett, The Honors College

FPGA Acceleration of CNNs Using OpenCL

Description

Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability, have been of general interest for the acceleration of CNNs. Recently, Field Programmable Gate Arrays (FPGAs) have been promising in CNN acceleration since they offer high performance while also being re-configurable to support the evolution of CNNs. This work focuses on a design methodology to accelerate CNNs on FPGA with low inference latency and high-throughput which are crucial for scenarios like self-driving cars, video surveillance etc. It also includes optimizations which reduce the resource utilization by a large margin with a small degradation in performance thus making the design suitable for low-end FPGA devices as well.

FPGA accelerators often suffer due to the limited main memory bandwidth. Also, highly parallel designs with large resource utilization often end up achieving low operating frequency due to poor routing. This work employs data fetch and buffer mechanisms, designed specifically for the memory access pattern of CNNs, that overlap computation with memory access. This work proposes a novel arrangement of the systolic processing element array to achieve high frequency and consume less resources than the existing works. Also, support has been extended to more complicated CNNs to do video processing. On Intel Arria 10 GX1150, the design operates at a frequency as high as 258MHz and performs single inference of VGG-16 and C3D in 23.5ms and 45.6ms respectively. For VGG-16 and C3D the design offers a throughput of 66.1 and 23.98 inferences/s respectively. This design can outperform other FPGA 2D CNN accelerators by up to 9.7 times and 3D CNN accelerators by up to 2.7 times.

Date Created

2020

Agent

Author (aut): Ravi, Pravin Kumar
Thesis advisor (ths): Zhao, Ming
Committee member: Li, Baoxin
Committee member: Ren, Fengbo
Publisher (pbl): Arizona State University

Cooperative Driving of Connected Autonomous Vehicles Using Responsibility Sensitive Safety Rules

Description

In the recent times, traffic congestion and motor accidents have been a major problem for transportation in major cities. Intelligent Transportation Systems has the potential to be an effective solution in order to tackle this issue. Connected Autonomous Vehicles can cooperate at intersections, ramp merging, lane change and other conflicting scenarios in order to resolve the conflicts and avoid collisions with other vehicles. A lot of works has been proposed for specific scenarios such as intersections, ramp merging or lane change which partially solve the conflict resolution problem. Also, one of the major issues in autonomous decision making - deadlocks have not been considered in some of the works. The existing works either do not consider deadlocks or lack a safety proof. This thesis proposes a cooperative driving solution that provides a complete navigation, conflict resolution and deadlock resolution for connected autonomous vehicles. A graph-based model is used to resolve the deadlocks between vehicles and the responsibility sensitive safety (RSS) rules have been used in order to ensure safety of the autonomous vehicles during conflict detection and resolution. This algorithm provides a complete navigation solution for an autonomous vehicle from its source to destination. The algorithm ensures that accidents do not occur even in the worst-case scenario and the decision making is deadlock free.

Date Created

2020

Agent

Author (aut): Allamsetti, Harshith
Thesis advisor (ths): Shrivastava, Aviral
Committee member: Sen, Arunabha
Committee member: Ren, Fengbo
Publisher (pbl): Arizona State University

Hardware Acceleration of Video analytics on FPGA using OpenCL

Description

With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place.

To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency.

This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.

Date Created

2019

Agent

Author (aut): Dua, Akshay
Thesis advisor (ths): Ren, Fengbo
Committee member: Ogras, Umit Y.
Committee member: Seo, Jae-Sun
Publisher (pbl): Arizona State University

FPGAs as an Edge Computing Solution

Description

As the Internet of Things continues to expand, not only must our computing power grow
alongside it, our very approach must evolve. While the recent trend has been to centralize our
computing resources in the cloud, it now looks beneficial to push more computing power
towards the “edge” with so called edge computing, reducing the immense strain on cloud
servers and the latency experienced by IoT devices. A new computing paradigm also brings
new opportunities for innovation, and one such innovation could be the use of FPGAs as edge
servers. In this research project, I learn the design flow for developing OpenCL kernels and
custom FPGA BSPs. Using these tools, I investigate the viability of using FPGAs as standalone
edge computing devices. Concluding that—although the technology is a great fit—the current
necessity of dynamically reprogrammable FPGAs to be closely coupled with a host CPU is
holding them back from this purpose. I propose a modification to the architecture of the Intel
Arria 10 GX that would allow it to be decoupled from its host CPU, allowing it to truly serve as a
viable edge computing solution.

Date Created

2019-05

Agent

Author (aut): Barth, Brandon Albert
Thesis director: Ren, Fengbo
Committee member: Vrudhula, Sarma
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Barrett, The Honors College

Exploring the Implementation of Multiple Partial Reconfiguration Regions to use FPGAs in Edge Computing

Description

Edge computing is an emerging field that improves upon cloud computing by moving the service from a centralized server to several de-centralized servers that are closer to the end user to decrease the latency, bandwidth, and cost requirements. Field programmable grid array (FPGA) devices are highly reconfigurable and excel in highly parallelized tasks, making them popular in many applications including digital signal processing and cryptography, while also making them a great candidate for edge computation. The purpose of this project was to explore existing board support packages for the Arria 10 GX FPGA and propose a BSP design with multiple partial reconfiguration regions to better support the use of FPGAs in edge computing. In this project, the general OpenCL development flow was studied, OpenCL workflow for Altera/Intel FPGAs was researched, the reference OpenCL BSP was explored to understand the connections between the modules, and a customized BSP with two partial reconfiguration regions was proposed. The existing BSP was explored using the Intel Quartus Prime software suite and the block diagrams for the existing and proposed designs were created using Microsoft Visio.

Date Created

2019-05

Agent

Author (aut): Lam, Evan
Thesis director: Ren, Fengbo
Committee member: Vrudhula, Sarma
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Barrett, The Honors College

An IoT Solution to Air Quality Monitoring

Description

Pollution is an increasing problem around the world, and one of the main forms it takes is air pollution. Air pollution, from oxides and dioxides to particulate matter, continues to contribute to millions of deaths each year, which is more than the next three leading causes of environment-related death combined. Plus, the problem is only growing as industrial plants, factories, and transportation continues to rapidly increase across the globe. Those most affected include less developed countries and individuals with pre-existing respiratory conditions. Although many citizens know about this issue, it is often unclear what times and locations are worst in terms of pollutant concentration as it can vary on the time of day, local activity, and other variable factors. As a result, citizens lack the knowledge and resources to properly combat or avoid air pollution, as well as the data and evidence to support any sort of regulatory change. Many companies and organizations have tried to address this through Air Quality Indexes (AQIs) but are not focused enough to help the everyday citizen, and often fail to include many significant pollutants. Thus, we sought to address this issue in a cost-effective way through creating a network of IoT (Internet of Things) devices and deploying them in a select area of Tempe, Arizona. We utilized Arduino Microprocessors and Wireless Radio Frequency Transceivers to send and receive air pollution data in real time. Then, displayed this data in such a way that it could be released to the public via web or mobile app. Furthermore, the product is cheap enough to be reproduced and sold in bulk as well as scaled and customized to be compatible with dozens of different air quality sensors.

Date Created

2019-05

Agent

Co-author: Coury, Abrahm Philip
Co-author: Gillespie, Cody
Thesis director: Ren, Fengbo
Committee member: Shrivastava, Aviral
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Computer Science and Engineering Program
Contributor (ctb): Barrett, The Honors College

Distortion Robust Biometric Recognition

Description

Information forensics and security have come a long way in just a few years thanks to the recent advances in biometric recognition. The main challenge remains a proper design of a biometric modality that can be resilient to unconstrained conditions, such as quality distortions. This work presents a solution to face and ear recognition under unconstrained visual variations, with a main focus on recognition in the presence of blur, occlusion and additive noise distortions.

First, the dissertation addresses the problem of scene variations in the presence of blur, occlusion and additive noise distortions resulting from capture, processing and transmission. Despite their excellent performance, ’deep’ methods are susceptible to visual distortions, which significantly reduce their performance. Sparse representations, on the other hand, have shown huge potential capabilities in handling problems, such as occlusion and corruption. In this work, an augmented SRC (ASRC) framework is presented to improve the performance of the Spare Representation Classifier (SRC) in the presence of blur, additive noise and block occlusion, while preserving its robustness to scene dependent variations. Different feature types are considered in the performance evaluation including image raw pixels, HoG and deep learning VGG-Face. The proposed ASRC framework is shown to outperform the conventional SRC in terms of recognition accuracy, in addition to other existing sparse-based methods and blur invariant methods at medium to high levels of distortion, when particularly used with discriminative features.

In order to assess the quality of features in improving both the sparsity of the representation and the classification accuracy, a feature sparse coding and classification index (FSCCI) is proposed and used for feature ranking and selection within both the SRC and ASRC frameworks.

The second part of the dissertation presents a method for unconstrained ear recognition using deep learning features. The unconstrained ear recognition is performed using transfer learning with deep neural networks (DNNs) as a feature extractor followed by a shallow classifier. Data augmentation is used to improve the recognition performance by augmenting the training dataset with image transformations. The recognition performance of the feature extraction models is compared with an ensemble of fine-tuned networks. The results show that, in the case where long training time is not desirable or a large amount of data is not available, the features from pre-trained DNNs can be used with a shallow classifier to give a comparable recognition accuracy to the fine-tuned networks.

Date Created

2018

Agent

Author (aut): Mounsef, Jinane
Thesis advisor (ths): Karam, Lina
Committee member: Papandreou-Suppapola, Antonia
Committee member: Li, Baoxin
Committee member: Ren, Fengbo
Publisher (pbl): Arizona State University

Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

Description

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these linear algebra based solutions.

Design of multiple dense (or sparse) matrix computation routines on the

same platform is quite challenging. Added to the complexity is the fact that dense

and sparse matrix computations have large differences in their storage and access

patterns and are difficult to optimize on the same architecture. This thesis addresses

this challenge and introduces a reconfigurable accelerator that supports both dense

and sparse matrix computations efficiently.

The reconfigurable architecture has been optimized to execute the following linear

algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM

(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),

LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),

SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where

each core consists of a 2D array of processing elements (PE).

The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix

updates efficiently. A sequence of such updates is used to solve a larger problem inside

a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)

is used to perform sparse kernel updates. Scalable partitioning and mapping schemes

are presented that map input matrices of any given size to the multicore architecture.

Design trade-offs related to the PE array dimension, size of local memory inside a core

and the bandwidth between on-chip memories and the cores have been presented. An

optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto

32 GOPS using a single core.

Date Created

2018

Agent

Author (aut): Animesh, Saurabh
Thesis advisor (ths): Chakrabarti, Chaitali
Committee member: Brunhaver, John
Committee member: Ren, Fengbo
Publisher (pbl): Arizona State University

Scratchpad Management in Software Managed Manycore Architectures

Description

Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average.

Date Created

2017

Agent

Author (aut): Cai, Jian
Thesis advisor (ths): Shrivastava, Aviral
Committee member: Wu, Carole
Committee member: Ren, Fengbo
Committee member: Dasgupta, Partha
Publisher (pbl): Arizona State University

Subscribe to Ren, Fengbo