Hardware Acceleration of Video analytics on FPGA using OpenCL

157870-Thumbnail Image.png
Description
With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses

With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place.

To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency.

This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.
Date Created
2019
Agent

Energy-Efficient ASIC Accelerators for Machine/Deep Learning Algorithms

157804-Thumbnail Image.png
Description
While machine/deep learning algorithms have been successfully used in many practical applications including object detection and image/video classification, accurate, fast, and low-power hardware implementations of such algorithms are still a challenging task, especially for mobile systems such as Internet of

While machine/deep learning algorithms have been successfully used in many practical applications including object detection and image/video classification, accurate, fast, and low-power hardware implementations of such algorithms are still a challenging task, especially for mobile systems such as Internet of Things, autonomous vehicles, and smart drones.

This work presents an energy-efficient programmable application-specific integrated circuit (ASIC) accelerator for object detection. The proposed ASIC supports multi-class (face/traffic sign/car license plate/pedestrian), many-object (up to 50) in one image with different sizes (6 down-/11 up-scaling), and high accuracy (87% for face detection datasets). The proposed accelerator is composed of an integral channel detector with 2,000 classifiers for five rigid boosted templates to make a strong object detection. By jointly optimizing the algorithm and efficient hardware architecture, the prototype chip implemented in 65nm demonstrates real-time object detection of 20-50 frames/s with 22.5-181.7mW (0.54-1.75nJ/pixel) at 0.58-1.1V supply.



In this work, to reduce computation without accuracy degradation, an energy-efficient deep convolutional neural network (DCNN) accelerator is proposed based on a novel conditional computing scheme and integrates convolution with subsequent max-pooling operations. This way, the total number of bit-wise convolutions could be reduced by ~2x, without affecting the output feature values. This work also has been developing an optimized dataflow that exploits sparsity, maximizes data re-use and minimizes off-chip memory access, which can improve upon existing hardware works. The total off-chip memory access can be saved by 2.12x. Preliminary results of the proposed DCNN accelerator achieved a peak 7.35 TOPS/W for VGG-16 by post-layout simulation results in 40nm.

A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and in-meomry computing concept. This work evaluates a comprehensive comparison of these various approaches in a unified framework. This work also presents the proposed energy-efficient in-memory computing accelerator for deep neural networks (DNNs) by integrating many instances of in-memory computing macros with an ensemble of peripheral digital circuits, which supports configurable multibit activations and large-scale DNNs seamlessly while substantially improving the chip-level energy-efficiency. Proposed accelerator is fully designed in 65nm, demonstrating ultralow energy consumption for DNNs.
Date Created
2019
Agent

Power, Performance, and Energy Management of Heterogeneous Architectures

157773-Thumbnail Image.png
Description
Many core modern multiprocessor systems-on-chip offers tremendous power and performance

optimization opportunities by tuning thousands of potential voltage, frequency

and core configurations. Applications running on these architectures are becoming increasingly

complex. As the basic building blocks, which make up the application, change during

runtime,

Many core modern multiprocessor systems-on-chip offers tremendous power and performance

optimization opportunities by tuning thousands of potential voltage, frequency

and core configurations. Applications running on these architectures are becoming increasingly

complex. As the basic building blocks, which make up the application, change during

runtime, different configurations may become optimal with respect to power, performance

or other metrics. Identifying the optimal configuration at runtime is a daunting task due

to a large number of workloads and configurations. Therefore, there is a strong need to

evaluate the metrics of interest as a function of the supported configurations.

This thesis focuses on two different types of modern multiprocessor systems-on-chip

(SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture.

For mobile heterogeneous systems, this thesis presents a novel methodology that can

accurately instrument different types of applications with specific performance monitoring

calls. These calls provide a rich set of performance statistics at a basic block level while the

application runs on the target platform. The target architecture used for this work (Odroid

XU3) is capable of running at 4940 different frequency and core combinations. With the

help of instrumented application vast amount of characterization data is collected that provides

details about performance, power and CPU state at every instrumented basic block

across 19 different types of applications. The vast amount of data collected has enabled

two runtime schemes. The first work provides a methodology to find optimal configurations

in heterogeneous architecture using classifiers and demonstrates an average increase

of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and

powersave governors, respectively. The second work using same data shows a novel imitation

learning framework for dynamically controlling the type, number, and the frequencies

of active cores to achieve an average of 109% PPW improvement compared to the default

governors.

This work also presents how to accurately profile tile based Intel Xeon Phi architecture

while training different types of neural networks using open image dataset on deep learning

framework. The data collected allows deep exploratory analysis. It also showcases how

different hardware parameters affect performance of Xeon Phi.
Date Created
2019
Agent

How Does Technology Development Influence the Assessment of Parkinson’s Disease? A Systematic Review

157465-Thumbnail Image.png
Description
Parkinson’s disease (PD) is a neurological disorder with complicated and disabling motor and non-motor symptoms. The pathology for PD is difficult and expensive. Furthermore, it depends on patient diaries and the neurologist’s subjective assessment of clinical scales. Objective,

Parkinson’s disease (PD) is a neurological disorder with complicated and disabling motor and non-motor symptoms. The pathology for PD is difficult and expensive. Furthermore, it depends on patient diaries and the neurologist’s subjective assessment of clinical scales. Objective, accurate, and continuous patient monitoring have become possible with the advancement in mobile and portable equipment. Consequently, a significant amount of work has been done to explore new cost-effective and subjective assessment methods or PD symptoms. For example, smart technologies, such as wearable sensors and optical motion capturing systems, have been used to analyze the symptoms of a PD patient to assess their disease progression and even to detect signs in their nascent stage for early diagnosis of PD.

This review focuses on the use of modern equipment for PD applications that were developed in the last decade. Four significant fields of research were identified: Assistance diagnosis, Prognosis or Monitoring of Symptoms and their Severity, Predicting Response to Treatment, and Assistance to Therapy or Rehabilitation. This study reviews the papers published between January 2008 and December 2018 in the following four databases: Pubmed Central, Science Direct, IEEE Xplore and MDPI. After removing unrelated articles, ones published in languages other than English, duplicate entries and other articles that did not fulfill the selection criteria, 778 papers were manually investigated and included in this review. A general overview of PD applications, devices used and aspects monitored for PD management is provided in this systematic review.
Date Created
2019
Agent

Low Cost 3D Flow Estimation in Medical Ultrasound

156894-Thumbnail Image.png
Description
Medical ultrasound imaging is widely used today because of it being non-invasive and cost-effective. Flow estimation helps in accurate diagnosis of vascular diseases and adds an important dimension to medical ultrasound imaging. Traditionally flow estimation is done using Doppler-based

Medical ultrasound imaging is widely used today because of it being non-invasive and cost-effective. Flow estimation helps in accurate diagnosis of vascular diseases and adds an important dimension to medical ultrasound imaging. Traditionally flow estimation is done using Doppler-based methods which only estimate velocity in the beam direction. Thus when blood vessels are close to being orthogonal to the beam direction, there are large errors in the estimation results. In this dissertation, a low cost blood flow estimation method that does not have the angle dependency of Doppler-based methods, is presented.

First, a velocity estimator based on speckle tracking and synthetic lateral phase is proposed for clutter-free blood flow.

Speckle tracking is based on kernel matching and does not have any angle dependency. While velocity estimation in axial dimension is accurate, lateral velocity estimation is challenging due to reduced resolution and lack of phase information. This work presents a two tiered method which estimates the pixel level movement using sum-of-absolute difference, and then estimates the sub-pixel level using synthetic phase information in the lateral dimension. Such a method achieves highly accurate velocity estimation with reduced complexity compared to a cross correlation based method. The average bias of the proposed estimation method is less than 2% for plug flow and less than 7% for parabolic flow.

Blood is always accompanied by clutter which originates from vessel wall and surrounding tissues. As magnitude of the blood signal is usually 40-60 dB lower than magnitude of the clutter signal, clutter filtering is necessary before blood flow estimation. Clutter filters utilize the high magnitude and low frequency features of clutter signal to effectively remove them from the compound (blood + clutter) signal. Instead of low complexity FIR filter or high complexity SVD-based filters, here a power/subspace iteration based method is proposed for clutter filtering. Excellent clutter filtering performance is achieved for both slow and fast moving clutters with lower complexity compared to SVD-based filters. For instance, use of the proposed method results in the bias being less than 8% and standard deviation being less than 12% for fast moving clutter when the beam-to-flow-angle is $90^o$.

Third, a flow rate estimation method based on kernel power weighting is proposed. As the velocity estimator is a kernel-based method, the estimation accuracy degrades near the vessel boundary. In order to account for kernels that are not fully inside the vessel, fractional weights are given to these kernels based on their signal power. The proposed method achieves excellent flow rate estimation results with less than 8% bias for both slow and fast moving clutters.

The performance of the velocity estimator is also evaluated for challenging models. A 2D version of our two-tiered method is able to accurately estimate velocity vectors in a spinning disk as well as in a carotid bifurcation model, both of which are part of the synthetic aperture vector flow imaging (SA-VFI) challenge of 2018. In fact, the proposed method ranked 3rd in the challenge for testing dataset with carotid bifurcation. The flow estimation method is also evaluated for blood flow in vessels with stenosis. Simulation results show that the proposed method is able to estimate the flow rate with less than 9% bias.
Date Created
2018
Agent

Optimized Stress Testing for Flexible Hybrid Electronics Designs

156888-Thumbnail Image.png
Description
Flexible hybrid electronics (FHE) is emerging as a promising solution to combine the benefits of printed electronics and silicon technology. FHE has many high-impact potential areas, such as wearable applications, health monitoring, and soft robotics, due to its physical advantages,

Flexible hybrid electronics (FHE) is emerging as a promising solution to combine the benefits of printed electronics and silicon technology. FHE has many high-impact potential areas, such as wearable applications, health monitoring, and soft robotics, due to its physical advantages, which include light weight, low cost and the ability conform to different shapes. However, physical deformations that can occur in the field lead to significant testing and validation challenges. For example, designers have to ensure that FHE devices continue to meet specs even when the components experience stress due to bending. Hence, physical deformation, which is hard to emulate, has to be part of the test procedures developed for FHE devices. This paper is the first to analyze stress experience at different parts of FHE devices under different bending conditions. Then develop a novel methodology to maximize the test coverage with minimum number of text vectors with the help of a mixed integer linear programming formulation.
Date Created
2018
Agent

Passive Loop Filter Zoom Analog to Digital Converters

156844-Thumbnail Image.png
Description
This dissertation proposes and presents two different passive sigma-delta

modulator zoom Analog to Digital Converter (ADC) architectures. The first ADC is fullydifferential, synthesizable zoom-ADC architecture with a passive loop filter for lowfrequency Built in Self-Test (BIST) applications. The detailed ADC architecture

This dissertation proposes and presents two different passive sigma-delta

modulator zoom Analog to Digital Converter (ADC) architectures. The first ADC is fullydifferential, synthesizable zoom-ADC architecture with a passive loop filter for lowfrequency Built in Self-Test (BIST) applications. The detailed ADC architecture and a step

by step process designing the zoom-ADC along with a synthesis tool that can target various

design specifications are presented. The design flow does not rely on extensive knowledge

of an experienced ADC designer. Two example set of BIST ADCs have been synthesized

with different performance requirements in 65nm CMOS process. The first ADC achieves

90.4dB Signal to Noise Ratio (SNR) in 512µs measurement time and consumes 17µW

power. Another example achieves 78.2dB SNR in 31.25µs measurement time and

consumes 63µW power. The second ADC architecture is a multi-mode, dynamically

zooming passive sigma-delta modulator. The architecture is based on a 5b interpolating

flash ADC as the zooming unit, and a passive discrete time sigma delta modulator as the

fine conversion unit. The proposed ADC provides an Oversampling Ratio (OSR)-

independent, dynamic zooming technique, employing an interpolating zooming front-end.

The modulator covers between 0.1 MHz and 10 MHz signal bandwidth which makes it

suitable for cellular applications including 4G radio systems. By reconfiguring the OSR,

bias current, and component parameters, optimal power consumption can be achieved for

every mode. The ADC is implemented in 0.13 µm CMOS technology and it achieves an

SNDR of 82.2/77.1/74.2/68 dB for 0.1/1.92/5/10MHz bandwidth with 1.3/5.7/9.6/11.9mW

power consumption from a 1.2 V supply.
Date Created
2018
Agent

DFT Solutions for Automated Test and Calibration of Forthcoming RF Integrated Transceivers

156773-Thumbnail Image.png
Description
As integrated technologies are scaling down, there is an increasing trend in the

process,voltage and temperature (PVT) variations of highly integrated RF systems.

Accounting for these variations during the design phase requires tremendous amount

of time for prediction of RF performance and optimizing

As integrated technologies are scaling down, there is an increasing trend in the

process,voltage and temperature (PVT) variations of highly integrated RF systems.

Accounting for these variations during the design phase requires tremendous amount

of time for prediction of RF performance and optimizing it accordingly. Thus, there

is an increasing gap between the need to relax the RF performance requirements at

the design phase for rapid development and the need to provide high performance

and low cost RF circuits that function with PVT variations. No matter how care-

fully designed, RF integrated circuits (ICs) manufactured with advanced technology

nodes necessitate lengthy post-production calibration and test cycles with expensive

RF test instruments. Hence design-for-test (DFT) is proposed for low-cost and fast

measurement of performance parameters during both post-production and in-eld op-

eration. For example, built-in self-test (BIST) is a DFT solution for low-cost on-chip

measurement of RF performance parameters. In this dissertation, three aspects of

automated test and calibration, including DFT mathematical model, BIST hardware

and built-in calibration are covered for RF front-end blocks.

First, the theoretical foundation of a post-production test of RF integrated phased

array antennas is proposed by developing the mathematical model to measure gain

and phase mismatches between antenna elements without any electrical contact. The

proposed technique is fast, cost-efficient and uses near-field measurement of radiated

power from antennas hence, it requires single test setup, it has easy implementation

and it is short in time which makes it viable for industrialized high volume integrated

IC production test.

Second, a BIST model intended for the characterization of I/Q offset, gain and

phase mismatch of IQ transmitters without relying on external equipment is intro-

duced. The proposed BIST method is based on on-chip amplitude measurement as

in prior works however,here the variations in the BIST circuit do not affect the target

parameter estimation accuracy since measurements are designed to be relative. The

BIST circuit is implemented in 130nm technology and can be used for post-production

and in-field calibration.

Third, a programmable low noise amplifier (LNA) is proposed which is adaptable

to different application scenarios depending on the specification requirements. Its

performance is optimized with regards to required specifications e.g. distance, power

consumption, BER, data rate, etc.The statistical modeling is used to capture the

correlations among measured performance parameters and calibration modes for fast

adaptation. Machine learning technique is used to capture these non-linear correlations and build the probability distribution of a target parameter based on measurement results of the correlated parameters. The proposed concept is demonstrated by

embedding built-in tuning knobs in LNA design in 130nm technology. The tuning

knobs are carefully designed to provide independent combinations of important per-

formance parameters such as gain and linearity. Minimum number of switches are

used to provide the desired tuning range without a need for an external analog input.
Date Created
2018
Agent

High Performance Power Management Integrated Circuits for Portable Devices

156491-Thumbnail Image.png
Description
Portable devices often require multiple power management IC (PMIC) to power different sub-modules, Li-ion batteries are well suited for portable devices because of its small size, high energy density and long life cycle. Since Li-ion battery is the major power

Portable devices often require multiple power management IC (PMIC) to power different sub-modules, Li-ion batteries are well suited for portable devices because of its small size, high energy density and long life cycle. Since Li-ion battery is the major power source for portable device, fast and high-efficiency battery charging solution has become a major requirement in portable device application.

In the first part of dissertation, a high performance Li-ion switching battery charger is proposed. Cascaded two loop (CTL) control architecture is used for seamless CC-CV transition, time based technique is utilized to minimize controller area and power consumption. Time domain controller is implemented by using voltage controlled oscillator (VCO) and voltage controlled delay line (VCDL). Several efficiency improvement techniques such as segmented power-FET, quasi-zero voltage switching (QZVS) and switching frequency reduction are proposed. The proposed switching battery charger is able to provide maximum 2 A charging current and has an peak efficiency of 93.3%. By configure the charger as boost converter, the charger is able to provide maximum 1.5 A charging current while achieving 96.3% peak efficiency.

The second part of dissertation presents a digital low dropout regulator (DLDO) for system on a chip (SoC) in portable devices application. The proposed DLDO achieve fast transient settling time, lower undershoot/overshoot and higher PSR performance compared to state of the art. By having a good PSR performance, the proposed DLDO is able to power mixed signal load. To achieve a fast load transient response, a load transient detector (LTD) enables boost mode operation of the digital PI controller. The boost mode operation achieves sub microsecond settling time, and reduces the settling time by 50% to 250 ns, undershoot/overshoot by 35% to 250 mV and 17% to 125 mV without compromising the system stability.
Date Created
2018
Agent

Power-Performance Modeling and Adaptive Management of Heterogeneous Mobile Platforms​

156489-Thumbnail Image.png
Description
Nearly 60% of the world population uses a mobile phone, which is typically powered by a system-on-chip (SoC). While the mobile platform capabilities range widely, responsiveness, long battery life and reliability are common design concerns that are crucial to remain

Nearly 60% of the world population uses a mobile phone, which is typically powered by a system-on-chip (SoC). While the mobile platform capabilities range widely, responsiveness, long battery life and reliability are common design concerns that are crucial to remain competitive. Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful SoC with numerous other resources, including display, memory, power management IC, battery and wireless modems. Furthermore, the SoC itself is a heterogeneous resource that integrates many processing elements, such as CPU cores, GPU, video, image, and audio processors. Therefore, CPU cores do not dominate the platform power consumption under many application scenarios.

Competitive performance requires higher operating frequency, and leads to larger power consumption. In turn, power consumption increases the junction and skin temperatures, which have adverse effects on the device reliability and user experience. As a result, allocating the power budget among the major platform resources and temperature control have become fundamental consideration for mobile platforms. Dynamic thermal and power management algorithms address this problem by putting a subset of the processing elements or shared resources to sleep states, or throttling their frequencies. However, an adhoc approach could easily cripple the performance, if it slows down the performance-critical processing element. Furthermore, mobile platforms run a wide range of applications with time varying workload characteristics, unlike early generations, which supported only limited functionality. As a result, there is a need for adaptive power and performance management approaches that consider the platform as a whole, rather than focusing on a subset. Towards this need, our specific contributions include (a) a framework to dynamically select the Pareto-optimal frequency and active cores for the heterogeneous CPUs, such as ARM big.Little architecture, (b) a dynamic power budgeting approach for allocating optimal power consumption to the CPU and GPU using performance sensitivity models for each PE, (c) an adaptive GPU frame time sensitivity prediction model to aid power management algorithms, and (d) an online learning algorithm that constructs adaptive run-time models for non-stationary workloads.
Date Created
2018
Agent