Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

156962-Thumbnail Image.png
Description
With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing,

With the end of Dennard scaling and Moore's law, architects have moved towards

heterogeneous designs consisting of specialized cores to achieve higher performance

and energy efficiency for a target application domain. Applications of linear algebra

are ubiquitous in the field of scientific computing, machine learning, statistics,

etc. with matrix computations being fundamental to these linear algebra based solutions.

Design of multiple dense (or sparse) matrix computation routines on the

same platform is quite challenging. Added to the complexity is the fact that dense

and sparse matrix computations have large differences in their storage and access

patterns and are difficult to optimize on the same architecture. This thesis addresses

this challenge and introduces a reconfigurable accelerator that supports both dense

and sparse matrix computations efficiently.

The reconfigurable architecture has been optimized to execute the following linear

algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM

(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),

LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),

SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where

each core consists of a 2D array of processing elements (PE).

The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix

updates efficiently. A sequence of such updates is used to solve a larger problem inside

a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)

is used to perform sparse kernel updates. Scalable partitioning and mapping schemes

are presented that map input matrices of any given size to the multicore architecture.

Design trade-offs related to the PE array dimension, size of local memory inside a core

and the bandwidth between on-chip memories and the cores have been presented. An

optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto

32 GOPS using a single core.
Date Created
2018
Agent

Using Capsule Networks for Image and Speech Recognition Problems

156936-Thumbnail Image.png
Description
In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule

In recent years, conventional convolutional neural network (CNN) has achieved outstanding performance in image and speech processing applications. Unfortunately, the pooling operation in CNN ignores important spatial information which is an important attribute in many applications. The recently proposed capsule network retains spatial information and improves the capabilities of traditional CNN. It uses capsules to describe features in multiple dimensions and dynamic routing to increase the statistical stability of the network.

In this work, we first use capsule network for overlapping digit recognition problem. We evaluate the performance of the network with respect to recognition accuracy, convergence and training time per epoch. We show that capsule network achieves higher accuracy when training set size is small. When training set size is larger, capsule network and conventional CNN have comparable recognition accuracy. The training time per epoch for capsule network is longer than conventional CNN because of the dynamic routing algorithm. An analysis of the GPU timing shows that adjusting the capsule structure can help decrease the time complexity of the dynamic routing algorithm significantly.

Next, we design a capsule network for speech recognition, specifically, overlapping word recognition. We use both capsule network and conventional CNN to recognize 2 overlapping words in speech files created from 5 word classes. We show that capsule network achieves a considerably higher recognition accuracy (96.92%) compared to conventional CNN (85.19%). Our results show that capsule network recognizes overlapping word by recognizing each individual word in the speech. We also verify the scalability of capsule network by increasing the number of word classes from 5 to 10. Capsule network still shows a high recognition accuracy of 95.42% in case of 10 words while the accuracy of conventional CNN decreases sharply to 73.18%.
Date Created
2018
Agent

Low Cost 3D Flow Estimation in Medical Ultrasound

156894-Thumbnail Image.png
Description
Medical ultrasound imaging is widely used today because of it being non-invasive and cost-effective. Flow estimation helps in accurate diagnosis of vascular diseases and adds an important dimension to medical ultrasound imaging. Traditionally flow estimation is done using Doppler-based

Medical ultrasound imaging is widely used today because of it being non-invasive and cost-effective. Flow estimation helps in accurate diagnosis of vascular diseases and adds an important dimension to medical ultrasound imaging. Traditionally flow estimation is done using Doppler-based methods which only estimate velocity in the beam direction. Thus when blood vessels are close to being orthogonal to the beam direction, there are large errors in the estimation results. In this dissertation, a low cost blood flow estimation method that does not have the angle dependency of Doppler-based methods, is presented.

First, a velocity estimator based on speckle tracking and synthetic lateral phase is proposed for clutter-free blood flow.

Speckle tracking is based on kernel matching and does not have any angle dependency. While velocity estimation in axial dimension is accurate, lateral velocity estimation is challenging due to reduced resolution and lack of phase information. This work presents a two tiered method which estimates the pixel level movement using sum-of-absolute difference, and then estimates the sub-pixel level using synthetic phase information in the lateral dimension. Such a method achieves highly accurate velocity estimation with reduced complexity compared to a cross correlation based method. The average bias of the proposed estimation method is less than 2% for plug flow and less than 7% for parabolic flow.

Blood is always accompanied by clutter which originates from vessel wall and surrounding tissues. As magnitude of the blood signal is usually 40-60 dB lower than magnitude of the clutter signal, clutter filtering is necessary before blood flow estimation. Clutter filters utilize the high magnitude and low frequency features of clutter signal to effectively remove them from the compound (blood + clutter) signal. Instead of low complexity FIR filter or high complexity SVD-based filters, here a power/subspace iteration based method is proposed for clutter filtering. Excellent clutter filtering performance is achieved for both slow and fast moving clutters with lower complexity compared to SVD-based filters. For instance, use of the proposed method results in the bias being less than 8% and standard deviation being less than 12% for fast moving clutter when the beam-to-flow-angle is $90^o$.

Third, a flow rate estimation method based on kernel power weighting is proposed. As the velocity estimator is a kernel-based method, the estimation accuracy degrades near the vessel boundary. In order to account for kernels that are not fully inside the vessel, fractional weights are given to these kernels based on their signal power. The proposed method achieves excellent flow rate estimation results with less than 8% bias for both slow and fast moving clutters.

The performance of the velocity estimator is also evaluated for challenging models. A 2D version of our two-tiered method is able to accurately estimate velocity vectors in a spinning disk as well as in a carotid bifurcation model, both of which are part of the synthetic aperture vector flow imaging (SA-VFI) challenge of 2018. In fact, the proposed method ranked 3rd in the challenge for testing dataset with carotid bifurcation. The flow estimation method is also evaluated for blood flow in vessels with stenosis. Simulation results show that the proposed method is able to estimate the flow rate with less than 9% bias.
Date Created
2018
Agent

Energy Efficient Hardware Design of Neural Networks

156822-Thumbnail Image.png
Description
Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in

Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.
Date Created
2018
Agent

Energy Modeling of Machine Learning Algorithms on General Purpose Hardware

156813-Thumbnail Image.png
Description
Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our

Articial Neural Network(ANN) has become a for-bearer in the field of Articial Intel-

ligence. The innovations in ANN has led to ground breaking technological advances

like self-driving vehicles,medical diagnosis,speech Processing,personal assistants and

many more. These were inspired by evolution and working of our brains. Similar

to how our brain evolved using a combination of epigenetics and live stimulus,ANN

require training to learn patterns.The training usually requires a lot of computation

and memory accesses. To realize these systems in real embedded hardware many

Energy/Power/Performance issues needs to be solved. The purpose of this research

is to focus on methods to study data movement requirement for generic Neural Net-

work along with the energy associated with it and suggest some ways to improve the

design.Many methods have suggested ways to optimize using mix of computation and

data movement solutions without affecting task accuracy. But these methods lack a

computation model to calculate the energy and depend on mere back of the envelope calculation. We realized that there is a need for a generic quantitative analysis

for memory access energy which helps in better architectural exploration. We show

that the present architectural tools are either incompatible or too slow and we need

a better analytical method to estimate data movement energy. We also propose a

simplistic yet effective approach that is robust and expandable by users to support

various systems.
Date Created
2018
Agent

Stagioni: Temperature management to enable near-sensor processing for performance, fidelity, and energy-efficiency of vision and imaging workloads

156790-Thumbnail Image.png
Description
Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision

Vision processing on traditional architectures is inefficient due to energy-expensive off-chip data movements. Many researchers advocate pushing processing close to the sensor to substantially reduce data movements. However, continuous near-sensor processing raises the sensor temperature, impairing the fidelity of imaging/vision tasks.

The work characterizes the thermal implications of using 3D stacked image sensors with near-sensor vision processing units. The characterization reveals that near-sensor processing reduces system power but degrades image quality. For reasonable image fidelity, the sensor temperature needs to stay below a threshold, situationally determined by application needs. Fortunately, the characterization also identifies opportunities -- unique to the needs of near-sensor processing -- to regulate temperature based on dynamic visual task requirements and rapidly increase capture quality on demand.

Based on the characterization, the work proposes and investigate two thermal management strategies -- stop-capture-go and seasonal migration -- for imaging-aware thermal management. The work present parameters that govern the policy decisions and explore the trade-offs between system power and policy overhead. The work's evaluation shows that the novel dynamic thermal management strategies can unlock the energy-efficiency potential of near-sensor processing with minimal performance impact, without compromising image fidelity.
Date Created
2019
Agent

Joint Optimization of Quantization and Structured Sparsity for Compressed Deep Neural Networks

156610-Thumbnail Image.png
Description
Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such as element-wise sparsity, structured sparsity and quantization. While most of these works have applied these compression techniques in isolation, there have been very few studies on application of quantization and structured sparsity together on a DNN model.

This thesis co-optimizes structured sparsity and quantization constraints on DNN models during training. Specifically, it obtains optimal setting of 2-bit weight and 2-bit activation coupled with 4X structured compression by performing combined exploration of quantization and structured compression settings. The optimal DNN model achieves 50X weight memory reduction compared to floating-point uncompressed DNN. This memory saving is significant since applying only structured sparsity constraints achieves 2X memory savings and only quantization constraints achieves 16X memory savings. The algorithm has been validated on both high and low capacity DNNs and on wide-sparse and deep-sparse DNN models. Experiments demonstrated that deep-sparse DNN outperforms shallow-dense DNN with varying level of memory savings depending on DNN precision and sparsity levels. This work further proposed a Pareto-optimal approach to systematically extract optimal DNN models from a huge set of sparse and dense DNN models. The resulting 11 optimal designs were further evaluated by considering overall DNN memory which includes activation memory and weight memory. It was found that there is only a small change in the memory footprint of the optimal designs corresponding to the low sparsity DNNs. However, activation memory cannot be ignored for high sparsity DNNs.
Date Created
2018
Agent

Power-Performance Modeling and Adaptive Management of Heterogeneous Mobile Platforms​

156489-Thumbnail Image.png
Description
Nearly 60% of the world population uses a mobile phone, which is typically powered by a system-on-chip (SoC). While the mobile platform capabilities range widely, responsiveness, long battery life and reliability are common design concerns that are crucial to remain

Nearly 60% of the world population uses a mobile phone, which is typically powered by a system-on-chip (SoC). While the mobile platform capabilities range widely, responsiveness, long battery life and reliability are common design concerns that are crucial to remain competitive. Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful SoC with numerous other resources, including display, memory, power management IC, battery and wireless modems. Furthermore, the SoC itself is a heterogeneous resource that integrates many processing elements, such as CPU cores, GPU, video, image, and audio processors. Therefore, CPU cores do not dominate the platform power consumption under many application scenarios.

Competitive performance requires higher operating frequency, and leads to larger power consumption. In turn, power consumption increases the junction and skin temperatures, which have adverse effects on the device reliability and user experience. As a result, allocating the power budget among the major platform resources and temperature control have become fundamental consideration for mobile platforms. Dynamic thermal and power management algorithms address this problem by putting a subset of the processing elements or shared resources to sleep states, or throttling their frequencies. However, an adhoc approach could easily cripple the performance, if it slows down the performance-critical processing element. Furthermore, mobile platforms run a wide range of applications with time varying workload characteristics, unlike early generations, which supported only limited functionality. As a result, there is a need for adaptive power and performance management approaches that consider the platform as a whole, rather than focusing on a subset. Towards this need, our specific contributions include (a) a framework to dynamically select the Pareto-optimal frequency and active cores for the heterogeneous CPUs, such as ARM big.Little architecture, (b) a dynamic power budgeting approach for allocating optimal power consumption to the CPU and GPU using performance sensitivity models for each PE, (c) an adaptive GPU frame time sensitivity prediction model to aid power management algorithms, and (d) an online learning algorithm that constructs adaptive run-time models for non-stationary workloads.
Date Created
2018
Agent

Design of Resistive Synaptic Devices and Array Architectures for Neuromorphic Computing

156195-Thumbnail Image.png
Description
Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for

Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional CMOS. In recent years, many efforts have been devoted in the research of next-generation emerging non-volatile memory (eNVM) technologies, such as resistive random access memory (RRAM) and phase change memory (PCM), to replace conventional digital memories (e.g. SRAM) for implementation of synapses in large-scale neuromorphic computing systems.

Essentially being compact and “analog”, these eNVM devices in a crossbar array can compute vector-matrix multiplication in parallel, significantly speeding up the machine/deep learning algorithms. However, non-ideal eNVM device and array properties may hamper the learning accuracy. To quantify their impact, the sparse coding algorithm was used as a starting point, where the strategies to remedy the accuracy loss were proposed, and the circuit-level design trade-offs were also analyzed. At architecture level, the parallel “pseudo-crossbar” array to prevent the write disturbance issue was presented. The peripheral circuits to support various parallel array architectures were also designed. One key component is the read circuit that employs the principle of integrate-and-fire neuron model to convert the analog column current to digital output. However, the read circuit is not area-efficient, which was proposed to be replaced with a compact two-terminal oscillation neuron device that exhibits metal-insulator-transition phenomenon.

To facilitate the design exploration, a circuit-level macro simulator “NeuroSim” was developed in C++ to estimate the area, latency, energy and leakage power of various neuromorphic architectures. NeuroSim provides a wide variety of design options at the circuit/device level. NeuroSim can be used alone or as a supporting module to provide circuit-level performance estimation in neural network algorithms. A 2-layer multilayer perceptron (MLP) simulator with integration of NeuroSim was demonstrated to evaluate both the learning accuracy and circuit-level performance metrics for the online learning and offline classification, as well as to study the impact of eNVM reliability issues such as data retention and write endurance on the learning performance.
Date Created
2018
Agent

Ionospheric Channel Modeling and Estimation

156077-Thumbnail Image.png
Description
The goal is to provide accurate measurement of the channel between a ground source and a receiving satellite.

The effects of the the ionosphere for ground to space propagation for radio waves in the 3-30 MHz HF band is an unstudied

The goal is to provide accurate measurement of the channel between a ground source and a receiving satellite.

The effects of the the ionosphere for ground to space propagation for radio waves in the 3-30 MHz HF band is an unstudied subject.

The effects of the ionosphere on radio propagation is a long studied subject, the primary focus has been ground to ground by means of ionospheric reflection and space to ground corrections of ionospheric distortions of GPS.

Because of the plasma properties of the ionosphere there is a strong dependence on the frequency of use.

GPS L1 1575.42 MHz and L2 1227.60 MHz are much less effected than the 3-30 MHz HF band used for skywave propagation.

The channel between the ground transmitter and the satellite receiver is characterized by 2 unique polarization modes with respective delays and Dopplers.

Accurate estimates of delay and Doppler are done using polynomial fit functions.

The application of polarimetric separation of the two propagating polarizations allows improved estimate quality of delay and Doppler of the respective mode.

These methods yield good channel models and an effective channel estimation method well suited for the ground to space propagation.
Date Created
2017
Agent