Development of Simulation Components for Wireless Communication

155967-Thumbnail Image.png
Description
This thesis work present the simulation of Bluetooth and Wi-Fi radios in real life interference environments. When information is transmitted via communication channels, data may get corrupted due to noise and other channel discrepancies. In order to receive the information

This thesis work present the simulation of Bluetooth and Wi-Fi radios in real life interference environments. When information is transmitted via communication channels, data may get corrupted due to noise and other channel discrepancies. In order to receive the information safely and correctly, error correction coding schemes are generally employed during the design of communication systems. Usually the simulations of wireless communication systems are done in such a way that they focus on some aspect of communications and neglect the others. The simulators available currently will either do network layer simulations or physical layer level simulations. In many situations, simulations are required which show inter-layer aspects of communication systems. For all such scenarios, a simulation environment, WiscaComm which is based on time-domain samples is built. WiscaComm allows the study of network and physical layer interactions in detail. The advantage of time domain sampling is that it allows the simulation of different radios together which is better than the complex baseband representation of symbols. The environment also supports study of multiple protocols operating simultaneously, which is of increasing importance in today's environment.
Date Created
2017
Agent

Low-power Physical-layer Design for LTE Based Very NarrowBand IoT (VNB - IoT) Communication

155926-Thumbnail Image.png
Description
With the new age Internet of Things (IoT) revolution, there is a need to connect a wide range of devices with varying throughput and performance requirements. In this thesis, a wireless system is proposed which is targeted towards very low

With the new age Internet of Things (IoT) revolution, there is a need to connect a wide range of devices with varying throughput and performance requirements. In this thesis, a wireless system is proposed which is targeted towards very low power, delay insensitive IoT applications with low throughput requirements. The low cost receivers for such devices will have very low complexity, consume very less power and hence will run for several years.

Long Term Evolution (LTE) is a standard developed and administered by 3rd Generation Partnership Project (3GPP) for high speed wireless communications for mobile devices. As a part of Release 13, another standard called narrowband IoT (NB-IoT) was introduced by 3GPP to serve the needs of IoT applications with low throughput requirements. Working along similar lines, this thesis proposes yet another LTE based solution called very narrowband IoT (VNB-IoT), which further reduces the complexity and power consumption of the user equipment (UE) while maintaining the base station (BS) architecture as defined in NB-IoT.

In the downlink operation, the transmitter of the proposed system uses the NB-IoT resource block with each subcarrier modulated with data symbols intended for a different user. On the receiver side, each UE locks to a particular subcarrier frequency instead of the entire resource block and operates as a single carrier receiver. On the uplink, the system uses a single-tone transmission as specified in the NB-IoT standard.

Performance of the proposed system is analyzed in an additive white Gaussian noise (AWGN) channel followed by an analysis of the inter carrier interference (ICI). Relationship between the overall filter bandwidth and ICI is established towards the end.
Date Created
2017
Agent

Algorithm and Hardware Co-design for Learning On-a-chip

155897-Thumbnail Image.png
Description
Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount

Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm.

For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability.

Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance.

From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning.
Date Created
2017
Agent

Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

155831-Thumbnail Image.png
Description
With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies

With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.

First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup.

Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads.

Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs.

In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.
Date Created
2017
Agent

Low Complexity Wireless Communication Digital Baseband Design

155805-Thumbnail Image.png
Description
This thesis addresses two problems in digital baseband design of wireless communication systems, namely, those in Internet of Things (IoT) terminals that support long range communications and those in full-duplex systems that are designed for high spectral efficiency.

IoT terminals for

This thesis addresses two problems in digital baseband design of wireless communication systems, namely, those in Internet of Things (IoT) terminals that support long range communications and those in full-duplex systems that are designed for high spectral efficiency.

IoT terminals for long range communications are typically based on Orthogonal Frequency-Division Multiple Access (OFDMA) and spread spectrum technologies. In order to design an efficient baseband architecture for such terminals, the workload profiles of both systems are analyzed. Since frame detection unit has by far the highest computational load, a simple architecture that uses only a scalar datapath is proposed. To optimize for low energy consumption, application-specific instructions that minimize register accesses and address generation units for streamlined memory access are introduced. Two parameters, namely, correlation window size and threshold value, affect the detection probability, the false alarm probability and hence energy consumption. Next, energy-optimal operation settings for correlation window size and threshold value are derived for different channel conditions. For both good and bad channel conditions, if target signal detection probability is greater than 0.9, the baseband processor has the lowest energy when the frame detection algorithm uses the longest correlation window and the highest threshold value.

A full-duplex system has high spectral efficiency but suffers from self-interference. Part of the interference can be cancelled digitally using equalization techniques. The cancellation performance and computation complexity of the competing equalization algorithms, namely, Least Mean Square (LMS), Normalized LMS (NLMS), Recursive Least Square (RLS) and feedback equalizers based on LMS, NLMS and RLS are analyzed, and a trade-off between performance and complexity established. NLMS linear equalizer is found to be suitable for resource-constrained mobile devices and NLMS decision feedback equalizer is more appropriate for base stations that are not energy constrained.
Date Created
2017
Agent

Development of Multiple Protocols in Novel Simulation Environment

155804-Thumbnail Image.png
Description
When one considers the current state of wireless communications, it becomes clear that it is both absolutely amazing and something of a mess. Present communications standards are the result of local optimizations over time that led to a confusing set

When one considers the current state of wireless communications, it becomes clear that it is both absolutely amazing and something of a mess. Present communications standards are the result of local optimizations over time that led to a confusing set of suboptimal and fragile wireless standards. Starting from a clean sheet of paper, Bliss Laboratory for Information, Signals, and Systems (BLISS) is considering a fluid set of communications standards co-optimized with flexible but power-efficient computational implementations that will enable the next revolution of wireless communications. The main aim is to enable much higher data rates and much lower data rates with corresponding lower power consumption as the needs of the users vary.

The thesis mainly looks at the different sections of the work done, to prime the development of the protocol development engine. It discusses channel modeling, and system integration of receiver and channel noise. It also proposes a Carrier-Sense Multiple Access (CSMA) Media Access Control (MAC) layer protocol implementation for (Wireless Fidelity) Wi-Fi protocol. This work also talks about the Graphical User Interface (GUI), which is a part of Protocol Development Kit (PDK) - a combination of the Protocol Recommendation Engine (PRE) and simulation package to aid the development of protocols. It also sheds light on the Automatic Dependent Surveillance - Broadcast (ADS-B) radio protocol, that will eventually replace radar as Air Traffic Control's (ATC) primary tool for separating aircraft.

All the algorithms used in this thesis, to define radio operation were in principle defined by mathematical descriptions; however, to test and implement these algorithms they had to be converted to a computer language. There were multiple phases of this conversion. In the first phase, the implementation of these algorithms was done in Matrix Laboratory (MATLAB). To aid this development, basic radio finite state machines and radio algorithmic tools were provided.
Date Created
2017
Agent

Computer Vision from Spatial-Multiplexing Cameras at Low Measurement Rates

155774-Thumbnail Image.png
Description
In UAVs and parking lots, it is typical to first collect an enormous number of pixels using conventional imagers. This is followed by employment of expensive methods to compress by throwing away redundant data. Subsequently, the compressed data is transmitted

In UAVs and parking lots, it is typical to first collect an enormous number of pixels using conventional imagers. This is followed by employment of expensive methods to compress by throwing away redundant data. Subsequently, the compressed data is transmitted to a ground station. The past decade has seen the emergence of novel imagers called spatial-multiplexing cameras, which offer compression at the sensing level itself by providing an arbitrary linear measurements of the scene instead of pixel-based sampling. In this dissertation, I discuss various approaches for effective information extraction from spatial-multiplexing measurements and present the trade-offs between reliability of the performance and computational/storage load of the system. In the first part, I present a reconstruction-free approach to high-level inference in computer vision, wherein I consider the specific case of activity analysis, and show that using correlation filters, one can perform effective action recognition and localization directly from a class of spatial-multiplexing cameras, called compressive cameras, even at very low measurement rates of 1\%. In the second part, I outline a deep learning based non-iterative and real-time algorithm to reconstruct images from compressively sensed (CS) measurements, which can outperform the traditional iterative CS reconstruction algorithms in terms of reconstruction quality and time complexity, especially at low measurement rates. To overcome the limitations of compressive cameras, which are operated with random measurements and not particularly tuned to any task, in the third part of the dissertation, I propose a method to design spatial-multiplexing measurements, which are tuned to facilitate the easy extraction of features that are useful in computer vision tasks like object tracking. The work presented in the dissertation provides sufficient evidence to high-level inference in computer vision at extremely low measurement rates, and hence allows us to think about the possibility of revamping the current day computer systems.
Date Created
2017
Agent

Locally Adaptive Stereo Vision Based 3D Visual Reconstruction

155540-Thumbnail Image.png
Description
Using stereo vision for 3D reconstruction and depth estimation has become a popular and promising research area as it has a simple setup with passive cameras and relatively efficient processing procedure. The work in this dissertation focuses on locally adaptive

Using stereo vision for 3D reconstruction and depth estimation has become a popular and promising research area as it has a simple setup with passive cameras and relatively efficient processing procedure. The work in this dissertation focuses on locally adaptive stereo vision methods and applications to different imaging setups and image scenes.





Solder ball height and substrate coplanarity inspection is essential to the detection of potential connectivity issues in semi-conductor units. Current ball height and substrate coplanarity inspection tools are expensive and slow, which makes them difficult to use in a real-time manufacturing setting. In this dissertation, an automatic, stereo vision based, in-line ball height and coplanarity inspection method is presented. The proposed method includes an imaging setup together with a computer vision algorithm for reliable, in-line ball height measurement. The imaging setup and calibration, ball height estimation and substrate coplanarity calculation are presented with novel stereo vision methods. The results of the proposed method are evaluated in a measurement capability analysis (MCA) procedure and compared with the ground-truth obtained by an existing laser scanning tool and an existing confocal inspection tool. The proposed system outperforms existing inspection tools in terms of accuracy and stability.



In a rectified stereo vision system, stereo matching methods can be categorized into global methods and local methods. Local stereo methods are more suitable for real-time processing purposes with competitive accuracy as compared with global methods. This work proposes a stereo matching method based on sparse locally adaptive cost aggregation. In order to reduce outlier disparity values that correspond to mis-matches, a novel sparse disparity subset selection method is proposed by assigning a significance status to candidate disparity values, and selecting the significant disparity values adaptively. An adaptive guided filtering method using the disparity subset for refined cost aggregation and disparity calculation is demonstrated. The proposed stereo matching algorithm is tested on the Middlebury and the KITTI stereo evaluation benchmark images. A performance analysis of the proposed method in terms of the I0 norm of the disparity subset is presented to demonstrate the achieved efficiency and accuracy.
Date Created
2017
Agent

Low Complexity Optical Flow Using Neighbor-Guided Semi-Global Matching

155477-Thumbnail Image.png
Description
Many real-time vision applications require accurate estimation of optical flow. This problem is quite challenging due to extremely high computation and memory requirements. This thesis focuses on designing low complexity dense optical flow algorithms.

First, a new method for optical flow

Many real-time vision applications require accurate estimation of optical flow. This problem is quite challenging due to extremely high computation and memory requirements. This thesis focuses on designing low complexity dense optical flow algorithms.

First, a new method for optical flow that is based on Semi-Global Matching (SGM), a popular dynamic programming algorithm for stereo vision, is presented. In SGM, the disparity of each pixel is calculated by aggregating local matching costs over the entire image to resolve local ambiguity in texture-less and occluded regions. The proposed method, Neighbor-Guided Semi-Global Matching (NG-fSGM) achieves significantly less complexity compared to SGM, by 1) operating on a subset of the search space that has been aggressively pruned based on neighboring pixels’ information, 2) using a simple cost aggregation function, 3) approximating aggregated cost array and embedding pixel-wise matching cost computation and flow computation in aggregation. Evaluation on the Middlebury benchmark suite showed that, compared to a prior SGM extension for optical flow, the proposed basic NG-fSGM provides robust optical flow with 0.53% accuracy improvement, 40x reduction in number of operations and 6x reduction in memory size. To further reduce the complexity, sparse-to-dense flow estimation method is proposed. The number of operations and memory size are reduced by 68% and 47%, respectively, with only 0.42% accuracy degradation, compared to the basic NG-fSGM.

A parallel block-based version of NG-fSGM is also proposed. The image is divided into overlapping blocks and the blocks are processed in parallel to improve throughput, latency and power efficiency. To minimize the amount of overlap among blocks with minimal effect on the accuracy, temporal information is used to estimate a flow map that guides flow vector selections for pixels along block boundaries. The proposed block-based NG-fSGM achieves significant reduction in complexity with only 0.51% accuracy degradation compared to the basic NG-fSGM.
Date Created
2017
Agent

Designing Low Cost Error Correction Schemes for Improving Memory Reliability

155283-Thumbnail Image.png
Description
Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes

Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes have higher reliability than current systems but with lower power, better performance and lower hardware cost.

First, a low overhead solution that improves the reliability of commodity DRAM systems with no change in the existing memory architecture is presented. Specifically, five erasure and error correction (E-ECC) schemes are proposed that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, the use of erasure codes extends the lifetime of the 2D DRAM systems.

Next, two error correction schemes are presented for 3D DRAM memory systems. The first scheme is a rate-adaptive, two-tiered error correction scheme (RATT-ECC) that provides strong reliability (10^10x) reduction in raw FIT rate) for an HBM-like 3D DRAM system that services CPU applications. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance.

The second scheme is a two-tiered error correction scheme (Config-ECC) that supports different sized accesses in GPU applications with strong reliability. It addresses the mismatch between data access size and fixed sized ECC scheme by designing a product code based flexible scheme. Config-ECC is built around a core unit designed for 32B access with a simple extension to support 64B and 128B accesses. Compared to fixed 32B and 64B ECC schemes, Config-ECC reduces the failure in time (FIT) rate by 200x and 20x, respectively. It also reduces the memory energy by 17% (in the dynamic mode) and 21% (in the static mode) compared to a state-of-the-art fixed 64B ECC scheme.
Date Created
2017
Agent