Automatic Program Optimization by Semantic Relaxation for Parallel Processing Accelerators

190906-Thumbnail Image.png
Description
Graphic Processing Units (GPUs) have become a key enabler of the big-data revolution, functioning as defacto co-processors to accelerate large-scale computation. As the GPU programming stack and tool support have matured, the technology has alsobecome accessible to programmers. However, optimizing

Graphic Processing Units (GPUs) have become a key enabler of the big-data revolution, functioning as defacto co-processors to accelerate large-scale computation. As the GPU programming stack and tool support have matured, the technology has alsobecome accessible to programmers. However, optimizing programs to run efficiently on GPUs requires developers to have both detailed understanding of the application logic and significant knowledge of parallel programming and GPU architectures. This dissertation proposes GEVO, a tool for automatically tuning the performance of GPU kernels in the LLVM representation to meet desired criteria. GEVO uses population-based search to find edits to programs compiled to LLVM-IR which improves performance on desired criteria and retains required functionality. The evaluation of GEVO on the Rodinia benchmark suite demonstrates many runtime optimization techniques. One of the key insights is that semantic relaxation enables GEVO to discover these optimizations that are usually prohibited by the compiler. GEVO also explores many other optimizations, including architecture- and application-specific. A follow-up evaluation of three bioinformatics applications at their different stages of development suggests that GEVO can optimize programs as well as human experts, sometimes even into a code base that is beyond a programmer’s reach. Furthermore, to unshackle the constraint of GEVO in optimizing neural network (NN) models, GEVO-ML is proposed by extending the representation capability of GEVO, where NN models and the training/prediction process are uniformly represented in a single intermediate language. An evaluation of GEVO-ML shows that GEVO-ML can optimize network models similar to how human developers improve model design, for example, by changing learning rates or pruning non-essential parameters. These results showcase the potential of automated program optimization tools to both reduce the optimization burden for researchers and provide new insights for GPU experts.
Date Created
2023
Agent

Concurrent Checkpointing for Embedded Real-Time Systems

156948-Thumbnail Image.png
Description
The Internet of Things ecosystem has spawned a wide variety of embedded real-time systems that complicate the identification and resolution of bugs in software. The methods of concurrent checkpoint provide a means to monitor the application state with the

The Internet of Things ecosystem has spawned a wide variety of embedded real-time systems that complicate the identification and resolution of bugs in software. The methods of concurrent checkpoint provide a means to monitor the application state with the ability to replay the execution on like hardware and software, without holding off and delaying the execution of application threads. In this thesis, it is accomplished by monitoring physical memory of the application using a soft-dirty page tracker and measuring the various types of overhead when employing concurrent checkpointing. The solution presented is an advancement of the Checkpoint and Replay In Userspace (CRIU) thereby eliminating the large stalls and parasitic operation for each successive checkpoint. Impact and performance is measured using the Parsec 3.0 Benchmark suite and 4.11.12-rt16+ Linux kernel on a MinnowBoard Turbot Quad-Core board.
Date Created
2018
Agent

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

156791-Thumbnail Image.png
Description
General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory

General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.

Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.

Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.

Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.

In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.
Date Created
2018
Agent

From Formal Requirement Analysis to Testing and Monitoring of Cyber-Physical Systems

155975-Thumbnail Image.png
Description
Cyber-Physical Systems (CPS) are being used in many safety-critical applications. Due to the important role in virtually every aspect of human life, it is crucial to make sure that a CPS works properly before its deployment. However, formal verification of

Cyber-Physical Systems (CPS) are being used in many safety-critical applications. Due to the important role in virtually every aspect of human life, it is crucial to make sure that a CPS works properly before its deployment. However, formal verification of CPS is a computationally hard problem. Therefore, lightweight verification methods such as testing and monitoring of the CPS are considered in the industry. The formal representation of the CPS requirements is a challenging task. In addition, checking the system outputs with respect to requirements is a computationally complex problem. In this dissertation, these problems for the verification of CPS are addressed. The first method provides a formal requirement analysis framework which can find logical issues in the requirements and help engineers to correct the requirements. Also, a method is provided to detect tests which vacuously satisfy the requirement because of the requirement structure. This method is used to improve the test generation framework for CPS. Finally, two runtime verification algorithms are developed for off-line/on-line monitoring with respect to real-time requirements. These monitoring algorithms are computationally efficient, and they can be used in practical applications for monitoring CPS with low runtime overhead.
Date Created
2017
Agent

Dynamic analysis of multithreaded embedded software to expose atomicity violations

154335-Thumbnail Image.png
Description
Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect

Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence of bugs remains same, approach changes a bit for embedded systems. The main focus of this research is to develop a systemic methodology to address the issue from embedded systems perspective. A framework is developed which predicts the access interleaving patterns that may violate atomicity using memory references of shared variables and provides support to force and analyze these schedules for any output change, system fault or change in execution path.
Date Created
2016
Agent

SmartGateway Framework

154004-Thumbnail Image.png
Description
Cisco estimates that by 2020, 50 billion devices will be connected to the Internet. But 99% of the things today remain isolated and unconnected. Different connectivity protocols, proprietary access, varied device characteristics, security concerns are the main reasons for that

Cisco estimates that by 2020, 50 billion devices will be connected to the Internet. But 99% of the things today remain isolated and unconnected. Different connectivity protocols, proprietary access, varied device characteristics, security concerns are the main reasons for that isolated state. This project aims at designing and building a prototype gateway that exposes a simple and intuitive HTTP Restful interface to access and manipulate devices and the data that they produce while addressing most of the issues listed above. Along with manipulating devices, the framework exposes sensor data in such a way that it can be used to create applications like rules or events that make the home smarter. It also allows the user to represent high-level knowledge by aggregating the low-level sensor data. This high-level representation can be considered as a property of the environment or object rather than the sensor itself which makes interpreting the values more intuitive and accessible.
Date Created
2015
Agent

Dynamic analysis of embedded software

154003-Thumbnail Image.png
Description
Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded

Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded embedded software makes the dynamic analysis difficult. In addition, instrumentation overhead for gathering execution information may change the execution of a program, and lead to distorted analysis results, i.e., probe effect. This thesis presents a framework that tackles the non-determinism and probe effect incurred in dynamic analysis of embedded software. The thesis largely consists of three parts. First of all, we discusses a deterministic replay framework to provide reproducible execution. Once a program execution is recorded, software instrumentation can be safely applied during replay without probe effect. Second, a discussion of probe effect is presented and a simulation-based analysis is proposed to detect execution changes of a program caused by instrumentation overhead. The simulation-based analysis examines if the recording instrumentation changes the original program execution. Lastly, the thesis discusses data race detection algorithms that help to remove data races for correctness of the replay and the simulation-based analysis. The focus is to make the detection efficient for C/C++ programs, and to increase scalability of the detection on multi-core machines.
Date Created
2015
Agent

Genost: a system for introductory computer science education with a focus on computational thinking

153384-Thumbnail Image.png
Description
Computational thinking, the creative thought process behind algorithmic design and programming, is a crucial introductory skill for both computer scientists and the population in general. In this thesis I perform an investigation into introductory computer science education in the United

Computational thinking, the creative thought process behind algorithmic design and programming, is a crucial introductory skill for both computer scientists and the population in general. In this thesis I perform an investigation into introductory computer science education in the United States and find that computational thinking is not effectively taught at either the high school or the college level. To remedy this, I present a new educational system intended to teach computational thinking called Genost. Genost consists of a software tool and a curriculum based on teaching computational thinking through fundamental programming structures and algorithm design. Genost's software design is informed by a review of eight major computer science educational software systems. Genost's curriculum is informed by a review of major literature on computational thinking. In two educational tests of Genost utilizing both college and high school students, Genost was shown to significantly increase computational thinking ability with a large effect size.
Date Created
2015
Agent

Data movement energy characterization of emerging smartphone workloads for mobile platforms

153089-Thumbnail Image.png
Description
A benchmark suite that is representative of the programs a processor typically executes is necessary to understand a processor's performance or energy consumption characteristics. The first contribution of this work addresses this need for mobile platforms with MobileBench, a

A benchmark suite that is representative of the programs a processor typically executes is necessary to understand a processor's performance or energy consumption characteristics. The first contribution of this work addresses this need for mobile platforms with MobileBench, a selection of representative smartphone applications. In smartphones, like any other portable computing systems, energy is a limited resource. Based on the energy characterization of a commercial widely-used smartphone, application cores are found to consume a significant part of the total energy consumption of the device. With this insight, the subsequent part of this thesis focuses on the portion of energy that is spent to move data from the memory system to the application core's internal registers. The primary motivation for this work comes from the relatively higher power consumption associated with a data movement instruction compared to that of an arithmetic instruction. The data movement energy cost is worsened esp. in a System on Chip (SoC) because the amount of data received and exchanged in a SoC based smartphone increases at an explosive rate. A detailed investigation is performed to quantify the impact of data movement

on the overall energy consumption of a smartphone device. To aid this study, microbenchmarks that generate desired data movement patterns between different levels of the memory hierarchy are designed. Energy costs of data movement are then computed by measuring the instantaneous power consumption of the device when the micro benchmarks are executed. This work makes an extensive use of hardware performance counters to validate the memory access behavior of microbenchmarks and to characterize the energy consumed in moving data. Finally, the calculated energy costs of data movement are used to characterize the portion of energy that MobileBench applications spend in moving data. The results of this study show that a significant 35% of the total device energy is spent in data movement alone. Energy is an increasingly important criteria in the context of designing architectures for future smartphones and this thesis offers insights into data movement energy consumption.
Date Created
2014
Agent

Asymmetric multiprocessing real time operating system on multicore platforms

152993-Thumbnail Image.png
Description
The need for multi-core architectural trends was realized in the desktop computing domain fairly long back. This trend is also beginning to be seen in the deeply embedded systems such as automotive and avionics industry owing to ever increasing demands

The need for multi-core architectural trends was realized in the desktop computing domain fairly long back. This trend is also beginning to be seen in the deeply embedded systems such as automotive and avionics industry owing to ever increasing demands in terms of sheer computational bandwidth, responsiveness, reliability and power consumption constraints. The adoption of such multi-core architectures in safety critical systems is often met with resistance owing to the overhead in migration of the existing stable code base to the new system setup, typically requiring extensive re-design. This also brings about the need for exhaustive testing and validation that goes hand in hand with such a migration, especially in safety critical real-time systems.

This project highlights the steps to develop an asymmetric multiprocessing variant of Micrium µC/OS-II real-time operating system suited for a multi-core system. This RTOS variant also supports multi-core synchronization, shared memory management and multi-core messaging queues.

Since such specialized embedded systems are usually developed by system designers focused more so on the functionality than on the coding standards, the adoption of automatic production code generation tools, such as SIMULINK's Embedded Coder, is increasingly becoming the industry norm. Such tools are capable of producing robust, industry compliant code with very little roll out time. This project documents the process of extending SIMULINK's automatic code generation tool for the AMP variant of µC/OS-II on Freescale's MPC5675K, dual-core Microcontroller Unit. This includes code generation from task based models and multi-rate models. Apart from this, it also de-scribes the development of additional software tools to allow semantically consistent communication between task on the same kernel and those across the kernels.
Date Created
2014
Agent