Towards energy efficient computing with Linux: enabling task level power awareness and support for energy efficient accelerator

151941-Thumbnail Image.png
Description
With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators.

With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment.
Date Created
2013
Agent

Compilation of stream programs onto embedded multicore architectures

151200-Thumbnail Image.png
Description
In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream programs distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to

In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream programs distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Current design frameworks for developing stream applications on SPM enhanced embedded architectures typically do not include a compiler that can perform automatic partitioning, mapping and scheduling under limited on-chip SPM capacities and memory access delays. Consequently, many designs are implemented manually, which leads to lengthy tasks and inferior designs. In this work, optimization techniques that automatically compile stream programs onto embedded multi-core architectures are proposed. As an initial case study, we implemented an automatic target recognition (ATR) algorithm on the IBM Cell Broadband Engine (BE). Then integer linear programming (ILP) and heuristic approaches were proposed to schedule stream programs on a single core embedded processor that has an SPM with code overlay. Later, ILP and heuristic approaches for Compiling Stream programs on SPM enhanced Multicore Processors (CSMP) were studied. The proposed CSMP ILP and heuristic approaches do not optimize for cycles in stream applications. Further, the number of software pipeline stages in the implementation is dependent on actor to processing engine (PE) mapping and is uncontrollable. We next presented a Retiming technique for Throughput optimization on Embedded Multi-core processors (RTEM). RTEM approach inherently handles cycles and can accept an upper bound on the number of software pipeline stages to be generated. We further enhanced RTEM by incorporating unrolling (URSTEM) that preserves all the beneficial properties of RTEM heuristic and also scales with the number of PEs through unrolling.
Date Created
2012
Agent

Threshold logic properties and methods: applications to post-CMOS design automation and gene regulation modeling

150901-Thumbnail Image.png
Description
Threshold logic has been studied by at least two independent group of researchers. One group of researchers studied threshold logic with the intention of building threshold logic circuits. The earliest research to this end was done in the 1960's. The

Threshold logic has been studied by at least two independent group of researchers. One group of researchers studied threshold logic with the intention of building threshold logic circuits. The earliest research to this end was done in the 1960's. The major work at that time focused on studying mathematical properties of threshold logic as no efficient circuit implementations of threshold logic were available. Recently many post-CMOS (Complimentary Metal Oxide Semiconductor) technologies that implement threshold logic have been proposed along with efficient CMOS implementations. This has renewed the effort to develop efficient threshold logic design automation techniques. This work contributes to this ongoing effort. Another group studying threshold logic did so, because the building block of neural networks - the Perceptron, is identical to the threshold element implementing a threshold function. Neural networks are used for various purposes as data classifiers. This work contributes tangentially to this field by proposing new methods and techniques to study and analyze functions implemented by a Perceptron After completion of the Human Genome Project, it has become evident that most biological phenomenon is not caused by the action of single genes, but due to the complex interaction involving a system of genes. In recent times, the `systems approach' for the study of gene systems is gaining popularity. Many different theories from mathematics and computer science has been used for this purpose. Among the systems approaches, the Boolean logic gene model has emerged as the current most popular discrete gene model. This work proposes a new gene model based on threshold logic functions (which are a subset of Boolean logic functions). The biological relevance and utility of this model is argued illustrated by using it to model different in-vivo as well as in-silico gene systems.
Date Created
2012
Agent

Smart compilers for reliable and power-efficient embedded computing

150743-Thumbnail Image.png
Description
Thanks to continuous technology scaling, intelligent, fast and smaller digital systems are now available at affordable costs. As a result, digital systems have found use in a wide range of application areas that were not even imagined before, including medical

Thanks to continuous technology scaling, intelligent, fast and smaller digital systems are now available at affordable costs. As a result, digital systems have found use in a wide range of application areas that were not even imagined before, including medical (e.g., MRI, remote or post-operative monitoring devices, etc.), automotive (e.g., adaptive cruise control, anti-lock brakes, etc.), security systems (e.g., residential security gateways, surveillance devices, etc.), and in- and out-of-body sensing (e.g., capsule swallowed by patients measuring digestive system pH, heart monitors, etc.). Such computing systems, which are completely embedded within the application, are called embedded systems, as opposed to general purpose computing systems. In the design of such embedded systems, power consumption and reliability are indispensable system requirements. In battery operated portable devices, the battery is the single largest factor contributing to device cost, weight, recharging time, frequency and ultimately its usability. For example, in the Apple iPhone 4 smart-phone, the battery is $40\%$ of the device weight, occupies $36\%$ of its volume and allows only $7$ hours (over 3G) of talk time. As embedded systems find use in a range of sensitive applications, from bio-medical applications to safety and security systems, the reliability of the computations performed becomes a crucial factor. At our current technology-node, portable embedded systems are prone to expect failures due to soft errors at the rate of once-per-year; but with aggressive technology scaling, the rate is predicted to increase exponentially to once-per-hour. Over the years, researchers have been successful in developing techniques, implemented at different layers of the design-spectrum, to improve system power efficiency and reliability. Among the layers of design abstraction, I observe that the interface between the compiler and processor micro-architecture possesses a unique potential for efficient design optimizations. A compiler designer is able to observe and analyze the application software at a finer granularity; while the processor architect analyzes the system output (power, performance, etc.) for each executed instruction. At the compiler micro-architecture interface, if the system knowledge at the two design layers can be integrated, design optimizations at the two layers can be modified to efficiently utilize available resources and thereby achieve appreciable system-level benefits. To this effect, the thesis statement is that, ``by merging system design information at the compiler and micro-architecture design layers, smart compilers can be developed, that achieve reliable and power-efficient embedded computing through: i) Pure compiler techniques, ii) Hybrid compiler micro-architecture techniques, and iii) Compiler-aware architectures''. In this dissertation demonstrates, through contributions in each of the three compiler-based techniques, the effectiveness of smart compilers in achieving power-efficiency and reliability in embedded systems.
Date Created
2012
Agent

A structured design methodology for high performance VLSI arrays

150703-Thumbnail Image.png
Description
The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most

The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.
Date Created
2012
Agent

System level power and thermal management on embedded processors

150660-Thumbnail Image.png
Description
Semiconductor scaling technology has led to a sharp growth in transistor counts. This has resulted in an exponential increase on both power dissipation and heat flux (or power density) in modern microprocessors. These microprocessors are integrated as the major components

Semiconductor scaling technology has led to a sharp growth in transistor counts. This has resulted in an exponential increase on both power dissipation and heat flux (or power density) in modern microprocessors. These microprocessors are integrated as the major components in many modern embedded devices, which offer richer features and attain higher performance than ever before. Therefore, power and thermal management have become the significant design considerations for modern embedded devices. Dynamic voltage/frequency scaling (DVFS) and dynamic power management (DPM) are two well-known hardware capabilities offered by modern embedded processors. However, the power or thermal aware performance optimization is not fully explored for the mainstream embedded processors with discrete DVFS and DPM capabilities. Many key problems have not been answered yet. What is the maximum performance that an embedded processor can achieve under power or thermal constraint for a periodic application? Does there exist an efficient algorithm for the power or thermal management problems with guaranteed quality bound? These questions are hard to be answered because the discrete settings of DVFS and DPM enhance the complexity of many power and thermal management problems, which are generally NP-hard. The dissertation presents a comprehensive study on these NP-hard power and thermal management problems for embedded processors with discrete DVFS and DPM capabilities. In the domain of power management, the dissertation addresses the power minimization problem for real-time schedules, the energy-constrained make-span minimization problem on homogeneous and heterogeneous chip multiprocessors (CMP) architectures, and the battery aware energy management problem with nonlinear battery discharging model. In the domain of thermal management, the work addresses several thermal-constrained performance maximization problems for periodic embedded applications. All the addressed problems are proved to be NP-hard or strongly NP-hard in the study. Then the work focuses on the design of the off-line optimal or polynomial time approximation algorithms as solutions in the problem design space. Several addressed NP-hard problems are tackled by dynamic programming with optimal solutions and pseudo-polynomial run time complexity. Because the optimal algorithms are not efficient in worst case, the fully polynomial time approximation algorithms are provided as more efficient solutions. Some efficient heuristic algorithms are also presented as solutions to several addressed problems. The comprehensive study answers the key questions in order to fully explore the power and thermal management potentials on embedded processors with discrete DVFS and DPM capabilities. The provided solutions enable the theoretical analysis of the maximum performance for periodic embedded applications under power or thermal constraints.
Date Created
2012
Agent

Energy management in solar powered wireless sensor networks

150486-Thumbnail Image.png
Description
The use of energy-harvesting in a wireless sensor network (WSN) is essential for situations where it is either difficult or not cost effective to access the network's nodes to replace the batteries. In this paper, the problems involved in controlling

The use of energy-harvesting in a wireless sensor network (WSN) is essential for situations where it is either difficult or not cost effective to access the network's nodes to replace the batteries. In this paper, the problems involved in controlling an active sensor network that is powered both by batteries and solar energy are investigated. The objective is to develop control strategies to maximize the quality of coverage (QoC), which is defined as the minimum number of targets that must be covered and reported over a 24 hour period. Assuming a time varying solar profile, the problem is to optimally control the sensing range of each sensor so as to maximize the QoC while maintaining connectivity throughout the network. Implicit in the solution is the dynamic allocation of solar energy during the day to sensing and to recharging the battery so that a minimum coverage is guaranteed even during the night, when only the batteries can supply energy to the sensors. This problem turns out to be a non-linear optimal control problem of high complexity. Based on novel and useful observations, a method is presented to solve it as a series of quasiconvex (unimodal) optimization problems which not only ensures a maximum QoC, but also maintains connectivity throughout the network. The runtime of the proposed solution is 60X less than a naive but optimal method which is based on dynamic programming, while the peak error of the solution is less than 8%. Unlike the dynamic programming method, the proposed method is scalable to large networks consisting of hundreds of sensors and targets. The solution method enables a designer to explore the optimal configuration of network design. This paper offers many insights in the design of energy-harvesting networks, which result in minimum network setup cost through determination of optimal configuration of number of sensors, sensing beam width, and the sampling time.
Date Created
2012
Agent

A New RNS 4-moduli set for the implementation of FIR filters

149788-Thumbnail Image.png
Description
Residue number systems have gained significant importance in the field of high-speed digital signal processing due to their carry-free nature and speed-up provided by parallelism. The critical aspect in the application of RNS is the selection of the moduli set

Residue number systems have gained significant importance in the field of high-speed digital signal processing due to their carry-free nature and speed-up provided by parallelism. The critical aspect in the application of RNS is the selection of the moduli set and the design of the conversion units. There have been several RNS moduli sets proposed for the implementation of digital filters. However, some are unbalanced and some do not provide the required dynamic range. This thesis addresses the drawbacks of existing RNS moduli sets and proposes a new moduli set for efficient implementation of FIR filters. An efficient VLSI implementation model has been derived for the design of a reverse converter from RNS to the conventional two's complement representation. This model facilitates the realization of a reverse converter for better performance with less hardware complexity when compared with the reverse converter designs of the existing balanced 4-moduli sets. Experimental results comparing multiply and accumulate units using RNS that are implemented using the proposed four-moduli set with the state-of-the-art balanced four-moduli sets, show large improvements in area (46%) and power (43%) reduction for various dynamic ranges. RNS FIR filters using the proposed moduli-set and existing balanced 4-moduli set are implemented in RTL and compared for chip area and power and observed 20% improvements. This thesis also presents threshold logic implementation of the reverse converter.
Date Created
2011
Agent

Modeling and implementation of threshold logic circuits and architectures

149343-Thumbnail Image.png
Description
Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work

Threshold logic has long been studied as a means of achieving higher performance and lower power dissipation, providing improvements by condensing simple logic gates into more complex primitives, effectively reducing gate count, pipeline depth, and number of interconnects. This work proposes a new physical implementation of threshold logic, the threshold logic latch (TLL), which overcomes the difficulties observed in previous work, particularly with respect to gate reliability in the presence of noise and process variations. Simple but effective models were created to assess the delay, power, and noise margin of TLL gates for the purpose of determining the physical parameters and assignment of input signals that achieves the lowest delay subject to constraints on power and reliability. From these models, an optimized library of standard TLL cells was developed to supplement a commercial library of static CMOS gates. The new cells were then demonstrated on a number of automatically synthesized, placed, and routed designs. A two-stage 2's complement integer multiplier designed with CMOS and TLL gates utilized 19.5% less area, 28.0% less active power, and 61.5% less leakage power than an equivalent design with the same performance using only static CMOS gates. Additionally, a two-stage 32-instruction 4-way issue queue designed with CMOS and TLL gates utilized 30.6% less area, 31.0% less active power, and 58.9% less leakage power than an equivalent design with the same performance using only static CMOS gates.
Date Created
2010
Agent