Analyzing and Improving the Reliability of Matrix Multiplication and Neural Networks on FPGAs

168467-Thumbnail Image.png
Description
Neural networks are increasingly becoming attractive solutions for automated systems within automotive, aerospace, and military industries.Since many applications in such fields are both real-time and safety-critical, strict performance and reliability constraints must be considered. To achieve high performance, specialized architectures

Neural networks are increasingly becoming attractive solutions for automated systems within automotive, aerospace, and military industries.Since many applications in such fields are both real-time and safety-critical, strict performance and reliability constraints must be considered. To achieve high performance, specialized architectures are required.Given that over 90% of the workload in modern neural network topologies is dominated by matrix multiplication, accelerating said algorithm becomes of paramount importance. Modern neural network accelerators, such as Xilinx's Deep Processing Unit (DPU), adopt efficient systolic-like architectures. Thanks to their high degree of parallelism and design flexibility, Field-Programmable Gate Arrays (FPGAs) are among the most promising devices for speeding up matrix multiplication and neural network computation.However, SRAM-based FPGAs are also known to suffer from radiation-induced upsets in their configuration memories. To achieve high reliability, hardening strategies must be put in place.However, traditional modular redundancy of inherently expensive modules is not always feasible due to limited resource availability on target devices. Therefore, more efficient and cleverly designed hardening methods become a necessity. For instance, Algorithm-Based Fault-Tolerance (ABFT) exploits algorithm characteristics to deliver error detection/correction capabilities at significantly lower costs. First, experimental results with Xilinx's DPU indicate that failure rates can be over twice as high as the limits specified for terrestrial applications.In other words, the undeniable need for hardening in the state-of-the-art neural network accelerator for FPGAs is demonstrated. Later, an extensive multi-level fault propagation analysis is presented, and an ultra-low-cost algorithm-based error detection strategy for matrix multiplication is proposed.By considering the specifics of FPGAs' fault model, this novel hardening method decreases costs of implementation by over a polynomial degree, when compared to state-of-the-art solutions. A corresponding architectural implementation is suggested, incurring area and energy overheads lower than 1% for the vast majority of systolic arrays dimensions. Finally, the impact of fundamental design decisions, such as data precision in processing elements, and overall degree of parallelism, on the reliability of hypothetical neural network accelerators is experimentally investigated.A novel way of predicting the compound failure rate of inherently inaccurate algorithms/applications in the presence of radiation is also provided.
Date Created
2021
Agent

Pre-Silicon Analysis of a Single Event Transient Pulse Measurement Test Structure in a FinFET Process

158643-Thumbnail Image.png
Description
A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the

A Single Event Transient (SET) is a transient voltage pulse induced by an ionizing radiation particle striking a combinational logic node in a circuit. The probability of a storage element capturing the transient pulse depends on the width of the pulse. Measuring the rate of occurrence and the distribution of SET pulse widths is essential to understand the likelihood of soft errors and to develop cost-effective mitigation schemes. Existing research measures the pulse width of SETs in bulk Complementary Metal-Oxide-Semiconductor (CMOS) and Silicon On Insulator (SOI) technologies, but not on Fin Field-Effect Transistors (FinFETs). This thesis focuses on developing a test structure on the FinFET process to generate, propagate, and separate SETs and build a time-to-digital converter to measure the pulse width of SET.



The proposed SET test structure statistically separates SETs generated at NMOS and PMOS based on the difference in restoring current. It consists of N-collection devices to collect events at NMOS and P-collection devices to collect events at PMOS. The events that occur in PMOS of the N-collection device and NMOS of the P-collection device are false events. The logic gates of the collection devices are skewed to perform pulse expansion so that a minimally sustained SET propagates without getting suppressed by the contamination delay. A symmetric tree structure with an S-R latch event detector localizes the location of the SET. The Cartesian coordinates-based pulse injection structure injects external pulses at specific nodes to perform instrumentation and calibrate the measurement. A thermometer-encoded chain (vernier chain) with mismatched delay paths measures the width of the SET.

For low Linear Energy Transfer (LET) tests, the false events are entirely masked and do not propagate since the amount of charge that has to be deposited for successful event propagation is significantly high. In the case of high LET tests, the actual events and false events propagate, but they can be separated based on the SET location and the width of the output event. The vernier chain has a high measurement resolution of ~3.5ps, which aids in separating the events.
Date Created
2020
Agent

Software Techniques For Dependable Execution

156829-Thumbnail Image.png
Description
Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of

Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number

of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can provide commercial-off-the-shelf microprocessors with adaptive and scalable reliability.

Among all software-level error resilience solutions, in-application instruction replication based approaches have been widely used and are deemed to be the most effective. However, existing instruction-based replication schemes only protect some part of computations i.e. arithmetic and logical instructions and leave the rest as unprotected. To improve the efficacy of instruction-level redundancy-based approaches, we developed several error detection and error correction schemes. nZDC (near Zero silent

Data Corruption) is an instruction duplication scheme which protects the execution of whole application. Rather than detecting errors on register operands of memory and control flow operations, nZDC checks the results of such operations. nZDC en

sures the correct execution of memory write instruction by reloading stored value and checking it against redundantly computed value. nZDC also introduces a novel control flow checking mechanism which replicates compare and branch instructions and

detects both wrong direction branches as well as unwanted jumps. Fault injection experiments show that nZDC can improve the error coverage of the state-of-the-art schemes by more than 10x, without incurring any more performance penalty. Further

more, we introduced two error recovery solutions. InCheck is our backward recovery solution which makes light-weighted error-free checkpoints at the basic block granularity. In the case of error, InCheck reverts the program execution to the beginning of last executed basic block and resumes the execution by the aid of preserved in formation. NEMESIS is our forward recovery scheme which runs three versions of computation and detects errors by checking the results of all memory write and branch

operations. In the case of a mismatch, NEMESIS diagnosis routine decides if the error is recoverable. If yes, NEMESIS recovery routine reverts the effect of error from the program state and resumes program normal execution from the error detection

point.
Date Created
2018
Agent

VLIW Remotely Reconfigurable DSP Element

135132-Thumbnail Image.png
Description
The purpose of the Very Long Instruction Word (VLIW) Remotely Reconfigurable DSP Element is to use VLIW as a design process and to design hardware components of a reconfigurable DSP Element and ascertaining the overall length of the Very Long

The purpose of the Very Long Instruction Word (VLIW) Remotely Reconfigurable DSP Element is to use VLIW as a design process and to design hardware components of a reconfigurable DSP Element and ascertaining the overall length of the Very Long Instruction Word. This project is focused solely on hardware components being designed by hand with regards to certain specifications deemed by General Dynamics Mission Systems, and using the designs, finding the overall length of the VLIW for use in future work. To design each of the elements, General Dynamics had specified several requirements. Each element was then designed individually according to the requirements. After the initial design, each was sent back for a design review from General Dynamics, and after revision, all parts were linked together for an overall calculation on the length of the VLIW. VLIW Reconfigurable DSP Elements is not a new concept, but has yet to have a proof of concept published. Future work includes a proof of concept with software (done by the ASU Capstone team), then future development by General Dynamics. Should they choose to continue with this project, they will continue testing on FPGA boards, and perhaps future development into an ASIC. Overall the purpose of General Dynamics for proposing this project is for deep space payloads, for which this project has the most applications.
Date Created
2016-12
Agent

6T-SRAM 1Mb design with test structures and post silicon validation

155708-Thumbnail Image.png
Description
Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In

Static random-access memories (SRAM) are integral part of design systems as caches and data memories that and occupy one-third of design space. The work presents an embedded low power SRAM on a triple well process that allows body-biasing control. In addition to the normal mode operation, the design is embedded with Physical Unclonable Function (PUF) [Suh07] and Sense Amplifier Test (SA Test) mode. With PUF mode structures, the fabrication and environmental mismatches in bit cells are used to generate unique identification bits. These bits are fixed and known as preferred state of an SRAM bit cell. The direct access test structure is a measurement unit for offset voltage analysis of sense amplifiers. These designs are manufactured using a foundry bulk CMOS 55 nm low-power (LP) process. The details about SRAM bit-cell and peripheral circuit design is discussed in detail, for certain cases the circuit simulation analysis is performed with random variations embedded in SPICE models. Further, post-silicon testing results are discussed for normal operation of SRAMs and the special test modes. The silicon and circuit simulation results for various tests are presented.
Date Created
2017
Agent

Radiation effects measurement test structure using GF 32-nm SOI process

155707-Thumbnail Image.png
Description
This thesis describes the design of a Single Event Transient (SET) duration measurement test-structure on the Global Foundries (previously IBM) 32-nm silicon-on insulator (SOI) process. The test structure is designed for portability and allows quick design and implementation on a

This thesis describes the design of a Single Event Transient (SET) duration measurement test-structure on the Global Foundries (previously IBM) 32-nm silicon-on insulator (SOI) process. The test structure is designed for portability and allows quick design and implementation on a new process node. Such a test structure is critical in analyzing the effects of radiation on complementary metal oxide semi-conductor (CMOS) circuits. The focus of this thesis is the change in pulse width during propagation of SET pulse and build a test structure to measure the duration of a SET pulse generated in real time. This test structure can estimate the SET pulse duration with 10ps resolution. It receives the input SET propagated through a SET capture structure made using a chain of combinational gates. The impact of propagation of the SET in a >200 deep collection structure is studied. A novel methodology of deploying Thick Gate TID structure is proposed and analyzed to build multi-stage chain of combinational gates. Upon using long chain of combinational gates, the most critical issue of pulse width broadening and shortening is analyzed across critical process corners. The impact of using regular standard cells on pulse width modification is compared with NMOS and/or PMOS skewed gates for the chain of combinational gates. A possible resolution to pulse width change is demonstrated using circuit and layout design of chain of inverters, two and three inputs NOR gates. The SET capture circuit is also tested in simulation by introducing a glitch signal that mimics an individual ion strike that could lead to perturbation in SET propagation. Design techniques and skewed gates are deployed to dampen the glitch that occurs under the effect of radiation. Simulation results, layout structures of SET capture circuit and chain of combinational gates are presented.
Date Created
2017
Agent

Electrostatic Analysis of Gate All Around (GAA) Nanowire over FinFET

155704-Thumbnail Image.png
Description
CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while

CMOS Technology has been scaled down to 7 nm with FinFET replacing planar MOSFET devices. Due to short channel effects, the FinFET structure was developed to provide better electrostatic control on subthreshold leakage and saturation current over planar MOSFETs while having the desired current drive. The FinFET structure has an undoped or fully depleted fin, which supports immunity from random dopant fluctuations (RDF – a phenomenon which causes a reduction in the threshold voltage and is prominent at sub 50 nm tech nodes due to lesser dopant atoms) and thus causes threshold voltage (Vth) roll-off by reducing the Vth. However, as the advanced CMOS technologies are shrinking down to a 5 nm technology node, subthreshold leakage and drain-induced-barrier-lowering (DIBL) are driving the introduction of new metal-oxide-semiconductor field-effect transistor (MOSFET) structures to improve performance. GAA field effect transistors are shown to be the potential candidates for these advanced nodes. In nanowire devices, due to the presence of the gate on all sides of the channel, DIBL should be lower compared to the FinFETs.

A 3-D technology computer aided design (TCAD) device simulation is done to compare the performance of FinFET and GAA nanowire structures with vertically stacked horizontal nanowires. Subthreshold slope, DIBL & saturation current are measured and compared between these devices. The FinFET’s device performance has been matched with the ASAP7 compact model with the impact of tensile and compressive strain on NMOS & PMOS respectively. Metal work function is adjusted for the desired current drive. The nanowires have shown better electrostatic performance over FinFETs with excellent improvement in DIBL and subthreshold slope. This proves that horizontal nanowires can be the potential candidate for 5 nm technology node. A GAA nanowire structure for 5 nm tech node is characterized with a gate length of 15 nm. The structure is scaled down from 7 nm node to 5 nm by using a scaling factor of 0.7.
Date Created
2017
Agent

Automated place and route methodologies for multi-project test chips

153490-Thumbnail Image.png
Description
This work describes the development of automated flows to generate pad rings, mixed signal power grids, and mega cells in a multi-project test chip. There were three major design flows that were created to create the test chip. The first

This work describes the development of automated flows to generate pad rings, mixed signal power grids, and mega cells in a multi-project test chip. There were three major design flows that were created to create the test chip. The first was the pad ring which was used as the staring block for creating the test chip. This flow put all of the signals for the chip in the order that was wanted along the outside of the die along with creation of the power ring that is used to supply the chip with a robust power source.

The second flow that was created was used to put together a flash block that is based off of a XILIX XCFXXP. This flow was somewhat similar to how the pad ring flow worked except that optimizations and a clock tree was added into the flow. There was a couple of design redoes due to timing and orientation constraints.

Finally, the last flow that was created was the top level flow which is where all of the components are combined together to create a finished test chip ready for fabrication. The main components that were used were the finished flash block, HERMES, test structures, and a clock instance along with the pad ring flow for the creation of the pad ring and power ring.

Also discussed is some work that was done on a previous multi-project test chip. The work that was done was the creation of power gaters that were used like switches to turn the power on and off for some flash modules. To control the power gaters the functionality change of some pad drivers was done so that they output a higher voltage than what is seen in the core of the chip.
Date Created
2015
Agent

Methodical design approaches to multiple node collection robustness for flip-flop soft error mitigation

153386-Thumbnail Image.png
Description
The space environment comprises cosmic ray particles, heavy ions and high energy electrons and protons. Microelectronic circuits used in space applications such as satellites and space stations are prone to upsets induced by these particles. With transistor dimensions shrinking due

The space environment comprises cosmic ray particles, heavy ions and high energy electrons and protons. Microelectronic circuits used in space applications such as satellites and space stations are prone to upsets induced by these particles. With transistor dimensions shrinking due to continued scaling, terrestrial integrated circuits are also increasingly susceptible to radiation upsets. Hence radiation hardening is a requirement for microelectronic circuits used in both space and terrestrial applications.

This work begins by exploring the different radiation hardened flip-flops that have been proposed in the literature and classifies them based on the different hardening techniques.

A reduced power delay element for the temporal hardening of sequential digital circuits is presented. The delay element single event transient tolerance is demonstrated by simulations using it in a radiation hardened by design master slave flip-flop (FF). Using the proposed delay element saves up to 25% total FF power at 50% activity factor. The delay element is used in the implementation of an 8-bit, 8051 designed in the TSMC 130 nm bulk CMOS.

A single impinging ionizing radiation particle is increasingly likely to upset multiple circuit nodes and produce logic transients that contribute to the soft error rate in most modern scaled process technologies. The design of flip-flops is made more difficult with increasing multi-node charge collection, which requires that charge storage and other sensitive nodes be separated so that one impinging radiation particle does not affect redundant nodes simultaneously. We describe a correct-by-construction design methodology to determine a-priori which hardened FF nodes must be separated, as well as a general interleaving scheme to achieve this separation. We apply the methodology to radiation hardened flip-flops and demonstrate optimal circuit physical organization for protection against multi-node charge collection.

Finally, the methodology is utilized to provide critical node separation for a new hardened flip-flop design that reduces the power and area by 31% and 35% respectively compared to a temporal FF with similar hardness. The hardness is verified and compared to other published designs via the proposed systematic simulation approach that comprehends multiple node charge collection and tests resiliency to upsets at all internal and input nodes. Comparison of the hardness, as measured by estimated upset cross-section, is made to other published designs. Additionally, the importance of specific circuit design aspects to achieving hardness is shown.
Date Created
2015
Agent

Radiation hardened pulse based D flip flop design

152421-Thumbnail Image.png
Description
ABSTRACT The D flip flop acts as a sequencing element while designing any pipelined system. Radiation Hardening by Design (RHBD) allows hardened circuits to be fabricated on commercially available CMOS manufacturing process. Recently, single event transients (SET's) have become as

ABSTRACT The D flip flop acts as a sequencing element while designing any pipelined system. Radiation Hardening by Design (RHBD) allows hardened circuits to be fabricated on commercially available CMOS manufacturing process. Recently, single event transients (SET's) have become as important as single event upset (SEU) in radiation hardened high speed digital designs. A novel temporal pulse based RHBD flip-flop design is presented. Temporally delayed pulses produced by a radiation hardened pulse generator design samples the data in three redundant pulse latches. The proposed RHBD flip-flop has been statistically designed and fabricated on 90 nm TSMC LP process. Detailed simulations of the flip-flop operation in both normal and radiation environments are presented. Spatial separation of critical nodes for the physical design of the flip-flop is carried out for mitigating multi-node charge collection upsets. The proposed flip-flop is also used in commercial CAD flows for high performance chip designs. The proposed flip-flop is used in the design and auto-place-route (APR) of an advanced encryption system and the metrics analyzed.
Date Created
2014
Agent