Analysis Methods for No-Confounding Screening Designs

158883-Thumbnail Image.png
Description
Nonregular designs are a preferable alternative to regular resolution four designs because they avoid confounding two-factor interactions. As a result nonregular designs can estimate and identify a few active two-factor interactions. However, due to the sometimes complex alias structure of

Nonregular designs are a preferable alternative to regular resolution four designs because they avoid confounding two-factor interactions. As a result nonregular designs can estimate and identify a few active two-factor interactions. However, due to the sometimes complex alias structure of nonregular designs, standard screening strategies can fail to identify all active effects. In this research, two-level nonregular screening designs with orthogonal main effects will be discussed. By utilizing knowledge of the alias structure, a design based model selection process for analyzing nonregular designs is proposed.

The Aliased Informed Model Selection (AIMS) strategy is a design specific approach that is compared to three generic model selection methods; stepwise regression, least absolute shrinkage and selection operator (LASSO), and the Dantzig selector. The AIMS approach substantially increases the power to detect active main effects and two-factor interactions versus the aforementioned generic methodologies. This research identifies design specific model spaces; sets of models with strong heredity, all estimable, and exhibit no model confounding. These spaces are then used in the AIMS method along with design specific aliasing rules for model selection decisions. Model spaces and alias rules are identified for three designs; 16-run no-confounding 6, 7, and 8-factor designs. The designs are demonstrated with several examples as well as simulations to show the AIMS superiority in model selection.

A final piece of the research provides a method for augmenting no-confounding designs based on a model spaces and maximum average D-efficiency. Several augmented designs are provided for different situations. A final simulation with the augmented designs shows strong results for augmenting four additional runs if time and resources permit.
Date Created
2020
Agent

Active Learning with Explore and Exploit Equilibriums

158694-Thumbnail Image.png
Description
In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images

In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images are abundant while labeling requires a considerable amount of knowledge from experienced physicians. Active learning addresses this challenge with an iterative process to select instances from the unlabeled data to annotate and improve the supervised learner. At each step, the query of examples to be labeled can be considered as a dilemma between exploitation of the supervised learner's current knowledge and exploration of the unlabeled input features.

Motivated by the need for efficient active learning strategies, this dissertation proposes new algorithms for batch-mode, pool-based active learning. The research considers the following questions: how can unsupervised knowledge of the input features (exploration) improve learning when incorporated with supervised learning (exploitation)? How to characterize exploration in active learning when data is high-dimensional? Finally, how to adaptively make a balance between exploration and exploitation?

The first contribution proposes a new active learning algorithm, Cluster-based Stochastic Query-by-Forest (CSQBF), which provides a batch-mode strategy that accelerates learning with added value from exploration and improved exploitation scores. CSQBF balances exploration and exploitation using a probabilistic scoring criterion based on classification probabilities from a tree-based ensemble model within each data cluster.

The second contribution introduces two more query strategies, Double Margin Active Learning (DMAL) and Cluster Agnostic Active Learning (CAAL), that combine consistent exploration and exploitation modules into a coherent and unified measure for label query. Instead of assuming a fixed clustering structure, CAAL and DMAL adopt a soft-clustering strategy which provides a new approach to formalize exploration in active learning.

The third contribution addresses the challenge of dynamically making a balance between exploration and exploitation criteria throughout the active learning process. Two adaptive algorithms are proposed based on feedback-driven bandit optimization frameworks that elegantly handle this issue by learning the relationship between exploration-exploitation trade-off and an active learner's performance.
Date Created
2020
Agent

Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data

158520-Thumbnail Image.png
Description
In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed.

Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion.

Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified.
Date Created
2020
Agent

Reliability Assessment Methodologies for Photovoltaic Modules

158398-Thumbnail Image.png
Description
The main objective of this research is to develop reliability assessment methodologies to quantify the effect of various environmental factors on photovoltaic (PV) module performance degradation. The manufacturers of these photovoltaic modules typically provide a warranty level of about 25

The main objective of this research is to develop reliability assessment methodologies to quantify the effect of various environmental factors on photovoltaic (PV) module performance degradation. The manufacturers of these photovoltaic modules typically provide a warranty level of about 25 years for 20% power degradation from the initial specified power rating. To quantify the reliability of such PV modules, the Accelerated Life Testing (ALT) plays an important role. But there are several obstacles that needs to be tackled to conduct such experiments, since there has not been enough historical field data available. Even if some time-series performance data of maximum output power (Pmax) is available, it may not be useful to develop failure/degradation mode-specific accelerated tests. This is because, to study the specific failure modes, it is essential to use failure mode-specific performance variable (like short circuit current, open circuit voltage or fill factor) that is directly affected by the failure mode, instead of overall power which would be affected by one or more of the performance variables. Hence, to address several of the above-mentioned issues, this research is divided into three phases. The first phase deals with developing models to study climate specific failure modes using failure mode specific parameters instead of power degradation. The limited field data collected after a long time (say 18-21 years), is utilized to model the degradation rate and the developed model is then calibrated to account for several unknown environmental effects using the available qualification testing data. The second phase discusses the cumulative damage modeling method to quantify the effects of various environmental variables on the overall power production of the photovoltaic module. Mainly, this cumulative degradation modeling approach is used to model the power degradation path and quantify the effects of high frequency multiple environmental input data (like temperature, humidity measured every minute or hour) with very sparse response data (power measurements taken quarterly or annually). The third phase deals with optimal planning and inference framework using Iterative-Accelerated Life Testing (I-ALT) methodology. All the proposed methodologies are demonstrated and validated using appropriate case studies.
Date Created
2020
Agent

Multivariate Statistical Modeling and Analysis of Accelerated Degradation Testing Data for Reliability Prediction

158154-Thumbnail Image.png
Description
Degradation process, as a course of progressive deterioration, commonly exists on many engineering systems. Since most failure mechanisms of these systems can be traced to the underlying degradation process, utilizing degradation data for reliability prediction is much needed. In industries,

Degradation process, as a course of progressive deterioration, commonly exists on many engineering systems. Since most failure mechanisms of these systems can be traced to the underlying degradation process, utilizing degradation data for reliability prediction is much needed. In industries, accelerated degradation tests (ADTs) are widely used to obtain timely reliability information of the system under test. This dissertation develops methodologies for the ADT data modeling and analysis.

In the first part of this dissertation, ADT is introduced along with three major challenges in the ADT data analysis – modeling framework, inference method, and the need of analyzing multi-dimensional processes. To overcome these challenges, in the second part, a hierarchical approach, that leads to a nonlinear mixed-effects regression model, to modeling a univariate degradation process is developed. With this modeling framework, the issues of ignoring uncertainties in both data analysis and lifetime prediction, as presented by an International Standard Organization (ISO) standard, are resolved. In the third part, an approach to modeling a bivariate degradation process is addressed. It is developed using the copula theory that brings the benefits of both model flexibility and inference convenience. This approach is provided with an efficient Bayesian method for reliability evaluation. In the last part, an extension to a multivariate modeling framework is developed. Three fundamental copula classes are applied to model the complex dependence structure among correlated degradation processes. The advantages of the proposed modeling framework and the effect of ignoring tail dependence are demonstrated through simulation studies. The applications of the copula-based multivariate degradation models on both system reliability evaluation and remaining useful life prediction are provided.

In summary, this dissertation studies and explores the use of statistical methods in analyzing ADT data. All proposed methodologies are demonstrated by case studies.
Date Created
2020
Agent

Separation in Optimal Designs for the Logistic Regression Model

157561-Thumbnail Image.png
Description
Optimal design theory provides a general framework for the construction of experimental designs for categorical responses. For a binary response, where the possible result is one of two outcomes, the logistic regression model is widely used to relate a

Optimal design theory provides a general framework for the construction of experimental designs for categorical responses. For a binary response, where the possible result is one of two outcomes, the logistic regression model is widely used to relate a set of experimental factors with the probability of a positive (or negative) outcome. This research investigates and proposes alternative designs to alleviate the problem of separation in small-sample D-optimal designs for the logistic regression model. Separation causes the non-existence of maximum likelihood parameter estimates and presents a serious problem for model fitting purposes.

First, it is shown that exact, multi-factor D-optimal designs for the logistic regression model can be susceptible to separation. Several logistic regression models are specified, and exact D-optimal designs of fixed sizes are constructed for each model. Sets of simulated response data are generated to estimate the probability of separation in each design. This study proves through simulation that small-sample D-optimal designs are prone to separation and that separation risk is dependent on the specified model. Additionally, it is demonstrated that exact designs of equal size constructed for the same models may have significantly different chances of encountering separation.

The second portion of this research establishes an effective strategy for augmentation, where additional design runs are judiciously added to eliminate separation that has occurred in an initial design. A simulation study is used to demonstrate that augmenting runs in regions of maximum prediction variance (MPV), where the predicted probability of either response category is 50%, most reliably eliminates separation. However, it is also shown that MPV augmentation tends to yield augmented designs with lower D-efficiencies.

The final portion of this research proposes a novel compound optimality criterion, DMP, that is used to construct locally optimal and robust compromise designs. A two-phase coordinate exchange algorithm is implemented to construct exact locally DMP-optimal designs. To address design dependence issues, a maximin strategy is proposed for designating a robust DMP-optimal design. A case study demonstrates that the maximin DMP-optimal design maintains comparable D-efficiencies to a corresponding Bayesian D-optimal design while offering significantly improved separation performance.
Date Created
2019
Agent

Data-Driven Decision-Making for Medications Management Modalities

157514-Thumbnail Image.png
Description
One of the critical issues in the U.S. healthcare sector is attributed to medications management. Mismanagement of medications can not only bring more unfavorable medical outcomes for patients, but also imposes avoidable medical expenditures, which can be partially accounted for

One of the critical issues in the U.S. healthcare sector is attributed to medications management. Mismanagement of medications can not only bring more unfavorable medical outcomes for patients, but also imposes avoidable medical expenditures, which can be partially accounted for the enormous $750 billion that the American healthcare system wastes annually. The lack of efficiency in medical outcomes can be due to several reasons. One of them is the problem of drug intensification: a problem associated with more aggressive management of medications and its negative consequences for patients.

To address this and many other challenges in regard to medications mismanagement, I take advantage of data-driven methodologies where a decision-making framework for identifying optimal medications management strategies will be established based on real-world data. This data-driven approach has the advantage of supporting decision-making processes by data analytics, and hence, the decision made can be validated by verifiable data. Thus, compared to merely theoretical methods, my methodology will be more applicable to patients as the ultimate beneficiaries of the healthcare system.

Based on this premise, in this dissertation I attempt to analyze and advance three streams of research that are influenced by issues involving the management of medications/treatments for different medical contexts. In particular, I will discuss (1) management of medications/treatment modalities for new-onset of diabetes after solid organ transplantations and (2) epidemic of opioid prescription and abuse.
Date Created
2019
Agent

Interaction Testing, Fault Location, and Anonymous Attribute-Based Authorization

157252-Thumbnail Image.png
Description
This dissertation studies three classes of combinatorial arrays with practical applications in testing, measurement, and security. Covering arrays are widely studied in software and hardware testing to indicate the presence of faulty interactions. Locating arrays extend covering arrays to achieve

This dissertation studies three classes of combinatorial arrays with practical applications in testing, measurement, and security. Covering arrays are widely studied in software and hardware testing to indicate the presence of faulty interactions. Locating arrays extend covering arrays to achieve identification of the interactions causing a fault by requiring additional conditions on how interactions are covered in rows. This dissertation introduces a new class, the anonymizing arrays, to guarantee a degree of anonymity by bounding the probability a particular row is identified by the interaction presented. Similarities among these arrays lead to common algorithmic techniques for their construction which this dissertation explores. Differences arising from their application domains lead to the unique features of each class, requiring tailoring the techniques to the specifics of each problem.

One contribution of this work is a conditional expectation algorithm to build covering arrays via an intermediate combinatorial object. Conditional expectation efficiently finds intermediate-sized arrays that are particularly useful as ingredients for additional recursive algorithms. A cut-and-paste method creates large arrays from small ingredients. Performing transformations on the copies makes further improvements by reducing redundancy in the composed arrays and leads to fewer rows.

This work contains the first algorithm for constructing locating arrays for general values of $d$ and $t$. A randomized computational search algorithmic framework verifies if a candidate array is $(\bar{d},t)$-locating by partitioning the search space and performs random resampling if a candidate fails. Algorithmic parameters determine which columns to resample and when to add additional rows to the candidate array. Additionally, analysis is conducted on the performance of the algorithmic parameters to provide guidance on how to tune parameters to prioritize speed, accuracy, or a combination of both.

This work proposes anonymizing arrays as a class related to covering arrays with a higher coverage requirement and constraints. The algorithms for covering and locating arrays are tailored to anonymizing array construction. An additional property, homogeneity, is introduced to meet the needs of attribute-based authorization. Two metrics, local and global homogeneity, are designed to compare anonymizing arrays with the same parameters. Finally, a post-optimization approach reduces the homogeneity of an anonymizing array.
Date Created
2019
Agent

Locating Arrays: Construction, Analysis, and Robustness

156852-Thumbnail Image.png
Description
Modern computer systems are complex engineered systems involving a large collection of individual parts, each with many parameters, or factors, affecting system performance. One way to understand these complex systems and their performance is through experimentation. However, most modern computer

Modern computer systems are complex engineered systems involving a large collection of individual parts, each with many parameters, or factors, affecting system performance. One way to understand these complex systems and their performance is through experimentation. However, most modern computer systems involve such a large number of factors that thorough experimentation on all of them is impossible. An initial screening step is thus necessary to determine which factors are relevant to the system's performance and which factors can be eliminated from experimentation.

Factors may impact system performance in different ways. A factor at a specific level may significantly affect performance as a main effect, or in combination with other main effects as an interaction. For screening, it is necessary both to identify the presence of these effects and to locate the factors responsible for them. A locating array is a relatively new experimental design that causes every main effect and interaction to occur and distinguishes all sets of d main effects and interactions from each other in the tests where they occur. This design is therefore helpful in screening complex systems.

The process of screening using locating arrays involves multiple steps. First, a locating array is constructed for all possibly significant factors. Next, the system is executed for all tests indicated by the locating array and a response is observed. Finally, the response is analyzed to identify the significant system factors for future experimentation. However, simply constructing a reasonably sized locating array for a large system is no easy task and analyzing the response of the tests presents additional difficulties due to the large number of possible predictors and the inherent imbalance in the experimental design itself. Further complications can arise from noise in the system or errors in testing.

This thesis has three contributions. First, it provides an algorithm to construct locating arrays using the Lovász Local Lemma with Moser-Tardos resampling. Second, it gives an algorithm to analyze the system response efficiently. Finally, it studies the robustness of the analysis to the heavy-hitters assumption underlying the approach as well as to varying amounts of system noise.
Date Created
2018
Agent

Data Fusion and Systems Engineering Approaches for Quality and Performance Improvement of Health Care Systems: From Diagnosis to Care to System-level Decision-making

156528-Thumbnail Image.png
Description
Technology advancements in diagnostic imaging, smart sensing, and health information systems have resulted in a data-rich environment in health care, which offers a great opportunity for Precision Medicine. The objective of my research is to develop data fusion and system

Technology advancements in diagnostic imaging, smart sensing, and health information systems have resulted in a data-rich environment in health care, which offers a great opportunity for Precision Medicine. The objective of my research is to develop data fusion and system informatics approaches for quality and performance improvement of health care. In my dissertation, I focus on three emerging problems in health care and develop novel statistical models and machine learning algorithms to tackle these problems from diagnosis to care to system-level decision-making.

The first topic is diagnosis/subtyping of migraine to customize effective treatment to different subtypes of patients. Existing clinical definitions of subtypes use somewhat arbitrary boundaries primarily based on patient self-reported symptoms, which are subjective and error-prone. My research develops a novel Multimodality Factor Mixture Model that discovers subtypes of migraine from multimodality imaging MRI data, which provides complementary accurate measurements of the disease. Patients in the different subtypes show significantly different clinical characteristics of the disease. Treatment tailored and optimized for patients of the same subtype paves the road toward Precision Medicine.

The second topic focuses on coordinated patient care. Care coordination between nurses and with other health care team members is important for providing high-quality and efficient care to patients. The recently developed Nurse Care Coordination Instrument (NCCI) is the first of its kind that enables large-scale quantitative data to be collected. My research develops a novel Multi-response Multi-level Model (M3) that enables transfer learning in NCCI data fusion. M3 identifies key factors that contribute to improving care coordination, and facilitates the design and optimization of nurses’ training, workload assignment, and practice environment, which leads to improved patient outcomes.

The last topic is about system-level decision-making for Alzheimer’s disease early detection at the early stage of Mild Cognitive Impairment (MCI), by predicting each MCI patient’s risk of converting to AD using imaging and proteomic biomarkers. My research proposes a systems engineering approach that integrates the multi-perspectives, including prediction accuracy, biomarker cost/availability, patient heterogeneity and diagnostic efficiency, and allows for system-wide optimized decision regarding the biomarker testing process for prediction of MCI conversion.
Date Created
2018
Agent