Machine Learning and Causal Inference: Theory, Examples, and Computational Results

187808-Thumbnail Image.png
Description
This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth theoretical study. Next, various computational approaches to estimating causal effects with machine learning methods are compared with these theoretical desiderata in mind. Several improvements to current methods for causal machine learning are identified and compelling angles for further study are pinpointed. Finally, a common method used for “explaining” predictions of machine learning algorithms, SHAP, is evaluated critically through a statistical lens.
Date Created
2023
Agent

Case Studies in Machine Learning of Reduced Form Models for Causal Inference

187395-Thumbnail Image.png
Description
This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters.

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.
Date Created
2023
Agent

Optimal Designs under Logistic Mixed Models

171508-Thumbnail Image.png
Description
Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated parameters, which leads to local optimality. Moreover, the information matrix under a GLMM doesn't have a closed-form expression. My dissertation includes three topics related to this design problem. The first part is searching for locally optimal designs under GLMMs with longitudinal data. I apply penalized quasi-likelihood (PQL) method to approximate the information matrix and compare several approximations to show the superiority of PQL over other approximations. Under different local parameters and design restrictions, locally D- and A- optimal designs are constructed based on the approximation. An interesting finding is that locally optimal designs sometimes apply different designs to different subjects. Finally, the robustness of these locally optimal designs is discussed. In the second part, an unknown observational covariate is added to the previous model. With an unknown observational variable in the experiment, expected optimality criteria are considered. Under different assumptions of the unknown variable and parameter settings, locally optimal designs are constructed and discussed. In the last part, Bayesian optimal designs are considered under logistic mixed models. Considering different priors of the local parameters, Bayesian optimal designs are generated. Bayesian design under such a model is usually expensive in time. The running time in this dissertation is optimized to an acceptable amount with accurate results. I also discuss the robustness of these Bayesian optimal designs, which is the motivation of applying such an approach.
Date Created
2022
Agent

Advances in Directional Goodness-of-fit Testing of Binary Data under Model Misspecification in Case of Sparseness

171467-Thumbnail Image.png
Description
Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit

Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but not a directional test so it is hard to find the source of poor fit when the null hypothesis is rejected and it will lose its validity and effectiveness in some of the special conditions. Sparseness is such an abnormal condition. One effective way to overcome the adverse effects of sparseness is to use limited-information statistics. In this dissertation, two topics about constructing and using limited-information statistics to overcome sparseness for binary data will be included. In the first topic, the theoretical framework of pairwise concordance and the transformation matrix which is used to extract the corresponding marginals and their generalizations are provided. Then a series of new chi-square test statistics and corresponding orthogonal components are proposed, which are used to detect the model misspecification for longitudinal binary data. One of the important conclusions is, the test statistic $X^2_{2c}$ can be taken as an extension of $X^2_{[2]}$, the second-order marginals of traditional Pearson’s chi-square statistic. In the second topic, the research interest is to investigate the effect caused by different intercept patterns when using Lagrange multiplier (LM) test to find the source of misfit for two items in 2-PL IRT model. Several other directional chi-square test statistics are taken into comparison. The simulation results showed that the intercept pattern does affect the performance of goodness-of-fit test, especially the power to find the source of misfit if the source of misfit does exist. More specifically, the power is directly affected by the `intercept distance' between two misfit variables. Another discovery is, the LM test statistic has the best balance between the accurate Type I error rates and high empirical power, which indicates the LM test is a robust test.
Date Created
2022
Agent

Partition of Unity Methods for Solving Partial Differential Equations on Surfaces

168481-Thumbnail Image.png
Description
Solving partial differential equations on surfaces has many applications including modeling chemical diffusion, pattern formation, geophysics and texture mapping. This dissertation presents two techniques for solving time dependent partial differential equations on various surfaces using the partition of unity method.

Solving partial differential equations on surfaces has many applications including modeling chemical diffusion, pattern formation, geophysics and texture mapping. This dissertation presents two techniques for solving time dependent partial differential equations on various surfaces using the partition of unity method. A novel spectral cubed sphere method that utilizes the windowed Fourier technique is presented and used for both approximating functions on spherical domains and solving partial differential equations. The spectral cubed sphere method is applied to solve the transport equation as well as the diffusion equation on the unit sphere. The second approach is a partition of unity method with local radial basis function approximations. This technique is also used to explore the effect of the node distribution as it is well known that node choice plays an important role in the accuracy and stability of an approximation. A greedy algorithm is implemented to generate good interpolation nodes using the column pivoting QR factorization. The partition of unity radial basis function method is applied to solve the diffusion equation on the sphere as well as a system of reaction-diffusion equations on multiple surfaces including the surface of a red blood cell, a torus, and the Stanford bunny. Accuracy and stability of both methods are investigated.
Date Created
2021
Agent

Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data

158520-Thumbnail Image.png
Description
In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed.

Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion.

Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified.
Date Created
2020
Agent

Essays on the Modeling of Binary Longitudinal Data with Time-dependent Covariates

158415-Thumbnail Image.png
Description
Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average

Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average the effects of time-dependent covariates on outcomes over time and provide a single regression coefficient per time-dependent covariate. This denies researchers the opportunity to follow the changing impact of time-dependent covariates on the outcomes. This dissertation addresses such issue through the use of partitioned regression coefficients in three different papers.

In the first paper, an alternative approach to the partitioned Generalized Method of Moments logistic regression model for longitudinal binary outcomes is presented. This method relies on Bayes estimators and is utilized when the partitioned Generalized Method of Moments model provides numerically unstable estimates of the regression coefficients. It is used to model obesity status in the Add Health study and cognitive impairment diagnosis in the National Alzheimer’s Coordination Center database.

The second paper develops a model that allows the joint modeling of two or more binary outcomes that provide an overall measure of a subject’s trait over time. The simultaneous modelling of all outcomes provides a complete picture of the overall measure of interest. This approach accounts for the correlation among and between the outcomes across time and the changing effects of time-dependent covariates on the outcomes. The model is used to analyze four outcomes measuring overall the quality of life in the Chinese Longitudinal Healthy Longevity Study.

The third paper presents an approach that allows for estimation of cross-sectional and lagged effects of the covariates on the outcome as well as the feedback of the response on future covariates. This is done in two-parts, in part-1, the effects of time-dependent covariates on the outcomes are estimated, then, in part-2, the outcome influences on future values of the covariates are measured. These model parameters are obtained through a Generalized Method of Moments procedure that uses valid moment conditions between the outcome and the covariates. Child morbidity in the Philippines and obesity status in the Add Health data are analyzed.
Date Created
2020
Agent

Comparison of Denominator Degrees of Freedom Approximations for Linear Mixed Models in Small-Sample Simulations

158282-Thumbnail Image.png
Description
Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the sample size is small and the underlying distribution for the

Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the sample size is small and the underlying distribution for the test statistic is unknown. Consequently, five methods of denominator degrees of freedom approximations (residual, containment, between-within, Satterthwaite, Kenward-Roger) are developed to overcome this problem. This study aims to evaluate the performance of these five methods with a mixed model consisting of random intercept and random slope. Specifically, simulations are conducted to provide insights on the F-statistics, denominator degrees of freedom and p-values each method gives with respect to different settings of the sample structure, the fixed-effect slopes and the missing-data proportion. The simulation results show that the residual method performs the worst in terms of F-statistics and p-values. Also, Satterthwaite and Kenward-Roger methods tend to be more sensitive to the change of designs. The Kenward-Roger method performs the best in terms of F-statistics when the null hypothesis is true.
Date Created
2020
Agent

Optimal Sampling Designs for Functional Data Analysis

158208-Thumbnail Image.png
Description
Functional regression models are widely considered in practice. To precisely understand an underlying functional mechanism, a good sampling schedule for collecting informative functional data is necessary, especially when data collection is limited. However, scarce research has been conducted on the

Functional regression models are widely considered in practice. To precisely understand an underlying functional mechanism, a good sampling schedule for collecting informative functional data is necessary, especially when data collection is limited. However, scarce research has been conducted on the optimal sampling schedule design for the functional regression model so far. To address this design issue, efficient approaches are proposed for generating the best sampling plan in the functional regression setting. First, three optimal experimental designs are considered under a function-on-function linear model: the schedule that maximizes the relative efficiency for recovering the predictor function, the schedule that maximizes the relative efficiency for predicting the response function, and the schedule that maximizes the mixture of the relative efficiencies of both the predictor and response functions. The obtained sampling plan allows a precise recovery of the predictor function and a precise prediction of the response function. The proposed approach can also be reduced to identify the optimal sampling plan for the problem with a scalar-on-function linear regression model. In addition, the optimality criterion on predicting a scalar response using a functional predictor is derived when the quadratic relationship between these two variables is present, and proofs of important properties of the derived optimality criterion are also provided. To find such designs, an algorithm that is comparably fast, and can generate nearly optimal designs is proposed. As the optimality criterion includes quantities that must be estimated from prior knowledge (e.g., a pilot study), the effectiveness of the suggested optimal design highly depends on the quality of the estimates. However, in many situations, the estimates are unreliable; thus, a bootstrap aggregating (bagging) approach is employed for enhancing the quality of estimates and for finding sampling schedules stable to the misspecification of estimates. Through case studies, it is demonstrated that the proposed designs outperform other designs in terms of accurately predicting the response and recovering the predictor. It is also proposed that bagging-enhanced design generates a more robust sampling design under the misspecification of estimated quantities.
Date Created
2020
Agent

Locally Optimal Experimental Designs for Mixed Responses Models

158061-Thumbnail Image.png
Description
Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed data. However, the research on optimal designs for this type

Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed data. However, the research on optimal designs for this type of experiments is still scarce. The joint mixed responses model that is considered here involves a mixture of ordinary linear models for the continuous response and a generalized linear model for the binary response. Using the complete class approach, tighter upper bounds on the number of support points required for finding locally optimal designs are derived for the mixed responses models studied in this work.

In the first part of this dissertation, a theoretical result was developed to facilitate the search of locally symmetric optimal designs for mixed responses models with one continuous covariate. Then, the study was extended to mixed responses models that include group effects. Two types of mixed responses models with group effects were investigated. The first type includes models having no common parameters across subject group, and the second type of models allows some common parameters (e.g., a common slope) across groups. In addition to complete class results, an efficient algorithm (PSO-FM) was proposed to search for the A- and D-optimal designs. Finally, the first-order mixed responses model is extended to a type of a quadratic mixed responses model with a quadratic polynomial predictor placed in its linear model.
Date Created
2020
Agent