Supervised and ensemble classification of multivariate functional data: applications to lupus diagnosis

156580-Thumbnail Image.png
Description
This dissertation investigates the classification of systemic lupus erythematosus (SLE) in the presence of non-SLE alternatives, while developing novel curve classification methodologies with wide ranging applications. Functional data representations of plasma thermogram measurements and the corresponding derivative curves provide

This dissertation investigates the classification of systemic lupus erythematosus (SLE) in the presence of non-SLE alternatives, while developing novel curve classification methodologies with wide ranging applications. Functional data representations of plasma thermogram measurements and the corresponding derivative curves provide predictors yet to be investigated for SLE identification. Functional nonparametric classifiers form a methodological basis, which is used herein to develop a) the family of ESFuNC segment-wise curve classification algorithms and b) per-pixel ensembles based on logistic regression and fused-LASSO. The proposed methods achieve test set accuracy rates as high as 94.3%, while returning information about regions of the temperature domain that are critical for population discrimination. The undertaken analyses suggest that derivate-based information contributes significantly in improved classification performance relative to recently published studies on SLE plasma thermograms.
Date Created
2018
Agent

Three essays on shrinkage estimation and model selection of linear and nonlinear time series models

156576-Thumbnail Image.png
Description
The primary objective in time series analysis is forecasting. Raw data often exhibits nonstationary behavior: trends, seasonal cycles, and heteroskedasticity. After data is transformed to a weakly stationary process, autoregressive moving average (ARMA) models may capture the remaining temporal

The primary objective in time series analysis is forecasting. Raw data often exhibits nonstationary behavior: trends, seasonal cycles, and heteroskedasticity. After data is transformed to a weakly stationary process, autoregressive moving average (ARMA) models may capture the remaining temporal dynamics to improve forecasting. Estimation of ARMA can be performed through regressing current values on previous realizations and proxy innovations. The classic paradigm fails when dynamics are nonlinear; in this case, parametric, regime-switching specifications model changes in level, ARMA dynamics, and volatility, using a finite number of latent states. If the states can be identified using past endogenous or exogenous information, a threshold autoregressive (TAR) or logistic smooth transition autoregressive (LSTAR) model may simplify complex nonlinear associations to conditional weakly stationary processes. For ARMA, TAR, and STAR, order parameters quantify the extent past information is associated with the future. Unfortunately, even if model orders are known a priori, the possibility of over-fitting can lead to sub-optimal forecasting performance. By intentionally overestimating these orders, a linear representation of the full model is exploited and Bayesian regularization can be used to achieve sparsity. Global-local shrinkage priors for AR, MA, and exogenous coefficients are adopted to pull posterior means toward 0 without over-shrinking relevant effects. This dissertation introduces, evaluates, and compares Bayesian techniques that automatically perform model selection and coefficient estimation of ARMA, TAR, and STAR models. Multiple Monte Carlo experiments illustrate the accuracy of these methods in finding the "true" data generating process. Practical applications demonstrate their efficacy in forecasting.
Date Created
2018
Agent

Locally D-optimal designs for generalized linear models

156371-Thumbnail Image.png
Description
Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained on a case-by-case basis, while in other situations, researchers also rely heavily on computational tools for design selection.

Three topics are investigated in this dissertation with each one focusing on one type of GLMs. Topic I considers GLMs with factorial effects and one continuous covariate. Factors can have interactions among each other and there is no restriction on the possible values of the continuous covariate. The locally D-optimal design structures for such models are identified and results for obtaining smaller optimal designs using orthogonal arrays (OAs) are presented. Topic II considers GLMs with multiple covariates under the assumptions that all but one covariate are bounded within specified intervals and interaction effects among those bounded covariates may also exist. An explicit formula for D-optimal designs is derived and OA-based smaller D-optimal designs for models with one or two two-factor interactions are also constructed. Topic III considers multiple-covariate logistic models. All covariates are nonnegative and there is no interaction among them. Two types of D-optimal design structures are identified and their global D-optimality is proved using the celebrated equivalence theorem.
Date Created
2018
Agent

A study of components of Pearson's chi-square based on marginal distributions of cross-classified tables for binary variables

156264-Thumbnail Image.png
Description
The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error rate due to sparseness. Pearson's statistic can be decomposed into orthogonal components associated with the marginal distributions of observed variables, and an omnibus fit statistic can be obtained as a sum of these components. When the statistic is a sum of components for lower-order marginals, it has good performance for Type I error rate and statistical power even when applied to a sparse table. In this dissertation, goodness-of-fit statistics using orthogonal components based on second- third- and fourth-order marginals were examined. If lack-of-fit is present in higher-order marginals, then a test that incorporates the higher-order marginals may have a higher power than a test that incorporates only first- and/or second-order marginals. To this end, two new statistics based on the orthogonal components of Pearson's chi-square that incorporate third- and fourth-order marginals were developed, and the Type I error, empirical power, and asymptotic power under different sparseness conditions were investigated. Individual orthogonal components as test statistics to identify lack-of-fit were also studied. The performance of individual orthogonal components to other popular lack-of-fit statistics were also compared. When the number of manifest variables becomes larger than 20, most of the statistics based on marginal distributions have limitations in terms of computer resources and CPU time. Under this problem, when the number manifest variables is larger than or equal to 20, the performance of a bootstrap based method to obtain p-values for Pearson-Fisher statistic, fit to confirmatory dichotomous variable factor analysis model, and the performance of Tollenaar and Mooijaart (2003) statistic were investigated.
Date Created
2018
Agent

Essays on the identification and modeling of variance

156163-Thumbnail Image.png
Description
In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two variance parameters are estimated using generalized method of moments, negating the need for a distributional assumption. The mean-variance relation estimates are applied to clustered data and implemented in an adjusted generalized quasi-likelihood approach through an adjustment to the covariance matrix. In the presence of significant correlation in hierarchical structured data, the adjusted generalized quasi-likelihood model shows improved performance for random effect estimates. In addition, submodels to address deviation in skewness and kurtosis are provided to jointly model the mean, variance, skewness, and kurtosis. The additional models identify covariates influencing the third and fourth moments. A cutoff to trim the data is provided which improves parameter estimation and model fit. For each topic, findings are demonstrated through comprehensive simulation studies and numerical examples. Examples evaluated include data on children’s morbidity in the Philippines, adolescent health from the National Longitudinal Study of Adolescent to Adult Health, as well as proteomic assays for breast cancer screening.
Date Created
2018
Agent

Three essays on correlated binary outcomes: detection and appropriate models

156148-Thumbnail Image.png
Description
Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations

Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations obtained on the same subject. Longitudinal studies also introduce association between the covariates and the outcomes across time. When multiple outcomes are of interest, association may exist between the various models. These correlations can lead to issues in model fitting and inference if not properly accounted for. This dissertation presents three papers discussing appropriate methods to properly consider different types of association. The first paper introduces an ANOVA based measure of intraclass correlation for three level hierarchical data with binary outcomes, and corresponding properties. This measure is useful for evaluating when the correlation due to clustering warrants a more complex model. This measure is used to investigate AIDS knowledge in a clustered study conducted in Bangladesh. The second paper develops the Partitioned generalized method of moments (Partitioned GMM) model for longitudinal studies. This model utilizes valid moment conditions to separately estimate the varying effects of each time-dependent covariate on the outcome over time using multiple coefficients. The model is fit to data from the National Longitudinal Study of Adolescent to Adult Health (Add Health) to investigate risk factors of childhood obesity. In the third paper, the Partitioned GMM model is extended to jointly estimate regression models for multiple outcomes of interest. Thus, this approach takes into account both the correlation between the multivariate outcomes, as well as the correlation due to time-dependency in longitudinal studies. The model utilizes an expanded weight matrix and objective function composed of valid moment conditions to simultaneously estimate optimal regression coefficients. This approach is applied to Add Health data to simultaneously study drivers of outcomes including smoking, social alcohol usage, and obesity in children.
Date Created
2018
Agent

Three essays on comparative simulation in three-level hierarchical data structure

155978-Thumbnail Image.png
Description
Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression and covariance parameters as well as the intraclass correlation coefficients are usually obtained. In those cases, I have resorted to use of Laplace approximation and large sample theory approach for point and interval estimates such as Wald-type confidence intervals and profile likelihood confidence intervals. These methods rely on distributional assumptions and large sample theory. However, when dealing with small hierarchical datasets they often result in severe bias or non-convergence. I present a generalized quasi-likelihood approach and a generalized method of moments approach; both do not rely on any distributional assumptions but only moments of response. As an alternative to the typical large sample theory approach, I present bootstrapping hierarchical logistic regression models which provides more accurate interval estimates for small binary hierarchical data. These models substitute computations as an alternative to the traditional Wald-type and profile likelihood confidence intervals. I use a latent variable approach with a new split bootstrap method for estimating intraclass correlation coefficients when analyzing binary data obtained from a three-level hierarchical structure. It is especially useful with small sample size and easily expanded to multilevel. Comparisons are made to existing approaches through both theoretical justification and simulation studies. Further, I demonstrate my findings through an analysis of three numerical examples, one based on cancer in remission data, one related to the China’s antibiotic abuse study, and a third related to teacher effectiveness in schools from a state of southwest US.
Date Created
2017
Agent

Optimal Experimental Designs for Mixed Categorical and Continuous Responses

155868-Thumbnail Image.png
Description
This study concerns optimal designs for experiments where responses consist of both binary and continuous variables. Many experiments in engineering, medical studies, and other fields have such mixed responses. Although in recent decades several statistical methods have been developed for

This study concerns optimal designs for experiments where responses consist of both binary and continuous variables. Many experiments in engineering, medical studies, and other fields have such mixed responses. Although in recent decades several statistical methods have been developed for jointly modeling both types of response variables, an effective way to design such experiments remains unclear. To address this void, some useful results are developed to guide the selection of optimal experimental designs in such studies. The results are mainly built upon a powerful tool called the complete class approach and a nonlinear optimization algorithm. The complete class approach was originally developed for a univariate response, but it is extended to the case of bivariate responses of mixed variable types. Consequently, the number of candidate designs are significantly reduced. An optimization algorithm is then applied to efficiently search the small class of candidate designs for the D- and A-optimal designs. Furthermore, the optimality of the obtained designs is verified by the general equivalence theorem. In the first part of the study, the focus is on a simple, first-order model. The study is expanded to a model with a quadratic polynomial predictor. The obtained designs can help to render a precise statistical inference in practice or serve as a benchmark for evaluating the quality of other designs.
Date Created
2017
Agent

Optimum Experimental Design Issues in Functional Neuroimaging Studies

155789-Thumbnail Image.png
Description
Functional magnetic resonance imaging (fMRI) is one of the popular tools to study human brain functions. High-quality experimental designs are crucial to the success of fMRI experiments as they allow the collection of informative data for making precise and valid

Functional magnetic resonance imaging (fMRI) is one of the popular tools to study human brain functions. High-quality experimental designs are crucial to the success of fMRI experiments as they allow the collection of informative data for making precise and valid inference with minimum cost. The primary goal of this study is on identifying the best sequence of mental stimuli (i.e. fMRI design) with respect to some statistically meaningful optimality criteria. This work focuses on two related topics in this research field. The first topic is on finding optimal designs for fMRI when the design matrix is uncertain. This challenging design issue occurs in many modern fMRI experiments, in which the design matrix of the statistical model depends on both the selected design and the experimental subject's uncertain behavior during the experiment. As a result, the design matrix cannot be fully determined at the design stage that makes it difficult to select a good design. For the commonly used linear model with autoregressive errors, this study proposes a very efficient approach for obtaining high-quality fMRI designs for such experiments. The proposed approach is built upon an analytical result, and an efficient computer algorithm. It is shown through case studies that our proposed approach can outperform the existing method in terms of computing time, and the quality of the obtained designs. The second topic of the research is to find optimal designs for fMRI when a wavelet-based technique is considered in the fMRI data analysis. An efficient computer algorithm to search for optimal fMRI designs for such cases is developed. This algorithm is inspired by simulated annealing and a recently proposed algorithm by Saleh et al. (2017). As demonstrated in the case studies, the proposed approach makes it possible to efficiently obtain high-quality designs for fMRI studies, and is practically useful.
Date Created
2017
Agent

fMRI design under autoregressive model with one type of stimulus

155642-Thumbnail Image.png
Description
Functional magnetic resonance imaging (fMRI) is used to study brain activity due

to stimuli presented to subjects in a scanner. It is important to conduct statistical

inference on such time series fMRI data obtained. It is also important to select optimal designs

Functional magnetic resonance imaging (fMRI) is used to study brain activity due

to stimuli presented to subjects in a scanner. It is important to conduct statistical

inference on such time series fMRI data obtained. It is also important to select optimal designs for practical experiments. Design selection under autoregressive models

have not been thoroughly discussed before. This paper derives general information

matrices for orthogonal designs under autoregressive model with an arbitrary number

of correlation coefficients. We further provide the minimum trace of orthogonal circulant designs under AR(1) model, which is used as a criterion to compare practical

designs such as M-sequence designs and circulant (almost) orthogonal array designs.

We also explore optimal designs under AR(2) model. In practice, types of stimuli can

be more than one, but in this paper we only consider the simplest situation with only

one type of stimuli.
Date Created
2017
Agent