Data Fusion and Systems Engineering Approaches for Quality and Performance Improvement of Health Care Systems: From Diagnosis to Care to System-level Decision-making

156528-Thumbnail Image.png
Technology advancements in diagnostic imaging, smart sensing, and health information systems have resulted in a data-rich environment in health care, which offers a great opportunity for Precision Medicine. The objective of my research is to develop data fusion and system

Technology advancements in diagnostic imaging, smart sensing, and health information systems have resulted in a data-rich environment in health care, which offers a great opportunity for Precision Medicine. The objective of my research is to develop data fusion and system informatics approaches for quality and performance improvement of health care. In my dissertation, I focus on three emerging problems in health care and develop novel statistical models and machine learning algorithms to tackle these problems from diagnosis to care to system-level decision-making.

The first topic is diagnosis/subtyping of migraine to customize effective treatment to different subtypes of patients. Existing clinical definitions of subtypes use somewhat arbitrary boundaries primarily based on patient self-reported symptoms, which are subjective and error-prone. My research develops a novel Multimodality Factor Mixture Model that discovers subtypes of migraine from multimodality imaging MRI data, which provides complementary accurate measurements of the disease. Patients in the different subtypes show significantly different clinical characteristics of the disease. Treatment tailored and optimized for patients of the same subtype paves the road toward Precision Medicine.

The second topic focuses on coordinated patient care. Care coordination between nurses and with other health care team members is important for providing high-quality and efficient care to patients. The recently developed Nurse Care Coordination Instrument (NCCI) is the first of its kind that enables large-scale quantitative data to be collected. My research develops a novel Multi-response Multi-level Model (M3) that enables transfer learning in NCCI data fusion. M3 identifies key factors that contribute to improving care coordination, and facilitates the design and optimization of nurses’ training, workload assignment, and practice environment, which leads to improved patient outcomes.

The last topic is about system-level decision-making for Alzheimer’s disease early detection at the early stage of Mild Cognitive Impairment (MCI), by predicting each MCI patient’s risk of converting to AD using imaging and proteomic biomarkers. My research proposes a systems engineering approach that integrates the multi-perspectives, including prediction accuracy, biomarker cost/availability, patient heterogeneity and diagnostic efficiency, and allows for system-wide optimized decision regarding the biomarker testing process for prediction of MCI conversion.
Date Created

Structure-Regularized Partition-Regression Models for Nonlinear System-Environment Interactions

156487-Thumbnail Image.png
Under different environmental conditions, the relationship between the design and operational variables of a system and the system’s performance is likely to vary and is difficult to be described by a single model. The environmental variables (e.g., temperature, humidity) are

Under different environmental conditions, the relationship between the design and operational variables of a system and the system’s performance is likely to vary and is difficult to be described by a single model. The environmental variables (e.g., temperature, humidity) are not controllable while the variables of the system (e.g. heating, cooling) are mostly controllable. This phenomenon has been widely seen in the areas of building energy management, mobile communication networks, and wind energy. To account for the complicated interaction between a system and the multivariate environment under which it operates, a Sparse Partitioned-Regression (SPR) model is proposed, which automatically searches for a partition of the environmental variables and fits a sparse regression within each subdivision of the partition. SPR is an innovative approach that integrates recursive partitioning and high-dimensional regression model fitting within a single framework. Moreover, theoretical studies of SPR are explicitly conducted to derive the oracle inequalities for the SPR estimators which could provide a bound for the difference between the risk of SPR estimators and Bayes’ risk. These theoretical studies show that the performance of SPR estimator is almost (up to numerical constants) as good as of an ideal estimator that can be theoretically achieved but is not available in practice. Finally, a Tree-Based Structure-Regularized Regression (TBSR) approach is proposed by considering the fact that the model performance can be improved by a joint estimation on different subdivisions in certain scenarios. It leverages the idea that models for different subdivisions may share some similarities and can borrow strength from each other. The proposed approaches are applied to two real datasets in the domain of building energy. (1) SPR is used in an application of adopting building design and operational variables, outdoor environmental variables, and their interactions to predict energy consumption based on the Department of Energy’s EnergyPlus data sets. SPR produces a high level of prediction accuracy and provides insights into the design, operation, and management of energy-efficient buildings. (2) TBSR is used in an application of predicting future temperature condition which could help to decide whether to activate or not the Heating, Ventilation, and Air Conditioning (HVAC) systems in an energy-efficient manner.
Date Created

Design and Mining of Health Information Systems for Process and Patient Care Improvement

156299-Thumbnail Image.png
In healthcare facilities, health information systems (HISs) are used to serve different purposes. The radiology department adopts multiple HISs in managing their operations and patient care. In general, the HISs that touch radiology fall into two categories: tracking HISs and

In healthcare facilities, health information systems (HISs) are used to serve different purposes. The radiology department adopts multiple HISs in managing their operations and patient care. In general, the HISs that touch radiology fall into two categories: tracking HISs and archive HISs. Electronic Health Records (EHR) is a typical tracking HIS, which tracks the care each patient receives at multiple encounters and facilities. Archive HISs are typically specialized databases to store large-size data collected as part of the patient care. A typical example of an archive HIS is the Picture Archive and Communication System (PACS), which provides economical storage and convenient access to diagnostic images from multiple modalities. How to integrate such HISs and best utilize their data remains a challenging problem due to the disparity of HISs as well as high-dimensionality and heterogeneity of the data. My PhD dissertation research includes three inter-connected and integrated topics and focuses on designing integrated HISs and further developing statistical models and machine learning algorithms for process and patient care improvement.

Topic 1: Design of super-HIS and tracking of quality of care (QoC). My research developed an information technology that integrates multiple HISs in radiology, and proposed QoC metrics defined upon the data that measure various dimensions of care. The DDD assisted the clinical practices and enabled an effective intervention for reducing lengthy radiologist turnaround times for patients.

Topic 2: Monitoring and change detection of QoC data streams for process improvement. With the super-HIS in place, high-dimensional data streams of QoC metrics are generated. I developed a statistical model for monitoring high- dimensional data streams that integrated Singular Vector Decomposition (SVD) and process control. The algorithm was applied to QoC metrics data, and additionally extended to another application of monitoring traffic data in communication networks.

Topic 3: Deep transfer learning of archive HIS data for computer-aided diagnosis (CAD). The novelty of the CAD system is the development of a deep transfer learning algorithm that combines the ideas of transfer learning and multi- modality image integration under the deep learning framework. Our system achieved high accuracy in breast cancer diagnosis compared with conventional machine learning algorithms.
Date Created

An exploration of statistical modelling methods on simulation data case study: biomechanical predator-prey simulations

156200-Thumbnail Image.png
Modern, advanced statistical tools from data mining and machine learning have become commonplace in molecular biology in large part because of the “big data” demands of various kinds of “-omics” (e.g., genomics, transcriptomics, metabolomics, etc.). However, in other fields

Modern, advanced statistical tools from data mining and machine learning have become commonplace in molecular biology in large part because of the “big data” demands of various kinds of “-omics” (e.g., genomics, transcriptomics, metabolomics, etc.). However, in other fields of biology where empirical data sets are conventionally smaller, more traditional statistical methods of inference are still very effective and widely used. Nevertheless, with the decrease in cost of high-performance computing, these fields are starting to employ simulation models to generate insights into questions that have been elusive in the laboratory and field. Although these computational models allow for exquisite control over large numbers of parameters, they also generate data at a qualitatively different scale than most experts in these fields are accustomed to. Thus, more sophisticated methods from big-data statistics have an opportunity to better facilitate the often-forgotten area of bioinformatics that might be called “in-silicomics”.

As a case study, this thesis develops methods for the analysis of large amounts of data generated from a simulated ecosystem designed to understand how mammalian biomechanics interact with environmental complexity to modulate the outcomes of predator–prey interactions. These simulations investigate how other biomechanical parameters relating to the agility of animals in predator–prey pairs are better predictors of pursuit outcomes. Traditional modelling techniques such as forward, backward, and stepwise variable selection are initially used to study these data, but the number of parameters and potentially relevant interaction effects render these methods impractical. Consequently, new modelling techniques such as LASSO regularization are used and compared to the traditional techniques in terms of accuracy and computational complexity. Finally, the splitting rules and instances in the leaves of classification trees provide the basis for future simulation with an economical number of additional runs. In general, this thesis shows the increased utility of these sophisticated statistical techniques with simulated ecological data compared to the approaches traditionally used in these fields. These techniques combined with methods from industrial Design of Experiments will help ecologists extract novel insights from simulations that combine habitat complexity, population structure, and biomechanics.
Date Created

Metabolic Remodeling of Membrane Glycerolipids in the Microalga Nannochloropsis Oceanica Under Nitrogen Deprivation

127874-Thumbnail Image.png

The lack of lipidome analytical tools has limited our ability to gain new knowledge about lipid metabolism in microalgae, especially for membrane glycerolipids. An electrospray ionization mass spectrometry-based lipidomics method was developed for Nannochloropsis oceanica IMET1, which resolved 41 membrane

The lack of lipidome analytical tools has limited our ability to gain new knowledge about lipid metabolism in microalgae, especially for membrane glycerolipids. An electrospray ionization mass spectrometry-based lipidomics method was developed for Nannochloropsis oceanica IMET1, which resolved 41 membrane glycerolipids molecular species belonging to eight classes. Changes in membrane glycerolipids under nitrogen deprivation and high-light (HL) conditions were uncovered. The results showed that the amount of plastidial membrane lipids including monogalactosyldiacylglycerol, phosphatidylglycerol, and the extraplastidic lipids diacylglyceryl-O-4′-(N, N, N,-trimethyl) homoserine and phosphatidylcholine decreased drastically under HL and nitrogen deprivation stresses. Algal cells accumulated considerably more digalactosyldiacylglycerol and sulfoquinovosyldiacylglycerols under stresses. The genes encoding enzymes responsible for biosynthesis, modification and degradation of glycerolipids were identified by mining a time-course global RNA-seq data set. It suggested that reduction in lipid contents under nitrogen deprivation is not attributable to the retarded biosynthesis processes, at least at the gene expression level, as most genes involved in their biosynthesis were unaffected by nitrogen supply, yet several genes were significantly up-regulated. Additionally, a conceptual eicosapentaenoic acid (EPA) biosynthesis network is proposed based on the lipidomic and transcriptomic data, which underlined import of EPA from cytosolic glycerolipids to the plastid for synthesizing EPA-containing chloroplast membrane lipids.

Date Created

MRI-Based Texture Analysis to Differentiate Sinonasal Squamous Cell Carcinoma from Inverted Papilloma

ABSTRACT BACKGROUND AND PURPOSE: Sinonasal inverted papilloma (IP) can harbor squamous cell carcinoma (SCC). Consequently, differentiating these tumors is important. The objective of this study was to determine if MRI-based texture analysis can differentiate SCC from IP and provide supplementary

ABSTRACT BACKGROUND AND PURPOSE: Sinonasal inverted papilloma (IP) can harbor squamous cell carcinoma (SCC). Consequently, differentiating these tumors is important. The objective of this study was to determine if MRI-based texture analysis can differentiate SCC from IP and provide supplementary information to the radiologist. MATERIALS AND METHODS: Adult patients who had IP or SCC resected were eligible (coexistent IP and SCC were excluded). Inclusion required tumor size greater than 1.5 cm and a pre-operative MRI with axial T1, axial T2, and axial T1 post-contrast sequences. Five well- established texture analysis algorithms were applied to an ROI from the largest tumor cross- section. For a training dataset, machine-learning algorithms were used to identify the most accurate model, and performance was also evaluated in a validation dataset. Based on three separate blinded reviews of the ROI, isolated tumor, and entire images, two neuroradiologists predicted tumor type in consensus. RESULTS: The IP and SCC cohorts were matched for age and gender, while SCC tumor volume was larger (p=0.001). The best classification model achieved similar accuracies for training (17 SCC, 16 IP) and validation (7 SCC, 6 IP) datasets of 90.9% and 84.6% respectively (p=0.537). The machine-learning accuracy for the entire cohort (89.1%) was better than that of the neuroradiologist ROI review (56.5%, p=0.0004) but not significantly different from the neuroradiologist review of the tumors (73.9%, p=0.060) or entire images (87.0%, p=0.748). CONCLUSION: MRI-based texture analysis has potential to differentiate SCC from IP and may provide incremental information to the neuroradiologist, particularly for small or heterogeneous tumors.
Date Created

Cost Driven Agent Based Simulation of the Department of Defense Acquisition System

135788-Thumbnail Image.png
The Department of Defense (DoD) acquisition system is a complex system riddled with cost and schedule overruns. These cost and schedule overruns are very serious issues as the acquisition system is responsible for aiding U.S. warfighters. Hence, if the acquisition

The Department of Defense (DoD) acquisition system is a complex system riddled with cost and schedule overruns. These cost and schedule overruns are very serious issues as the acquisition system is responsible for aiding U.S. warfighters. Hence, if the acquisition process is failing that could be a potential threat to our nation's security. Furthermore, the DoD acquisition system is responsible for proper allocation of billions of taxpayer's dollars and employs many civilians and military personnel. Much research has been done in the past on the acquisition system with little impact or success. One reason for this lack of success in improving the system is the lack of accurate models to test theories. This research is a continuation of the effort on the Enterprise Requirements and Acquisition Model (ERAM), a discrete event simulation modeling research on DoD acquisition system. We propose to extend ERAM using agent-based simulation principles due to the many interactions among the subsystems of the acquisition system. We initially identify ten sub models needed to simulate the acquisition system. This research focuses on three sub models related to the budget of acquisition programs. In this thesis, we present the data collection, data analysis, initial implementation, and initial validation needed to facilitate these sub models and lay the groundwork for a full agent-based simulation of the DoD acquisition system.
Date Created

Machine Learning Methods for Diagnosis, Prognosis and Prediction of Long-term Treatment Outcome of Major Depression

Major Depression, clinically called Major Depressive Disorder, is a mood disorder that affects about one eighth of population in US and is projected to be the second leading cause of disability in the world by the

Major Depression, clinically called Major Depressive Disorder, is a mood disorder that affects about one eighth of population in US and is projected to be the second leading cause of disability in the world by the year 2020. Recent advances in biotechnology have enabled us to collect a great variety of data which could potentially offer us a deeper understanding of the disorder as well as advancing personalized medicine.

This dissertation focuses on developing methods for three different aspects of predictive analytics related to the disorder: automatic diagnosis, prognosis, and prediction of long-term treatment outcome. The data used for each task have their specific characteristics and demonstrate unique problems. Automatic diagnosis of melancholic depression is made on the basis of metabolic profiles and micro-array gene expression profiles where the presence of missing values and strong empirical correlation between the variables is not unusual. To deal with these problems, a method of generating a representative set of features is proposed. Prognosis is made on data collected from rating scales and questionnaires which consist mainly of categorical and ordinal variables and thus favor decision tree based predictive models. Decision tree models are known for the notorious problem of overfitting. A decision tree pruning method that overcomes the shortcomings of a greedy nature and reliance on heuristics inherent in traditional decision tree pruning approaches is proposed. The method is further extended to prune Gradient Boosting Decision Tree and tested on the task of prognosis of treatment outcome. Follow-up studies evaluating the long-term effect of the treatments on patients usually measure patients' depressive symptom severity monthly, resulting in the actual time of relapse upper bounded by the observed time of relapse. To resolve such uncertainty in response, a general loss function where the hypothesis could take different forms is proposed to predict the risk of relapse in situations where only an interval for time of relapse can be derived from the observed data.
Date Created

Scaling Up Large-scale Sparse Learning and Its Application to Medical Imaging

155389-Thumbnail Image.png
Large-scale $\ell_1$-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. In many applications, it remains challenging to apply the sparse learning model to large-scale problems that have massive

Large-scale $\ell_1$-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. In many applications, it remains challenging to apply the sparse learning model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy is to scaling up the optimization problem in parallel. Parallel solvers run multiple cores on a shared memory system or a distributed environment to speed up the computation, while the practical usage is limited by the huge dimension in the feature space and synchronization problems.

In this dissertation, I carry out the research along the direction with particular focuses on scaling up the optimization of sparse learning for supervised and unsupervised learning problems. For the supervised learning, I firstly propose an asynchronous parallel solver to optimize the large-scale sparse learning model in a multithreading environment. Moreover, I propose a distributed framework to conduct the learning process when the dataset is distributed stored among different machines. Then the proposed model is further extended to the studies of risk genetic factors for Alzheimer's Disease (AD) among different research institutions, integrating a group feature selection framework to rank the top risk SNPs for AD. For the unsupervised learning problem, I propose a highly efficient solver, termed Stochastic Coordinate Coding (SCC), scaling up the optimization of dictionary learning and sparse coding problems. The common issue for the medical imaging research is that the longitudinal features of patients among different time points are beneficial to study together. To further improve the dictionary learning model, I propose a multi-task dictionary learning method, learning the different task simultaneously and utilizing shared and individual dictionary to encode both consistent and changing imaging features.
Date Created

Multi-Parametric MRI and Texture Analysis to Visualize Spatial Histologic Heterogeneity and Tumor Extent in Glioblastoma

128818-Thumbnail Image.png

Background: Genetic profiling represents the future of neuro-oncology but suffers from inadequate biopsies in heterogeneous tumors like Glioblastoma (GBM). Contrast-enhanced MRI (CE-MRI) targets enhancing core (ENH) but yields adequate tumor in only ~60% of cases. Further, CE-MRI poorly localizes infiltrative tumor

Background: Genetic profiling represents the future of neuro-oncology but suffers from inadequate biopsies in heterogeneous tumors like Glioblastoma (GBM). Contrast-enhanced MRI (CE-MRI) targets enhancing core (ENH) but yields adequate tumor in only ~60% of cases. Further, CE-MRI poorly localizes infiltrative tumor within surrounding non-enhancing parenchyma, or brain-around-tumor (BAT), despite the importance of characterizing this tumor segment, which universally recurs. In this study, we use multiple texture analysis and machine learning (ML) algorithms to analyze multi-parametric MRI, and produce new images indicating tumor-rich targets in GBM.

Methods: We recruited primary GBM patients undergoing image-guided biopsies and acquired pre-operative MRI: CE-MRI, Dynamic-Susceptibility-weighted-Contrast-enhanced-MRI, and Diffusion Tensor Imaging. Following image coregistration and region of interest placement at biopsy locations, we compared MRI metrics and regional texture with histologic diagnoses of high- vs low-tumor content (≥80% vs <80% tumor nuclei) for corresponding samples. In a training set, we used three texture analysis algorithms and three ML methods to identify MRI-texture features that optimized model accuracy to distinguish tumor content. We confirmed model accuracy in a separate validation set.

Results: We collected 82 biopsies from 18 GBMs throughout ENH and BAT. The MRI-based model achieved 85% cross-validated accuracy to diagnose high- vs low-tumor in the training set (60 biopsies, 11 patients). The model achieved 81.8% accuracy in the validation set (22 biopsies, 7 patients).

Conclusion: Multi-parametric MRI and texture analysis can help characterize and visualize GBM’s spatial histologic heterogeneity to identify regional tumor-rich biopsy targets.

Date Created