Pharmacogenomics of Selective Serotonin Reuptake Inhibitor Treatment for Major Depressive Disorder: a Genome Wide Association Study

Description
A genome wide association study (GWAS) of treatment outcomes for citalopram and escitalopram, two frontline SSRI treatments for Major Depressive Disorder, was conducted with 529 subjects on an imputed dataset. While no variants of genome-wide significance were identified, various potentially

A genome wide association study (GWAS) of treatment outcomes for citalopram and escitalopram, two frontline SSRI treatments for Major Depressive Disorder, was conducted with 529 subjects on an imputed dataset. While no variants of genome-wide significance were identified, various potentially interesting variants were identified that warrant further exploration. These findings have the potential to elucidate novel mechanisms underlying drug response for SSRIs. This work will be continued further, with machine learning and deep learning analyses to perform non-linear analyses and employing a biologist or geneticist to provide more specialized knowledge for interpretation of results.
Date Created
2024-05
Agent

Evaluating the Heterogeneity of Logistic Regression Models to Predict Coronary Artery Disease Status

Description
Coronary artery disease (CAD) is one of the most diagnosed heart diseases globally, affecting about 5% of adults over the age of twenty[1]. Lifestyle changes can positively impact risk of developing CAD and are especially important for individuals with high

Coronary artery disease (CAD) is one of the most diagnosed heart diseases globally, affecting about 5% of adults over the age of twenty[1]. Lifestyle changes can positively impact risk of developing CAD and are especially important for individuals with high genetic risk [1]. In this study, we sought to predict the likelihood of developing CAD using genetic, demographic, and clinical variables. Leveraging genetic and clinical data from the UK Biobank on over 500,000 individuals, we classified and separated 500 genetically similar individuals to a target individual from another 500 genetically dissimilar individuals. This process was repeated for 10 target individuals as a proof-of-concept. Then, CAD-related variables were used and these include age, relevant clinical factors, and polygenic risk score to train models for predicting CAD status for the 500 genetically similar and 500 genetically dissimilar groups, and determine which group predicts the likelihood of CAD more accurately. To compute genetic similarity to the target individuals we used the Mahalanobis distance. To reduce the heterogeneity between sexes and races, the studies were restricted to British male Caucasians. The models using the more similar individuals demonstrated better predictive performance. The area under the receiver operating characteristic curve (AUC) was found to be significantly higher for the ‘similar’ rather than the ’dissimilar’ groups, indicating better predictive capability (AUC=0.67 vs. 0.65, respectively; p-value<0.05). These findings support the potential of precision prevention strategies, since one should build predictive models of disease for any one target individual from more similar individuals to that target even within an otherwise homogenous group of individuals (e.g., British Caucasians). Although intuitive, such practices are not done routinely. Further validation and exploration of additional predictors are warranted to enhance the predictive accuracy and applicability of the model.
Date Created
2024-05
Agent

Learning RNA Viral Disease Dynamics from Molecular Sequence Data

158895-Thumbnail Image.png
Description
The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious

The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious disease spread. The growing emphasis placed on genome sequencing to support pathogen outbreak response highlights the need to adapt traditional epidemiological metrics to leverage this increasingly rich data stream. Further, the rapidity with which pathogen molecular sequence data is now generated, coupled with advent of sophisticated, Bayesian statistical techniques for pathogen molecular sequence analysis, creates an unprecedented opportunity to disrupt and innovate public health surveillance using 21st century tools. Bayesian phylogeography is a modeling framework which assumes discrete traits -- such as age, location of sampling, or species -- evolve according to a continuous-time Markov chain process along a phylogenetic tree topology which is inferred from molecular sequence data.

While myriad studies exist which reconstruct patterns of discrete trait evolution along an inferred phylogeny, attempts to translate the results of phyloegographic analyses into actionable metrics that can be used by public health agencies to direct the development of interventions aimed at reducing pathogen spread are conspicuously absent from the literature. In this dissertation, I focus on developing an intuitive metric, the phylogenetic risk ratio (PRR), which I use to translate the results of Bayesian phylogeographic modeling studies into a form actionable by public health agencies. I apply the PRR to two case studies: i) age-associated diffusion of influenza A/H3N2 during the 2016-17 US epidemic and ii) host associated diffusion of West Nile virus in the US. I discuss the limitations of this (and Bayesian phylogeographic) approaches when studying non-geographic traits for which limited metadata is available in public molecular sequence databases and statistically principled solutions to the missing metadata problem in the phylogenetic context. Then, I perform a simulation study to evaluate the statistical performance of the missing metadata solution. Finally, I provide a solution for researchers whom are interested in using the PRR and phylogenetic UTMs in their own genomic epidemiological studies yet are deterred by the idiosyncratic, error-prone processes required to implement these methods using popular Bayesian phylogenetic inference software packages. My solution, Build-A-BEAST, is a publicly available, object-oriented system written in python which aims to reduce the complexity and idiosyncrasy of creating XML files necessary to perform the aforementioned analyses. This dissertation extends the conceptual framework of Bayesian phylogeographic methods, develops a method to translates the output of phylogenetic models into an actionable form, evaluates the use of priors for missing metadata, and, finally, provides a solution which eases the implementation of these methods. In doing so, I lay the foundation for future work in disseminating and implementing Bayesian phylogeographic methods for routine public health surveillance.
Date Created
2020
Agent

Mathematics of Dengue Transmission Dynamics and Assessment of Wolbachia-based Interventions

158697-Thumbnail Image.png
Description
Dengue is a mosquito-borne arboviral disease that causes significant public health burden in many trophical and sub-tropical parts of the world (where dengue is endemic). This dissertation is based on using mathematical modeling approaches, coupled with rigorous analysis and computation,

Dengue is a mosquito-borne arboviral disease that causes significant public health burden in many trophical and sub-tropical parts of the world (where dengue is endemic). This dissertation is based on using mathematical modeling approaches, coupled with rigorous analysis and computation, to study the transmission dynamics and control of dengue disease. In Chapter 2, a new deterministic model was designed and used to assess the impact of local fluctuation of temperature and mosquito vertical (transvasorial) transmission on the population abundance of dengue mosquitoes and disease in a population. The model, which takes the form of a deterministic system of nonlinear differential equations, was parametrized using data from the Chiang Mai province of Thailand. The disease-free equilibrium of the model was shown to be globally-asymptotically stable when a certain epidemiological quantity is less than unity. Vertical transmission was shown to only have marginal impact on the disease dynamics, and its effect is temperature-dependent. Dengue burden in the province is maximized when the mean monthly temperature lie in the range [26-28] C. A new deterministic model was designed in Chapter 3 to assess the impact of the release of Wolbachia-infected mosquitoes on curtailing the mosquito population and dengue disease in a population. The model, which stratifies the mosquito population in terms of sex and Wolbachia-infection status, was rigorously analysed to characterize the bifurcation property of the model as well as the asymptotic stability of the various disease-free equilibria. Simulations, using Wolbachia-based mosquito control from Queensland, Australia, showed that the frequent release of mosquitoes infected with the bacterium can lead to the effective control of the local wild mosquito population, and that such effective control increases with increasing number of Wolbachia-infected mosquitoes released (up to 90% reduction in the wild mosquito population, from their baseline values, can be achieved). It was also shown that the well-known feature of cytoplasmic incompatibility has very little effect on the effectiveness of the Wolbachia-based mosquito control.
Date Created
2020
Agent

Effects of LCMV Infection on Murine Fetal Development in Immunized Mothers

131272-Thumbnail Image.png
Description
Despite a continuously growing body of evidence that they are one of the major causes of pregnancy loss, preterm birth, pregnancy complications, and developmental abnormalities leading to high rates of morbidity and mortality, viruses are often overlooked and underestimated as

Despite a continuously growing body of evidence that they are one of the major causes of pregnancy loss, preterm birth, pregnancy complications, and developmental abnormalities leading to high rates of morbidity and mortality, viruses are often overlooked and underestimated as teratogens. The Zika virus epidemic beginning in Brazil in 2015 brought teratogenic viruses into the spotlight for the public health community and popular media, and its infamy may bring about positive motivation and funding for novel treatments and vaccination strategies against it and a variety of other viruses that can lead to severe congenital disease. Lymphocytic choriomeningitis virus (LCMV) is famous in the biomedical community for its historic and continued utility in mouse models of the human immune system, but it is rarely a source of clinical concern in terms of its teratogenic risk to humans, despite its ability to cause consistently severe ocular and neurological abnormalities in cases of congenital infection. Possibilities for a safe and effective LCMV vaccine remain difficult, as the robust immune response typical to LCMV can be either efficiently protective or lethally pathological based on relatively small changes in the host type, viral strain, viral dose, method of infection/immunization, or molecular characteristics of synthetic vaccination. Introducing the immunologically unique state of pregnancy and fetal development to the mix adds complexity to the process. This thesis consists of a literature review of teratogenic viruses as a whole, of LCMV and its complications during pregnancy, of LCMV immunopathology, and of current understanding of vaccination against LCMV and against other teratogenic viruses, as well as a hypothetical experimental design intended to initially bridge the gaps between LCMV vaccinology and LCMV teratogenicity by bringing a vaccine study of LCMV into the context of viral challenge during pregnancy.
Date Created
2020-05
Agent

The Elucidation of Potential New Factors that Influence and Impact Type 2 Diabetes Mellitus Prevalence in Pima Indian populations

Description
Introduction: Diabetes Mellitus (DM) is a significant health problem in the United States, with over 20 million adults diagnosed with the condition. Type 2 Diabetes Mellitus, characterized by insulin resistance, in particular has been associated with various adverse conditions such

Introduction: Diabetes Mellitus (DM) is a significant health problem in the United States, with over 20 million adults diagnosed with the condition. Type 2 Diabetes Mellitus, characterized by insulin resistance, in particular has been associated with various adverse conditions such as chronic kidney disease and peripheral artery disease. The presence of Type 2 Diabetes in an individual is also associated with various risk factors such as genetic markers and ethnicity. Native Americans, in particular, are more susceptible to Type 2 Diabetes Mellitus, with Native Americans having over two times the likelihood to present with Type 2 DM than non Hispanic whites. Of worry is the Pima Indian population in Arizona, which has the highest prevalence of Type 2 DM in the world. There have been many risk factors associated with the population such as genetic markers and lifestyle changes, but there has not been much research on the utilization of raw data to find the most pertinent factors for diabetes incidence.

Objective: There were three main objectives of the study. One objective was to elucidate potential new relationships via linear regression. Another objective was to determine which factors were indicative of Type 2 DM in the population. Finally, the last objective was to compare the incidence of Type 2 DM in the dataset to trends seen elsewhere.

Methods: The dataset was uploaded from an open source site with citation onto Python. The dataset, created in 1990, was composed of 768 female patients across 9 different attributes (Number of Pregnancies, Plasma Glucose Levels, Systolic Blood Pressure, Triceps Skin Thickness, Insulin Levels, BMI, Diabetes Pedigree Function, Age and Diabetes Presence (0 or 1)). The dataset was then cleaned using mean or median imputation. Post cleaning, linear regression was done to assess the relationships between certain factors in the population and assessed via the probability statistic for significance, with the exclusion of the Diabetes Pedigree Function and Diabetes Presence. Reverse stepwise logistic regression was used to determine the most pertinent factors for Type 2 DM via the Akaike Information Criterion and through the statistical significance in the model. Finally, data from the Center of Disease Control (CDC) Diabetes Surveillance was assessed for relationships with Female DM Percenatge in Pinal County through Obesity or through Physical Inactivity via simple logistic regression for statistical significance.

Results: The majority of the relationships found were statistically significant with each other. The most pertinent factors of Type 2 DM in the dataset were the number of pregnancies, the plasma glucose levels as well as the Blood Pressure. Via the USDS Data from the CDC, the relationships between Female DM Percentage and the obesity and inactivity percentages were statistically significant.

Conclusion: The trends found in the study matched the trends found in the literature. Per the results, recommendations for better diabetes control include more medical education as well as better blood sugar monitoring.With more analysis, there can be more done for checking other factors such as genetic factors and epidemiological analysis. In conclusion, the study accomplished its main objectives.
Date Created
2020-05
Agent

Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning

157992-Thumbnail Image.png
Description
Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for

Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.
Date Created
2019
Agent

Knowledge-driven methods for geographic information extraction in the biomedical domain

157879-Thumbnail Image.png
Description
Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such

Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain.

This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.
Date Created
2019
Agent

The Relationship between Wastewater Toxic Substances and Alzheimer’s disease

132548-Thumbnail Image.png
Description
Alzheimer’s disease (AD) is a neurodegenerative disease resulting in loss of cognitive function and is not considered part of the typical aging process. Recently, research is being conducted to study environmental effects on AD because the exact molecular mechanisms behind

Alzheimer’s disease (AD) is a neurodegenerative disease resulting in loss of cognitive function and is not considered part of the typical aging process. Recently, research is being conducted to study environmental effects on AD because the exact molecular mechanisms behind AD are not known. The associations between various toxins and AD have been mixed and unclear. In order to better understand the role of the environment and toxic substances on AD, we conducted a literature review and geospatial analysis of environmental, specifically wastewater, contaminants that have biological plausibility for increasing risk of development or exacerbation of AD. This literature review assisted us in selecting 10 wastewater toxic substances that displayed a mixed or one-sided relationship with the symptoms or prevalence of Alzheimer’s for our data analysis. We utilized data of toxic substances in wastewater treatment plants and compared them to the crude rate of AD in the different Census regions of the United States to test for possible linear relationships. Using data from the Targeted National Sewage Sludge Survey (TNSSS) and the Centers for Disease Control and Prevention (CDC), we developed an application using R Shiny to allow users to interactively visualize both datasets as choropleths of the United States and understand the importance of this area of research. Pearson’s correlation coefficient was calculated resulting in arsenic and cadmium displaying positive linear correlations with AD. Other analytes from this statistical analysis demonstrated mixed correlations with AD. This application and data analysis serve as a model in the methodology for further geospatial analysis on AD. Further data analysis and visualization at a lower level in terms of scope is necessary for more accurate and reliable evidence of a causal relationship between the wastewater substance analytes and AD.
GitHub Repository: https://github.com/komal-agrawal/AD_GIS.git
Date Created
2019-05
Agent