Comparison of ARIMA and Random Forest Time Series Models for Prediction of Avian Influenza H5N1 Outbreaks

129074-Thumbnail Image.png
Description

Background: Time series models can play an important role in disease prediction. Incidence data can be used to predict the future occurrence of disease events. Developments in modeling approaches provide an opportunity to compare different time series models for predictive power.

Results: We

Background: Time series models can play an important role in disease prediction. Incidence data can be used to predict the future occurrence of disease events. Developments in modeling approaches provide an opportunity to compare different time series models for predictive power.

Results: We applied ARIMA and Random Forest time series models to incidence data of outbreaks of highly pathogenic avian influenza (H5N1) in Egypt, available through the online EMPRES-I system. We found that the Random Forest model outperformed the ARIMA model in predictive ability. Furthermore, we found that the Random Forest model is effective for predicting outbreaks of H5N1 in Egypt.

Conclusions: Random Forest time series modeling provides enhanced predictive ability over existing time series models for the prediction of infectious disease outbreaks. This result, along with those showing the concordance between bird and human outbreaks (Rabinowitz et al. 2012), provides a new approach to predicting these dangerous outbreaks in bird populations based on existing, freely available data. Our analysis uncovers the time-series structure of outbreak severity for highly pathogenic avian influenza (H5N1) in Egypt.

Date Created
2014-08-13
Agent

A new image quantitative method for diagnosis and therapeutic response

155110-Thumbnail Image.png
Description
Accurate quantitative information of tumor/lesion volume plays a critical role

in diagnosis and treatment assessment. The current clinical practice emphasizes on efficiency, but sacrifices accuracy (bias and precision). In the other hand, many computational algorithms focus on improving the accuracy, but

Accurate quantitative information of tumor/lesion volume plays a critical role

in diagnosis and treatment assessment. The current clinical practice emphasizes on efficiency, but sacrifices accuracy (bias and precision). In the other hand, many computational algorithms focus on improving the accuracy, but are often time consuming and cumbersome to use. Not to mention that most of them lack validation studies on real clinical data. All of these hinder the translation of these advanced methods from benchside to bedside.

In this dissertation, I present a user interactive image application to rapidly extract accurate quantitative information of abnormalities (tumor/lesion) from multi-spectral medical images, such as measuring brain tumor volume from MRI. This is enabled by a GPU level set method, an intelligent algorithm to learn image features from user inputs, and a simple and intuitive graphical user interface with 2D/3D visualization. In addition, a comprehensive workflow is presented to validate image quantitative methods for clinical studies.

This application has been evaluated and validated in multiple cases, including quantifying healthy brain white matter volume from MRI and brain lesion volume from CT or MRI. The evaluation studies show that this application has been able to achieve comparable results to the state-of-the-art computer algorithms. More importantly, the retrospective validation study on measuring intracerebral hemorrhage volume from CT scans demonstrates that not only the measurement attributes are superior to the current practice method in terms of bias and precision but also it is achieved without a significant delay in acquisition time. In other words, it could be useful to the clinical trials and clinical practice, especially when intervention and prognostication rely upon accurate baseline lesion volume or upon detecting change in serial lesion volumetric measurements. Obviously, this application is useful to biomedical research areas which desire an accurate quantitative information of anatomies from medical images. In addition, the morphological information is retained also. This is useful to researches which require an accurate delineation of anatomic structures, such as surgery simulation and planning.
Date Created
2016
Agent

Health information extraction from social media

154999-Thumbnail Image.png
Description
Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the

Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks such as pharmacovigilance via the use of Natural Language Processing (NLP) techniques. One of the critical steps in information extraction pipelines is Named Entity Recognition (NER), where the mentions of entities such as diseases are located in text and their entity type are identified. However, the language in social media is highly informal, and user-expressed health-related concepts are often non-technical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and advanced machine learning-based NLP techniques have been underutilized. This work explores the effectiveness of different machine learning techniques, and particularly deep learning, to address the challenges associated with extraction of health-related concepts from social media. Deep learning has recently attracted a lot of attention in machine learning research and has shown remarkable success in several applications particularly imaging and speech recognition. However, thus far, deep learning techniques are relatively unexplored for biomedical text mining and, in particular, this is the first attempt in applying deep learning for health information extraction from social media.

This work presents ADRMine that uses a Conditional Random Field (CRF) sequence tagger for extraction of complex health-related concepts. It utilizes a large volume of unlabeled user posts for automatic learning of embedding cluster features, a novel application of deep learning in modeling the similarity between the tokens. ADRMine significantly improved the medical NER performance compared to the baseline systems.

This work also presents DeepHealthMiner, a deep learning pipeline for health-related concept extraction. Most of the machine learning methods require sophisticated task-specific manual feature design which is a challenging step in processing the informal and noisy content of social media. DeepHealthMiner automatically learns classification features using neural networks and utilizing a large volume of unlabeled user posts. Using a relatively small labeled training set, DeepHealthMiner could accurately identify most of the concepts, including the consumer expressions that were not observed in the training data or in the standard medical lexicons outperforming the state-of-the-art baseline techniques.
Date Created
2016
Agent

Context-aware adaptive hybrid semantic relatedness in biomedical science

154663-Thumbnail Image.png
Description
Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two

Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1–6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.
Date Created
2016
Agent

Integrative analysis of genomic aberrations in cancer and xenograft Models

154070-Thumbnail Image.png
Description
No two cancers are alike. Cancer is a dynamic and heterogeneous disease, such heterogeneity arise among patients with the same cancer type, among cancer cells within the same individual’s tumor and even among cells within the same sub-clone over time.

No two cancers are alike. Cancer is a dynamic and heterogeneous disease, such heterogeneity arise among patients with the same cancer type, among cancer cells within the same individual’s tumor and even among cells within the same sub-clone over time. The recent application of next-generation sequencing and precision medicine techniques is the driving force to uncover the complexity of cancer and the best clinical practice. The core concept of precision medicine is to move away from crowd-based, best-for-most treatment and take individual variability into account when optimizing the prevention and treatment strategies. Next-generation sequencing is the method to sift through the entire 3 billion letters of each patient’s DNA genetic code in a massively parallel fashion.

The deluge of next-generation sequencing data nowadays has shifted the bottleneck of cancer research from multiple “-omics” data collection to integrative analysis and data interpretation. In this dissertation, I attempt to address two distinct, but dependent, challenges. The first is to design specific computational algorithms and tools that can process and extract useful information from the raw data in an efficient, robust, and reproducible manner. The second challenge is to develop high-level computational methods and data frameworks for integrating and interpreting these data. Specifically, Chapter 2 presents a tool called Snipea (SNv Integration, Prioritization, Ensemble, and Annotation) to further identify, prioritize and annotate somatic SNVs (Single Nucleotide Variant) called from multiple variant callers. Chapter 3 describes a novel alignment-based algorithm to accurately and losslessly classify sequencing reads from xenograft models. Chapter 4 describes a direct and biologically motivated framework and associated methods for identification of putative aberrations causing survival difference in GBM patients by integrating whole-genome sequencing, exome sequencing, RNA-Sequencing, methylation array and clinical data. Lastly, chapter 5 explores longitudinal and intratumor heterogeneity studies to reveal the temporal and spatial context of tumor evolution. The long-term goal is to help patients with cancer, particularly those who are in front of us today. Genome-based analysis of the patient tumor can identify genomic alterations unique to each patient’s tumor that are candidate therapeutic targets to decrease therapy resistance and improve clinical outcome.
Date Created
2015
Agent

Ensuring high-quality colonoscopy by reducing polyp miss-rates

153713-Thumbnail Image.png
Description
Colorectal cancer is the second-highest cause of cancer-related deaths in the United States with approximately 50,000 estimated deaths in 2015. The advanced stages of colorectal cancer has a poor five-year survival rate of 10%, whereas the diagnosis in early stages

Colorectal cancer is the second-highest cause of cancer-related deaths in the United States with approximately 50,000 estimated deaths in 2015. The advanced stages of colorectal cancer has a poor five-year survival rate of 10%, whereas the diagnosis in early stages of development has showed a more favorable five-year survival rate of 90%. Early diagnosis of colorectal cancer is achievable if colorectal polyps, a possible precursor to cancer, are detected and removed before developing into malignancy.

The preferred method for polyp detection and removal is optical colonoscopy. A colonoscopic procedure consists of two phases: (1) insertion phase during which a flexible endoscope (a flexible tube with a tiny video camera at the tip) is advanced via the anus and then gradually to the end of the colon--called the cecum, and (2) withdrawal phase during which the endoscope is gradually withdrawn while colonoscopists examine the colon wall to find and remove polyps. Colonoscopy is an effective procedure and has led to a significant decline in the incidence and mortality of colon cancer. However, despite many screening and therapeutic advantages, 1 out of every 4 polyps and 1 out of 13 colon cancers are missed during colonoscopy.

There are many factors that contribute to missed polyps and cancers including poor colon preparation, inadequate navigational skills, and fatigue. Poor colon preparation results in a substantial portion of colon covered with fecal content, hindering a careful examination of the colon. Inadequate navigational skills can prevent a colonoscopist from examining hard-to-reach regions of the colon that may contain a polyp. Fatigue can manifest itself in the performance of a colonoscopist by decreasing diligence and vigilance during procedures. Lack of vigilance may prevent a colonoscopist from detecting the polyps that briefly appear in the colonoscopy videos. Lack of diligence may result in hasty examination of the colon that is likely to miss polyps and lesions.

To reduce polyp and cancer miss rates, this research presents a quality assurance system with 3 components. The first component is an automatic polyp detection system that highlights the regions with suspected polyps in colonoscopy videos. The goal is to encourage more vigilance during procedures. The suggested polyp detection system consists of several novel modules: (1) a new patch descriptor that characterizes image appearance around boundaries more accurately and more efficiently than widely-used patch descriptors such HoG, LBP, and Daisy; (2) A 2-stage classification framework that is able to enhance low level image features prior to classification. Unlike the traditional way of image classification where a single patch undergoes the processing pipeline, our system fuses the information extracted from a pair of patches for more accurate edge classification; (3) a new vote accumulation scheme that robustly localizes objects with curvy boundaries in fragmented edge maps. Our voting scheme produces a probabilistic output for each polyp candidate but unlike the existing methods (e.g., Hough transform) does not require any predefined parametric model of the object of interest; (4) and a unique three-way image representation coupled with convolutional neural networks (CNNs) for classifying the polyp candidates. Our image representation efficiently captures a variety of features such as color, texture, shape, and temporal information and significantly improves the performance of the subsequent CNNs for candidate classification. This contrasts with the exiting methods that mainly rely on a subset of the above image features for polyp detection. Furthermore, this research is the first to investigate the use of CNNs for polyp detection in colonoscopy videos.

The second component of our quality assurance system is an automatic image quality assessment for colonoscopy. The goal is to encourage more diligence during procedures by warning against hasty and low quality colon examination. We detect a low quality colon examination by identifying a number of consecutive non-informative frames in videos. We base our methodology for detecting non-informative frames on two key observations: (1) non-informative frames

most often show an unrecognizable scene with few details and blurry edges and thus their information can be locally compressed in a few Discrete Cosine Transform (DCT) coefficients; however, informative images include much more details and their information content cannot be summarized by a small subset of DCT coefficients; (2) information content is spread all over the image in the case of informative frames, whereas in non-informative frames, depending on image artifacts and degradation factors, details may appear in only a few regions. We use the former observation in designing our global features and the latter in designing our local image features. We demonstrated that the suggested new features are superior to the existing features based on wavelet and Fourier transforms.

The third component of our quality assurance system is a 3D visualization system. The goal is to provide colonoscopists with feedback about the regions of the colon that have remained unexamined during colonoscopy, thereby helping them improve their navigational skills. The suggested system is based on a new 3D reconstruction algorithm that combines depth and position information for 3D reconstruction. We propose to use a depth camera and a tracking sensor to obtain depth and position information. Our system contrasts with the existing works where the depth and position information are unreliably estimated from the colonoscopy frames. We conducted a use case experiment, demonstrating that the suggested 3D visualization system can determine the unseen regions of the navigated environment. However, due to technology limitations, we were not able to evaluate our 3D visualization system using a phantom model of the colon.
Date Created
2015
Agent

Combining Phylogeography and Spatial Epidemiology to Uncover Predictors of H5N1 Influenza A Virus Diffusion

Description

Emerging and re-emerging infectious diseases of zoonotic origin like highly pathogenic avian influenza pose a significant threat to human and animal health due to their elevated transmissibility. Identifying the drivers of such viruses is challenging, and estimation of spatial diffusion

Emerging and re-emerging infectious diseases of zoonotic origin like highly pathogenic avian influenza pose a significant threat to human and animal health due to their elevated transmissibility. Identifying the drivers of such viruses is challenging, and estimation of spatial diffusion is complicated by the fact that the variability of viral spread from locations could be caused by a complex array of unknown factors. Several techniques exist to help identify these drivers, including bioinformatics, phylogeography, and spatial epidemiology, but these methods are generally evaluated separately and do not consider the complementary nature of each other. Here, we studied an approach that integrates these techniques and identifies the most important drivers of viral spread by focusing on H5N1 influenza A virus in Egypt because of its recent emergence as an epicenter for the disease. We used a Bayesian phylogeographic generalized linear model (GLM) to reconstruct spatiotemporal patterns of viral diffusion while simultaneously assessing the impact of factors contributing to transmission. We also calculated the cross-species transmission rates among hosts in order to identify the species driving transmission. The densities of both human and avian species were supported contributors, along with latitude, longitude, elevation, and several meteorological variables. Also supported was the presence of a genetic motif found near the hemagglutinin cleavage site. Various genetic, geographic, demographic, and environmental predictors each play a role in H1N1 diffusion. Further development and expansion of phylogeographic GLMs such as this will enable health agencies to identify variables that can curb virus diffusion and reduce morbidity and mortality.

Date Created
2015-01-01
Agent

Informatics approaches for integrative analysis of disparate high-throughput genomic datasets in cancer

152847-Thumbnail Image.png
Description
The processes of a human somatic cell are very complex with various genetic mechanisms governing its fate. Such cells undergo various genetic mutations, which translate to the genetic aberrations that we see in cancer. There are more than 100 types

The processes of a human somatic cell are very complex with various genetic mechanisms governing its fate. Such cells undergo various genetic mutations, which translate to the genetic aberrations that we see in cancer. There are more than 100 types of cancer, each having many more subtypes with aberrations being unique to each. In the past two decades, the widespread application of high-throughput genomic technologies, such as micro-arrays and next-generation sequencing, has led to the revelation of many such aberrations. Known types and subtypes can be readily identified using gene-expression profiling and more importantly, high-throughput genomic datasets have helped identify novel sub-types with distinct signatures. Recent studies showing usage of gene-expression profiling in clinical decision making in breast cancer patients underscore the utility of high-throughput datasets. Beyond prognosis, understanding the underlying cellular processes is essential for effective cancer treatment. Various high-throughput techniques are now available to look at a particular aspect of a genetic mechanism in cancer tissue. To look at these mechanisms individually is akin to looking at a broken watch; taking apart each of its parts, looking at them individually and finally making a list of all the faulty ones. Integrative approaches are needed to transform one-dimensional cancer signatures into multi-dimensional interaction and regulatory networks, consequently bettering our understanding of cellular processes in cancer. Here, I attempt to (i) address ways to effectively identify high quality variants when multiple assays on the same sample samples are available through two novel tools, snpSniffer and NGSPE; (ii) glean new biological insight into multiple myeloma through two novel integrative analysis approaches making use of disparate high-throughput datasets. While these methods focus on multiple myeloma datasets, the informatics approaches are applicable to all cancer datasets and will thus help advance cancer genomics.
Date Created
2014
Agent

Structural variant detection: a novel approach

152740-Thumbnail Image.png
Description
Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery

Genomic structural variation (SV) is defined as gross alterations in the genome broadly classified as insertions/duplications, deletions inversions and translocations. DNA sequencing ushered structural variant discovery beyond laboratory detection techniques to high resolution informatics approaches. Bioinformatics tools for computational discovery of SVs however are still missing variants in the complex cancer genome. This study aimed to define genomic context leading to tool failure and design novel algorithm addressing this context. Methods: The study tested the widely held but unproven hypothesis that tools fail to detect variants which lie in repeat regions. Publicly available 1000-Genomes dataset with experimentally validated variants was tested with SVDetect-tool for presence of true positives (TP) SVs versus false negative (FN) SVs, expecting that FNs would be overrepresented in repeat regions. Further, the novel algorithm designed to informatically capture the biological etiology of translocations (non-allelic homologous recombination and 3&ndashD; placement of chromosomes in cells –context) was tested using simulated dataset. Translocations were created in known translocation hotspots and the novel&ndashalgorithm; tool compared with SVDetect and BreakDancer. Results: 53% of false negative (FN) deletions were within repeat structure compared to 81% true positive (TP) deletions. Similarly, 33% FN insertions versus 42% TP, 26% FN duplication versus 57% TP and 54% FN novel sequences versus 62% TP were within repeats. Repeat structure was not driving the tool's inability to detect variants and could not be used as context. The novel algorithm with a redefined context, when tested against SVDetect and BreakDancer was able to detect 10/10 simulated translocations with 30X coverage dataset and 100% allele frequency, while SVDetect captured 4/10 and BreakDancer detected 6/10. For 15X coverage dataset with 100% allele frequency, novel algorithm was able to detect all ten translocations albeit with fewer reads supporting the same. BreakDancer detected 4/10 and SVDetect detected 2/10 Conclusion: This study showed that presence of repetitive elements in general within a structural variant did not influence the tool's ability to capture it. This context-based algorithm proved better than current tools even with half the genome coverage than accepted protocol and provides an important first step for novel translocation discovery in cancer genome.
Date Created
2014
Agent