COVID-19 Literature Search via Natural Language Processing Tools

Description

The COVID-19 pandemic began in March of 2020 and drastically affected the global human population. Millions of people died due to a SARS-CoV-2 infection while many who survived developed devastating sequelae of the disease. In addition, the closure of schools

The COVID-19 pandemic began in March of 2020 and drastically affected the global human population. Millions of people died due to a SARS-CoV-2 infection while many who survived developed devastating sequelae of the disease. In addition, the closure of schools and businesses led to international economic struggle in the year 2020 as global economies declined. Since the beginning of the pandemic, over 200,000 scientific articles have been published and compiled into a database that grows daily— a rare occurrence within the scientific community. This thesis uses natural language processing tools via Python and VOSviewer software to perform a bibliometric analysis on 205,712 papers published between January of 2020 and February of 2021 pertaining to COVID-19. We first investigate how to analyze these publications most effectively in terms of title versus abstract keyword searches, we further obtain the focus of the current scientific literature via co-occurrence analysis and clustering, and we at last discuss the time evolution of these topics over the course of 14 months.

Date Created
2021-05
Agent

Properties of Disordered Regions of Proteins in RNA Granules

131832-Thumbnail Image.png
Description
RNA granules are assemblies of RNA and proteins inside cells that serve multiple roles and functions. Some of the functions they serve in include a variety of organelles such as germ cell P granules, stress granules, and neuronal granules with

RNA granules are assemblies of RNA and proteins inside cells that serve multiple roles and functions. Some of the functions they serve in include a variety of organelles such as germ cell P granules, stress granules, and neuronal granules with diverse functions. Intrinsically disordered domains are abundant in the proteins responsible for RNA granules, and they have been attributed to the formation and degradation of RNA granules through a liquid-liquid phase separation (LLPS) process. LLPS is typically a reversible process where a homogenous fluid de- mixes into two distinct liquid phases. Here, 47 RNA granule proteins with such disordered regions have been surveyed. These proteins have been simulated using coarse-grained molecular simulations to determine size dependence on temperature change. Upper critical solution temperature (UCST) and lower critical solution temperature (LCST) are phase behaviors that can be calculated using the data gathered for scaling and phase behaviors of these proteins. We discovered that less charged amino acid contents are present in RNA granules in comparison to general disordered sequences. This is in line with the observation that charged amino acids are less preferred for the sequence to phase separate at physiological-relevant temperatures. More interestingly, there seems to be an even mix of sequences contributing to both UCST, LCST, and no phase behaviors and the average temperature dependent behaviors of all these proteins have a relatively weak temperature dependence within the temperature range 300 and 325K. The average suggest that these proteins might collectively contribute to RNA granules in a way that adapts to small temperature fluctuations.
Date Created
2020-05
Agent

Predicting Dimensions of Intrinsically Disordered Proteins

132667-Thumbnail Image.png
Description
In recent years, experimental and theoretical evidence has pointed to the existence of biologically active proteins that either include unstructured regions or are entirely unstructured. Referred to as intrinsically disordered proteins (IDPs), they are now known to be involved

In recent years, experimental and theoretical evidence has pointed to the existence of biologically active proteins that either include unstructured regions or are entirely unstructured. Referred to as intrinsically disordered proteins (IDPs), they are now known to be involved in diverse functions, much as any folded protein. Mutations in IDPs have been implicated in multiple neurodegenerative diseases. Considering the disordered nature of IDPs, there are limited structure features that can be used to quantify the disordered state. One such pair of variables are the radius of gyration (Rg) and the corresponding Flory’s scaling exponent, both of which characterize the dimension and size of the protein. It is generally understood that the sequence of an IDP affects its Rg and scaling exponent. Properties such as amino acid hydrophobicity and charge can play important roles in determining the Rg of an IDP, much as they affect the structure of a folded protein. However, it is nontrivial to directly predict Rg and scaling exponent from an IDP sequence. In this thesis, a coarse-grained model is used to simulate the Rg and scaling exponents of 10,000 randomly generated sequences mimicking the amino acid propensities of a typical IDP sequence. Such a database is then fed into an artificial neural network model to directly predict the scaling exponent from the sequence. The framework has not only made accurate and precise predictions (<1% error) in comparing to the simulation-obtained scaling exponent, but also suggest important sequence descriptors for such prediction. In addition, through varying the number of sequences for training the model, we suggest a minimum dataset of 100 sequences might be sufficient to achieve a 5% error of prediction, shedding light upon possible predictive models with only experimental inputs.
Date Created
2019-05
Agent