Robust and Rapid Algorithms Facilitate Large-Scale Whole Genome Sequencing Downstream Analysis in an Integrative Framework

128340-Thumbnail Image.png
Description

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

Date Created
2017-01-23
Agent

Exploring Genetic Associations With ceRNA Regulation in the Human Genome

128353-Thumbnail Image.png
Description

Competing endogenous RNAs (ceRNAs) are RNA molecules that sequester shared microRNAs (miRNAs) thereby affecting the expression of other targets of the miRNAs. Whether genetic variants in ceRNA can affect its biological function and disease development is still an open question.

Competing endogenous RNAs (ceRNAs) are RNA molecules that sequester shared microRNAs (miRNAs) thereby affecting the expression of other targets of the miRNAs. Whether genetic variants in ceRNA can affect its biological function and disease development is still an open question. Here we identified a large number of genetic variants that are associated with ceRNA's function using Geuvaids RNA-seq data for 462 individuals from the 1000 Genomes Project. We call these loci competing endogenous RNA expression quantitative trait loci or ‘cerQTL’, and found that a large number of them were unexplored in conventional eQTL mapping. We identified many cerQTLs that have undergone recent positive selection in different human populations, and showed that single nucleotide polymorphisms in gene 3΄UTRs at the miRNA seed binding regions can simultaneously regulate gene expression changes in both cis and trans by the ceRNA mechanism. We also discovered that cerQTLs are significantly enriched in traits/diseases associated variants reported from genome-wide association studies in the miRNA binding sites, suggesting that disease susceptibilities could be attributed to ceRNA regulation. Further in vitro functional experiments demonstrated that a cerQTL rs11540855 can regulate ceRNA function. These results provide a comprehensive catalog of functional non-coding regulatory variants that may be responsible for ceRNA crosstalk at the post-transcriptional level.

Date Created
2017-05-02
Agent

Long Noncoding RNA LINC00305 Promotes Inflammation by Activating the AHRR-NF-κB Pathway in Human Monocytes

128532-Thumbnail Image.png
Description

Accumulating data from genome-wide association studies (GWAS) have provided a collection of novel candidate genes associated with complex diseases, such as atherosclerosis. We identified an atherosclerosis-associated single-nucleotide polymorphism (SNP) located in the intron of the long noncoding RNA (lncRNA) LINC00305

Accumulating data from genome-wide association studies (GWAS) have provided a collection of novel candidate genes associated with complex diseases, such as atherosclerosis. We identified an atherosclerosis-associated single-nucleotide polymorphism (SNP) located in the intron of the long noncoding RNA (lncRNA) LINC00305 by searching the GWAS database. Although the function of LINC00305 is unknown, we found that LINC00305 expression is enriched in atherosclerotic plaques and monocytes. Overexpression of LINC00305 promoted the expression of inflammation-associated genes in THP-1 cells and reduced the expression of contractile markers in co-cultured human aortic smooth muscle cells (HASMCs). We showed that overexpression of LINC00305 activated nuclear factor-kappa beta (NF-κB) and that inhibition of NF-κB abolished LINC00305-mediated activation of cytokine expression. Mechanistically, LINC00305 interacted with lipocalin-1 interacting membrane receptor (LIMR), enhanced the interaction of LIMR and aryl-hydrocarbon receptor repressor (AHRR), and promoted protein expression as well as nuclear localization of AHRR. Moreover, LINC00305 activated NF-κB exclusively in the presence of LIMR and AHRR. In light of these findings, we propose that LINC00305 promotes monocyte inflammation by facilitating LIMR and AHRR cooperation and the AHRR activation, which eventually activates NF-κB, thereby inducing HASMC phenotype switching.

Date Created
2017-04-10
Agent

Cepip: Context-Dependent Epigenomic Weighting for Prioritization of Regulatory Variants and Disease-Associated Genes

128638-Thumbnail Image.png
Description

It remains challenging to predict regulatory variants in particular tissues or cell types due to highly context-specific gene regulation. By connecting large-scale epigenomic profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify

It remains challenging to predict regulatory variants in particular tissues or cell types due to highly context-specific gene regulation. By connecting large-scale epigenomic profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify critical chromatin features that predict variant regulatory potential. We present cepip, a joint likelihood framework, for estimating a variant’s regulatory probability in a context-dependent manner. Our method exhibits significant GWAS signal enrichment and is superior to existing cell type-specific methods. Furthermore, using phenotypically relevant epigenomes to weight the GWAS single-nucleotide polymorphisms, we improve the statistical power of the gene-based association test.

Date Created
2017-03-16
Agent