Gene Network Inference via Sequence Alignment and Rectification

Faucon, Philippe Christophe

While techniques for reading DNA in some capacity has been possible for decades,

the ability to accurately edit genomes at scale has remained elusive. Novel techniques

have been introduced recently to aid in the writing of DNA sequences. While writing

DNA is more…

While techniques for reading DNA in some capacity has been possible for decades,

the ability to accurately edit genomes at scale has remained elusive. Novel techniques

have been introduced recently to aid in the writing of DNA sequences. While writing

DNA is more accessible, it still remains expensive, justifying the increased interest in

in silico predictions of cell behavior. In order to accurately predict the behavior of

cells it is necessary to extensively model the cell environment, including gene-to-gene

interactions as completely as possible.

Significant algorithmic advances have been made for identifying these interactions,

but despite these improvements current techniques fail to infer some edges, and

fail to capture some complexities in the network. Much of this limitation is due to

heavily underdetermined problems, whereby tens of thousands of variables are to be

inferred using datasets with the power to resolve only a small fraction of the variables.

Additionally, failure to correctly resolve gene isoforms using short reads contributes

significantly to noise in gene quantification measures.

This dissertation introduces novel mathematical models, machine learning techniques,

and biological techniques to solve the problems described above. Mathematical

models are proposed for simulation of gene network motifs, and raw read simulation.

Machine learning techniques are shown for DNA sequence matching, and DNA

sequence correction.

Results provide novel insights into the low level functionality of gene networks. Also

shown is the ability to use normalization techniques to aggregate data for gene network

inference leading to larger data sets while minimizing increases in inter-experimental

noise. Results also demonstrate that high error rates experienced by third generation

sequencing are significantly different than previous error profiles, and that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.

Copyright Statement