Statistical Sequence Alignment of Protein Coding Regions

190921-Thumbnail Image.png
Description
Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in comparative and functional genomic studies. While such errors are eventually fixed in the reference genomes of model organisms, many genomes used by researchers contain these artifacts, often forcing researchers to discard large amounts of data to prevent artifacts from impacting results. I developed COATi, a statistical, codon-aware pairwise aligner designed to align protein-coding sequences in the presence of artifacts commonly introduced by sequencing or annotation errors, such as early stop codons and abiological frameshifts. Unlike common sequence aligners, which rely on amino acid translations, only model insertion and deletions between codons, or lack a statistical model, COATi combines a codon substitution model specifically designed for protein-coding regions, a complex insertion-deletion model, and a sequencing base calling error step. The alignment algorithm is based on finite state transducers (FSTs), computational machines well-suited for modeling sequence evolution. I show that COATi outperforms available methods using a simulated empirical pairwise alignment dataset as a benchmark. The FST-based model and alignment algorithm in COATi is resource-intense for sequences longer than a few kilobases. To address this constraint, I developed an approximate model compatible with traditional dynamic programming alignment algorithms. I describe how the original codon substitution model is transformed to build an approximate model and how the alignment algorithm is implemented by modifying the popular Gotoh algorithm. I simulated a benchmark of alignments and measured how well the marginal models approximate the original method. Finally, I present a novel tool for analyzing sequence alignments. Available metrics can measure the similarity between two alignments or the column uncertainty within an alignment but cannot produce a site-specific comparison of two or more alignments. AlnDotPlot is an R software package inspired by traditional dot plots that can provide valuable insights when comparing pairwise alignments. I describe AlnDotPlot and showcase its utility in displaying a single alignment, comparing different pairwise alignments, and summarizing alignment space.
Date Created
2023
Agent

Profiling of Indel Phases in Coding Regions

171500-Thumbnail Image.png
Description
Advances in sequencing technology have generated an enormous amount of data over the past decade. Equally advanced computational methods are needed to conduct comparative and functional genomic studies on these datasets, in particular tools that appropriately interpret indels within an

Advances in sequencing technology have generated an enormous amount of data over the past decade. Equally advanced computational methods are needed to conduct comparative and functional genomic studies on these datasets, in particular tools that appropriately interpret indels within an evolutionary framework. The evolutionary history of indels is complex and often involves repetitive genomic regions, which makes identification, alignment, and annotation difficult. While previous studies have found that indel lengths in both deoxyribonucleic acid and proteins obey a power law, probabilistic models for indel evolution have rarely been explored due to their computational complexity. In my research, I first explore an application of an expectation-maximization algorithm for maximum-likelihood training of a codon substitution model. I demonstrate the training accuracy of the expectation-maximization on my substitution model. Then I apply this algorithm on a published 90 pairwise species dataset and find a negative correlation between the branch length and non-synonymous selection coefficient. Second, I develop a post-alignment fixation method to profile each indel event into three different phases according to its codon position. Because current codon-aware models can only identify the indels by placing the gaps between codons and lead to the misalignment of the sequences. I find that the mouse-rat species pair is under purifying selection by looking at the proportion difference of the indel phases. I also demonstrate the power of my sliding-window method by comparing the post-aligned and original gap positions. Third, I create an indel-phase moore machine including the indel rates of three phases, length distributions, and codon substitution models. Then I design a gillespie simulation that is capable of generating true sequence alignments. Next I develop an importance sampling method within the expectation-maximization algorithm that can successfully train the indel-phase model and infer accurate parameter estimates from alignments. Finally, I extend the indel phase analysis to the 90 pairwise species dataset across three alignment methods, including Mafft+sw method developed in chapter 3, coati-sampling methods applied in chapter 4, and coati-max method. Also I explore a non-linear relationship between the dN/dS and Zn/(Zn+Zs) ratio across 90 species pairs.
Date Created
2022
Agent

Spatial genetic structure under limited dispersal: theory, methods and consequences of isolation-by-distance

154511-Thumbnail Image.png
Description
Isolation-by-distance is a specific type of spatial genetic structure that arises when parent-offspring dispersal is limited. Many natural populations exhibit localized dispersal, and as a result, individuals that are geographically near each other will tend to have greater genetic similarity

Isolation-by-distance is a specific type of spatial genetic structure that arises when parent-offspring dispersal is limited. Many natural populations exhibit localized dispersal, and as a result, individuals that are geographically near each other will tend to have greater genetic similarity than individuals that are further apart. It is important to identify isolation-by-distance because it can impact the statistical analysis of population samples and it can help us better understand evolutionary dynamics. For this dissertation I investigated several aspects of isolation-by-distance. First, I looked at how the shape of the dispersal distribution affects the observed pattern of isolation-by-distance. If, as theory predicts, the shape of the distribution has little effect, then it would be more practical to model isolation-by-distance using a simple dispersal distribution rather than replicating the complexities of more realistic distributions. Therefore, I developed an efficient algorithm to simulate dispersal based on a simple triangular distribution, and using a simulation, I confirmed that the pattern of isolation-by-distance was similar to other more realistic distributions. Second, I developed a Bayesian method to quantify isolation-by-distance using genetic data by estimating Wright’s neighborhood size parameter. I analyzed the performance of this method using simulated data and a microsatellite data set from two populations of Maritime pine, and I found that the neighborhood size estimates had good coverage and low error. Finally, one of the major consequences of isolation-by-distance is an increase in inbreeding. Plants are often particularly susceptible to inbreeding, and as a result, they have evolved many inbreeding avoidance mechanisms. Using a simulation, I determined which mechanisms are more successful at preventing inbreeding associated with isolation-by-distance.
Date Created
2015
Agent