Statistical Sequence Alignment of Protein Coding Regions

190921-Thumbnail Image.png
Description
Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in comparative and functional genomic studies. While such errors are eventually fixed in the reference genomes of model organisms, many genomes used by researchers contain these artifacts, often forcing researchers to discard large amounts of data to prevent artifacts from impacting results. I developed COATi, a statistical, codon-aware pairwise aligner designed to align protein-coding sequences in the presence of artifacts commonly introduced by sequencing or annotation errors, such as early stop codons and abiological frameshifts. Unlike common sequence aligners, which rely on amino acid translations, only model insertion and deletions between codons, or lack a statistical model, COATi combines a codon substitution model specifically designed for protein-coding regions, a complex insertion-deletion model, and a sequencing base calling error step. The alignment algorithm is based on finite state transducers (FSTs), computational machines well-suited for modeling sequence evolution. I show that COATi outperforms available methods using a simulated empirical pairwise alignment dataset as a benchmark. The FST-based model and alignment algorithm in COATi is resource-intense for sequences longer than a few kilobases. To address this constraint, I developed an approximate model compatible with traditional dynamic programming alignment algorithms. I describe how the original codon substitution model is transformed to build an approximate model and how the alignment algorithm is implemented by modifying the popular Gotoh algorithm. I simulated a benchmark of alignments and measured how well the marginal models approximate the original method. Finally, I present a novel tool for analyzing sequence alignments. Available metrics can measure the similarity between two alignments or the column uncertainty within an alignment but cannot produce a site-specific comparison of two or more alignments. AlnDotPlot is an R software package inspired by traditional dot plots that can provide valuable insights when comparing pairwise alignments. I describe AlnDotPlot and showcase its utility in displaying a single alignment, comparing different pairwise alignments, and summarizing alignment space.
Date Created
2023
Agent

Biodiversity and Biogeography of Myrmecosymbioses and Vanuatuan Ants

189225-Thumbnail Image.png
Description
Biogeography places the geographical distribution of biodiversity in an evolutionary context. Ants (Hymenoptera: Formicidae), being a group of ubiquitous, ecologically dominant, and diverse insects, are useful model systems to understand the evolutionary origins and mechanisms of biogeographical patterns across spatial

Biogeography places the geographical distribution of biodiversity in an evolutionary context. Ants (Hymenoptera: Formicidae), being a group of ubiquitous, ecologically dominant, and diverse insects, are useful model systems to understand the evolutionary origins and mechanisms of biogeographical patterns across spatial scales. On a global scale, ants have been used to test hypotheses on the origin and maintenance of the remarkably consistent latitudinal diversity gradient where biodiversity peaks in the equatorial tropics and decreases towards the poles. Additionally, ants have been used to posit and test theories of island biogeography such as the mechanisms of the species-area relationship, being the increase of biodiversity with cumulative land area. However, there are still unanswered questions about ant biogeography such as how specialized life histories contribute to their global biogeographical patterns. Furthermore, there remain island systems in the world’s biodiversity hotspots that harbor much less ant species than predicted by the species-area relationship, which potentially suggests a place ripe for discovery. In this dissertation, I use natural history, taxonomic, geographic, and phylogenetic data to study ant biodiversity and biogeography across spatial scales. First, I study the global biodiversity and biogeography of a specialized set of symbiotic interactions between ant species, here referred to as myrmecosymbioses, with an emphasis on social parasitism where one species exploits the parental care behavior and social colony environment of another species. In addition to characterizing a new myrmecosymbiosis, I use a global biogeographic and phylogenetic dataset to show that ant social parasitism is distributed along an inverse latitudinal diversity gradient where species richness and independent evolutionary origins of social parasitism peak within the northern hemisphere where the least free-living ant diversity exists. Second, I study the unexplored ant fauna of the Vanuatuan archipelago in the South Pacific. Using approximately 10,000 Vanuatuan ant specimens coupled with phylogenomics, I fill in a historical knowledge gap of South Pacific ant biogeography and demonstrate that the Vanuatuan ant fauna is a novel biodiversity hotspot. With these studies, I provide insights into how specialized life histories and unique island biotas shape the global distribution of biodiversity in different ways, especially in the ants.
Date Created
2023
Agent

Black-footed Ferret Conservation History, Methodology, and Discussion

Description

Black-footed ferrets have become one of the most popular conservation success stories because of the miraculous rediscovery of the species after being declared extinct and the growing population today. The stability of the species is still highly variable as the

Black-footed ferrets have become one of the most popular conservation success stories because of the miraculous rediscovery of the species after being declared extinct and the growing population today. The stability of the species is still highly variable as the ferrets are threatened by disease, habitat fragmentation, human infringement, and the extermination of their main prey item the prairie dog. The complexity of the issue arises from negative public perceptions of prairie dogs leading to less citizen support for protection which in turn undermines progress in black-footed ferret conservation. General issues with the bureaucracy of conservation helps to delay a formal protection of species at risk which would be especially important for species that are actively being removed or exterminated by humans like the prairie dog. Careful analysis of the black-footed ferret and the prairie dog through the lenses of their natural histories, conservation histories, and modern conservation methods suggest that the public’s opinion and support is the greatest tool for the protection of species at risk because of the complexity of conservation and the rallying bureaucratic motion.

Date Created
2023-05
Agent

Revision of the Genus Pachnaeus (Coleoptera: Curculionidae: Entiminae)

171426-Thumbnail Image.png
Description
The weevil genus Pachnaeus Schoenherr, 1826 (Coleoptera: Curculionidae: Entiminae: Eustylini Lacordaire) is revised to accommodate 21 species, including the following 10 new species from the northern Caribbean region: Pachnaeus andersoni sp. nov. (Little Cayman), Pachnaeus eisenbergi sp. nov. (Jamaica), Pachnaeus

The weevil genus Pachnaeus Schoenherr, 1826 (Coleoptera: Curculionidae: Entiminae: Eustylini Lacordaire) is revised to accommodate 21 species, including the following 10 new species from the northern Caribbean region: Pachnaeus andersoni sp. nov. (Little Cayman), Pachnaeus eisenbergi sp. nov. (Jamaica), Pachnaeus godivae sp. nov. (Cayman Brac), Pachnaeus gordoni sp. nov. (Jamaica), Pachnaeus howdenae sp. nov. (Bahamas), Pachnaeus ivieorum sp. nov. (Bahamas with adventive records from Florida), Pachnaeus maestrensis sp. nov. (Cuba), Pachnaeus morelli sp. nov. (Haiti), Pachnaeus obrienorum sp. nov. (Cuba and Bahamas), and Pachnaeus quadrilineatus sp. nov. (Jamaica).Pachnaeus can be distinguished from similar, co-occurring taxa such as Exophthalmus quadrivittatus (Olivier, 1807), Exophthalmus roseipes (Chevrolat, 1876), Exophthalmus vittatus (Linnaeus, 1758), and Diaprepes abbreviatus (Linnaeus, 1758) by (1) the presence of postocular vibrissae, (2) endophallus primarily membranous and sac-like proximally, and long (>3 × width), tubular, and sclerotized distally, (3) additional endophallic sclerites typically absent, (4) a never bicarinate, typically tricarinate, rostrum, and several additional characteristics of the pedon, endophallus, pronotal structure, rostral structure, and scaling. Based on these characters, Pachnaeus sommeri (Munck af Rosenschoeld in Schoenherr, 1840) comb, nov. and Pachnaeus gowdeyi (Marshall, 1926) comb. nov. are transferred into the genus from Exophthalmus Schoenherr and Lachnopus Schoenherr respectively. This revision provides genus and species redescriptions, diagnoses, illustrations, and the first comprehensive key to all 21 species within the present circumscription of Pachnaeus, in addition to reviewing the known biology and observed intraspecific variation within species. The complex taxonomic history of the genus is reviewed, and the evolutionary relationships of its presumed constituent clades are proposed through the construction of informal species groups and subgroups based on diagnosable shared traits. Lectotypes for Pachnaeus citri Marshall, Pachnaeus costatus Perroud, and Exophthalmus sommeri Munck af Rosenschoeld in Schoenherr and paralectotypes of P. citri (3 specimens) and E. sommeri (4 specimens) are designated. New state and national records are reported for Pachnaeus azurescens Gyllenhal in Schoenherr for Florida, U.S.A. and new national records are reported for Pachnaeus litus (Germar) for the Bahamas. Validity of the names Docorhinus Schoenherr, 1823 and Pachnaeus Schoenherr, 1826 is treated. Generic placement of Pachnaeus roseipes Chevrolat, 1876 is explored.
Date Created
2022
Agent

Analysis of Specificity of Associations Between Myrmecophilous Mites and their Host Species

147534-Thumbnail Image.png
Description

Ants are widespread species of eusocial insects, and myrmecophily describes the species which are associated with ants. Many mites are myrmecophilous species and interact with hosts in many ways such as phoresis or parasitism. The relationship between ants and mites

Ants are widespread species of eusocial insects, and myrmecophily describes the species which are associated with ants. Many mites are myrmecophilous species and interact with hosts in many ways such as phoresis or parasitism. The relationship between ants and mites are interesting as parasitic species could be used to control the spread of invasive ant species. For this project, I reviewed the existing literature on myrmecophilous mites around the world and compiled a database of ant-mite associations, which I then used to characterize factors such as host specificity, attachment sites, and biogeographical patterns. This work demonstrates that existing research on myrmecophilous mites has been both geographically and taxonomically biased and highlights the need for much more comprehensive surveys of mites living in association with ants.

Date Created
2021-05
Agent

Methods for Detecting Mutations in Non-model Organisms

158849-Thumbnail Image.png
Description
Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The problem of accurate genotyping is exacerbated when
there is not a reference genome or other auxiliary information available.
I explore several methods for sensitively detecting mutations in non-model or-
ganisms using an example Eucalyptus melliodora individual. I use the structure of
the tree to find bounds on its somatic mutation rate and evaluate several algorithms
for variant calling. I find that conventional methods are suitable if the genome of a
close relative can be adapted to the study organism. However, with structured data,
a likelihood framework that is aware of this structure is more accurate. I use the
techniques developed here to evaluate a reference-free variant calling algorithm.
I also use this data to evaluate a k-mer based base quality score recalibrator
(KBBQ), a tool I developed to recalibrate base quality scores attached to sequencing
data. Base quality scores can help detect errors in sequencing reads, but are often
inaccurate. The most popular method for correcting this issue requires a known
set of variant sites, which is unavailable in most cases. I simulate data and show
that errors in this set of variant sites can cause calibration errors. I then show that
KBBQ accurately recalibrates base quality scores while requiring no reference or other
information and performs as well as other methods.
Finally, I use the Eucalyptus data to investigate the impact of quality score calibra-
tion on the quality of output variant calls and show that improved base quality score
calibration increases the sensitivity and reduces the false positive rate of a variant
calling algorithm.
Date Created
2020
Agent

The Role of Multiple Expression Sites and Mosaic Gene Conversion in Antigenic Variation in African Trypanosomes

131674-Thumbnail Image.png
Description
Although extracellular throughout their lifecycle, trypanosomes are able to persist despite strong host immune responses through a process known as antigenic variation involving a large, highly diverse family of surface glycopro- tein (VSG) genes, only one of which is expressed

Although extracellular throughout their lifecycle, trypanosomes are able to persist despite strong host immune responses through a process known as antigenic variation involving a large, highly diverse family of surface glycopro- tein (VSG) genes, only one of which is expressed at a time. Previous studies have used mathematical models to investigate the relationship between VSG switching and the dynamics of trypanosome infections, but none have explored the role of multiple VSG expression sites or the contribution of mosaic gene conversion events involving VSG pseudogenes.
Date Created
2020-05
Agent

Diversity and Distribution of the Desert Stink Beetles: Systematics of the Amphidorini LeConte, 1862 (Coleoptera: Tenebrionidae)

156871-Thumbnail Image.png
Description
Understanding the diversity, evolutionary relationships, and geographic distribution of species is foundational knowledge in biology. However, this knowledge is lacking for many diverse lineages of the tree of life. This is the case for the desert stink beetles in the

Understanding the diversity, evolutionary relationships, and geographic distribution of species is foundational knowledge in biology. However, this knowledge is lacking for many diverse lineages of the tree of life. This is the case for the desert stink beetles in the tribe Amphidorini LeConte, 1862 (Coleoptera: Tenebrionidae) – a lineage of arid-adapted flightless beetles found throughout western North America. Four interconnected studies that jointly increase our knowledge of this group are presented. First, the darkling beetle fauna of the Algodones sand dunes in southern California is examined as a case study to explore the scientific practice of checklist creation. An updated list of the species known from this region is presented, with a critical focus on material now made available through digitization and global aggregation. This part concludes with recommendations for future biodiversity checklist authors. Second, the psammophilic genus Trogloderus LeConte, 1879 is revised. Six new species are described, and the first, multi-gene phylogeny for the genus is inferred. In addition, historical biogeographic reconstructions along with novel hypotheses of speciation patterns within the Intermountain Region are given. In particular, the Kaibab Plateau and Kaiparowitz Formation are found to have promoted speciation on the Colorado Plateau. The Owens Valley and prehistoric Bouse Embayment are similarly hypothesized to drive species diversification in southern California. Third, a novel phylogenomic analysis for the tribe Amphidorini is presented, based on 29 de novo partial transcriptomes. Three putative ortholog sets were discovered and analyzed to infer the relationships between species groups and genera. The existing classification of the tribe is found to be highly inadequate, though the earliest-diverging relationships within the tribe are still in question. Finally, the new phylogenetic framework is used to provide a genus-level revision for the Amphidorini, which previously contained six valid genera and 253 valid species. This updated classification includes more than 100 taxonomic changes and results in the revised tribe consisting of 16 genera, with three being described as new to science.
Date Created
2018
Agent

Modeling Health Indicators of Arizona State's Women's Soccer Team

137250-Thumbnail Image.png
Description
Winning records are critical to a team's morale, success, and future. As such, players need to perform their best when they are called into a game to ensure the best possible chance of contributing to the team's success. During the

Winning records are critical to a team's morale, success, and future. As such, players need to perform their best when they are called into a game to ensure the best possible chance of contributing to the team's success. During the 2013 fall season of Arizona State's NCAA soccer team, twenty-five females had quantities measured, such as heart rate workload, weight loss and playing time, that were analyzed using a least squares regression line and other mathematical relationships with mathematical software. Equations and box plots were produced for each player in the hopes that the coaches could tailor practices to the athletes' bodies needs to increase performance and results for the upcoming fall 2014 season. The playing time and heart rate workload model suggests that increased playing time increases heart rate workload in a linear fashion, though the increase varies by player. The model for the team proposes that the heart rate workload changes in response to playing time according to the equation y=2.67x+127.41 throughout the season. The weight loss and heart rate workload model suggest that establishing a relationship between the two variables is complex since the linear and power regression models did not fit the data. Future studies can focus on the Rate of Perceived Exertion scale, which can supplement the heart rate workload and provide valuable information on players' fatigue levels.
Date Created
2014-05
Agent

Bayesian Biogeographical Analyses with Beast: Assessment Using Simulated Data

136036-Thumbnail Image.png
Description
Biogeography is the study of the spatial distribution of the earth's biota, both in the present and the past. Traditionally, biogeographical studies have relied on a combination of surveys of existing populations, fossil evidence, and the geological record of the

Biogeography is the study of the spatial distribution of the earth's biota, both in the present and the past. Traditionally, biogeographical studies have relied on a combination of surveys of existing populations, fossil evidence, and the geological record of the earth. However, with the advent of relatively inexpensive methods of DNA sequencing, it is now possible to use information concerning the genetic relatedness of individuals in populations to address questions about how those populations came to be where they are today. For example, biogeographical studies of HIV-I provide strong support for the hypothesis that this virus arose in Africa through a host switch from chimpanzees to humans and only began to spread to human populations located on other continents some 60 to 70 years ago (Sharp & Hahn, 2010).
Date Created
2015-05
Agent