Early Career Performance Models: Regression-Based Forecasting Models for Predicting Future Major League Baseball Player Performance

137647-Thumbnail Image.png
Description
The widespread use of statistical analysis in sports-particularly Baseball- has made it increasingly necessary for small and mid-market teams to find ways to maintain their analytical advantages over large market clubs. In baseball, an opportunity for exists for teams with

The widespread use of statistical analysis in sports-particularly Baseball- has made it increasingly necessary for small and mid-market teams to find ways to maintain their analytical advantages over large market clubs. In baseball, an opportunity for exists for teams with limited financial resources to sign players under team control to long-term contracts before other teams can bid for their services in free agency. If small and mid-market clubs can successfully identify talented players early, clubs can save money, achieve cost certainty and remain competitive for longer periods of time. These deals are also advantageous to players since they receive job security and greater financial dividends earlier in their career. The objective of this paper is to develop a regression-based predictive model that teams can use to forecast the performance of young baseball players with limited Major League experience. There were several tasks conducted to achieve this goal: (1) Data was obtained from Major League Baseball and Lahman's Baseball Database and sorted using Excel macros for easier analysis. (2) Players were separated into three positional groups depending on similar fielding requirements and offensive profiles: Group I was comprised of first and third basemen, Group II contains second basemen, shortstops, and center fielders and Group III contains left and right fielders. (3) Based on the context of baseball and the nature of offensive performance metrics, only players who achieve greater than 200 plate appearances within the first two years of their major league debut are included in this analysis. (4) The statistical software package JMP was used to create regression models of each group and analyze the residuals for any irregularities or normality violations. Once the models were developed, slight adjustments were made to improve the accuracy of the forecasts and identify opportunities for future work. It was discovered that Group I and Group III were the easiest player groupings to forecast while Group II required several attempts to improve the model.
Date Created
2013-05
Agent

A Designed Experiments Approach to Optimizing MALDI-TOF MS Spectrum Processing Parameters Enhances Detection of Antibiotic Resistance in Campylobacter Jejuni

128599-Thumbnail Image.png
Description

MALDI-TOF MS has been utilized as a reliable and rapid tool for microbial fingerprinting at the genus and species levels. Recently, there has been keen interest in using MALDI-TOF MS beyond the genus and species levels to rapidly identify antibiotic

MALDI-TOF MS has been utilized as a reliable and rapid tool for microbial fingerprinting at the genus and species levels. Recently, there has been keen interest in using MALDI-TOF MS beyond the genus and species levels to rapidly identify antibiotic resistant strains of bacteria. The purpose of this study was to enhance strain level resolution for Campylobacter jejuni through the optimization of spectrum processing parameters using a series of designed experiments. A collection of 172 strains of C. jejuni were collected from Luxembourg, New Zealand, North America, and South Africa, consisting of four groups of antibiotic resistant isolates. The groups included: (1) 65 strains resistant to cefoperazone (2) 26 resistant to cefoperazone and beta-lactams (3) 5 strains resistant to cefoperazone, beta-lactams, and tetracycline, and (4) 76 strains resistant to cefoperazone, teicoplanin, amphotericin, B and cephalothin.

Initially, a model set of 16 strains (three biological replicates and three technical replicates per isolate, yielding a total of 144 spectra) of C. jejuni was subjected to each designed experiment to enhance detection of antibiotic resistance. The most optimal parameters were applied to the larger collection of 172 isolates (two biological replicates and three technical replicates per isolate, yielding a total of 1,031 spectra). We observed an increase in antibiotic resistance detection whenever either a curve based similarity coefficient (Pearson or ranked Pearson) was applied rather than a peak based (Dice) and/or the optimized preprocessing parameters were applied. Increases in antimicrobial resistance detection were scored using the jackknife maximum similarity technique following cluster analysis. From the first four groups of antibiotic resistant isolates, the optimized preprocessing parameters increased detection respective to the aforementioned groups by: (1) 5% (2) 9% (3) 10%, and (4) 2%. An additional second categorization was created from the collection consisting of 31 strains resistant to beta-lactams and 141 strains sensitive to beta-lactams. Applying optimal preprocessing parameters, beta-lactam resistance detection was increased by 34%. These results suggest that spectrum processing parameters, which are rarely optimized or adjusted, affect the performance of MALDI-TOF MS-based detection of antibiotic resistance and can be fine-tuned to enhance screening performance.

Date Created
2016-05-31
Agent

A Designed Experiments Approach to Optimization of Automated Data Acquisition During Characterization of Bacteria With MALDI-TOF Mass Spectrometry

128911-Thumbnail Image.png
Description

MALDI-TOF MS has been shown capable of rapidly and accurately characterizing bacteria. Highly reproducible spectra are required to ensure reliable characterization. Prior work has shown that spectra acquired manually can have higher reproducibility than those acquired automatically. For this reason,

MALDI-TOF MS has been shown capable of rapidly and accurately characterizing bacteria. Highly reproducible spectra are required to ensure reliable characterization. Prior work has shown that spectra acquired manually can have higher reproducibility than those acquired automatically. For this reason, the objective of this study was to optimize automated data acquisition to yield spectra with reproducibility comparable to those acquired manually. Fractional factorial design was used to design experiments for robust optimization of settings, in which values of five parameters (peak selection mass range, signal to noise ratio (S:N), base peak intensity, minimum resolution and number of shots summed) commonly used to facilitate automated data acquisition were varied. Pseudomonas aeruginosa was used as a model bacterium in the designed experiments, and spectra were acquired using an intact cell sample preparation method. Optimum automated data acquisition settings (i.e., those settings yielding the highest reproducibility of replicate mass spectra) were obtained based on statistical analysis of spectra of P. aeruginosa. Finally, spectrum quality and reproducibility obtained from non-optimized and optimized automated data acquisition settings were compared for P. aeruginosa, as well as for two other bacteria, Klebsiella pneumoniae and Serratia marcescens. Results indicated that reproducibility increased from 90% to 97% (p-value [~ over =] 0.002) for P. aeruginosa when more shots were summed and, interestingly, decreased from 95% to 92% (p-value [~ over =] 0.013) with increased threshold minimum resolution. With regard to spectrum quality, highly reproducible spectra were more likely to have high spectrum quality as measured by several quality metrics, except for base peak resolution. Interaction plots suggest that, in cases of low threshold minimum resolution, high reproducibility can be achieved with fewer shots. Optimization yielded more reproducible spectra than non-optimized settings for all three bacteria.

Date Created
2014-03-24
Agent

Measurement systems analysis studies: a look at the partition of variation (POV) method

154216-Thumbnail Image.png
Description
The Partition of Variance (POV) method is a simplistic way to identify large sources of variation in manufacturing systems. This method identifies the variance by estimating the variance of the means (between variance) and the means of the variance (within

The Partition of Variance (POV) method is a simplistic way to identify large sources of variation in manufacturing systems. This method identifies the variance by estimating the variance of the means (between variance) and the means of the variance (within variance). The project shows that the method correctly identifies the variance source when compared to the ANOVA method. Although the variance estimators deteriorate when varying degrees of non-normality is introduced through simulation; however, the POV method is shown to be a more stable measure of variance in the aggregate. The POV method also provides non-negative, stable estimates for interaction when compared to the ANOVA method. The POV method is shown to be more stable, particularly in low sample size situations. Based on these findings, it is suggested that the POV is not a replacement for more complex analysis methods, but rather, a supplement to them. POV is ideal for preliminary analysis due to the ease of implementation, the simplicity of interpretation, and the lack of dependency on statistical analysis packages or statistical knowledge.
Date Created
2015
Agent

A probabilistic framework of transfer learning- theory and application

154099-Thumbnail Image.png
Description
Transfer learning refers to statistical machine learning methods that integrate the knowledge of one domain (source domain) and the data of another domain (target domain) in an appropriate way, in order to develop a model for the target domain that

Transfer learning refers to statistical machine learning methods that integrate the knowledge of one domain (source domain) and the data of another domain (target domain) in an appropriate way, in order to develop a model for the target domain that is better than a model using the data of the target domain alone. Transfer learning emerged because classic machine learning, when used to model different domains, has to take on one of two mechanical approaches. That is, it will either assume the data distributions of the different domains to be the same and thereby developing one model that fits all, or develop one model for each domain independently. Transfer learning, on the other hand, aims to mitigate the limitations of the two approaches by accounting for both the similarity and specificity of related domains. The objective of my dissertation research is to develop new transfer learning methods and demonstrate the utility of the methods in real-world applications. Specifically, in my methodological development, I focus on two different transfer learning scenarios: spatial transfer learning across different domains and temporal transfer learning along time in the same domain. Furthermore, I apply the proposed spatial transfer learning approach to modeling of degenerate biological systems.Degeneracy is a well-known characteristic, widely-existing in many biological systems, and contributes to the heterogeneity, complexity, and robustness of biological systems. In particular, I study the application of one degenerate biological system which is to use transcription factor (TF) binding sites to predict gene expression across multiple cell lines. Also, I apply the proposed temporal transfer learning approach to change detection of dynamic network data. Change detection is a classic research area in Statistical Process Control (SPC), but change detection in network data has been limited studied. I integrate the temporal transfer learning method called the Network State Space Model (NSSM) and SPC and formulate the problem of change detection from dynamic networks into a covariance monitoring problem. I demonstrate the performance of the NSSM in change detection of dynamic social networks.
Date Created
2015
Agent

A model fusion based framework for imbalanced classification problem with noisy dataset

153065-Thumbnail Image.png
Description
Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data imbalance and data noise have been treated separately in the data mining field. Yet, such approach ignores the mutual effects and as a result may lead to new problems. A desirable solution is to tackle these two issues jointly. Noting the complementary nature of generative and discriminative models, this research proposes a unified model fusion based framework to handle the imbalanced classification with noisy dataset.

The phase I study focuses on the imbalanced classification problem. A generative classifier, Gaussian Mixture Model (GMM) is studied which can learn the distribution of the imbalance data to improve the discrimination power on imbalanced classes. By fusing this knowledge into cost SVM (cSVM), a CSG method is proposed. Experimental results show the effectiveness of CSG in dealing with imbalanced classification problems.

The phase II study expands the research scope to include the noisy dataset into the imbalanced classification problem. A model fusion based framework, K Nearest Gaussian (KNG) is proposed. KNG employs a generative modeling method, GMM, to model the training data as Gaussian mixtures and form adjustable confidence regions which are less sensitive to data imbalance and noise. Motivated by the K-nearest neighbor algorithm, the neighboring Gaussians are used to classify the testing instances. Experimental results show KNG method greatly outperforms traditional classification methods in dealing with imbalanced classification problems with noisy dataset.

The phase III study addresses the issues of feature selection and parameter tuning of KNG algorithm. To further improve the performance of KNG algorithm, a Particle Swarm Optimization based method (PSO-KNG) is proposed. PSO-KNG formulates model parameters and data features into the same particle vector and thus can search the best feature and parameter combination jointly. The experimental results show that PSO can greatly improve the performance of KNG with better accuracy and much lower computational cost.
Date Created
2014
Agent

Analysis of no-confounding designs using the dantzig selector

153053-Thumbnail Image.png
Description
No-confounding designs (NC) in 16 runs for 6, 7, and 8 factors are non-regular fractional factorial designs that have been suggested as attractive alternatives to the regular minimum aberration resolution IV designs because they do not completely confound any two-factor

No-confounding designs (NC) in 16 runs for 6, 7, and 8 factors are non-regular fractional factorial designs that have been suggested as attractive alternatives to the regular minimum aberration resolution IV designs because they do not completely confound any two-factor interactions with each other. These designs allow for potential estimation of main effects and a few two-factor interactions without the need for follow-up experimentation. Analysis methods for non-regular designs is an area of ongoing research, because standard variable selection techniques such as stepwise regression may not always be the best approach. The current work investigates the use of the Dantzig selector for analyzing no-confounding designs. Through a series of examples it shows that this technique is very effective for identifying the set of active factors in no-confounding designs when there are three of four active main effects and up to two active two-factor interactions.

To evaluate the performance of Dantzig selector, a simulation study was conducted and the results based on the percentage of type II errors are analyzed. Also, another alternative for 6 factor NC design, called the Alternate No-confounding design in six factors is introduced in this study. The performance of this Alternate NC design in 6 factors is then evaluated by using Dantzig selector as an analysis method. Lastly, a section is dedicated to comparing the performance of NC-6 and Alternate NC-6 designs.
Date Created
2014
Agent

A P-value based approach for phase II profile monitoring

152382-Thumbnail Image.png
Description
A P-value based method is proposed for statistical monitoring of various types of profiles in phase II. The performance of the proposed method is evaluated by the average run length criterion under various shifts in the intercept, slope and error

A P-value based method is proposed for statistical monitoring of various types of profiles in phase II. The performance of the proposed method is evaluated by the average run length criterion under various shifts in the intercept, slope and error standard deviation of the model. In our proposed approach, P-values are computed at each level within a sample. If at least one of the P-values is less than a pre-specified significance level, the chart signals out-of-control. The primary advantage of our approach is that only one control chart is required to monitor several parameters simultaneously: the intercept, slope(s), and the error standard deviation. A comprehensive comparison of the proposed method and the existing KMW-Shewhart method for monitoring linear profiles is conducted. In addition, the effect that the number of observations within a sample has on the performance of the proposed method is investigated. The proposed method was also compared to the T^2 method discussed in Kang and Albin (2000) for multivariate, polynomial, and nonlinear profiles. A simulation study shows that overall the proposed P-value method performs satisfactorily for different profile types.
Date Created
2013
Agent

Optimal experimental design for accelerated life testing and design evaluation

152223-Thumbnail Image.png
Description
Nowadays product reliability becomes the top concern of the manufacturers and customers always prefer the products with good performances under long period. In order to estimate the lifetime of the product, accelerated life testing (ALT) is introduced because most of

Nowadays product reliability becomes the top concern of the manufacturers and customers always prefer the products with good performances under long period. In order to estimate the lifetime of the product, accelerated life testing (ALT) is introduced because most of the products can last years even decades. Much research has been done in the ALT area and optimal design for ALT is a major topic. This dissertation consists of three main studies. First, a methodology of finding optimal design for ALT with right censoring and interval censoring have been developed and it employs the proportional hazard (PH) model and generalized linear model (GLM) to simplify the computational process. A sensitivity study is also given to show the effects brought by parameters to the designs. Second, an extended version of I-optimal design for ALT is discussed and then a dual-objective design criterion is defined and showed with several examples. Also in order to evaluate different candidate designs, several graphical tools are developed. Finally, when there are more than one models available, different model checking designs are discussed.
Date Created
2013
Agent

Projection properties and analysis methods for six to fourteen factor no confounding designs in 16 runs

151329-Thumbnail Image.png
Description
During the initial stages of experimentation, there are usually a large number of factors to be investigated. Fractional factorial (2^(k-p)) designs are particularly useful during this initial phase of experimental work. These experiments often referred to as screening experiments hel

During the initial stages of experimentation, there are usually a large number of factors to be investigated. Fractional factorial (2^(k-p)) designs are particularly useful during this initial phase of experimental work. These experiments often referred to as screening experiments help reduce the large number of factors to a smaller set. The 16 run regular fractional factorial designs for six, seven and eight factors are in common usage. These designs allow clear estimation of all main effects when the three-factor and higher order interactions are negligible, but all two-factor interactions are aliased with each other making estimation of these effects problematic without additional runs. Alternatively, certain nonregular designs called no-confounding (NC) designs by Jones and Montgomery (Jones & Montgomery, Alternatives to resolution IV screening designs in 16 runs, 2010) partially confound the main effects with the two-factor interactions but do not completely confound any two-factor interactions with each other. The NC designs are useful for independently estimating main effects and two-factor interactions without additional runs. While several methods have been suggested for the analysis of data from nonregular designs, stepwise regression is familiar to practitioners, available in commercial software, and is widely used in practice. Given that an NC design has been run, the performance of stepwise regression for model selection is unknown. In this dissertation I present a comprehensive simulation study evaluating stepwise regression for analyzing both regular fractional factorial and NC designs. Next, the projection properties of the six, seven and eight factor NC designs are studied. Studying the projection properties of these designs allows the development of analysis methods to analyze these designs. Lastly the designs and projection properties of 9 to 14 factor NC designs onto three and four factors are presented. Certain recommendations are made on analysis methods for these designs as well.
Date Created
2012
Agent