Batting for the Long Run: How MLB Service Time Influences the Predictive Accuracy of Expected Statistics

Description
In 2015, a new way to track baseball games was introduced to MLB, marking the beginning of the Statcast Revolution. This new way to track the game brought about a number of new statistics, including the use of expected statistics.

In 2015, a new way to track baseball games was introduced to MLB, marking the beginning of the Statcast Revolution. This new way to track the game brought about a number of new statistics, including the use of expected statistics. Expected statistics provide an estimate of what a player’s statistics should be on average with their same actions. This will be explored more in the upcoming paper. While expected statistics are not intended to predict the future performance of players, I theorized that there may be some relation, particularly on younger players. There is not any research on this topic yet, and if there does exist a correlation between expected statistics and future performance, it would allow teams to have a new way to predict data on their players. Research to find a correlation between the two was carried out by taking predictive accuracies of expected batting average and slugging of 12 MLB players throughout their rookie to 8th year seasons and combining them together to find an interval in which I could be confident the correlation lay. Overall, I found that I could not be certain that there was a correlation between the predictive accuracy of expected statistics and the length of time a player has played in MLB. While this conclusion does not offer any insights of how to better predict a player’s future performance, the methodology and findings still present opportunities to gain a better understanding of the predictive measures of expected statistics.
Date Created
2024-05
Agent

An Analysis of the Boundary Explorer Adaptive Sampling Technique

191035-Thumbnail Image.png
Description
With the explosion of autonomous systems under development, complex simulation models are being tested and relied on far more than in the recent past. This uptick in autonomous systems being modeled then tested magnifies both the advantages and disadvantages of

With the explosion of autonomous systems under development, complex simulation models are being tested and relied on far more than in the recent past. This uptick in autonomous systems being modeled then tested magnifies both the advantages and disadvantages of simulation experimentation. An inherent problem in autonomous systems development is when small changes in factor settings result in large changes in a response’s performance. These occurrences look like cliffs in a metamodel’s response surface and are referred to as performance mode boundary regions. These regions represent areas of interest in the autonomous system’s decision-making process. Therefore, performance mode boundary regions are areas of interest for autonomous systems developers.Traditional augmentation methods aid experimenters seeking different objectives, often by improving a certain design property of the factor space (such as variance) or a design’s modeling capabilities. While useful, these augmentation techniques do not target areas of interest that need attention in autonomous systems testing focused on the response. Boundary Explorer Adaptive Sampling Technique, or BEAST, is a set of design augmentation algorithms. The adaptive sampling algorithm targets performance mode boundaries with additional samples. The gap filling augmentation algorithm targets sparsely sampled areas in the factor space. BEAST allows for sampling to adapt to information obtained from pervious iterations of experimentation and target these regions of interest. Exploiting the advantages of simulation model experimentation, BEAST can be used to provide additional iterations of experimentation, providing clarity and high-fidelity in areas of interest along potentially steep gradient regions. The objective of this thesis is to research and present BEAST, then compare BEAST’s algorithms to other design augmentation techniques. Comparisons are made towards traditional methods that are already implemented in SAS Institute’s JMP software, or emerging adaptive sampling techniques, such as Range Adversarial Planning Tool (RAPT). The goal of this objective is to gain a deeper understanding of how BEAST works and where it stands in the design augmentation space for practical applications. With a gained understanding of how BEAST operates and how well BEAST performs, future research recommendations will be presented to improve BEAST’s capabilities.
Date Created
2024
Agent

Design of Experiments and Reliability Growth on Repairable Systems

189289-Thumbnail Image.png
Description
Reliability growth is not a new topic in either engineering or statistics and has been a major focus for the past few decades. The increasing level of high-tech complex systems and interconnected components and systems implies that reliability problems will

Reliability growth is not a new topic in either engineering or statistics and has been a major focus for the past few decades. The increasing level of high-tech complex systems and interconnected components and systems implies that reliability problems will continue to exist and may require more complex solutions. The most heavily used experimental designs in assessing and predicting a systems reliability are the "classical designs", such as full factorial designs, fractional factorial designs, and Latin square designs. They are so heavily used because they are optimal in their own right and have served superbly well in providing efficient insight into the underlying structure of industrial processes. However, cases do arise when the classical designs do not cover a particular practical situation. Repairable systems are such a case in that they usually have limitations on the maximum number of runs or too many varying levels for factors. This research explores the D-optimal design criteria as it applies to the Poisson Regression model on repairable systems, with a number of independent variables and under varying assumptions, to include the total time tested at a specific design point with fixed parameters, the use of a Bayesian approach with unknown parameters, and how the design region affects the optimal design. In applying experimental design to these complex repairable systems, one may discover interactions between stressors and provide better failure data. Our novel approach of accounting for time and the design space in the early stages of testing of repairable systems should, theoretically, in the final engineering design improve the system's reliability, maintainability and availability.
Date Created
2023
Agent

Analysis of No-Confounding Designs in 16 Runs for 9-14 Factors

171421-Thumbnail Image.png
Description
Nonregular designs for 9-14 factors in 16 runs are a vital alternative for to theregular minimum aberration resolution III fractional factorials. Because there is no complete aliasing between the main factor and two factor interactions, these designs are useful as

Nonregular designs for 9-14 factors in 16 runs are a vital alternative for to theregular minimum aberration resolution III fractional factorials. Because there is no complete aliasing between the main factor and two factor interactions, these designs are useful as potential confusion in results is avoided. However, there is another associated complication to this kind of design due to the complete confounding for some of the two- factors. In this research, the focus is on using three different of methods and compare the results. The methods are: Stepwise, least absolute shrinkage and selection operator (LASSO) and the Dantzig selector method. In a previous research, Metcalfe discuss the nonregular designs for 6-8 factors design and studies several analysis methods. She also develops a new method, The Aliased Informed Model Selection (AIMS), for those designs. This research builds upon that. For this research, simulation is used to develop random models to analyze designs from the class of nonregular fractions with 9 – 14 factors in 16 runs using JMP scripting. Then, analyze the cases with the mentioned methods and find the success rate for each one. The model generations were random with only main factors, or main factors and two- factors interaction as active effects. Effect sizes of 2 and 3 standard deviations are studied. The nonregular design used in this research are 9 and 11-factors design. Results shows that there is a clear consistency for the main factors only as active effects using all the methods. However, adding the interactions to the active effects degrade the success rate substantially for the Dantzig method. Moreover, as the active effects exceed approximately half of the degrees of freedom for the design the performance for all i the methods decreases. Finally, some recommendations are discussed for further research investigation such as AIMS, other variation methods and Augmentation.
Date Created
2022
Agent

Machine Learning Models for High-Dimensional Matched Data

161983-Thumbnail Image.png
Description
Matching or stratification is commonly used in observational studies to remove bias due to confounding variables. Analyzing matched data sets requires specific methods which handle dependency among observations within a stratum. Also, modern studies often include hundreds or thousands of

Matching or stratification is commonly used in observational studies to remove bias due to confounding variables. Analyzing matched data sets requires specific methods which handle dependency among observations within a stratum. Also, modern studies often include hundreds or thousands of variables. Traditional methods for matched data sets are challenged in high-dimensional settings, mixed type variables (numerical and categorical), nonlinear andinteraction effects. Furthermore, machine learning research for such structured data is quite limited. This dissertation addresses this important gap and proposes machine learning models for identifying informative variables from high-dimensional matched data sets. The first part of this dissertation proposes a machine learning model to identify informative variables from high-dimensional matched case-control data sets. The outcome of interest in this study design is binary (case or control), and each stratum is assumed to have one unit from each outcome level. The proposed method which is referred to as Matched Forest (MF) is effective for large number of variables and identifying interaction effects. The second part of this dissertation proposes three enhancements of MF algorithm. First, a regularization framework is proposed to improve variable selection performance in excessively high-dimensional settings. Second, a classification method is proposed to classify unlabeled pairs of data. Third, two metrics are proposed to estimate the effects of important variables identified by MF. The third part proposes a machine learning model based on Neural Networks to identify important variables from a more generalized matched case-control data set where each stratum has one unit from case outcome level and more than one unit from control outcome level. This method which is referred to as Matched Neural Network (MNN) performs better than current algorithms to identify variables with interaction effects. Lastly, a generalized machine learning model is proposed to identify informative variables from high-dimensional matched data sets where the outcome has more than two levels. This method outperforms existing algorithms in the literature in identifying variables with complex nonlinear and interaction effects.
Date Created
2021
Agent

Making Bayesian Optimization Practical in the Context of High Dimensional, Highly Expensive, Black­Box Functions

161846-Thumbnail Image.png
Description
Complex systems appear when interaction among system components creates emergent behavior that is difficult to be predicted from component properties. The growth of Internet of Things (IoT) and embedded technology has increased complexity across several sectors (e.g., automotive, aerospace, agriculture,

Complex systems appear when interaction among system components creates emergent behavior that is difficult to be predicted from component properties. The growth of Internet of Things (IoT) and embedded technology has increased complexity across several sectors (e.g., automotive, aerospace, agriculture, city infrastructures, home technologies, healthcare) where the paradigm of cyber-physical systems (CPSs) has become a standard. While CPS enables unprecedented capabilities, it raises new challenges in system design, certification, control, and verification. When optimizing system performance computationally expensive simulation tools are often required, and search algorithms that sequentially interrogate a simulator to learn promising solutions are in great demand. This class of algorithms are black-box optimization techniques. However, the generality that makes black-box optimization desirable also causes computational efficiency difficulties when applied real problems. This thesis focuses on Bayesian optimization, a prominent black-box optimization family, and proposes new principles, translated in implementable algorithms, to scale Bayesian optimization to highly expensive, large scale problems. Four problem contexts are studied and approaches are proposed for practically applying Bayesian optimization concepts, namely: (1) increasing sample efficiency of a highly expensive simulator in the presence of other sources of information, where multi-fidelity optimization is used to leverage complementary information sources; (2) accelerating global optimization in the presence of local searches by avoiding over-exploitation with adaptive restart behavior; (3) scaling optimization to high dimensional input spaces by integrating Game theoretic mechanisms with traditional techniques; (4) accelerating optimization by embedding function structure when the reward function is a minimum of several functions. In the first context this thesis produces two multi-fidelity algorithms, a sample driven and model driven approach, and is implemented to optimize a serial production line; in the second context the Stochastic Optimization with Adaptive Restart (SOAR) framework is produced and analyzed with multiple applications to CPS falsification problems; in the third context the Bayesian optimization with sample fictitious play (BOFiP) algorithm is developed with an implementation in high-dimensional neural network training; in the last problem context the minimum surrogate optimization (MSO) framework is produced and combined with both Bayesian optimization and the SOAR framework with applications in simultaneous falsification of multiple CPS requirements.
Date Created
2021
Agent

The Use of Simulation in a Foundry Setting

132730-Thumbnail Image.png
Description
Woodland/Alloy Casting, Inc. is an aluminum foundry known for providing high-quality molds to their customers in industries such as aviation, electrical, defense, and nuclear power. However, as the company has grown larger during the past three years, they have begun

Woodland/Alloy Casting, Inc. is an aluminum foundry known for providing high-quality molds to their customers in industries such as aviation, electrical, defense, and nuclear power. However, as the company has grown larger during the past three years, they have begun to struggle with the on-time delivery of their orders. Woodland prides itself on their high-grade process that includes core processing, the molding process, cleaning process, and heat-treat process. To create each mold, it has to flow through each part of the system flawlessly. Throughout this process, significant bottlenecks occur that limit the number of molds leaving the system. To combat this issue, this project uses a simulation of the foundry to test how best to schedule their work to optimize the use of their resources. Simulation can be an effective tool when testing for improvements in systems where making changes to the physical system is too expensive. ARENA is a simulation tool that allows for manipulation of resources and process while also allowing both random and selected schedules to be run through the foundry’s production process. By using an ARENA simulation to test different scheduling techniques, the risk of missing production runs is minimized during the experimental period so that many different options can be tested to see how they will affect the production line. In this project, several feasible scheduling techniques are compared in simulation to determine which schedules allow for the highest number of molds to be completed.
Date Created
2019-05
Agent

Statistical Analysis of Power Differences between Experimental Design Software Packages

134361-Thumbnail Image.png
Description

Based on findings of previous studies, there was speculation that two well-known experimental design software packages, JMP and Design Expert, produced varying power outputs given the same design and user inputs. For context and scope, another popular experimental design software

Based on findings of previous studies, there was speculation that two well-known experimental design software packages, JMP and Design Expert, produced varying power outputs given the same design and user inputs. For context and scope, another popular experimental design software package, Minitab® Statistical Software version 17, was added to the comparison. The study compared multiple test cases run on the three software packages with a focus on 2k and 3K factorial design and adjusting the standard deviation effect size, number of categorical factors, levels, number of factors, and replicates. All six cases were run on all three programs and were attempted to be run at one, two, and three replicates each. There was an issue at the one replicate stage, however—Minitab does not allow for only one replicate full factorial designs and Design Expert will not provide power outputs for only one replicate unless there are three or more factors. From the analysis of these results, it was concluded that the differences between JMP 13 and Design Expert 10 were well within the margin of error and likely caused by rounding. The differences between JMP 13, Design Expert 10, and Minitab 17 on the other hand indicated a fundamental difference in the way Minitab addressed power calculation compared to the latest versions of JMP and Design Expert. This was found to be likely a cause of Minitab’s dummy variable coding as its default instead of the orthogonal coding default of the other two. Although dummy variable and orthogonal coding for factorial designs do not show a difference in results, the methods affect the overall power calculations. All three programs can be adjusted to use either method of coding, but the exact instructions for how are difficult to find and thus a follow-up guide on changing the coding for factorial variables would improve this issue.

Date Created
2017-05
Agent

Modeling Fantasy Baseball Player Popularity Using Twitter Activity

134317-Thumbnail Image.png
Description
Social media is used by people every day to discuss the nuances of their lives. Major League Baseball (MLB) is a popular sport in the United States, and as such has generated a great deal of activity on Twitter. As

Social media is used by people every day to discuss the nuances of their lives. Major League Baseball (MLB) is a popular sport in the United States, and as such has generated a great deal of activity on Twitter. As fantasy baseball continues to grow in popularity, so does the research into better algorithms for picking players. Most of the research done in this area focuses on improving the prediction of a player's individual performance. However, the crowd-sourcing power afforded by social media may enable more informed predictions about players' performances. Players are chosen by popularity and personal preferences by most amateur gamblers. While some of these trends (particularly the long-term ones) are captured by ranking systems, this research was focused on predicting the daily spikes in popularity (and therefore price or draft order) by comparing the number of mentions that the player received on Twitter compared to their previous mentions. In doing so, it was demonstrated that improved fantasy baseball predictions can be made through leveraging social media data.
Date Created
2017-05
Agent

Early Career Performance Models: Regression-Based Forecasting Models for Predicting Future Major League Baseball Player Performance

137647-Thumbnail Image.png
Description
The widespread use of statistical analysis in sports-particularly Baseball- has made it increasingly necessary for small and mid-market teams to find ways to maintain their analytical advantages over large market clubs. In baseball, an opportunity for exists for teams with

The widespread use of statistical analysis in sports-particularly Baseball- has made it increasingly necessary for small and mid-market teams to find ways to maintain their analytical advantages over large market clubs. In baseball, an opportunity for exists for teams with limited financial resources to sign players under team control to long-term contracts before other teams can bid for their services in free agency. If small and mid-market clubs can successfully identify talented players early, clubs can save money, achieve cost certainty and remain competitive for longer periods of time. These deals are also advantageous to players since they receive job security and greater financial dividends earlier in their career. The objective of this paper is to develop a regression-based predictive model that teams can use to forecast the performance of young baseball players with limited Major League experience. There were several tasks conducted to achieve this goal: (1) Data was obtained from Major League Baseball and Lahman's Baseball Database and sorted using Excel macros for easier analysis. (2) Players were separated into three positional groups depending on similar fielding requirements and offensive profiles: Group I was comprised of first and third basemen, Group II contains second basemen, shortstops, and center fielders and Group III contains left and right fielders. (3) Based on the context of baseball and the nature of offensive performance metrics, only players who achieve greater than 200 plate appearances within the first two years of their major league debut are included in this analysis. (4) The statistical software package JMP was used to create regression models of each group and analyze the residuals for any irregularities or normality violations. Once the models were developed, slight adjustments were made to improve the accuracy of the forecasts and identify opportunities for future work. It was discovered that Group I and Group III were the easiest player groupings to forecast while Group II required several attempts to improve the model.
Date Created
2013-05
Agent