Detecting Specification Mismatches using Machine Learning-Based Analysis of CPU Manuals

193576-Thumbnail Image.png
Description
Having properly implemented instructions is key to computer architecture and the security of a computer. Without properly implemented instructions, there is a risk of security vulnerabilities such as privilege escalation. Current methods of checking specification mismatches are the various versions

Having properly implemented instructions is key to computer architecture and the security of a computer. Without properly implemented instructions, there is a risk of security vulnerabilities such as privilege escalation. Current methods of checking specification mismatches are the various versions of the manual approach and the use of symbolic execution. These current methods can be time-consuming or have issues with scalability and efficiency. In this thesis, an approach is proposed to improve the current methods by employing the aid of machine-learning, specifically large-language models (LLMs), testing on RISC-V architecture. RISC-V architecture is proposed to test this method due to its simplistic nature and smaller instruction set compared to other architectures like x86. In this approach, Chat-GPT is proposed as the LLM of choice due to its rising popularity as well as its capability and power. The approach combines manual aspects and the aid of Chat-GPT to fully test how well Chat-GPT is at generating expressions and test cases to detect specification mismatches. The Chat-GPT generated test cases are evaluated on a RISC-V framework to see if the Chat-GPT generated test cases can be used in the future to detect specification mismatches as well as being used in more complicated architectures.
Date Created
2024
Agent

Accelerating Deep Learning Inference in Relational Database Systems

193457-Thumbnail Image.png
Description
Deep learning has become a potent method for drawing conclusions and forecasts from massive amounts of data. But when used in practical applications, conventional deep learning frameworks frequently run into problems, especially when data is stored in relational database systems.

Deep learning has become a potent method for drawing conclusions and forecasts from massive amounts of data. But when used in practical applications, conventional deep learning frameworks frequently run into problems, especially when data is stored in relational database systems. Thus, in recent years, a stream of research in integrating machine learning model inferences with a relational database to achieve benefits such as avoiding privacy issues and data transfer overheads is observed. The logic for performing the inference using the DNN model can be encapsulated in a user-defined function (UDF). These UDFs can then be integrated with the query interface of the DBMS and executed by the query execution engine. While it is relatively straightforward to leverage the User Defined Functions (UDFs) to implement machine learning algorithms using parallelism, it is observed that such implementations will not always be optimal and may incur issues in balancing the database threading and the threading of the libraries that the UDFs invoke. Since relational databases provide native support for relational operators, it is possible to leverage a cost model to make decisions for selectively transforming the UDFs based inference logic into a model-parallel implementation for optimal performance. Thus, this thesis will focus on the following: 1. Designing a domain-specific language for implementing the UDFs using Velox library, which can be lowered to a graph-based intermediate representation (IR); 2. Providing a cost model that aids in the decision-making of converting a UDF-centric implementation to a relation centric one.
Date Created
2024
Agent

Regression Test Suite for Logic Tutor on the Web

Description
As a Creative Project, there are two goals: learn and leave documentation on a version control system called Git; develop a regression test suite through different testing strategies. Through researching various sources, a set of 62 test cases were developed

As a Creative Project, there are two goals: learn and leave documentation on a version control system called Git; develop a regression test suite through different testing strategies. Through researching various sources, a set of 62 test cases were developed to accurately verify that a program's logic is correct. As a result, a few defects were found in the source code and effectively notified to the developer.
Date Created
2023-12
Agent

Optimizing Consistency and Performance Trade-off in Distributed Log-Structured Merge-Tree-based Key-Value Stores

189344-Thumbnail Image.png
Description
Distributed databases, such as Log-Structured Merge-Tree Key-Value Stores (LSM-KVS), are widely used in modern infrastructure. One of the primary challenges in these databases is ensuring consistency, meaning that all nodes have the same view of data at any given time.

Distributed databases, such as Log-Structured Merge-Tree Key-Value Stores (LSM-KVS), are widely used in modern infrastructure. One of the primary challenges in these databases is ensuring consistency, meaning that all nodes have the same view of data at any given time. However, maintaining consistency requires a trade-off: the stronger the consistency, the more resources are necessary to replicate data across replicas, which decreases database performance. Addressing this trade-off poses two challenges: first, developing and managing multiple consistency levels within a single system, and second, assigning consistency levels to effectively balance the consistency-performance trade-off. This thesis introduces Self-configuring Consistency In Distributed LSM-KVS (SCID), a service that leverages unique properties of LSM KVS properties to manage consistency levels and automates level assignment with ML. To address the first challenge, SCID combines Dynamic read-only instances and Logical KV-based partitions to enable on-demand updates of read-only instances and facilitate the logical separation of groups of key-value pairs. SCID uses logical partitions as consistency levels and on-demand updates in dynamic read-only instances to allow for multiple consistency levels. To address the second challenge, the thesis presents an ML-based solution, SCID-ML to manage consistency-performance trade-off with better effectiveness. We evaluate SCID and find it to improve the write throughput up to 50% and achieve 62% accuracy for consistency-level predictions.
Date Created
2023
Agent

Analyzing the Impact of Software Configurations on Dynamic Code Coverage

187402-Thumbnail Image.png
Description
Large software tend to have a large number of configuration options that can be tuned to a varying degree in order to run the software in a specific way. These configuration options cause a change in the execution of the

Large software tend to have a large number of configuration options that can be tuned to a varying degree in order to run the software in a specific way. These configuration options cause a change in the execution of the software, and therefore affect the code coverage of the software. This gives rise to the problem of understanding how much a certain configuration change affects the code coverage of the software in a measurable way. It also raises the question of effectively mapping code coverage to a configuration change. Solutions to these problems could give way to increasing efficiency in various areas of software security, like maximizing code coverage in fuzz testing and vulnerability identification in specific configurations.In this work, I perform analyze widely used software, such as the database cache `Redis' and web servers like `Nginx' and `Apache httpd'. I perform fuzz tests on multiple configurations of each of these software to measure the difference in code coverage caused by each configuration. I use Coverage Instrumentation to obtain traces for each software in their configurations, and then I analyze these traces to understand the configuration's impact on the software's code coverage. In conclusion, I describe a method to measure how much code coverage differs for each configuration with respect to the default configuration of the software, and how certain configurations have a much larger difference in code coverage with respect to the default configuration than others, analyze the overlap in code coverage between the configurations and finally find the root causes of the differing code coverage.
Date Created
2023
Agent