Unsupervised Bayesian data cleaning techniques for structured data
Description
Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data.
Date Created
The date the item was original created (prior to any relationship with the ASU Digital Repositories.)
2014
Agent
- Author (aut): De, Sushovan
- Thesis advisor (ths): Kambhampati, Subbarao
- Committee member: Chen, Yi
- Committee member: Candan, K. Selcuk
- Committee member: Liu, Huan
- Publisher (pbl): Arizona State University