Skip to main content
Upcoming Events:

PhD Defence: Yu Huang

Date & Time:
   Add All to Calendar

Microsoft Teams Channel: CAS Graduate Seminars and Defences

Event Contact:
Dr. Emil Sekerinski, Chair
Dr. Farhana H. Zulkernine
Dr. Frantisek Franek
Dr. Wenbo He
Dr. Fei Chiang, Supervisor

Relational Data Curation by Deduplication, Anonymization, and Diversification


Enterprises have been acquiring large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling more informed analysis. These processes rely on data analysis algorithms and techniques they used as well as the quality of datasets they acquired.  Unfortunately, organizations continue to be hindered by poor data quality as they wrangle with their data to extract value since most real datasets are rarely error-free. Poor data quality is a pervasive problem that spans across all industries and can cause imprecision and inconsistency in analysis tasks costing in the billions of dollars. The large body of datasets, the pace of the data acquisition and the variety of the data sources make it hard to clean these data, especially when data privacy and data diversity need to be considered. In this thesis, we propose three algorithms to respectively address data duplication, data cleaning and data privacy, and data diversity.