Systems for Big Data, Data Science, and Machine Learning

The focus of this work is information infrastructures and systems for data science and machine learning, with an emphasis on systems for graph data management and mining, machine learning, cloud data management, data stream analytics, interactive data exploration, uncertain data management, high-performance genomic data processing, and RFID/sensor data management.

Interplay Between Data and Models

In the past few years, research around data management has begun to intertwine with research around simulation, machine learning, and optimization models in novel and interesting ways in order to support robust decision making under uncertainty. Our research has addressed a variety of challenges arising at the interface of models and data. These include methods for moving stochastic analytics closer to the data, scaling model-based analytics over large data, using data for semi-automatic model creation, feeding data-hungry models in the presence of sparse data, efficiently maintaining models in the face of changing data, overcoming biases in training data, and using data to engender trust in a model.

Usability and Analysis

As data is now a staple in so many aspects of human activity, the audience for data technologies has expanded to include a varied range of users: from non-experts wishing to peruse datasets, to domain experts with specialized data processing needs. Data systems have not adapted to address these demands effectively: databases’ specialized query languages and structure create barriers for non-experts, while the lack of native support for important computing needs leaves experts to develop application-specific solutions themselves. Our work removes data-use barriers by simplifying access for non-experts to data and by augmenting database functionality with advanced problem-solving capabilities, thus simplifying analytics workflows by moving them closer to the data.

Provenance, Causality, and Explanations

Data is critical in almost every aspect of society, including education, technology, healthcare, economy, and science. Poor understanding and handling of data, poor data quality, and errors in data-driven processes are detrimental in all domains that rely on data. The goal of this research is to target these particular challenges, to develop tools that improve our understanding of data and facilitate the diagnosis of errors, and to extend the capabilities of modern database systems to support complex decisions and strategy planning queries.

Fairness and Diversity

Data-driven software has the ability to shape human behavior: it affects the products we view and purchase, the news articles we read, the social interactions we engage in, and, ultimately, the opinions we form. Yet, data is an imperfect medium, tainted by errors, omissions, and biases. As a result, discrimination shows up in many data-driven applications, such as advertisements, hotel bookings, image search, and vendor services. Biases in data and software risk forming, propagating, and perpetuating biases in society. Data management research should develop tools to detect, inform, and mitigate the effects of bias, skew, and misuse in data-driven processes.

Private Dissemination and Analysis of Data

The goal of this work is to understand how accurately aggregate properties about a data set can be studied while preserving the privacy of individual participants. Our recent work focuses on complex graph-structured data and trace data. Please see the following project pages for details, publications, and code releases:

Privacy, Provenance, and Data Retention

The goal of this work is to achieve the benefits of preserving history — accountability through the ability to audit the past — while avoiding threats to privacy posed by preserved data. Our work has included investigations of database forensics and models for the protection of audit histories. Please see the following project page for details and publications: