#### Presentation Title

Improving Algorithmic Fairness Through Sequential Dataset Construction

#### Abstract

Problem Statement:

One potential kind of algorithmic bias is unevenly distributed model inaccuracies caused by non-representative data sets. In the machine learning subfield of active learning, model building is represented as a sequential data set construction process. Iteratively training a model and building a data set in order to optimize algorithmic fairness could yield insights into what aspects of the data affect model fairness and what new data is useful for improving model fairness.

Methods:

The ProPublica Machine Bias data set (https://github.com/propublica/compas-analysis) will be used to train a predictive recidivism model. An agent model will be trained to incrementally construct a data set for the recidivism model. Model fairness will be analyzed over the course of training to understand the characteristics of the input data which affect improvements in model performance. The effects of the agent model will be compared to a random model.

Results:

Pending

Conclusions:

The cost of acquiring new data to improve the representativeness of a data set or the fairness of a model is one limiting factor in health care and criminal justice systems. The application of methods from the fields of active learning and reinforcement learning can significantly reduce the amount of data required to achieve comparable increases in fairness. We hope the results from this paper can help people understand what kind of data is needed to improve fairness, how much data, and what gains can be expected from new data. Future directions include incorporating data acquisition costs into a multi-objective optimization problem and exploring the role of feedback (e.g. perpetuating stereotypes) between the model and the environment.

#### Primary Faculty Mentor Name

Chris Danforth

#### Status

Graduate

#### Student College

College of Engineering and Mathematical Sciences

#### Program/Major

Complex Systems

#### Primary Research Category

Engineering & Physical Sciences

Improving Algorithmic Fairness Through Sequential Dataset Construction

Problem Statement:

One potential kind of algorithmic bias is unevenly distributed model inaccuracies caused by non-representative data sets. In the machine learning subfield of active learning, model building is represented as a sequential data set construction process. Iteratively training a model and building a data set in order to optimize algorithmic fairness could yield insights into what aspects of the data affect model fairness and what new data is useful for improving model fairness.

Methods:

The ProPublica Machine Bias data set (https://github.com/propublica/compas-analysis) will be used to train a predictive recidivism model. An agent model will be trained to incrementally construct a data set for the recidivism model. Model fairness will be analyzed over the course of training to understand the characteristics of the input data which affect improvements in model performance. The effects of the agent model will be compared to a random model.

Results:

Pending

Conclusions:

The cost of acquiring new data to improve the representativeness of a data set or the fairness of a model is one limiting factor in health care and criminal justice systems. The application of methods from the fields of active learning and reinforcement learning can significantly reduce the amount of data required to achieve comparable increases in fairness. We hope the results from this paper can help people understand what kind of data is needed to improve fairness, how much data, and what gains can be expected from new data. Future directions include incorporating data acquisition costs into a multi-objective optimization problem and exploring the role of feedback (e.g. perpetuating stereotypes) between the model and the environment.