Improving Algorithmic Fairness Through Sequential Dataset Construction
Conference Year
January 2019
Abstract
Problem Statement:
One potential kind of algorithmic bias is unevenly distributed model inaccuracies caused by non-representative data sets. In the machine learning subfield of active learning, model building is represented as a sequential data set construction process. Iteratively training a model and building a data set in order to optimize algorithmic fairness could yield insights into what aspects of the data affect model fairness and what new data is useful for improving model fairness.
Methods:
The ProPublica Machine Bias data set (https://github.com/propublica/compas-analysis) will be used to train a predictive recidivism model. An agent model will be trained to incrementally construct a data set for the recidivism model. Model fairness will be analyzed over the course of training to understand the characteristics of the input data which affect improvements in model performance. The effects of the agent model will be compared to a random model.
Results:
Pending
Conclusions:
The cost of acquiring new data to improve the representativeness of a data set or the fairness of a model is one limiting factor in health care and criminal justice systems. The application of methods from the fields of active learning and reinforcement learning can significantly reduce the amount of data required to achieve comparable increases in fairness. We hope the results from this paper can help people understand what kind of data is needed to improve fairness, how much data, and what gains can be expected from new data. Future directions include incorporating data acquisition costs into a multi-objective optimization problem and exploring the role of feedback (e.g. perpetuating stereotypes) between the model and the environment.
Primary Faculty Mentor Name
Chris Danforth
Status
Graduate
Student College
College of Engineering and Mathematical Sciences
Program/Major
Complex Systems
Primary Research Category
Engineering & Physical Sciences
Improving Algorithmic Fairness Through Sequential Dataset Construction
Problem Statement:
One potential kind of algorithmic bias is unevenly distributed model inaccuracies caused by non-representative data sets. In the machine learning subfield of active learning, model building is represented as a sequential data set construction process. Iteratively training a model and building a data set in order to optimize algorithmic fairness could yield insights into what aspects of the data affect model fairness and what new data is useful for improving model fairness.
Methods:
The ProPublica Machine Bias data set (https://github.com/propublica/compas-analysis) will be used to train a predictive recidivism model. An agent model will be trained to incrementally construct a data set for the recidivism model. Model fairness will be analyzed over the course of training to understand the characteristics of the input data which affect improvements in model performance. The effects of the agent model will be compared to a random model.
Results:
Pending
Conclusions:
The cost of acquiring new data to improve the representativeness of a data set or the fairness of a model is one limiting factor in health care and criminal justice systems. The application of methods from the fields of active learning and reinforcement learning can significantly reduce the amount of data required to achieve comparable increases in fairness. We hope the results from this paper can help people understand what kind of data is needed to improve fairness, how much data, and what gains can be expected from new data. Future directions include incorporating data acquisition costs into a multi-objective optimization problem and exploring the role of feedback (e.g. perpetuating stereotypes) between the model and the environment.