Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
\section{Introduction}
%\acomment{'Decision maker' sounds and looks much better than 'decider'! Can we use that?}
%\acomment{We should be careful with the word bias and unbiased, they may refer to statistical bias of estimator, some bias in the decision maker based on e.g. race, and finally selection bias.}
Nowadays, a lot of decisions are being made which affect the course of human lives are being automated. In addition to lower cost, computational models could enhance the decision-making process in accuracy and fairness.
The advantage of using models does not necessarily lie in pure performance, that a machine can make more decisions, but rather in that a machine can give bounds for uncertainty and can learn from a vast set of information and that with care, a machine can be made as unbiased as possible.
However, before deploying any decision making algorithms, they should be evaluated to show that they actually improve on previous, often human, decision-making: a judge, a doctor, ... who makes the decisions on which outcome labels are available. This evaluation far from trivial.
%Although, evaluating algorithms in conventional settings is trivial, when (almost) all of the labels are available, numerous metrics have been proposed and are in use in multiple fields.
Specifically, `Selective labels' settings arise in situations where data are the product of a decision mechanism that prevents us from observing outcomes for part of the data \cite{lakkaraju2017selective}. As a typical example, consider bail-or-jail decisions in judicial settings: a judge decides whether to grant bail to a defendant based on whether the defendant is considered likely to violate bail conditions while awaiting trial -- and therefore a violation might occur only in case bail is granted. Naturally similar scenarios are observed throughout many applications from ecnomy or medicine.
%For example, given a data set of bail violations and bail/jail decision according some background factors, there will never be bail violations on those subjects kept in jail by the current decision making mechanism, hence the evaluation of a decision bailing such subjects is left undefined.
Such settings give rise to questions about the effect of alternative decision mechanisms -- e.g., `how many defendants would violate bail conditions if more bail decisions were granted?'. In other words, one faces the challenge to estimate the performance of an alternative, potentially automated, decision policy that might make different decisions than the ones found in the existing data.
%In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions.
This can be seen as a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Lakkaraju et al. recently name this the selective labels problem \cite{lakkaraju2017selective}. Selective labels issue has been addressed in the causal inference literature by discussing selection bias. discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.)Pearl calls missing the outcome under an alternative decision the 'fundamental problem' in causal inference \cite{bookofwhy}.
Recently, Lakkaraju et al. presented method for evaluation of decision making mechanisms called contraction \cite{lakkaraju2017selective}. It assumes the subjects are random assigned to decision makers with a given leniency level. Due to the assumptions, an estimate of the performance can be obtained by essentially considering the most lenient decision maker. It was shown to perform well compared to other methods previously presented.
For contraction to work, we need lenient decision makers making decision on a large number of subjects. If there is need have a better decision maker, there may not be sufficiently lenient decision makers. Furthermore, evaluating only on one decision maker when possibly many are present may produce estimates with higher variation. In reality, the decision maker in the data is not perfect but e.g. biased. Our aim is to develop a method that overcome these challenges and limitations.
In this paper we propose a (novel modular) framework to provide a systematic way of evaluation decision makers from selectively labeled data. Our approach that is based on imputing the missing labels using counterfactual reasoning. We also build on Jung et al. who present a method for constructing optimal policies, we show that that approach can also be applied to the selective labels setting \cite{jung2018algorithmic}.
%to evaluate the performance of predictive models in settings where selective labeling and latent confounding is present. We use theory of counterfactuals and causal inference to formally define the problem.
We define a flexible, Bayesian approach and the inference is performed with the latest tools.
%\begin{itemize}
%\item What we study
% \begin{itemize}
% \item We studied methods to evaluate the performance of predictive algorithms/models when the historical data suffers from selective labeling and unmeasured confounding.
% \end{itemize}
%item Motivation for the study%
% \begin{itemize}
% \item %Fairness has been discussed in the existing literature and numerous publications are available for interested readers. Our emphasis on this paper is on pure performance, getting the predictions accurate.
% \item
% \item %
% \end{itemize}
%\item Present the setting and challenge:
% \begin{itemize}
% \item
%
% \item
% \item %Characteristically, in many of the settings the decisions hiding the outcomes are made by different deciders
% \item Labels are missing non-randomly, decisions might be made by different deciders who differ in leniency.
% \item So this might lead to situation where subjects with same characteristics may be given different decisions due to the differing leniency.
% \item Of course the differing decisions might be attributable to some unobserved information that the decision-maker might have had available ude to meeting with the subject.
% \item %The explainability of black-box models has been discussed in X. We don't discuss fairness.
% \item In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions. This results in a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Recently this problem has been named the selective labels problem.
%\end{itemize}
%\item
%Related work
%\begin{itemize}
% \item In %the original paper, Lakkaraju et al. presented contraction which performed well compared to other methods previously presented in the literature.
% \item We wanted to benchmark our approach to that and show that we can improve on their algorithm in terms of restrictions and accuracy.
% \item %Restrictions = our method doesn't have so many assumptions (random assignments, agreement rate, etc.) and can estimate the performance on all levels of leniency despite the judge with the highest leniency. See fig 5 from Lakkaraju
% \item
% \item They didn't have selective labeling nor did they consider that the judges would differ in leniency.%
% \item
% \item %Latent confounding has bee discussed by X when discussing the effect of latent confounders to ORs. ec etc.
% \end{itemize}
%\item Our contribution
% \begin{itemize}
% \item
% \end{itemize}
%\end{itemize}