\newcommand{\ourtitle}{Working title: From would-have-beens to should-have-beens: Counterfactuals in model evaluation}
...
...
@@ -54,7 +61,10 @@
\begin{abstract}
%We show how a causality-based approach can be used to estimate the performance of prediction algorithms in `selective labels' settings -- with particular application to `bail-or-jail' judicial decisions.
.
Increasing number of important decision affecting people's lives are being made by machine learning and AI systems.
We study evaluating the quality of such decision makers.
The major difficulty in such evaluation is that existing decision makers in use, whether AI or human, influence the data the evaluation is based on. For example, when
deciding whether of defendant should be given bail or kept in jail, we are not able to directly observe the possible offences by defendants that the decision making system in use decides to keep in jail. To evaluate decision makers in these difficult settings, we derive a flexible Bayesian approach, that utilizes counterfactual-based imputation. Compared to previous state-of-the-art, the approach gives more accurate predictions on the decision quality with lower variance. The approach is also shown to be robust to different variations in the decision mechanisms in the data.
\end{abstract}
...
...
@@ -68,6 +78,9 @@
\section{Introduction}
\acomment{'Decision maker' sounds and looks much better than 'decider'! Can we use that?}
\acomment{We should be careful with the word bias and unbiased, they may refer to statistical bias of estimator, some bias in the decision maker based on e.g. race, and finally selection bias.}
\begin{itemize}
\item What we study
...
...
@@ -114,9 +127,85 @@
\end{itemize}
\section{Framework}
\section{The Selective Labels Framework}
We begin by formalizing the selective labels setting.
Let binary variable $T$ denote a decision, where $T=1$ is interpreted as a positive decision. The binary variable $Y$ measures some outcome that is affected by the decision $T$. The selective labels issue is that in the observed data when $T=0$ then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}}$Y=1$.
For example, consider that
$T$ denotes a decision to jail $T=0$ or bail $T=1$.
Outcome $Y=0$ then marks that a defendant offended and $Y=1$ the defendant did not. When a defendant is jailed $T=0$ the defendant obviously did not violate the bail and thus always $Y=1$.
\subsection{Decision Makers}
A decision maker $D$ makes the decision $T$ based on the characteristics of the subject. A decision maker may be human or a machine learning system. They seek to predict outcome $Y$ based on what they know and then decide $T$ based on this prediction: a negative decision $T=0$ is prefered for subjects predicted to have negative outcome $Y=0$ and a positive decision $T=1$ when the outcome is predicted as positive $Y=1$. We especially consider machine learning system that need to use similar data as used for the evaluation; they also need to take into account the selective labels issue.
In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail.
\subsection{Evaluating Decision Makers}
The goodness of a decision maker can be examined as follows. Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
A good decision makes a large number of positive decisions with low failure rate.
However, the data we have does not directly provide a way to evaluate FR. If a decision maker decides $T=1$ for a subject that had $T=0$ in the data, the outcome $Y$ recorded in the data is based on the decision $T=0$ and hence $Y=1$ regardless of the decision taken by $D$. The number of negative outcomes $Y=0$ for these decision needs to be calculated in some non-trivial way.
In the example situation the difficulty is occurs when a decision maker decides to bail $T=0$ a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.
Therefore, the aim is here to give an estimate of the FR at any given AR for any decision maker $D$. This estimate is vital in the employment machine learning and AI systems to every day use.
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.
%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
\caption{$R$ leniency of the decision maker, $T$ is a binary decision, $Y$ is the outcome that is selectively labled. Background features $X$ for a subject affect the decision and the outcome. Additional background features $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}
We model the selective labels setting as summarized by Figure~\ref{fig:model}\cite{lakkaraju2017selective}.
The outcome $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in[0,1]$ of the decision maker.
We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.
In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
\section{ Framework ( by Riku)}
%In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
%Antti: In conference papers we do not waste space for such in this paper stuff!! In journals one can do that.