Let binary variable $T$ denote a decision, where $T=1$ is interpreted as a positive decision. The binary variable $Y$ measures some outcome that is affected by the decision $T$. The selective labels issue is that in the observed data when $T=0$ then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}}$Y=1$.
$T$ denotes a decision to jail $T=0$ or bail $T=1$.
Outcome $Y=0$ then marks that a defendant offended and $Y=1$ the defendant did not. When a defendant is jailed $T=0$ the defendant obviously did not violate the bail and thus always $Y=1$.
\node[state] (R) {$R$};
\node[state] (X) [right of=R] {$X$};
\node[state] (T) [below of=X] {$T$};
\node[state] (Z) [rectangle, right of=X] {$Z$};
\node[state] (Y) [below of=Z] {$Y$};
\subsection{Decision Makers}
\path (R) edge (T)
(X) edge (T)
edge (Y)
(Z) edge (T)
edge (Y)
(T) edge (Y);
\end{tikzpicture}
\caption{$R$ leniency of the decision maker, $T$ is a binary decision, $Y$ is the outcome only observed for some decisions. Background features $X$ for a subject affect the decision and the outcome. Additional background features $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}
A decision maker $D$ makes the decision $T$ based on the characteristics of the subject. A decision maker may be human or a machine learning system. They seek to predict outcome $Y$ based on what they know and then decide $T$ based on this prediction: a negative decision $T=0$ is prefered for subjects predicted to have negative outcome $Y=0$ and a positive decision $T=1$ when the outcome is predicted as positive $Y=1$. We especially consider machine learning system that need to use similar data as used for the evaluation; they also need to take into account the selective labels issue.
Figure~\ref{fig:model} shows the considered framework \cite{lakkaraju2017selective}.
In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail.
Binary variable $T$ denotes the decision: $T=0$ the defendant is jailed and when $T=1$ the defendant is given bail. We assume the decision is affected by the leniency level $R \in[0,1]$, observed background factors in the data $X$ and other background factors $Z$ not observed in the data. We model $X$ and $Z$ as continuous Gaussian variables. These are propensity scores\cite{}.
\subsection{Evaluating Decision Makers}
The goodness of a decision maker can be examined as follows. Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
A good decision makes a large number of positive decisions with low failure rate.
The binary variable $Y$ measures the outcome: if $Y=0$ defendant offended and if $Y=1$ the defendant did not. This outcome is affect by the observed background factors $X$, unobserved background factors $Z$. In addition, there may be other background factors that affect $Y$ but not $T$.
However, the data we have does not directly provide a way to evaluate FR. If a decision maker decides $T=1$ for a subject that had $T=0$ in the data, the outcome $Y$ recorded in the data is based on the decision $T=0$ and hence $Y=1$ regardless of the decision taken by $D$. The number of negative outcomes $Y=0$ for these decision needs to be calculated in some non-trivial way.
In the example situation the difficulty is occurs when a decision maker decides to bail $T=0$ a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.
The selective labels issue is that in the observed data when $T=1$ (i.e. jail the defendant) then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}}$Y=1$ (i.e. no offences by the defendant).
Therefore, the aim is here to give an estimate of the FR at any given AR for any decision maker $D$. This estimate is vital in the employment machine learning and AI systems to every day use.
\subsection{Decision Makers}
We especially consider machine learning system that use similar data as used for the evaluation.
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.
%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
\subsection{Evaluation}
Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
\subsection{Modeling the Situation}
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
A good decision makes a large number of positive decisions with low failure rate. Therefore, the goal is to give an estimate of the FR at any given AR for any decision maker $D$. The difficulty is occurs when a decision maker decides to bail a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.
\path (R) edge (T)
(X) edge (T)
edge (Y)
(Z) edge (T)
edge (Y)
(T) edge (Y);
\end{tikzpicture}
\caption{$R$ leniency of the decision maker, $T$ is a binary decision, $Y$ is the outcome that is selectively labled. Background features $X$ for a subject affect the decision and the outcome. Additional background features $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}
%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
We model the selective labels setting as summarized by Figure~\ref{fig:model}\cite{lakkaraju2017selective}.
The outcome $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in[0,1]$ of the decision maker.
We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.