@@ -153,35 +153,36 @@ deciding whether of defendant should be given bail or kept in jail, we are not a
edge (Y)
(T) edge (Y);
\end{tikzpicture}
\caption{$R$ leniency of the decision maker, $T$ is a binary decision, $Y$ is the outcome only observed for some decisions. Background features $X$ for a subject affect the decision and the outcome. Additional background features $Z$ are visible only to the decision maker in use. }
\caption{$R$ leniency of the decision maker, $T$ is a binary decision, $Y$ is the outcome only observed for some decisions. Background features $X$ for a subject affect the decision and the outcome. Additional background features $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}
Binary variable $T$ denotes the decision: $T=0$ the defendant is jailed and when $T=1$ the defendant is given bail. We assume the decision is affected by the leniency, observed background factors in the data $X$ and other background factors $Z$ not observed in the data.
Figure~\ref{fig:model} shows the considered framework \cite{lakkaraju2017selective}.
Binary variable $T$ denotes the decision: $T=0$ the defendant is jailed and when $T=1$ the defendant is given bail. We assume the decision is affected by the leniency level $R \in[0,1]$, observed background factors in the data $X$ and other background factors $Z$ not observed in the data. We model $X$ and $Z$ as continuous Gaussian variables. These are propensity scores\cite{}.
The binary variable $Y$ measures the outcome: if $Y=0$ defendant offended and if $Y=1$ the defendant did not. This outcome is affect by the observed background factors $X$, unobserved background factors $Z$. In addition, there may be other background factors that affect $Y$ but not $T$.
The selective labels issue is that in the observed data when $T=1$ (i.e. jail the defendant) then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=1$ inducing a problem of selection bias.}$Y=1$ (i.e. no offences by the defendant).
The selective labels issue is that in the observed data when $T=1$ (i.e. jail the defendant) then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}}$Y=1$ (i.e. no offences by the defendant).
\subsection{Decision Makers}
We especially consider machine learning system that use similar data as used for the evaluation.
\subsection{Evaluation}
Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions.
\subsection{Evaluation}
Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
The goal is to give an estimate of the FR at any given AR for any decision maker $D$. The difficulty is occurs when a decision maker decides to bail a defendant , we cannot directly observe whether the defendant offended or not.
A good decision makes a large number of positive decisions with low failure rate. Therefore, the goal is to give an estimate of the FR at any given AR for any decision maker $D$. The difficulty is occurs when a decision maker decides to bail a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.