Skip to content
Snippets Groups Projects
experiments.tex 30.7 KiB
Newer Older
%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Experiments}

\todo{Michael}{We should make sure that what we describe in the experiments follows what we have in the technical part of the paper, without describing the technical details again. This may require a light re-write of the experiments. See also other notes below.}
\todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper. Already done for previous sections}

Antti Hyttinen's avatar
Antti Hyttinen committed
We thoroughly tested our proposed method for evaluating decision maker performance in terms of  accuracy, variability and robustness. We employed both synthetic and real data, including decision of several different kinds of decision makers. We evaluated the performance of several different kinds of decision makers, using different evaluators including also contraction of \citet{lakkaraju2017selective} as the current state of the art. 

%on synthetic and real data in comparison with the the current state of the art.


%HERE WE CANNOT SAY THAT OUR AIM WAS TO SHOW ROBUSTNESS
%THAT WOULD INDICATE THAT WE DESIGN TEST TO SHOW IT
% WE TEST

%To establish that our method is robust to differing decision making processes, we employed multiple decision makers in our experiments. In this section we present the decision makers and our results from experiments with synthetic and realistic data. 
Antti Hyttinen's avatar
Antti Hyttinen committed
%Each experiment is described in terms of 
%\begin{itemize}
%	\item A dataset \dataset. In turn, this is defined in terms of 
%		\begin{itemize}
%			\item features \obsFeatures and \unobservable of the cases and
%			\item decision makers \human.
%		\end{itemize}
%		\item A decision maker \machine.
%		\item Algorithms for evaluation.
%\end{itemize}
Antti Hyttinen's avatar
Antti Hyttinen committed
\subsection{Synthetic Data} \label{sec:syntheticsetting}
Antti Hyttinen's avatar
Antti Hyttinen committed
For our experiments with synthetic data, we build on the simulation setting of \citet{lakkaraju2017selective}.
We sampled a total of  $\datasize=5k$ subjects. 
The features \obsFeatures and \unobservable were drawn independently from standard Gaussians.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
Antti Hyttinen's avatar
Antti Hyttinen committed
% \acomment{More details needed, t} 
%and decisions \decision for both human and machine decision makers were assigned using different deciders as described in section \ref{sec:decisionmakers}.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
Antti Hyttinen's avatar
Antti Hyttinen committed
Outcomes for the subjects were sampled from:
\begin{eqnarray}
\prob{\outcome=0~|~\obsFeaturesValue, \unobservableValue} & =&
	\invlogit(\alpha_\outcome + \beta_\obsFeatures^T \obsFeaturesValue + \beta_\unobservable \unobservableValue + \epsilon_\outcome )	
\end{eqnarray}
where values for the coefficients $
%\gamma_\obsFeaturesValue,~\gamma_\unobservableValue,
~\beta_\obsFeatures$ and $\beta_\unobservable$ were all initially set to $1$.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
Antti Hyttinen's avatar
Antti Hyttinen committed
The intercept $\alpha_\outcome$ determining the baseline probability for a negative result was set to $0$.
Antti Hyttinen's avatar
Antti Hyttinen committed
Stochasticity was added to the behaviour of the subjects by including $\epsilon_\outcome$ which has a zero-mean Gaussian distribution with variance $0.1$.
The subjects were assigned to the decision makers (described in the next subsection) randomly such that each decision maker received a total of 100 subjects.
The leniencies \leniency for the $\judgeAmount=50$ decision makers \human were then drawn from Uniform$(0.1,~0.9)$.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
Antti Hyttinen's avatar
Antti Hyttinen committed
%The error terms $\epsilon_\decisionValue$ and $\epsilon_\outcomeValue$ were sampled independently from zero-mean Gaussian distributions with variance $0.1$.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
The data set was finally split to training and test data sets containing the decision from 25 judges each at random for training the models for the machine decision makers.
Antti Hyttinen's avatar
Antti Hyttinen committed

Antti Hyttinen's avatar
Antti Hyttinen committed
%We use this basic setting [as described by~\citet{lakkaraju2017selective}] and extend it to various directions.
Antti Hyttinen's avatar
Antti Hyttinen committed

Antti Hyttinen's avatar
Antti Hyttinen committed
%\acomment{Cannot describe synthetic data generation without the decision makers.} \rcomment{If you change the order as has been done, it's not needed?} 
Antti Hyttinen's avatar
Antti Hyttinen committed
%Let's describe how we do the following.
%\begin{itemize}
%	\item How we generate features \obsFeatures and \unobservable of cases; how we generate outcomes \outcome for positive decisions. 
%	\item How we assign cases and leniency levels to decision makers \human.
%	
%	
%	
%
%	
%	\item Decision makers \human can be one of two types (one type used in each experiment) :
%		\begin{itemize}
%			\item Random.
%			\item Using same parameters as model for \outcome, i.e., $\gamma_x = \beta_{_\obsFeaturesValue}$ and $\gamma_\unobservableValue = \beta_{_\unobservableValue}$.
%		\end{itemize}
Antti Hyttinen's avatar
Antti Hyttinen committed
%	\item Decision makers \machine(\leniencyValue):
%		\begin{itemize}
%			\item Random.
%			\item Decision maker with $\gamma_x = \beta_{_\obsFeaturesValue}$ and $\gamma_\unobservableValue = 0$.
%			\item Decision makers learned (naively) from a separate set of labeled data (what we've called the `training set').
%			\item Decision maker learned from Stan.
%		\end{itemize}
%	\item How we evaluate:
%		\begin{itemize}
%			\item Our approach (Counterfactuals).
%			\item Contraction.
%			\item Labeled outcomes.
%			\item True evaluation.
%		\end{itemize}
Antti Hyttinen's avatar
Antti Hyttinen committed
%\end{itemize}
Antti Hyttinen's avatar
Antti Hyttinen committed
%The \emph{default} decision-maker \human assigns decisions based on the probabilistic expression given in Equation \ref{eq:judgemodel}.
Antti Hyttinen's avatar
Antti Hyttinen committed
%As mentioned, the decision-maker can be deemed as good, if the values for the coefficients $\beta_\obsFeaturesValue$ and $\beta_\unobservableValue$ are close those of $\gamma_\obsFeaturesValue$ and $\gamma_\unobservableValue$.
Antti Hyttinen's avatar
Antti Hyttinen committed
%To examine the robustness of our method, we checked that our model can produce accurate estimates even if decision-makers \human would assign their decisions at random.
Antti Hyttinen's avatar
Antti Hyttinen committed
%In contrast, Lakkaraju et al. employed a human decision-maker making decisions based on quantity $\sigma(\gamma_\obsFeaturesValue \obsFeaturesValue + \gamma_\unobservableValue \unobservableValue + \epsilon_\decisionValue)$.
Antti Hyttinen's avatar
Antti Hyttinen committed
%Their decision-maker would then assign positive decisions to subjects belonging to the $\leniencyValue \cdot 100$ percentile rendering the decision dependent \cite{lakkaraju2017selective}.
Antti Hyttinen's avatar
Antti Hyttinen committed
%We deployed two machine decision-makers on the synthetic data sets.
Antti Hyttinen's avatar
Antti Hyttinen committed
%The decision-maker \machine can be \emph{random} and giving out positive decisions only with probability \leniencyValue ignoring information concerning the subjects in variables \obsFeatures and \unobservable.
Antti Hyttinen's avatar
Antti Hyttinen committed
%The machine \machine can also be taught naively to make decisions based on a separate set of labeled data.
Antti Hyttinen's avatar
Antti Hyttinen committed
%This \emph{default} decision-maker \machine uses the observable features \obsFeatures and observed outcomes (with $\decision = 1$) to predict probability for the outcome and assigns positive decisions...

Antti Hyttinen's avatar
Antti Hyttinen committed
\subsection{Decision Makers} \label{sec:decisionmakers}

The following decision makers can be used both as human \human and machine \machine decision makers. We used all of them as human decision makers but only employed \emph{random} and \emph{batch} decision makers as machine decision makers.
\begin{itemize}
Antti Hyttinen's avatar
Antti Hyttinen committed
\item \textbf{Random}: With leniency $r$, a random decision maker selects portion $r$ of the subjects assigned to it at random, assigns them to $T=1$ and the rest to $T=0$.
Antti Hyttinen's avatar
Antti Hyttinen committed
\item \textbf{Batch}: Following~\citet{lakkaraju2017selective}, this decision maker sorts its subjects according to a logistic regression model 
$y$ given features
%$y \sim x$ or $y\sim x+z$ 
and releases $r$ portion of the subjects. \rcomment{I would describe Lakkaraju's decision-makers as follows: Following~\citet{lakkaraju2017selective}, this decision maker sorts its subjects with risk scores and then releases $\leniencyValue$ portion of the subjects with the lowest score. In the experiments the risk scores were given by expression 
\begin{equation} \label{eq:riskscore}
\invlogit(\gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue\unobservableValue + \epsilon_\decisionValue).
\end{equation}}
\item \textbf{Independent}: The independent decision maker is similar to the batch decision maker only that it gives out independent decisions. Let judge \human be an experienced decision maker making decisions given some risk score [which has been comprised from features \obsFeatures and \unobservable]. The risk scores of all defendants have some distribution which subsequently has some cumulative distribution function $G$. Now, given a decision maker with leniency $r$, the independent decision maker deals a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue. In the experiments the risk scores were computed using equation \ref{eq:riskscore} without the error term $\epsilon_\decisionValue$. \acomment{EXPLAIN BETTER. MAKE DETERMINISTIC?} \rcomment{Explained in another way. Is deterministic now. Can now be used as \machine in synthetic data? Because in real data we wouldn't know the underlying true distribution of the risk scores.}
\item \textbf{Probabilistic}: Each subject is released with probability based on the logistic regression model, where the leniency is inputted through $\alpha_j$. \rcomment{Hard to justify any more? Or this decision maker could now be described as follows: Each subject is released with probability equal to some risk score which differs based on the assigned judge. In the experiments, the risk scores were  computed with equation \ref{eq:judgemodel} where leniency was inputted through $\alpha_j$.}
\end{itemize}

Decision makers in the data (\human) have access to \unobservable. Evaluated decision makers do not have access to \unobservable. All parameters of the models for evaluated decision makers are learned from a separate data set. \rcomment{Last sentence as: When a decision maker is used as a machine decision maker \machine, the model producing the risk scores is learned from a separate data set.}



Antti Hyttinen's avatar
Antti Hyttinen committed
\subsection{Evaluators}
We evaluated the machine decision-maker \machine at leniency \leniencyValue by comparing the \failurerate estimate they produced to the true failure rate.
%
The true failure rate of \machine (\textbf{true evaluation}) was computed by employing the designated decision maker type at leniency \leniencyValue on data having all the true outcome labels available regardless of decision.
It is notable that true failure rate can not be computed on empirical data sets as they do not have all true outcomes available due to selective labeling.
Antti Hyttinen's avatar
Antti Hyttinen committed
A naive algorithm \textbf{labeled outcomes} was then deployed on the dataset.
Labeled outcomes evaluator employs the decision maker at leniency \leniencyValue on data containing only subject with positive decisions.
Subjects who were given negative decisions are dropped out.
Antti Hyttinen's avatar
Antti Hyttinen committed
The theoretical background of \textbf{counterfactual imputation} method has been presented in the previous sections. \acomment{This has to be the same in the plots and in the text. Can we change to counterfactual imputation in the plots?}
%
After imputing the predicted outcomes for subjects with negative decisions, the method computes the \failurerate estimate by employing the chosen decision maker \machine at leniency \leniencyValue.
%
The failure rate estimate is the mean value of the possibly imputed outcomes which were given a negative decision by the decision maker \machine.
Antti Hyttinen's avatar
Antti Hyttinen committed

The former state-of-the-art method is \textbf{contraction}, presented by \citet{lakkaraju2017selective}.
%
It is an algorithm designed specifically to estimate the true failure rate of a machine decision maker under selective labeling.
%
According to the original presentation, contraction takes risk scores in interval $[0, 1]$ and the data set as input and then produces a \failurerate estimate by contracting the set of subjects assigned to the most lenient decision maker \human and then computing the estimate from that group.
Eventually the absolute errors of \failurerate estimates from contraction, counterfactual imputation and labeled outcomes evaluators with regard to the true failure rate were compared to each other to evaluate their precision and accuracy.

\acomment{Figure 7 does not have labeled outcomes or true evaluation.} \rcomment{Fig. 7 has labeled outcomes and true evaluation? Curve for labeled outcomes goes much lower and the curve for true evaluation is overlapping with the curves for contraction and counterfactual approach.}
Antti Hyttinen's avatar
Antti Hyttinen committed
\subsection{Results}
%
Antti Hyttinen's avatar
Antti Hyttinen committed
\subsubsection{The basic plot}
Antti Hyttinen's avatar
Antti Hyttinen committed
%we need to explain this for the other plots to make any sense.

\begin{figure}
\includegraphics[width=\linewidth]{./img/_deciderH_batch_deciderM_batch_maxR_0_9coefZ1_0_all} 
Antti Hyttinen's avatar
Antti Hyttinen committed
\caption{Evaluation of the batch decision maker on synthetic data with batch decision makers. In this basic setting, both counterfactuals and contraction are able to match the true evaluation curve closely but counterfactuals exhibits markedly lower standard deviations  (shown by the error bars). }\label{fig:basic}
Antti Hyttinen's avatar
Antti Hyttinen committed
\end{figure}

Figure~\ref{fig:basic} shows the basic evaluation of a batch decision maker. Here evaluation metric is good if it can match the true evaluation for any leniency.
Antti Hyttinen's avatar
Antti Hyttinen committed
%
In this basic setting our approach achieves more accurate estimates with lower variance than the state-of-the-art contraction.
%
\subsubsection{Different Decision Makers}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_errors}
Antti Hyttinen's avatar
Antti Hyttinen committed
\caption{Error of estimate w.r.t true evaluation.
% with leniences from $0.1$ to $0.9$ and coefficient $\beta_\unobservable=\gamma_\unobservable=1$. 
Error bars denote standard deviations. The presented method (counterfactuals) is able to offer steady estimates with low variance robustly across different decision makers, the performance of contraction varies considerably within and across different decision makers.
 %The figure shows how the performance of  presented method is robust to the mechanism and accuracy of the decision-making processes \human and \machine.
 }
Antti Hyttinen's avatar
Antti Hyttinen committed
\label{fig:results_errors}
\end{figure}
Antti Hyttinen's avatar
Antti Hyttinen committed
We used this same setting, this time evaluating different decision makers (\machine) and also employing different decision makers when generating the data (\human). 
Antti Hyttinen's avatar
Antti Hyttinen committed
\subsubsection{Limiting the Leniency}
 
We experimented with limiting the maximum leniency to $0.5$ to show that our method is able to estimate the true failure rate despite lower maximum leniency.
%
The experiment was done as explained in section \ref{sec:syntheticsetting}, only that the leniencies for the decision makers were sampled uniformly from interval $(0.1, 0.9)$.
%
The results presented in figure \ref{} show how contraction is only able to evaluate the machine decision maker up to leniency $0.5$ whereas the presented approach can estimate the  true failure rate for all levels of leniency. 
 
\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_rmax05_H_independent_M_batch_fixed}
Antti Hyttinen's avatar
Antti Hyttinen committed
\caption{Evaluating batch decision maker on data employing independent decision makers and with leniency at most $0.5$. The proposed method (counterfactuals) offers good estimates of the failure rates for all levels of leniency, whereas contraction can estimate failure rate only up to leniency $0.5$.
% with lower accuracy. The figure also exemplifies how the conventional evaluation (labeled outcomes) gives very optimistic estimates for the machine's perfomance.
}
\label{fig:results_rmax05}
\end{figure}
Antti Hyttinen's avatar
Antti Hyttinen committed
% We show how the different evaluation methods perform for decision makers \machine of different type and leniency level.

%We also make a plot to show how our Counterfactuals method infers correctly values Z based on X and T.

%\subsubsection*{The effect of unobservables}

%Perform the same experiment but with $\beta_{_\unobservableValue} \gg \beta_{_\obsFeaturesValue}$.

%\subsubsection*{The effect of observed leniency}

%Perform the same experiment, but the minimum leniency of \human is now larger than that of \machine.

Antti Hyttinen's avatar
Antti Hyttinen committed
%\subsubsection{Overall}
Antti Hyttinen's avatar
Antti Hyttinen committed

Antti Hyttinen's avatar
Antti Hyttinen committed
Thus overall, in this synthetic setting our method surpasses and allows for accurate evaluation in settings where contraction cannot achieve accurate evaluations.
Antti Hyttinen's avatar
Antti Hyttinen committed

Antti Hyttinen's avatar
Antti Hyttinen committed
\subsection{COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is equivant's (formerly Northpointe\footnote{\url{https://www.equivant.com}}) set of tools for assisting decision-making in the criminal justice system. 
%
COMPAS provides needs assesments and risk estimates of recidivism. 
%
The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years \cite{brennan2009evaluating}.
The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was biased against black people \cite{angwin2016machine}.
%
After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe couldn't have been consolidated.

The COMPAS data set used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. 
%
After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. 
%
Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. 
%
Following ProPublica's data cleaning process, finally the data consisted of $6 172$ offenders.
%
Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

For the analysis, we created $\judgeAmount=12$ synthetic judges with fixed leniency levels 0.1, 0.5 and 0.9 so that 4 decision-makers shared a leniency level.
The $\datasize=6 172$ subjects were distributed to these judges as evenly as possible and at random.
%
In this semi-synthetic scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
%
E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with the lowest COMPAS score.
%
Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
%
After assigning the decisions, the data was split 10 times to training and test sets containing the decisions of 6 judges each.
%
A logistic regression model was trained on the training data to predict two-year recidivism from categorised age, race, gender, number of priors and the degree of crime COMPAS screened for (felony or misdemeanour) using only observations with positive decisions.
As the COMPAS score is derived from a larger set of predictors then the aforementioned five \cite{}, the unobservable information was would then be encoded in the COMPAS score.
%
The built logistic regression model was used as decision-maker \machine in the test data and the same features were given as input for the counterfactual imputation.

\subsection{Results}
Results for experiments with the COMPAS data are shown in figure \ref{fig:results_compas}.
The mean absolute error for the proposed method at all levels of leniency was consistently lower than that of contraction regardless of number judges used in the experiments.
%The experiments showed an agreement rate ranging from 0.23 to 0.53 for contraction.
%That combined with a high maximum acceptance rate and approximately 514 subjects per decision-maker should guarantee a comparable performance for contraction.
The experiments gave an estimate of X for $\gamma_\unobservable$ which indicates that the COMPAS score encoded some additional information.
%
In conclusion, ...
\begin{figure}
%\centering
Riku-Laine's avatar
Riku-Laine committed
\includegraphics[width=\linewidth]{./img/sl_errors_compas}
Antti Hyttinen's avatar
Antti Hyttinen committed
\caption{Results with COMPAS data. The present method (counterfactuals) shows good performance despite number of judges, contractions performance gets notable worse when data includes decision by increasing number of judges.
%The figure shows how performance of contraction depends on the number of subjects assigned to the most lenient decision-maker whereas the performance of the proposed method is stable.
}
\label{fig:results_compas}
\end{figure}
Riku-Laine's avatar
Riku-Laine committed


\hide{

%% These are old results. Do we want a similar table?
\begin{table}[]
\begin{tabular}{@{}lll@{}}
\toprule
Setting 										& Contraction & Counterfactuals \\ \midrule
Default 										& 0.00648     & 0.00510         \\
100 cases per judge, $\datasize=5k$ 				& 0.00629     & 0.00481         \\
$\leniency \sim \text{Uniform}(0.1,~0.5)$ 			& N/A         & 0.01385         \\
Decision-maker \human random 					& 0.01522     & 0.00137         \\ 
Decision-maker \machine random 					& 0.03005     & 0.00327              \\
Lakkaraju's decision-maker \human \cite{lakkaraju2017selective} & 0.01187          & 0.00288            \\ \bottomrule
Antti Hyttinen's avatar
Antti Hyttinen committed
\caption{Comparison of mean absolute error w.r.t true evaluation between contraction and the counterfactual-based imputation with leniences from $0.1$ to $0.9$. The table shows that our method can perform well despite violations of the assumptions (eg. having decision-maker \human giving random and non-informative decisions). Here data had $\max(\leniencyValue)=0.9$.}
Michael Mathioudakis's avatar
Michael Mathioudakis committed
\subsubsection*{Old Content}
\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

We sampled $N=7k$ samples of $X$, $Z$, and $W$ as independent standard Gaussians. 
\todo{Michael to Riku}{W does not appear in our model.}
We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$ and $0.2$ respectively. Then the leniency levels $R$ for each of the $M=14$ judges were assigned pairwise so that each of the paiirs had leniencies $0.1,~0.2,\ldots, 0.7$. 
\todo{Michael to Riku}{We have assumed all along that the outcome \outcome causally follows from leniency \leniency. So we cannot suddenly say that we assign leniency after we have generated the outcomes. Let's follow strictly our model definition. If what you describe above is equivalent to what we have in the model section, then let's simply say that we do things as in the model section. Otherwise, let's do a light re-arrangement of experiments so that we follow the model exactly. In any case, in this section we should not have any model description -- we should only say what model parameters we used for the previous defined model.}
The subjects were assigned randomly to the judges so each received $500$ subjects. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}
The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\obsFeatures, \unobservable) = \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\obsFeatures, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\obsFeatures, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\obsFeatures, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions are independent and that the ratio of positive decisions to negative decisions converges to $r$. Then the outcomes for which the decision was negative, were set to $0$.
We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome using only the recorded features \obsFeatures so that $P(\decision = 0~|~\obsFeatures, \unobservable) = \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as the default decision-maker but the values for the observed features \obsFeatures observed by the decision-maker are altered. We modified the values so that if the value for \obsFeaturesValue  was greater than $1$ it was multiplied by $0.75$ to induce more positive decisions. Similarly, if the subject's \obsFeaturesValue was in the interval $(-2,~-1)$ we added $0.5$ to induce more negative decisions. Additionally the effect of non-informative decisions were investigated by deploying a \emph{random} decision-maker. Given leniency $R$, a random decision-maker give a positive decision $T=1$ with probability given by $R$.

In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions 
on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour: a single judge would need to jail defendant today in order to release a defendant tomorrow.
We treat the observations as independent and the still the leniency would be a good estimate of the acceptance rate. The acceptance rate converges to the leniency. 

\paragraph{Evaluators} 
	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
\begin{itemize}
	\item  \emph{True evaluation:} True evaluation depicts the true performance of a model. The estimate is computed by first sorting the subjects into a descending order based on the prediction of the model. Then the true failure rate estimate is computable directly from the outcome labels of the top $1-r\%$ of the subjects. True evaluation can only be computed on synthetic data sets as the ground truth labels are missing.
	%\item \emph{Human evaluation:} Human evaluation presents the performance of the decision-makers who observe the latent variable. Human evaluation curve is computed by binning the decision-makers with similar values of leniency into bins and then computing their failure rate from the ground truth labels. \rcomment{Not computing now.}
	\item \emph{Labeled outcomes:} Labeled outcomes algorithm is the conventional method of computing the failure rate. We proceed as in the true evaluation method but use only the available outcome labels to estimate the failure rate.
	\item \emph{Contraction:} Contraction is an algorithm designed specifically to estimate the failure rate of a black-box predictive model under selective labeling. See previous section.
\end{itemize}

\paragraph{Results} We deployed the evaluators on the synthetic data set presented and the results are in Figure \ref{fig:results_main, fig:results_main_2}. The new presented method can recover the true performance of a model for all levels of leniency. In the figure we see how contraction algorithm can only estimate the true performance up to the level of the most lenient decision-maker when the proposed method can do that for arbitrary levels of leniency. To create comparable results we also employed the batch decision-making mechanism presented by Lakkaraju et al. to a synthetic data set which had $N=9k$ instaces and $M=14$ judges with leniencies $0.1, \ldots, 0.9$. The mean absolute error of contraction ($0.00265$) was approximately $90\%$ higher than the performance of the presented method ($0.00139$).

Siimilar results were obtained from experiments with the aforementioned \emph{limited} \rcomment{Not done yet.}, \emph{biased} and \emph{random} deciders. The mean absolute errors for the estimates / other results are presented in .... This shows  that the counterfactual-based imputation method is robust to changes in the data generating mechanisms and therefore will accompany multiple scenarios. The results from experiments with the biased decision-makers show that the proposed method can preform well despite the  biased decisions of the current decisino-makers. This is important because...
	

\begin{itemize}
%\item COMPAS data set
%	\begin{itemize}
%	\item Size, availability, COMPAS scoring
%		\begin{itemize}
%		\item COMPAS general recidivism risk score is made to ,
%		\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
%		\item Data was made available ProPublica.
%		\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
%		\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences. 
%		\end{itemize}
%	\item Subsequent modifications for analysis 
%		\begin{itemize}
%		\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9. 
%		\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
%		\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score 
%		\item As the COMPAS score is derived mainly from so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
%		\item Data was split to test and training sets
%		\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
%		\item We used these same variables as input to the CBI evaluator.
%		\end{itemize}
%	\item Results
%		\begin{itemize}
%		\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
%		\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
%		\end{itemize}
%	\end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)
	\begin{itemize}
	\item Size, availability, RisCanvi scoring
	\item Subsequent modifications for analysis
	\item Results
	\end{itemize}
\end{itemize}