experiments.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Experiments}
\label{sec:experiments}

We test the accuracy and robustness of \cfbi in evaluating the performance of decision makers of different kinds. 
%
Towards, this end, we employ both synthetic and real data.
%
We compare \cfbi especially with the recent \contraction technique of \citet[KDD'17]{lakkaraju2017selective}.

\spara{Reproducibility}. Our manuscript contains a full specification of the parameters of the models and synthetic datasets we used, as well as links to the real public datasets.
%
The implementation uses Python 3.6.9 and PyStan v.2.19.0.0 with cmdstanpy 0.4.3 -- and will be made available online upon publication.


\subsection{Synthetic Data} 
\label{sec:syntheticsetting}

We begin our experiments with synthetic data, in order to investigate various properties of our approach.
%
To set up the experimentation, we follow the setting of \citet{lakkaraju2017selective}.

We base our experiments on 10 synthetic data sets.
%
Each dataset we experiment with consists of $\datasize=5{,}000$ randomly generated cases.
%
The features \obsFeatures and \unobservable of each case are drawn independently from standard Gaussians.
%
Each case is assigned randomly to one out of $\judgeAmount=50$ decision makers, such that each decision maker receives a total of $100$ cases.
%
The leniency \leniency of each decision maker is drawn 
%independently of other decision makers OBVIOUS
from Uniform$(0.1,~0.9)$.
%
%As soon as a case is assigned to a decision maker, a decision $\decision$ is made for the case.
%DOES NOT SOUND WHAT A REAL JUDGE WOULD DO 
%I AM NOT SURE WE DO THIS; WE GET A BATCH AND THEN DO DECISIONS
A decision $\decision$ is made for each case by the assigned decision maker.
%
The exact method of assigning a decision is specified in the next subsection (Sec.~\ref{sec:dm_exps}).
%
If the decision is positive, then a binary outcome is sampled from a Bernoulli distribution:
\begin{equation}
\prob{\outcome = 0~|~\decision=1, \obsFeaturesValue, \unobservableValue}  =  \invlogit(b_\obsFeatures \obsFeaturesValue + b_\unobservable \unobservableValue + e_\outcome)  \label{eq:Ysampling} 
\end{equation}% Note the ''inverted'' probability and added \epsilon_\outcome compared to eq 1.
%
with $b_\obsFeatures = b_\unobservable = 1$.
%
Additional noise is added to the outcome of each case via $e_\outcome$, which was drawn from a zero-mean Gaussian distribution with small variance,  $e_\outcome\sim \gaussian{0}{0.1}$. The data set was split in half to training and tests sets, such that each decision maker appears only in one. The evaluated decision maker $\machine$ is trained on the training set while the evaluation is based only on the test set. 

%\acomment{$\epsilon_\outcome$  does not appear anywhere in the formulas except appendix A where it is used for a different purpose. Can you Riku describe it here in words?}
%
%\rcomment{I'm not exactly sure what you wish me to describe. The noise $\epsilon_\outcome$ is generated as stated above. It represents something unobserved by the decision maker, events happening after the decision is made or just simply chance.}

%\acomment{Like this: To add noise to the outcomes we flipped the outcomes with probability X. Or: To add noise to the outcomes, we included a noise term from $\gaussian{0}{0.1}$ into the logistic regression formula for each subject. In a way that I can do it without being confused. Now I am because epsilon is no longer in the formulas.}


\subsection{Decision Makers} \label{sec:decisionmakers}
\label{sec:dm_exps}

Our experimentation involves two categories of decision makers: (i) the set of decision makers \humanset, the decisions of which are reflected in a dataset, and (ii) the decision maker \machine, whose performance is to be evaluated on the log of cases decided by \humanset.
%
We describe both of them below.

\mpara{Decisions by \humanset.} %\newline
%%
%Among cases that receive a positive decision, the probability to have a positive or negative outcome is higher or lower depending on the quantity below (see Equation~\ref{eq:defendantmodel}), to which we refer as the `{\it risk score}' of each case
%\begin{equation}
%\text{risk score} = \beta_\obsFeatures \obsFeaturesValue + \beta_\unobservable \unobservableValue .
%\end{equation}
%%
%Lower values indicate that a negative outcome is more likely.
%%
%
The decisions of decision makers \humanset are based on their perception of the dangerousness of a case, to which we refer as the {\it risk score}.
%
With synthetic data we compute the risk score as
\begin{equation} \label{eq:risk}
\text{risk score} = b_\obsFeatures \obsFeatures + b_\unobservable \unobservable.
\end{equation}

For the {\it first} type of decision makers we consider, we assume that decisions are rational and well-informed, and that a decision maker with leniency \leniencyValue makes a positive decision only for the \leniencyValue fraction of cases that are most likely to lead to a positive outcome. 
%
Specifically, we assume that the decision-makers know the cumulative distribution function $F$ that the risk scores $s = b_\obsFeatures \obsFeaturesValue + b_\unobservable \unobservableValue$ of defendants follow. 
%
This is a reasonable assumption to make in settings where decision makers have accurate knowledge of the joint feature distribution as such knowledge allows one to calculate $F$.
%
For example, an experienced judge who has tried a large volume and variety of defendants may have a good idea about the various cases that appear at court and which of them pose higher risk.
%
Considering a decision maker with leniency $\leniency = \leniencyValue$ who decides a case with risk score $s$, a positive decision is made only if $s$ is in the \leniencyValue portion of the lowest scores according to $F$.
%
Since in our setting the distribution $F$ is given and fixed, such decisions for different cases happen independently based on their risk score.
%
Because of this, we refer to this type of decision makers as \independent.

%OUT TO GET A SUBMITTABLE VERSION
%\todo{MM}{Make sure Appendix~\ref{sec:independent} is correct.} \acomment{Appendix has F as the cumulative distribution function, here we have G.} %RL: Of the presented decision-makers, independent is the only one that could be deployed in an online setting. You couldn't deploy batch to make decisions on single subjects.


In addition, we also experiment with a different type of decision makers, namely \batch, also used in \cite{lakkaraju2017selective}.
%
Decision makers of this type are assumed to consider all cases assigned to them at once, as a batch; sort them by the risk score in Equation~\ref{eq:risk}; and, for leniency \leniency = \leniencyValue, release $\leniencyValue$ portion of the batch with the lowest risk score. 
%
Such decision makers still have a good knowledge of the relative risk that the cases assigned to them pose, but they are also shortsighted, as they make decisions for a case \emph{depending} on other cases in their batch. 
%
For example, if a decision maker is randomly assigned a batch of cases that are all very likely to lead to a good outcome, a large portion $1-\leniencyValue$ of them will still be handed a negative decision.
%RL: This is a problem of any decision-maker on who we impose a _leniency_. I.e. if we say to any decision-maker that you must release 20% of people brought in front of you, the above situation is bound to happen. If we impose a risk threshold, such problem will not happen as the judge can adjust leniency as needed.


%Finally, we consider a random version of \batch as a third type of decision maker, namely \random.
%WHY CONFUSE THE READER BY SAYING THAT RANDOM IS BATCH
% MUCH EASIER TO TALK ABOUT IT SEPARATELY
%
Finally, we consider a third type of decision maker, namely \random. It simply makes a positive decision with probability \leniencyValue.
%
We include this to test the evaluation methods also in settings where some of their assumptions may be violated.
%
%%\random decision makers may make poor decisions -- but they do not introduce selection bias, as their decision is not correlated with the possible outcome.
%WE SHOULD NOT ADVERTISE THIS, SOUNDS LIKE SILLY EXPLANATIONS

%
%For this reason, including them in the experiments is useful, as it allows us to test performance of different evaluation approaches in the `extreme' scenario of absence of selection bias in the data.
%
%\rcomment{Couple of comments hidden.} % First, A positive decision is given with probability \leniencyValue. We do not assign a positive decision to \leniencyValue portion of the cases. Hence \random is more like independent. Second, we include the random decision maker to show that our method performs well despite violation of assumptions. We assume accuracy of judge and that there is information in the decisions. Informative decisions are assumed as the basis for this approach but they are not necessary. If the data shows the decisions are random, we observe that and don't utilize them.
 
\mpara{Decisions by \machine.} %\newline
%
For \machine, we consider the same three types of decision makers as for \humanset above, with one difference: decision makers \humanset have access to \unobservable, while \machine does not.
%
Their definitions are adapted in the obvious way -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures.
% -- but we still refer to them with the same names.
%WHY ADVERTISE THIS 
Risk scores are computed with a logistic regression model which is trained in the training data set.
%
For the \independent decision maker \machine, the cumulative distribution function of the risk scores is constructed using the empirical distribution of risk scores of all the observations in the training data.

\begin{figure}
\includegraphics[width=1.1\linewidth,trim={0 0 0 1.8cm},clip]{./img/with_epsilon_deciderH_independent_deciderM_batch_maxR_0_9coefZ1_0_all} 
\caption{Evaluation of \batch decision maker on data with \independent. Error bars show the standard deviation of the \failurerate estimate across 10 datasets. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhibits lower variation as shown by the error bars.
}
\label{fig:basic}
\end{figure}


\subsection{Evaluators}
\label{sec:evaluators}

The purpose of the experiments is to investigate the performance of different methods in evaluating \machine on a dataset that records cases decided by \humanset.
%
We call those methods ``evaluators''.
%
For an evaluator, desired properties are accuracy (i.e., the evaluator should estimate well the failure rate of \machine), but also robustness (i.e., besides performing well on average, an evaluator is desired to perform consistently for all cases).

The first evaluator we consider is \cfbi, the proposed method of this paper, described in Section~\ref{sec:imputation}.
%
To summarize: \cfbi uses the dataset to learn a model, i.e., a distribution for the parameters involved in formulas~\ref{eq:defendantmodel} and~\ref{eq:judgemodel}; using this distribution, it predicts the outcome of the cases for which \humanset made a negative decision and \machine makes a positive one; and finally it evaluates the failure rate of \machine on the dataset.


The second evaluator we consider is \contraction, proposed in recent work~\cite{lakkaraju2017selective}.
% , KDD'17 I DONT THINK WE SHOULD DO THIS MORE THAN TWICE AND THOSE ARE USED ALREADY
It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
%
Contraction bases its evaluation only on the cases assigned to the most lenient decision maker $\human_l$ in the data. 
%
%The algorithm removes cases with negative decision from that decision maker. % R_q
Because of the lower leniency of the evaluated decision maker $\machine$, the approach assumes  $\machine$ makes a negative decision for all cases for which $\human_l$ makes a negative decision.
%
The cases with a positive decision by $\human_l$ are sorted according to the lowest leniency level at which they receive positive decisions by  $\machine$. % R sort q
% I PUT HERE MATCH BECAUSE THE FRACTION NEEDS CONSIDER 
The sorted list is then {\it contracted} to match the leniency level $\leniencyValue$ at which \machine is evaluated. Because all outcomes for cases in this list are available in the data,
%so that only cases in the \leniencyValue fraction 
%of the least dangerous
%of cases are considered in the computation of the failure rate estimate: % R_b
we can estimate the failure rate by the number of cases in the contracted list with a negative outcomes divided by the number of cases assigned to $\human_l$. Because the cases are assigned randomly to all decision makers, this estimates the failure rate on the whole dataset.
%
%\rcomment{Rewrote it to the best of my abilities. Ok now?}
%\todo{MM}{Re-write the last two sentences.}


In addition, we consider two baselines.
%In addition, we experiment with a few baselines.
%
As a first baseline, we consider the method that evaluates the failure rate \machine based only on those cases that received a positive decision by \humanset in the data.
%
This is refered to as \labeledoutcomes.
%We refer to it as \labeledoutcomes.
%WE DO NOT; LAKKARAJU DOES!!!!! WE CANNOT TAKE THEIR TERM AND SAY THAT
%WE REFER TO IT AS
% PUTTING THE BASELINES 
As a second baseline, we consider a method that performs straightforward imputation:
given a training dataset, it considers only those cases that were accompanied with a positive decision and builds a logistic regression model on them;
it then uses the prediction of this logistic regression to impute the outcome in the test data for those cases where \machine makes a positive decision but \humanset had made a negative decision.
%
We refer to this evaluator as \logisticregression.

Finally, all evaluators are compared with the optimal evaluator that has access to actual outcomes.
%
While such an evaluator is not realistic to have in practice, it is available for synthetically generated data.
% WE CANNOT SAY WE REFER TO SINCE THIS IS STRAIGHT FROM LAKKARAJU PAPER
This is refered to as \trueevaluation.


%%IF PIPELINE IS TO BE PUT BACK IN; WATCH OUT IT INCLUDES MULTIPLE ERRORS
%\subsection{Pipeline}
%\label{sec:pipeline}
%
%\acomment{I propose deleting this subsection as a whole, still putting the training set test set split earlier, for example in synthetic data. 1) The separate section of pipeline confuses readers, since previous sections explain or should explain these. 2) It takes even longer to get the results. 3) Anybody can read the pipeline in the online code. 4) It is totally confusing and too complicated, it is better that the reader does not get confused and think of the experiments as a complicated mess. 5) I dont see pipeline sections anywhere (I dont read bioinformatics papers) 6) the paper will be shorter.  }
%
%Having described the synthetic data, decision makers, and evaluators, we now summarize how the above components are put together into the experimental pipeline.
%
%As a first step, we generate three synthetic datasets (Section~\ref{sec:syntheticsetting}).
%%
%The decisions in each dataset \dataset have been made by a different type of decision maker (Section~\ref{sec:dm_exps}).
%% -- the leniency of which is set uniformly at random over discrete values $\{0.1, 0.2, \ldots\}$.
%%THIS IS SIMPLY NOT TRUE, LENIENCIES ARE DRAWN UNIFORMLY AT RANDOM; values between 0.1 and 0.2 are possible.
%%
%As a second step, we create three instances of decision maker \machine -- again, one for each type of decision maker.
%%
%The role of \machine is to act as an automated system that makes decisions based on observed features \obsFeatures alone.
%%
%To train the \independent and \batch instances of \machine (\random does not require training) we separate a training subset from dataset \dataset.
%%
%To do that, we randomly mark the decisions of half of the decision makers as `training' data points -- while data points from the other half are marked as `test'.
%%
%Subsequently, the training subset is used to train a logistic regression model to predict outcomes \outcome from observed features \obsFeatures alone.
%%
%In a third step, \independent and \batch instances of \machine use the previously learned model to make decisions on the test data points in the way described in Section~\ref{sec:dm_exps} -- while \random decides randomly.
%%
%Each execution of the pipeline is parameterized by the leniency level of \machine, which is given as input.
%%
%And finally, in a fourth step, all evaluators are employed to estimate the failure rate on the decisions that different instances of \machine made on the test dataset.


\subsection{Results}
\label{sec:results}

These results show the accuracy of the different evaluators (Section~\ref{sec:evaluators}) on different types of automated decision makers \machine over data sets employing different decision makers \humanset (Section~\ref{sec:decisionmakers}).
%In these results we evaluate decision maker \machine over data 
%For the results we describe immediately below, we executed the pipeline for multiple random train-test splits and different leniency levels for \machine.

%Figure~\ref{fig:basic} shows some representative results from this process.
%SOME REPRESENTATIVE??? WE CANNOT BE THIS VAGUE
\spara{The basic setting.}   Figure~\ref{fig:basic} shows estimated failure rates for each of the evaluators, at different leniency levels, when decisions in the data were made by \independent decision maker, while \machine was of \batch type.
%
%Specifically, the plot shows the %
%The same leniency level was used for decisions in the data and for decisions by \machine.
% NO THIS IS NOT TRUE, DATA HAS MANY LENIENCY LEVELS AND M IS RUN ON SEVERAL LENIENCY ELVELS AND THESE MAY NOT EVEN BE THE SAME
%For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
%ALREADY
%
In interpreting this plot, we should consider an evaluator to be accurate if its curve follows well that of the optimal evaluator, i.e., \trueevaluation.
%
In this scenario, we see that \cfbi and \contraction are quite accurate, while the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
%
In addition, \cfbi exhibits considerably lower variation than \contraction, as shown by the error bars.
%\footnote{To obtain the error bars, we divided the data $10$ times to training and test datasets.
%
%We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators. 
%
%The error bars denote the standard deviation of the estimates in this process.}
%
%
\begin{figure}
%\centering
\includegraphics[width=\linewidth,trim={0 0 0 0.25cm},clip]{./img/sl_errors_betaZ1}
\caption{Mean absolute error (MAE) of estimate w.r.t. true evaluation.
Error bars show standard deviation of the absolute error over 10 datasets. The presented method (\cfbi) is able to offer stable estimates with low variance robustly across different decision makers. The error of \contraction
varies considerably within and across different decision makers.}
\label{fig:results_errors}
\end{figure}


%WHY WOULD WE JUMP TO LIMITED LENIENCY AND THEN BACK TO UNLIMITED??? I DONT UNDERSTAND!!!!!
%
%This picture painted by Figures~\ref{fig:results_errors} and~\ref{fig:results_rmax05} is representative for all combinations of types of decision makers $\human$ in the data and for \machine.
%WHAT ARE WE ARTISTS?
%Figures~\ref{fig:results_errors} shows the error rates for combinations of different types of decision makers $\humanset$ in the data and for \machine.
%
%While we do not have space to include the corresponding figures for all other settings, we do provide summary results in Figure~\ref{fig:results_errors}.
%WE HAVE THE APPENDIX; WE DO HAVE PLENTY SPACE; FOR EXAMPLE REPLACING ANY DISCUSSIONS IN THE PAPER; BETTER TO HAVE THEM ASK AND THEN PRODUCE
%
%Specifically,
Figure~\ref{fig:results_errors} shows the aggregate absolute error rates of the two evaluators, \cfbi and \contraction.
%
Each error bar is based on all datasets and leniencies from $0.1$ to $0.8$, for different types of decision makers for $\humanset$ and for \machine.
%, when the full range of leniencies is present in the data (like in Figure~\ref{fig:basic}).
%
The overall result is that \cfbi evaluates the decision makers accurately and robustly across different decision makers.
%
It is able to learn model parameters that capture the behavior of decision makers employed in the data and use that model to evaluate any decision maker \machine using only selectively labeled data.
\contraction shows consistently poorer performance, and markedly larger variation as shown by the error bars.
%
%\contraction shows consistently poorer performance, but not dramatically worse.
% THE VARIATION IS DRAMATICALLY WORSE
%
Again, our interpretation is that this is due to the fact that \contraction crucially depends on the cases assigned to the most lenient decision makers, while \cfbi makes full use of all data.


\begin{figure}
\includegraphics[width=1.1\linewidth,trim={0 0 0 1.8cm},clip]{./img/with_epsilon_deciderH_independent_deciderM_batch_maxR_0_5coefZ1_0_all}
\caption{Evaluating \batch on data employing \independent and with leniency at most $0.5$. The proposed method (\cfbi) offers sensible estimates of the failure rates for all levels of leniency, whereas \contraction produces failure rates only up to leniency $0.5$.}
\label{fig:results_rmax05}
\end{figure}

\spara{The effect of limited leniency.}  
%However, the picture changes somewhat when we restrict leniency to take values from a more restricted range.
%
%This is because \contraction depends crucially on the most lenient decision maker to estimate the performance of the rest.
%WE CANNOT DESCRIBE INTUITION AND RESULTS BEFORE WE OBSERVE AND LOOK AT THE PLOTS! IT IS LIKE WE ARE MANUFACTURING A PLOT TO SHOW THIS
%
%For example, consider Figure~\ref{fig:results_rmax05}.
%
%The figure was generated in the same way as Figure~\ref{fig:basic}, with the only difference that the leniency of decision makers in the data was allowed to take values only among $\{0.1, 0.2, ..., 0.5\}$, and not up to $0.9$ as was the case for Figure~\ref{fig:basic}.
%NOT TRUE THE LENIENCIES IN THE DATA ARE NOT LIKE THIS THEY ARE UNIFORMLY AT RANDOM
Figure~\ref{fig:results_rmax05} shows the results when leniency of decision makers in the data was restricted to below $0.5$, and not below $0.9$ as was the case for Figure~\ref{fig:basic}.
%
%Figure~\ref{fig:results_rmax05} demonstrates that 
Here, \contraction is only able to estimate the failure rate up to $0.5$, which is the highest leniency of decision makers in the data -- but for higher leniency rates it does not output any results. 
%
On the contrary, the proposed method \cfbi produces failure rate estimates for all leniencies.
%
We note of course that, when we compare with \trueevaluation, we see that the accuracy \cfbi decreases for the largest leniencies -- a fact to be expected, as such cases do not exist in the data.
%
This observation is vitally important in the sense that decision makers based on elaborate machine learning techniques, such as \cfbi, may well allow for evaluation at higher leniency rates than those (often human) employed in the data.


\spara{The effect of unobservables.} So far in our synthetic experiments, we have assumed that observed and unobserved features are of equal importance in determining possible outcomes, an assumption encoded in the value of parameters $b_\obsFeatures$ and $b_\unobservable$ which were equal to $1$ (see Section~\ref{sec:syntheticsetting}).
%
To explore situations where the importance of unobservables is higher, we now also consider settings with
$b_\obsFeatures = 1$, $b_\unobservable = 5$.
% 
The results are shown in Figure~\ref{fig:highz}. 
%WE DONT WANT THE READER TO LOOK AT ANOTHER FIGURE HERE
%which is produced just like Figure~\ref{fig:results_errors}, 
%the only difference being the values of parameters $\beta_\unobservable$ and $\gamma_\unobservable$.
%
In these settings, the decisions in the data are made mostly based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance $\machine$ is worse than in Fig.~\ref{fig:results_errors}.
%In these settings, the decisions in the data are made mostly based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance $\machine$ is expected to be as good as in Fig.~\ref{fig:results_errors}.
% WHAT??? NOT AS GOOD
Nevertheless, the proposed method (\cfbi) is able to evaluate different decision makers $\machine$ accurately. 
%
\contraction shows again consistently worse performance in comparison. Furthermore, when compared to the basic case (Figure~\ref{fig:results_errors}), the performance of \contraction is also worse, indicating some sensitivity to unobservables.

% SOUNDS AGAIN THAT WE MADE THE FIGURE FOR THIS
% AND THE READER WONT UNDERSTAND
% (again, this was to be expected due to the higher weight of $\beta_\unobservable$).

\vspace{2pt}

Thus overall, in these synthetic settings our method achieves more accurate results with considerably less variation than \contraction, allowing for evaluation in situations where the strong assumptions of contraction inhibit evaluation altogether.

\begin{figure}
\begin{center}\includegraphics[width=\linewidth,trim={0 0 0 0.25cm},clip]{img/sl_errors_betaZ5}
\end{center}
\caption{MAE of estimate w.r.t true evaluation when the effect of the unobserved $\unobservable$ is high ($b_\unobservable=5$). Although the decision maker quality is poorer, the proposed approach (\cfbi) can still evaluate the decision accurately. \contraction shows higher variance and less accuracy.}
\label{fig:highz}
\end{figure} % RL: Note that only machine decision maker is poorer, not the human.
%\subsection{Results}

\subsection{COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is equivant's (formerly Northpointe)
%\footnote{\url{https://www.equivant.com}}
 set of tools for assisting decision-making in the criminal justice system. 
%
COMPAS provides needs assessments and risk estimates of recidivism. 
%
The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years \cite{brennan2009evaluating}.
%
%The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was ethnically biased \cite{angwin2016machine}.
%% against black people
%After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe were not compatible.

The COMPAS dataset used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
Judges and defendants in the data correspond to decision makers and cases in our setting (Section~\ref{sec:setting}), respectively.
%
The original data contained information about $18{,}610$ defendants who were given a COMPAS score during 2013 or 2014. 
%
After removing defendants who were not preprocessed at pretrial stage $11{,}757$ defendants were left. 
%
Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7{,}214$ observations. 
%
Following ProPublica's data cleaning process, finally the data consisted of $6{,}172$ offenders.
%
Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

For the analysis, we deployed $\judgeAmount \in \{12, 24, 48\}$ synthetic judges with fixed leniency levels 0.1, 0.5 and 0.9 so that a third of the decision-makers shared a leniency level.
%
The $\datasize=6{,}172$ subjects were distributed to the \judgeAmount judges uniformly at random.
%
In this scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
%
E.g. if a synthetic judge had leniency $0.5$, they would release $50\%$ of defendants with the lowest COMPAS score.
%
Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
%
After assigning the decisions, the data was split 10 times to training and test sets containing the decisions of half of the judges each.
%
A logistic regression model was trained on the training data to predict two-year recidivism from categorised age, race, gender, number of prior crimes and the degree of crime COMPAS screened for (felony or misdemeanour) using only observations with positive decisions.
%
As the COMPAS score is derived from a larger set of predictors then the aforementioned five \cite{brennan2009evaluating}, the unobservable information would then be encoded in the COMPAS score.
%
The built logistic regression model was used in decision maker \machine in the test data and the same features were given as input for the counterfactual imputation.
%
The deployed machine decision maker was defined to release \leniencyValue fraction of the defendants with the lowest probability for negative outcome.


\begin{figure}
%\centering
\includegraphics[width=\linewidth,trim={0 0 0 0.25cm},clip]{./img/sl_errors_compas}
\caption{Results with COMPAS data. Error bars represent the standard deviation of the absolute \failurerate estimate errors across all levels of leniency w.r.t. true evaluation. \cfbi gives both accurate and precise estimates despite of the number of judges used. Performance of \contraction gets notably worse as the number of judges increases.
}
\label{fig:results_compas}
\end{figure}

Figure~\ref{fig:results_compas}  shows the absolute errors of failure rate of the batch machine decision maker as a function of the number of judges in the data (also batch decision makers).
%
The MAE of our \cfbi at all levels of leniency is consistently lower than that of \contraction for each number of judges used in the experiments.
%
Quite notably, the error of \contraction gets larger when there are more judges in the data and the variance of the failure rate estimates it produces increases as the most lenient judge are assigned fewer and fewer subjects.
%
Again, we attribute this behavior to the fact that \contraction crucially depends on the most lenient decision makers, while \cfbi makes full use of the data.