experiments.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Experiments}
\label{sec:experiments}

We test the accuracy and robustness of \cfbi (countefactual-based imputation) in evaluating the performance of decision makers of different kinds. 
%
Towards, this end, we employ both synthetic and real data.
%
We compare \cfbi especially with the recent {\it contraction} technique of \citet{lakkaraju2017selective}, {KDD 2017}.

\spara{Reproducibility}. Our manuscript contains a full specification of the parameters of the models and synthetic datasets we used, as well as links to the real public datasets.
%
The implementation uses Python 3.6.9 and PyStan v.2.19.0.0 with cmdstanpy 0.4.3 -- and will be made available online upon publication.

\subsection{Synthetic Data} \label{sec:syntheticsetting}

We begin our experiments with synthetic data, in order to demonstrate various properties of our approach.
%
To set up the experimentation, we follow the setting of \citet{lakkaraju2017selective}.

Each synthetic dataset we experiment with consists of $\datasize=5,000$ randomly generated cases.
%
The features \obsFeatures and \unobservable of each case are drawn independently from standard Gaussians.
%
Each case is assigned randomly to one out of $\judgeAmount=50$ decision makers, such that each decision maker receives a total of $100$ cases.
%
The leniency \leniency of each decision maker is drawn independently of other decision makers from Uniform$(0.1,~0.9)$.
%
As soon as a case is assigned to a decision maker, a decision $\decision$ is made for the case.
%
The exact way this happens for different types of decision makers is described in the next subsection (Sec.~\ref{sec:dm_exps}).
%
If the decision if positive, then an outcome is assigned to the case according to Eq.~\ref{eq:defendantmodel}
%
with $\alpha_\outcome = 0$ and $~\beta_\obsFeatures = \beta_\unobservable = 1$.
%
Note that, in the event of a positive decision, the intercept $\alpha_\outcome$ determines the base probability for a negative result -- and the choice of $\alpha_\outcome = 0$ means that positive and negative outcomes are equally likely to appear in the dataset (expected proportion is $50\%-50\%$) among cases with positive decisions.
%
Additional noise is added to the outcome of each case via $\epsilon_\outcome$, which was drawn from a zero-mean Gaussian distribution with small variance,  $\epsilon_\outcome\sim \gaussian{0}{0.1}$.


\subsection{Decision Makers}
\label{sec:dm_exps}

Our experimentation involves two categories of decision makers: (i) the set of decision makers \humanset, the decisions of which are reflected in a dataset, and (ii) the decision maker \machine, whose performance is to be evaluated on the log of cases decided by \humanset.
%
We describe both of them below.

\mpara{Decisions by \humanset}\newline
%
Among cases that receive a positive decision, the probability to have a positive or negative outcome is higher or lower depending on the quantity below (see Equation~\ref{eq:defendantmodel}), to which we refer as the `{\it risk score}' of each case
\begin{equation}
\text{risk score} = \beta_\obsFeatures \obsFeaturesValue + \beta_\unobservable \unobservableValue .
\end{equation}
%
Lower values indicate that a negative outcome is more likely.
%

For the {\it first} type of decision makers we consider, we assume that decisions are rational and well-informed, and that a decision maker with leniency \leniencyValue makes a positive decision only for the \leniencyValue fraction of cases that are most likely to lead to a positive outcome. 
%
Specifically, we assume that the decision-makers know the cumulative distribution function $G$ that the risk scores $s = \beta_\obsFeatures \obsFeaturesValue + \beta_\unobservable \unobservableValue$ of defendants follow. 
%
This is a reasonable assumption to make in settings where decision makers have accurate knowledge of the joint feature distribution \prob{\obsFeatures = \obsFeaturesValue, \unobservable =\unobservableValue}, as well as the risk parameters $\beta_\obsFeatures$ and $\beta_\unobservable$ -- as such knowledge allows one to calculate $G$.
%
For example, a seasoned judge who has tried a large volume and variety of cases may have a good idea about the various cases that appear at court and which of them pose higher risk.
%
Considering a decision maker with leniency $\leniency = \leniencyValue$ who decides a case with risk score $s$, a positive decision is made only if $s$ is in the \leniencyValue portion of the lowest scores according to $G$, i.e. if 
\begin{equation}
	s \leq G^{-1}(\leniencyValue).
\end{equation} 
%
See Appendix~\ref{sec:independent} for more details. 
%
Since in our setting the distribution $G$ is given and fixed, such decisions for different cases happen independently based on their risk score.
%
Because of this, we refer to this type of decision makers as \independent.

\todo{MM}{Make sure Appendix~\ref{sec:independent} is correct.}

In addition, we also experiment with a different type of decision makers, namely \batch.
%
Decision makers of this type are assumed to consider all cases assigned to them at once, as a batch; sort them by risk score; and, for leniency \leniency = \leniencyValue, release $\leniencyValue$ portion of the batch with the best risk score. 
%
Such decision makers still have a good knowledge of the relative risk that the cases assigned to them pose, but they are also shortsighted, as they make decisions for a case \emph{depending} on other cases in their batch. 
%
For example, if a decision maker is randomly assigned a batch of cases that are all very likely to lead to a good outcome, a large portion $1-\leniencyValue$ of them will still be handed a negative decision.

Finally, we consider a random version of \batch as a third type of decision maker, namely \random.
%
Decision makers of this type simply select uniformly at random a portion $\leniency=\leniencyValue$ of cases from the batch assigned to them, make a positive decision for them -- and make a negative decision for the remaining cases.
%
\random decision makers make poor decisions -- but they do not introduce selection bias, as their decision is not correlated with the possible outcome.
%
For this reason, including them in the experiments is useful, as it allows us to test performance of different evaluation approaches in the `extreme' scenario of absence of selection bias in the data.

\mpara{Decisions by \machine}\newline
%
For \machine, we consider the same three types of decision makers as for \humanset above, with one difference: decision makers \humanset have access to \unobservable, while \machine does not.
%
Their definitions are adapted in the obvious way -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures.


\begin{figure}
\includegraphics[width=1.1\linewidth]{./img/with_epsilon_deciderH_independent_deciderM_batch_maxR_0_9coefZ1_0_all} 
\caption{Evaluation of batch decision maker on synthetic data with independent decision makers in the data. Error bars denote the standard deviation of the \failurerate estimate across data splits. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhiblower standard deviations as shown by the error bars.
}
\label{fig:basic}
\end{figure}

\begin{figure}
\includegraphics[width=1.1\linewidth]{./img/_deciderH_independent_deciderM_batch_maxR_0_9coefZ1_0_all} 
\caption{OLD FIGURE: Evaluation of batch decision maker on synthetic data with independent decision makers in the data. Error bars denote the standard deviation of the \failurerate estimate across data splits. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhiblower standard deviations as shown by the error bars. \rcomment{Here labeled outcomes is divided by the number of all subjects in the data.}
}
\label{fig:basic}
\end{figure}


\subsection{Evaluators}

In addition to counterfactual-based imputation (\cfbi) presented in this paper, we consider three other ways of evaluating decision makers. For the synthetic data, we can obtain the outcomes even for cases with negative decisions. We call the acceptance rate failure rate tradeoff curve obtained by using the true outcomes \textbf{True evaluation}. Note that for a realistic setting the true evaluation would not be available. We also report the failure rate using only the cases that were released in the data as \textbf{Labeled outcomes}. This naive baseline has already been previously shown to considerably underestimate the true failure rate \citep{lakkaraju2017selective}.

A recent method for evaluating decision makers in this setting is \textbf{Contraction} of \citet{lakkaraju2017selective}.
%
It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
%
Contraction bases its evaluation only on the cases assigned to the most lenient decision maker in the data. These cases are sorted according to the lowest leniency level at which they would be released by the evaluated decision maker $\machine$. Then the failure rate estimate for a given leniency level $\leniencyValue$ can be computed as labeled outcomes after contracting the positive decisions on the cases released only at higher rates.
Assuming that the distribution of cases assigned to each case is similar, this failure rate can be generalized to the whole data set.

\subsection{Results}


\spara{Moved from previous}
% The resulting dataset was split 10 times to training and test data sets each containing the decisions from randomly selected 25 judges.
%
To produce a train-test split, we randomly choose the decisions of half of the judges to be in the training dataset -- while the rest are assigned to the test dataset.
%
The training data sets were used only to train the machine decision makers.

The evaluation algorithms produced separate \failurerate estimates for each test data set.
%
Curves in figures \ref{fig:basic} and \ref{fig:results_rmax05} present the mean of the estimate at the given level of leniency.
%
The error bars denote the standard deviation of the estimated failure rates across the test data sets.
%
For summary figures \ref{fig:results_errors}, \ref{fig:highz} and \ref{fig:results_compas} the failure rate estimates for each test data set were compared to the estimate given by true evaluation algorithm.
%
Error bars in the figures stand for the standard deviation of the error.

\acomment{You really need to describe how error bars are gotten. It is not enough to say that they a sds. They could be sds from bootstrapping, cross-validation, over several data sets, over decision makers etc etc. For example followingly: \textbf{We divided the data set 10 times to learning and datasets. We learned the decision maker $\machine$ from the learning set and evaluated its performance on the test set using different evaluators. The error bars denote the std. deviation from the means in this process.}}


\spara{Original start of subsection}
Figure~\ref{fig:basic} shows the basic evaluation of a batch decision maker on data employing batch decision makers over different leniencies. Here evaluation metric is good if it can match the true evaluation (only available for synthetic data) for any leniency.
%
In this basic setting, the proposed \cfbi achieves estimates with considerable lower variance (given by the error bars) than the state-otraction. 
%
The naive approach of comparing only cases where outcome is not masked by decision (labeled outcomes) underestimates the failure rate considerably.


\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_errors_betaZ1}
\caption{Error of estimate w.r.t true evaluation.
Error bars denote standard deviation of the error. The presented method (\cfbi) is able to offer stable estimates with low variance robustly across different decision makers, ce of contraction
varies considerably within and across different decision makers.}
\label{fig:results_errors}
\end{figure}

Figure~\ref{fig:results_errors} shows the summarized error rates for top evaluators, using different decision makers being evaluated and also employed in the data. \cfbi evaluates the decision makers with low error rates robustly. It is able to correctly detect  of the decision makers employed in the data and use that to evaluate any further decision makers using only selectively labeled data. Contraction shows markedly poorer performance here, this may be due to the strong assumptions it needs to make which may not hold for all decision makers.

%NOT TRUE:
%\footnote{More detailed plots similar to Figure~\ref{fig:basic} can be found in the supplementary material.}
\begin{figure}
\includegraphics[width=1.1\linewidth]{./img/with_epsilon_deciderH_independent_deciderM_batch_maxR_0_5coefZ1_0_all}
\caption{Evaluating batch decision maker on data employing independent decision makers and with leniency at most $0.5$. The proposed method (\cfbi) offers good estimates of the failure rates for all levels of leniency, whereas contraction cailure rate only up to leniency $0.5$.}
\label{fig:results_rmax05}
\end{figure}

\begin{figure}
\includegraphics[width=1.1\linewidth]{./img/_deciderH_independent_deciderM_batch_maxR_0_5coefZ1_0_all}
\caption{OLD FIGURE: Evaluating batch decision maker on data employing independent decision makers and with leniency at most $0.5$. The proposed method (\cfbi) offers good estimates of the failure rates for all levels of leniency, whereas contraction cailure rate only up to leniency $0.5$. \rcomment{Here labeled outcomes is divided by the number of all subjects in the data.}}
\label{fig:results_rmax05}
\end{figure}

Figure~\ref{fig:results_rmax05} shows the evaluation over leniencies similarly as~Figure~\ref{fig:basic} but this time the maximum leniencies of decision makers in the data were limited to below $0.5$. 
%
Contraction is only able to estimate the failure rate up to $0.5$, for higher leniency rates it does not output any results. 
%
Our method (\cfbi) can produce failure rate estimates for all leniencies (although the accuracy of failure rate ethe largest lenie
%
ncies are lower than with unlimited leniency).
%
This observation is vitally important in the sense that decision makers based on advanced machine learning techniques may well allow for the use higher leniency rates than those (often human) employed in the data.


Figure~\ref{fig:highz} still shows experiments where the effect of the unobserved $\unobservable$ is higher, i.e. we used $\beta_\unobservable=\gamma_\unobservable=5$ when generating the data. In this case the decisions in the data are made mostly based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance $\machine$ is expected to be not too good. Nevertheless the proposed method (\cfbi) is able to evaluate different decision makers $\machine$ accurately. Contraction shows markerformance in comp
%
arison, also when comparing to contraction in Figure~\ref{fig:results_errors} where the effect of $\unobservable$  on the decisions in the data was not as high.


Thus overall, in these synthetic settings our method achieves more accurate results with considerably less variation than the state-of-the-art contraction, allowing for evaluation in situations where the strong assumptions of contraction inhibit evaluation altogether.


\subsection{COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is equivant's (formerly Northpointe\footnote{\url{https://www.equivant.com}}) set of tools for assisting decision-making in the criminal justice system. 
%
COMPAS provides needs assesments and risk estimates of recidivism. 
%
The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years \cite{brennan2009evaluating}.
%
The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was biased against black people \cite{angwin2016machine}.
%
After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe couldn't have been consolidated.

The COMPAS data set used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. 
%
After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. 
%
Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. 
%
Following ProPublica's data cleaning process, finally the data consisted of $6 172$ offenders.
%
Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

For the analysis, we deployed $\judgeAmount \in \{12, 24, 48\}$ synthetic judges with fixed leniency levels 0.1, 0.5 and 0.9 so that a third of the decision-makers shared a leniency level.
%
The $\datasize=6 172$ subjects were distributed to the \judgeAmount judges as evenly as possible and at random.
%
In this 
%semi-synthetic %MAYBE NOT EMPHASIZE THIS TOO MUCH
 scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
%
E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with the lowest COMPAS score.
%
Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
%
After assigning the decisions, the data was split 10 times to training and test sets containing the decisions of half of the judges each.
%
A logistic regression model was trained on the training data to predict recidivism from categorised age, race, gender, number of prior crimes and the degree of crime COMPAS screened for (felony or misdemeanour) using only observations with positive decisions.
%
As the COMPAS score is derived from a larger set of predictors then the aforementioned five \cite{brennan2009evaluating}, the unobservable information would then be encoded in the COMPAS score.
%
The built logistic regression model was used in decision maker \machine in the test data and the same features were given as input for the counterfactual imputation.
%
The deployed machine decision maker was defined to release \leniencyValue fraction of the defendants with the lowest probability for negative outcome.


\begin{figure}
\begin{center}\includegraphics[width=\linewidth]{img/sl_errors_betaZ5}
\end{center}
\caption{Error of estimate w.r.t true evaluation when the effect of the unobserved $\unobservable$ is high ($\beta_\unobservable=\gamma_\unobservable=5$). Although the decision maker quality is poorer, the proposed approach (\cfbi) can still evaluate the decision accurately. Contraction shows higher variance and less accuracy}
\label{fig:highz}
\end{figure}
%\subsection{Results}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_errors_compas}
\caption{Results with COMPAS data, error bars represent the standard deviation of the \failurerate estimate errors across all levels of leniency with regard to true evaluation. \cfbi gives both accurate and precise estimates despite of the number of judges used. Performance of ets notably worse
%
 when data includes decisions by increasing number of judges.
}
\label{fig:results_compas}
\end{figure}

Figure~\ref{fig:results_compas}  shows the failure rate errors of the batch machine decision maker as a function of the number of judges in the data (also batch decision makers).
%
The mean error of our \cfbi at all levels of leniency is consistently lower than that of contractions regardless of  used in the expe
%
riments. The error of contraction gets larger when there are more judges in the data. Especially the variance increases as the most lenient judge gets fewer and fewer subjects.