experiments.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Experiments}

\todo{Michael}{We should make sure that what we describe in the experiments follows what we have in the technical part of the paper, without describing the technical details again. This may require a light re-write of the experiments. See also other notes below.}

\todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper. Already done for previous sections}

We thoroughly tested our proposed method for evaluating decision maker performance in terms of  accuracy, variability and robustness. We employed both synthetic and real data, including decision of several different kinds of decision makers. We compare performance especially to the state-of-the-art contraction technique of \citet{lakkaraju2017selective}. 


\subsection{Synthetic Data} \label{sec:syntheticsetting}

For our experiments with synthetic data, we build on the simulation setting of \citet{lakkaraju2017selective}.
We sampled a total of  $\datasize=5k$ subjects. 
The features \obsFeatures and \unobservable were drawn independently from standard Gaussians.
Outcomes for the subjects were sampled from:
\begin{eqnarray}
\prob{\outcome=0~|~\obsFeaturesValue, \unobservableValue} & =&
	\invlogit(\alpha_\outcome + \beta_\obsFeatures^T \obsFeaturesValue + \beta_\unobservable \unobservableValue + \epsilon_\outcome )	
\end{eqnarray}
where values for the coefficients $
%\gamma_\obsFeaturesValue,~\gamma_\unobservableValue,
~\beta_\obsFeatures$ and $\beta_\unobservable$ were all initially set to $1$.
%
The intercept $\alpha_\outcome$ determining the baseline probability for a negative result was set to $0$.
%
Stochasticity was added to the behaviour of the subjects by including $\epsilon_\outcome$ which has a zero-mean Gaussian distribution with variance $0.1$.
The subjects were assigned to the decision makers (described in the next subsection) randomly such that each decision maker received a total of 100 subjects.
The leniencies \leniency for the $\judgeAmount=50$ decision makers \human were then drawn from Uniform$(0.1,~0.9)$.
%
%The error terms $\epsilon_\decisionValue$ and $\epsilon_\outcomeValue$ were sampled independently from zero-mean Gaussian distributions with variance $0.1$.
%
The data set was finally split to training and test data sets containing the decision from 25 judges each at random for training the models for the machine decision makers.


\subsection{Decision Makers} \label{sec:decisionmakers}


We used several different kinds of decision makers in the experiments, both as providing simulated decision in the data and as candidate decision makers to be evaluated. Recall from Section~2 that the portion of subjects each decision maker makes a positive decision for, can be controlled by leniency level $\leniencyValue$.

The simplest decision maker, \textbf{Random}, simply selects portion $\leniencyValue$ of the subjects assigned to it at random, makes a positive decision from them $\decision=1$ and a negative decision $\decision =0$ for the remaining subjects.
%\item 
Following~\citet{lakkaraju2017selective}, the \textbf{Batch} decision maker sorts its subjects with risk scores and then releases $\leniencyValue$ portion of the subjects with the lowest score. In the experiments the risk scores were given by expression 
\begin{equation} \label{eq:riskscore}
\invlogit(\gamma_\obsFeatures\obsFeaturesValue + \gamma_\unobservable\unobservableValue + \epsilon_\decisionValue).
\end{equation}

The previous decision makers may seem unfair as they make decision based on a subject \emph{dependent} on other subject. To put it simply, they may need to make a negative decision for a subject today in order to make a positive decision for some subject tomorrow. To this end, we formulated an \textbf{Independent} decision maker, generalizing the batch decision maker, in the following way.
 The risk scores of all defendants have some distribution which subsequently has some cumulative distribution function $G$. Now, given a decision maker with leniency $r$, the independent decision maker deals a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue. In the experiments the risk scores were computed using equation \ref{eq:riskscore} without the error term $\epsilon_\decisionValue$. 

The decision makers in the data and the evaluated decision makers differ in the observability of  \unobservable: the former have access to \unobservable and include it in their regression model while the latter omit \unobservable completely. All parameters of the regression models for evaluated decision makers are learned from the training data set; evaluation is solely based on the test set. 

\subsection{Evaluators}

In addition to counterfactual imputation (\textbf{Counterfactuals}) presented in this paper, we consider three other ways of evaluating decision makers. For the synthetic data, we can obtain the outcomes even for subjects with negative decision. We plot these are \textbf{True evaluation}. Note that for a realistic setting the true evaluation would not be available. We also report the failure rate using only on the subjects that were released in the data as \textbf{Labeled outcomes}. This naive baseline has already been previously shown to considerable underestimate the true failure rate.

The state-of-the-art method for this setting is \textbf{Contraction} of \citet{lakkaraju2017selective}.
%
It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
%
Contraction bases its evaluation only on the subjects assigned to the most lenient decision maker in the data. These subjects are sorted according to the lowest leniency level at which they would be released by the evaluated decision maker $\machine$. Then the failure rate estimate for a given leniency level $\leniencyValue$ can be computed as labeled outcomes after contracting the positive decisions on the subjects released only at higher rates.
Assuming that the distribution of subjects assigned to each subject is similar, this failure rate can be generalized to the whole data set.

\subsection{Results}


\begin{figure}
\includegraphics[width=\linewidth]{./img/_deciderH_batch_deciderM_batch_maxR_0_9coefZ1_0_all} 
\caption{Evaluation of the batch decision maker on synthetic data with batch decision makers. In this basic setting, both counterfactuals and contraction are able to match the true evaluation curve closely but counterfactuals exhibits markedly lower standard deviations  (shown by the error bars).
\acomment{Why no error bars for true evaluation? How exactly were the error bars determined?}
 }\label{fig:basic}
\end{figure}

Figure~\ref{fig:basic} shows the basic evaluation of a batch decision maker on data employing batch decision makers over different leniencies. Here evaluation metric is good if it can match the true evaluation (only available for synthetic data) for any leniency.
%
In this basic setting our approach (counterfactuals) achieves more accurate estimates with lower variance than the state-of-the-art contraction. 
%
The naive approach of comparing only cases where outcome is not masked by decision (labeled outcomes) underestimates the failure rate considerably.


\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_errors}
\caption{Error of estimate w.r.t true evaluation.
Error bars denote standard deviations. The presented method (counterfactuals) is able to offer steady estimates with low variance robustly across different decision makers, the performance of contraction varies considerably within and across different decision makers. \acomment{Does not communicate that H includes many decision makers!}
 }
\label{fig:results_errors}
\end{figure}

Figure~\ref{fig:results_errors} shows the summarized error rates for top evaluators, using different decision makers being evaluated and also employed in the data.\footnote{More detailed plots similar to Figure~\ref{fig:basic} can be found in the supplementary material.} Counterfactuals evaluates the decision makers with low error rates robustly. It is able to correctly detect the behaviour of the decision makers employed in the data and use that to evaluate any further decision makers using only selectively labeled data. Contraction shows markedly poorer performance here, this may be due to the strong assumptions it needs to make which may not hold for all decision makers.


\begin{figure}
\includegraphics[width=\linewidth]{./img/sl_rmax05_H_independent_M_batch_fixed}
\caption{Evaluating batch decision maker on data employing independent decision makers and with leniency at most $0.5$. The proposed method (counterfactuals) offers good estimates of the failure rates for all levels of leniency, whereas contraction can estimate failure rate only up to leniency $0.5$.
}
\label{fig:results_rmax05}
\end{figure}


Figure~\ref{fig:results_rmax05} shows the evaluation over leniencies similarly as~Figure~\ref{fig:basic} but this time the maximum leniencies of decision makers in the data were limited to below $0.5$. 
%
Contraction is only able to estimate the failure rate up to $0.5$, for higher leniency rates it does not output any results. 
%
Our method (counterfactuals) can produce failure rate estimates for all leniencies (although the accuracy of failure rate estimates for the largest leniencies are lower than with unlimited leniency).
%
This observation is vitally important in the sense that decision makers based on advanced machine learning techniques may well allow for the use higher leniency rates than those (often human) employed in the data.

\begin{figure}
\begin{center}\includegraphics[width=\linewidth]{img/sl_errors_betaZ5}
\end{center}
\caption{Error of estimate w.r.t true evaluation when the effect of the unbserved $\unobservable$ is high ($\beta_\unobservable=\gamma_\unobservable=5$). Although the decision maker quality is poorer, the proposed approach (counterfactuals) can still evaluate the decision accurately. Contraction shows higher variance and less accuracy.}
\label{fig:highz}
\end{figure}
Figure~\ref{fig:highz} still shows experiments where the effect of the unobserved $\unobservable$ is higher, i.e. we used $\beta_\unobservable=\gamma_\unobservable=5$ when generating the data. In this case the decisions in the data are made based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance is expected to be not very good. Nevertheless the proposed method (counterfactuals) is able to evaluate different decision makers $\machine$ accurately. Contraction shows markedly poorer performance in comparison, also when comparing to contraction in Figure~\ref{fig:results_errors} where the effect of $\unobservable$  on the decisions in the data was not as high.


Thus overall, in these synthetic settings our method achieves more accurate results with considerably less variation than the state-of-the-art contraction, allowing for evaluation in situations where the strong assumptions of contraction inhibit evaluation altogether.

\noindent
\hrulefill

\subsection{COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is equivant's (formerly Northpointe\footnote{\url{https://www.equivant.com}}) set of tools for assisting decision-making in the criminal justice system. 
%
COMPAS provides needs assesments and risk estimates of recidivism. 
%
The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years \cite{brennan2009evaluating}.
%
The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was biased against black people \cite{angwin2016machine}.
%
After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe couldn't have been consolidated.

The COMPAS data set used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. 
%
After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. 
%
Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. 
%
Following ProPublica's data cleaning process, finally the data consisted of $6 172$ offenders.
%
Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

\acomment{Riku, update this paragraph to what is currently correct!} For the analysis, we created $\judgeAmount=12$ synthetic judges with fixed leniency levels 0.1, 0.5 and 0.9 so that 4 decision-makers shared a leniency level.
%
The $\datasize=6 172$ subjects were distributed to these judges as evenly as possible and at random.
%
In this semi-synthetic scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
%
E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with the lowest COMPAS score.
%
Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
%
After assigning the decisions, the data was split 10 times to training and test sets containing the decisions of 6 judges each.
%
A logistic regression model was trained on the training data to predict two-year recidivism from categorised age, race, gender, number of priors and the degree of crime COMPAS screened for (felony or misdemeanour) using only observations with positive decisions.
%
As the COMPAS score is derived from a larger set of predictors then the aforementioned five \cite{}, the unobservable information was would then be encoded in the COMPAS score.
%
The built logistic regression model was used as decision-maker \machine in the test data and the same features were given as input for the counterfactual imputation.

%\subsection{Results}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_errors_compas}
\caption{Results with COMPAS data. The present method (counterfactuals) shows accurate performance despite number of judges, the performance of contraction gets notable worse when data includes decisions by increasing number of judges.
}
\label{fig:results_compas}
\end{figure}

Figure~\ref{fig:results_compas}  shows the failure rate errors of the batch decision maker as a function of the number of different judges (also batch decision makers).
%
The mean error for the proposed method (counterfactuals) at all levels of leniency is consistently lower than that of contraction regardless of number judges used in the experiments. The error of contraction gets larger when there are more judges in the data. Especially the variance increases as the most lenient judge gets fewer and fewer subjects.