experiments.tex

\section{Experiments}

\todo{Michael}{We should make sure that what we describe in the experiments follows what we have in the technical part of the paper, without describing the technical details again. This may require a light re-write of the experiments. See also other notes below.}

\todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper. Already done for previous sections}

In this section we present our results from experiments with synthetic and realistic data. 
%
Each experiment is described in terms of 
\begin{itemize}
	\item A dataset \dataset. In turn, this is defined in terms of 
		\begin{itemize}
			\item features \obsFeatures and \unobservable of the cases and
			\item decision makers \human.
		\end{itemize}
		\item A decision maker \machine.
		\item Algorithms for evaluation.
\end{itemize}

\subsection{Synthetic setting}

Let's describe how we do the following.
\begin{itemize}
	\item How we generate features \obsFeatures and \unobservable of cases; how we generate outcomes \outcome for positive decisions. 
	\item How we assign cases and leniency levels to decision makers \human.
	\item Decision makers \human can be one of two types (one type used in each experiment) :
		\begin{itemize}
			\item Random.
			\item Using same parameters as model for \outcome, i.e., $\gamma_x = \beta_{_\obsFeaturesValue}$ and $\gamma_\unobservableValue = \beta_{_\unobservableValue}$.
		\end{itemize}
	\item Decision makers \machine(\leniencyValue):
		\begin{itemize}
			\item Random.
			\item Decision maker with $\gamma_x = \beta_{_\obsFeaturesValue}$ and $\gamma_\unobservableValue = 0$.
			\item Decision makers learned (naively) from a separate set of labeled data (what we've called the `training set').
			\item Decision maker learned from Stan.
		\end{itemize}
	\item How we evaluate:
		\begin{itemize}
			\item Our approach (Counterfactuals).
			\item Contraction.
			\item Labeled outcomes.
			\item True evaluation.
		\end{itemize}
\end{itemize}

For our experiments we sampled $\datasize=20k$ observations of \obsFeatures and \unobservable independently from standard Gaussians.
%
The leniencies \leniency for the $\judgeAmount=40$ decision-makers \human were then drawn from Uniform$(0.1,~0.9)$ and rounded to tenth decimal place.
%
The subjects were randomly distributed to the judges and decisions \decision and outcomes \outcome were sampled from Bernoulli distributions (see equations \ref{eq:judgemodel} and \ref{eq:defendantmodel}).
%
The values for the coefficients $\gamma_\obsFeaturesValue,~\gamma_\unobservableValue,~\beta_\obsFeaturesValue$ and $\beta_\unobservableValue$ were all set to $1$.
%
The intercept $\alpha_\outcomeValue$ determining the baseline probability for a negative result was set to $0$.
%
The noise terms $\epsilon_\decisionValue$ and $\epsilon_\outcomeValue$ were sampled independently from zero-mean Gaussian distributions with variance $0.1$ and $0.2^2$ respectively.

The \emph{default} decision-maker \human assigns decisions based on the probabilistic expression given in Equation \ref{eq:judgemodel}.
%
As mentioned, the decision-maker can be deemed as good, if the values for the coefficients $\beta_\obsFeaturesValue$ and $\beta_\unobservableValue$ are close those of $\gamma_\obsFeaturesValue$ and $\gamma_\unobservableValue$.
%
To examine the robustness of our method, we checked that our model can produce accurate estimates even if decision-makers \human would assign their decisions at random.
%
In contrast, Lakkaraju et al. employed a human decision-maker making decisions based on quantity $\sigma(\gamma_\obsFeaturesValue \obsFeaturesValue + \gamma_\unobservableValue \unobservableValue + \epsilon_\decisionValue)$.
%
Their decision-maker would then assign positive decisions to subjects belonging to the $\leniencyValue \cdot 100$ percentile rendering the decision dependent. \cite{lakkaraju2017selective} 

We deployed multiple machine decision-makers on the synthetic data sets.
%
The decision-maker \machine can be \emph{random} and giving out positive decisions only with probability \leniencyValue ignoring information concerning the subjects in variables \obsFeatures and \unobservable.
%
The machine \machine can also be taught naively to make decisions based on a separate set of labeled data.
%
This \emph{default} decision-maker \machine uses the observable features \obsFeatures and observed outcomes (with $\decision = 1$) to predict probability for the outcome and assign decisions...

We evaluated the machine decision-maker \machine at leniency \leniencyValue by comparing the \failurerate estimate they produced to the true failure rate.
%
The true \failurerate was computed by first sorting all observations based on the predictions given by \machine and then assigning positive decisions to the $\leniencyValue \cdot 100\%$ of observations having the lowest predicted probability for negative outcome.
%
This \emph{true evaluation} estimate was then computed directly from the ground truth labels even for the observations with $\decision_\human=0$.
%
A naive algorithm \emph{labeled outcomes} was then deployed on the dataset.
%
It is the conventional method for computing an algorithm's \failurerate and it is computed similarly to the estimate for true evaluation.
%
The algorithm first sorts all observations with $\decision_\human=1$ based on the predictions given by \machine(\leniencyValue) and then assigs positive decisions to the $\leniencyValue \cdot 100\%$ of observations having the lowest predicted probability for negative outcome.
%
The mean absolute errors of \failurerate estimates w.r.t. true evaluation from contraction, counterfactual imputation and labeled outcomes were the compared to each other.

\note{Riku}{Last chapter is now written in line with decision-maker \machine giving out predictions in interval [0, 1].}

\spara{Results} We show how the different evaluation methods perform for decision makers \machine of different type and leniency level.

We also make a plot to show how our Counterfactuals method infers correctly values Z based on X and T.

\subsubsection*{The effect of unobservables}

Perform the same experiment but with $\beta_{_\unobservableValue} \gg \beta_{_\obsFeaturesValue}$.

\subsubsection*{The effect of observed leniency}

Perform the same experiment, but the minimum leniency of \human is now larger than that of \machine.

\noindent
\hrulefill

\subsubsection{Analysis on COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is equivant's (formerly Northpointe\footnote{\url{https://www.equivant.com}}) set of tools for assisting decision-making in the criminal justice system. 
%
COMPAS provides needs assesments and risk estimates of recidivism. 
%
The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years. \cite{brennan2009evaluating}
%
The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was biased against black people. \cite{angwin2016machine}
%
After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe couldn't have been consolidated.

The COMPAS data set used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
We preprocessed the two year recidivism data as ProPublica did for their article.
%
The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. 
%
After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. 
%
Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. 
%
Following ProPublica's analysis, after final data cleaning we were left with $6 172$ offences.
%
Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

For the analysis, we created $\judgeAmount=16$ synthetic judges \human with leniencies sampled from Uniform$(0.1, 0.9)$.
%
The subjects were distributed to the judges as evenly as possible and at random.
%
In this semi-synthetic scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
%
E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with with the lowest COMPAS score.
%
Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
%
The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, number of priors and the degree of crime COMPAS screened for (felony or misdemeanour).
%
This model was used as decision-maker \machine and these same features were used as input for the counterfactual imputation.

\spara{Results} \textbf{TODO}...


\noindent
\hrulefill

\begin{table}[]
\begin{tabular}{@{}lll@{}}
\toprule
Setting 										& Contraction & Counterfactuals \\ \midrule
Default 										& 0.00555     & 0.00504         \\
100 cases per judge, $\datasize=5k$ 				& 0.00629     & 0.00473         \\
$\leniency \sim \text{Uniform}(0.1,~0.5)$ 			& N/A         & 0.01385         \\
Decision-maker \human random 					& 0.01535     & 0.00074         \\ 
Decision-maker \machine random 					& 0.03005          & 0.00458              \\
Lakkaraju's decision-maker \human \cite{lakkaraju2017selective} & 0.01187          & 0.00288            \\ \bottomrule
\end{tabular}
\caption{Comparison of mean absolute error w.r.t true evaluation between contraction and the counterfactual-based method we have  presented. Standard deviation of our method was considerably smaller and more stable. The table shows that our method can perform welll despite violations of the assumptions (eg. having decision-maker \human giving random and non-informative decisions). }
\label{tab:}
\end{table}

\subsubsection*{Old Content}
\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

We sampled $N=7k$ samples of $X$, $Z$, and $W$ as independent standard Gaussians. 
\todo{Michael to Riku}{W does not appear in our model.}
We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$ and $0.2$ respectively. Then the leniency levels $R$ for each of the $M=14$ judges were assigned pairwise so that each of the paiirs had leniencies $0.1,~0.2,\ldots, 0.7$. 
\todo{Michael to Riku}{We have assumed all along that the outcome \outcome causally follows from leniency \leniency. So we cannot suddenly say that we assign leniency after we have generated the outcomes. Let's follow strictly our model definition. If what you describe above is equivalent to what we have in the model section, then let's simply say that we do things as in the model section. Otherwise, let's do a light re-arrangement of experiments so that we follow the model exactly. In any case, in this section we should not have any model description -- we should only say what model parameters we used for the previous defined model.}
The subjects were assigned randomly to the judges so each received $500$ subjects. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}

The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\obsFeatures, \unobservable) = \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue+\beta_{_\unobservableValue} \unobservableValue)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\obsFeatures, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\obsFeatures, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\obsFeatures, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions are independent and that the ratio of positive decisions to negative decisions converges to $r$. Then the outcomes for which the decision was negative, were set to $0$.
 
We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome using only the recorded features \obsFeatures so that $P(\decision = 0~|~\obsFeatures, \unobservable) = \invlogit(\beta_{_\obsFeaturesValue} \obsFeaturesValue)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as the default decision-maker but the values for the observed features \obsFeatures observed by the decision-maker are altered. We modified the values so that if the value for \obsFeaturesValue  was greater than $1$ it was multiplied by $0.75$ to induce more positive decisions. Similarly, if the subject's \obsFeaturesValue was in the interval $(-2,~-1)$ we added $0.5$ to induce more negative decisions. Additionally the effect of non-informative decisions were investigated by deploying a \emph{random} decision-maker. Given leniency $R$, a random decision-maker give a positive decision $T=1$ with probability given by $R$.

In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions 
on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour: a single judge would need to jail defendant today in order to release a defendant tomorrow.
We treat the observations as independent and the still the leniency would be a good estimate of the acceptance rate. The acceptance rate converges to the leniency. 

\paragraph{Evaluators} 
	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
\begin{itemize}
	\item  \emph{True evaluation:} True evaluation depicts the true performance of a model. The estimate is computed by first sorting the subjects into a descending order based on the prediction of the model. Then the true failure rate estimate is computable directly from the outcome labels of the top $1-r\%$ of the subjects. True evaluation can only be computed on synthetic data sets as the ground truth labels are missing.
	%\item \emph{Human evaluation:} Human evaluation presents the performance of the decision-makers who observe the latent variable. Human evaluation curve is computed by binning the decision-makers with similar values of leniency into bins and then computing their failure rate from the ground truth labels. \rcomment{Not computing now.}
	\item \emph{Labeled outcomes:} Labeled outcomes algorithm is the conventional method of computing the failure rate. We proceed as in the true evaluation method but use only the available outcome labels to estimate the failure rate.
	\item \emph{Contraction:} Contraction is an algorithm designed specifically to estimate the failure rate of a black-box predictive model under selective labeling. See previous section.
\end{itemize}

\paragraph{Results} We deployed the evaluators on the synthetic data set presented and the results are in Figure \ref{fig:results_main, fig:results_main_2}. The new presented method can recover the true performance of a model for all levels of leniency. In the figure we see how contraction algorithm can only estimate the true performance up to the level of the most lenient decision-maker when the proposed method can do that for arbitrary levels of leniency. To create comparable results we also employed the batch decision-making mechanism presented by Lakkaraju et al. to a synthetic data set which had $N=9k$ instaces and $M=14$ judges with leniencies $0.1, \ldots, 0.9$. The mean absolute error of contraction ($0.00265$) was approximately $90\%$ higher than the performance of the presented method ($0.00139$).

Siimilar results were obtained from experiments with the aforementioned \emph{limited} \rcomment{Not done yet.}, \emph{biased} and \emph{random} deciders. The mean absolute errors for the estimates / other results are presented in .... This shows  that the counterfactual-based imputation method is robust to changes in the data generating mechanisms and therefore will accompany multiple scenarios. The results from experiments with the biased decision-makers show that the proposed method can preform well despite the  biased decisions of the current decisino-makers. This is important because...
	
\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_independent_decisions}
\caption{Failure rate vs Acceptance rate with independent decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve (green) for all levels of leniency regardless of the leniency of the most lenient decision maker.}
\label{fig:results_main}
\end{figure}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_batch_decisions}
\caption{Failure rate vs Acceptance rate with batch decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve for all levels of leniency regardless of the leniency of the most lenient decision maker.}
\label{fig:results_main_2}
\end{figure}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_batch_decisions_error_figure}
\caption{Error w.r.t. True evaluation vs Acceptance rate, error bars denote standard deviations. }
\label{fig:results_main_err}
\end{figure}


\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_compas_error}
\caption{COMPAS data: Error w.r.t. True evaluation vs Acceptance rate, error bars denote standard deviations. (Preliminary figure) }
\label{fig:results_compas}
\end{figure}

\begin{itemize}
%\item COMPAS data set
%	\begin{itemize}
%	\item Size, availability, COMPAS scoring
%		\begin{itemize}
%		\item COMPAS general recidivism risk score is made to ,
%		\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
%		\item Data was made available ProPublica.
%		\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
%		\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences. 
%		\end{itemize}
%	\item Subsequent modifications for analysis 
%		\begin{itemize}
%		\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9. 
%		\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
%		\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score 
%		\item As the COMPAS score is derived mainly from so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
%		\item Data was split to test and training sets
%		\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
%		\item We used these same variables as input to the CBI evaluator.
%		\end{itemize}
%	\item Results
%		\begin{itemize}
%		\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
%		\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
%		\end{itemize}
%	\end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)
	\begin{itemize}
	\item Size, availability, RisCanvi scoring
	\item Subsequent modifications for analysis
	\item Results
	\end{itemize}
\end{itemize}