sl.tex

\documentclass[sigconf,anonymous]{acmart}
% \documentclass[sigconf]{acmart}


\usepackage{tikz}
\usepackage{tikz-cd}
\usetikzlibrary{arrows,automata, positioning}

% Packages
\usepackage{type1cm}     % type1 computer modern font
\usepackage{graphicx}     % advanced figures
\usepackage{xspace}     % fix space in macros
\usepackage{balance}     % to better equalize the last page
\usepackage{multirow}     % multi rows for tables
\usepackage[font={bf}, tableposition=top]{caption}     % captions on top for tables
\usepackage{bold-extra}     % bold + {small capital, italic}
\usepackage{siunitx}          % \num for decimal grouping
\usepackage[vlined,linesnumbered,ruled,noend]{algorithm2e}     % algorithms
\usepackage{booktabs}     % nicer tables
%\usepackage[hyphens]{url}     % handle long urls
%\usepackage[bookmarks, pdftex, colorlinks=false]{hyperref}     % clickable references
%\usepackage[square,numbers]{natbib}     % better references
\usepackage{microtype}    % compress text
\usepackage{units}     % nicer slanted fractions
\usepackage{mathtools}     % amsmath++
%\usepackage{amssymb}     % math symbols
%\usepackage{amsmath}
\usepackage{relsize}
\usepackage{caption}
\captionsetup{belowskip=6pt,aboveskip=2pt} % to save space.
%\usepackage{subcaption}
% \usepackage{multicolumn}
\usepackage[]{inputenc}
\usepackage{xfrac}
\RequirePackage{graphicx,color}
\usepackage[font={small}]{subfig} % subfig, 4 figures in a row
\usepackage{pifont}
\usepackage{footnote} % show footnotes in tables
\makesavenoteenv{table}

\newcommand{\acomment}[1]{{{\color{orange} [A: #1]}}}
\newcommand{\rcomment}[1]{{{\color{red} [R: #1]}}}
\newcommand{\mcomment}[1]{{{\color{blue} [M: #1]}}}

\newtheorem{problem}{Problem}

\newcommand{\ourtitle}{Evaluating Decision Makers over Selectively Labeled Data}

\input{macros}
\usepackage{chato-notes}


\title{\ourtitle}

\author{Michael Mathioudakis}
\affiliation{%
  \institution{University of Helsinki}
  \city{Helsinki} 
  \country{Finland} 
}
\email{michael.mathioudakis@helsinki.fi}


\begin{abstract}
As an increasing number of decisions affecting people's lives are made by AI systems, automating the evaluation of such systems becomes increasingly important.
%
One major challenge for evaluation is that often decisions skew the data on which the evaluation is performed. 
%
% For example, when deciding whether a defendant should be granted bail or rather be led to jail, a decision is deemed successful if it grants bail to defendants who would honor the conditions of the bail and leads to jail ones who would violate them.
%
% However, in such cases, we are only able to directly evaluate the mechanism when it grants bail, while we cannot observe the potential bail violations by defendants who were led to jail. 
%
For example, when a bank decides whether a customer should be granted a loan or not, a decision is deemed successful if it grants a loan to a customer who would honor the its conditions, but not to one who would violate them.
%
However, in such cases, we are only able to directly evaluate the decision to grant the loan, while we cannot observe whether customers who were not granted the loan would indeed violate its conditions. 
%
To evaluate the decision not to grant the loan, one approach is to infer the outcome in the hypothetical case that the loan were granted.
%
In this paper, we develop a Bayesian approach towards this end that uses counterfactual-based imputation to infer unobserved outcomes.
%
Compared to previous state-of-the-art, the quality of decisions is estimated more accurately and with lower variance. 
%
The approach is also shown to be robust to different variations in the decision mechanisms in the data.
%
\mcomment{On one hand, since we use judicial data in our experiments, it makes sense to use the bail-or-jail case in the abstract. On the other hand, this does not connect with the motivation we provide to evaluate the decision of (computer/ML/AI) systems, since jail-or-bail decisions are not currently made by such systems. The bank loan example might look better in the abstract.}
%
\end{abstract}


\begin{document}


\fancyhead{}
\maketitle

\renewcommand{\shortauthors}{Authors}


\input{introduction}

\input{setting}

\section{Counterfactual-Based Imputation For Selective Labels}

%\acomment{This chapter should be our contributions. One discuss previous results we build over but one should consider putting them in the previous section.}

\subsection{Causal Modeling}


We model the selective labels setting as summarized by Figure~\ref{fig:model}\cite{lakkaraju2017selective}.

The outcome  $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
 In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in [0,1]$ of the decision maker.


We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.

\acomment{Not sure if this is good to discuss here or in the next section: if we would like the next section be full of our contributions and not lakkarajus, we should place it here.}

\subsection{Imputation}

%\acomment{We need to start by noting that with a simple example how we assume this to work. If X indicates a safe subject that is jailed, then we know that (I dont know how this applies to other produces) that Z must have indicated a serious risk. This makes $Y=0$ more likely than what regression on $X$ suggests.} done by Riku!


%\acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}


Our approach is based on the fact that in almost all cases, some information regarding the latent variable is recoverable. For illustration, let us consider defendant $i$ who has been given a negative decision $\decisionValue_i = 0$. If the defendant's private features $\featuresValue_i$ would indicate that this subject would be safe to release, we could easily deduce that the unobservable variable $\unobservableValue_i$ indicated high risk since
%contained so significant information that 
the defendant had to be jailed. In turn, this makes $Y=0$ more likely than what would have been predicted based on $\featuresValue_i$ alone.
In an opposite situation,  where the features $\featuresValue_i$ clearly imply that the defendant is dangerous and is subsequently jailed, we do not have that much information available on the latent variable.

\acomment{Could emphasize the above with a plot, x and z in the axis and point styles indicating the decision.}

\acomment{The above assumes that the decision maker in the data is good and not bad.}


In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:model}. Using Stan, we model the observed data as 
\begin{align} \label{eq:data_model}
 \outcome ~|~\decision = 1, x & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
 \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
\end{align}

\acomment{Are the coefs really the same in the model?}

That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.

Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
\[
p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
\]

In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.

%\begin{itemize}
%\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
%	\begin{itemize}
%	\item Counterfactuals are 
%		\begin{itemize}
%		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
%		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
%		\item Using counterfactuals requires defining a structural causal model.
%		\item Pearl's Book of Why: "The fundamental problem"
%		\end{itemize}
%	\item By defining structural equations / a graph
%		\begin{itemize}
%		\item we can begin formulating causal questions to get answers to our questions.
%		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
%		\item We denote the counterfactual "Y had been y had T been t" with...
%		\item By first estimating the distribution of the latent variable Z we can impose 
%		\item Now counterfactuals can be defined as
%			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
%			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
%			\begin{equation} \label{eq:counterfactual}
%				Y_x(u) := Y_{M_x}(u)
%			\end{equation}
%			\end{definition}
%		\end{itemize}
%	\item In a high level
%		\begin{itemize}
%		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
%		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
%		\end{itemize}
%	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
%	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
%\end{itemize}
%\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
%	\begin{itemize}
%	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
%	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
%	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
%	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
%	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
%		\begin{align} \label{eq:structural_equations}
%		\outcome_0 & = NA \\ \nonumber
%		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
%		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
%		\outcome & =\outcome_\decisionValue \\ \nonumber
%		\end{align}
%	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
%	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
%	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
%	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
%%		\begin{align} \label{eq:data_model}
%%		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
%%		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
%%		\end{align}
%	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
%	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
%	\item Specify, motivate and explain priors here if space.
%	\end{itemize}
%\item Computation (Stan in general, ...)
%	\begin{itemize}
%	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
%	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
%	\end{itemize}
%\end{itemize}

\begin{figure}
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$R$};
  \node[state] (X) [right of=R] {$X$};
  \node[state] (T) [below of=X] {$T$};
  \node[state] (Z) [rectangle, right of=X] {$Z$};
  \node[state] (Y) [below of=Z] {$Y$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{ $R$ leniency of the decision maker, $T$ is a binary decision,  $Y$ is the outcome that is selectively labled. Background features  $X$ for a subject affect the decision and the outcome. Additional background features  $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}

\begin{algorithm}
	%\item Potential outcomes / CBI \acomment{Put this in section 3? Algorithm box with these?}
\DontPrintSemicolon
\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
\KwOut{Failure rate at acceptance rate $r$} 
Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
\For{i in $1, \ldots, S$}{
	\For{j in $1, \ldots, \datasize$}{
		Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
	}
	Impute missing values using outcomes drawn in the previous step.\;
	Sort the observations in ascending order based on the predictions of the predictive model.\;
	Estimate the FR as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
}
\Return{Mean of $\mathcal{U}$.}
	
\caption{Counterfactual-based imputation}	\end{algorithm}

%\section{Extension To Non-Linearity (2nd priority)}

% If X has multiple dimensions or the relationships between the features and the outcomes are clearly non-linear the presented approach can be extended to accomodate non-lineairty. Jung proposed that... Groups... etc etc.


\section{Related work}

Discuss this: \cite{DBLP:conf/icml/Kusner0LS19}

\begin{itemize}
\item Lakkaraju and contraction. \cite{lakkaraju2017selective}
	\item Contraction
		\begin{itemize}
		\item Algorithm by Lakkaraju et al. Assumes that the subjects are assigned to the judges at random and requires that the judges differ in leniency. 
		\item Can estimate the true failure only up to the leniency of the most lenient decision-maker.
		\item Performance is affected by the number of people judged by the most lenient decision-maker, the agreement rate and the leniency of the most lenient decision-maker. (Performance is guaranteed / better when ...)
		\item Works only on binary outcomes
		\item (We show that our method isn't constrained by any of these)
		\item The algorithm goes as follows...
%\begin{algorithm}[] 			% enter the algorithm environment
%\caption{Contraction algorithm \cite{lakkaraju17}} 		% give the algorithm a caption
%\label{alg:contraction} 			% and a label for \ref{} commands later in the document
%\begin{algorithmic}[1] 		% enter the algorithmic environment
%\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
%\ENSURE
%\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
%\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
%\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
%\STATE
%\STATE $\RR_q = \{(x, j, t, y) \in \D_q|t=1\}$
%\STATE \hskip3.0em $\rhd$ $\RR_q$ is the set of observations in $\D_q$ with observed outcome labels
%\STATE
%\STATE Sort observations in $\RR_q$ in descending order of confidence scores $\s$ and assign to $\RR_q^{sort}$.
%\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
%\STATE
%\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\RR_q |]$ observations of $\RR_q^{sort}$ and call this list $\mathcal{R_B}$
%\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
%\STATE
%\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
%\RETURN $\mathbf{u}$
%\end{algorithmic}
%\end{algorithm}
		\end{itemize}
\item Counterfactuals/Potential outcomes. \cite{pearl2010introduction} (also Rubin)
\item Approach of Jung et al for optimal policy construction. \cite{jung2018algorithmic}
\item Discussions of latent confounders in multiple contexts.
\item Imputation methods and other approaches to selective labels, eg. \cite{dearteaga2018learning}
\end{itemize}

\section{Experiments}

In this section we present our results from experiments with synthetic and realistic data. We show that our approach provides the best estimates for evaluating the performance of a predictive model on all levels of leniency.

\subsection{Synthetic data}

\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

We sampled $N=7k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$ and $0.2$ respectively. Then the leniency levels $R$ for each of the $M=14$ judges were assigned pairwise so that the judges had leniencies $0.1,~0.2,\ldots, 0.7$. The subjects were assigned randomly to the judges so each received $500$ subjects. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}

The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\features, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then the outcomes for which the decision was negative, were set to $0$.
 
We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome using only the recorded features \features so that $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as the default decision-maker but the values for the observed features \features observed by the decision-maker are altered. We modified the values so that if the value for \featuresValue  lied in the interval .. it was multiplied by 0.75 to induce more positive decisions. Similarly if the subject's \featuresValue was in the .. we deducted ... to induce more negative decisions. Additionally the effect of non-informative decisions were investigated by deploying a \emph{random} decision-maker. Given leniency $R$, a random decision-maker give a positive decision $T=1$ with probability given by $R$.

In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions 
on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour: a single judge would need to jail defendant today in order to release a defendant tomorrow.
We treat the observations as independent and the still the leniency would be a good estimate of the acceptance rate. The acceptance rate converges to the leniency. \acomment{As a reviewer I would perhaps ask to see the results for the Lakkaraju mechanism.}
 
 %This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)

\paragraph{Evaluators} 
	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
\begin{itemize}
	\item  \emph{True evaluation:} True evaluation depicts the true performance of a model. The estimate is computed by first sorting the subjects into a descending order based on the prediction of the model. Then the true failure rate estimate is computable directly from the outcome labels of the top $1-r\%$ of the subjects. True evaluation can only be computed on synthetic data sets as the ground truth labels are missing.
	%\item \emph{Human evaluation:} Human evaluation presents the performance of the decision-makers who observe the latent variable. Human evaluation curve is computed by binning the decision-makers with similar values of leniency into bins and then computing their failure rate from the ground truth labels. \rcomment{Not computing now.}
	\item \emph{Labeled outcomes:} Labeled outcomes algorithm is the conventional method of computing the failure rate. We proceed as in the true evaluation method but use only the available outcome labels to estimate the failure rate.
	\item \emph{Contraction:} Contraction is an algorithm designed specifically to estimate the failure rate of a black-box predictive model under selective labeling. See previous section.
\end{itemize}

\paragraph{Results} We deployed the evaluators on the synthetic data set presented and the results are in Figure \ref{fig:results_main}. The new presented method can recover the true performance of a model for all levels of leniency. In the figure we see, how contraction algorithm can only estimate the true performance up to the level of the most lenient decision-maker. (The mean absolute errors on leniency levels from 0.1 to 0.6 were 0.007605 for contraction and 0.001912 for our method. Our error approximately $75\%$ smaller.)

(Target for this section from problem formulation: show that our evaluator is unbiased/accurate (show mean absolute error), robust to changes in data generation (some table perhaps, at least should discuss situations when the decisions are bad/biased/random = non-informative or misleading), also if the decider in the modelling step is bad and its information is used as input, what happens.)
	
\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_independent_decisions}
\caption{Failure rate vs Acceptance rate with independent decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve (green) for all levels of leniency regardless of the leniency of the most lenient decision maker.}
\label{fig:results_main}
\end{figure}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_batch_decisions}
\caption{Failure rate vs Acceptance rate with batch decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve for all levels of leniency regardless of the leniency of the most lenient decision maker. \rcomment{Contraction at 0.7 is a bug. Standard deviations are in the order $0.003$ so their bars are quite tiny.}}
\label{fig:results_main}
\end{figure}

\subsection{Realistic data}
In this section we present results from experiments with (realistic) data sets. 

\subsubsection{Analysis on COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is is Northpointe's (now under different name) tool for guiding decisions in the criminal justice system. COMPAS tool provides judges with risk estimates regarding the probability of recidivism and failure to appear. The COMPAS score is mainly derived from  "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" and it predicts recidivism in the following two years. The sole use of the COMPAS score as a basis for judgement has been denied by law, judges must base their decisions to other factors too. 

The COMPAS data set is recidivism data from Broward county, California, USA. The data set was preprocessed by ProPublica for their article Machine Bias. The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. Following ProPublica's reasoning, after final data cleaning we were left with $6 172$ offences. Data includes the subjects' demographic information such as gender, age and race and information on their previous offences.

For the analysis, we created 9 synthetic judges with leniencies $0.1, 0.2, \ldots, 0.9$. All the subjects were distributed to the judges evenly and at random. In this semi-synthetic scenario, the judge would base their decisions on the COMPAS score, releasing the fraction of defendants according to their leniency. Those who were given a negative decision had their outcome label hidden. The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanour). We experimented with other models but the results remained the same. These same features were used as an input for the counterfactual imputing method.

\begin{itemize}
%\item COMPAS data set
%	\begin{itemize}
%	\item Size, availability, COMPAS scoring
%		\begin{itemize}
%		\item COMPAS general recidivism risk score is made to ,
%		\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
%		\item Data was made available ProPublica.
%		\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
%		\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences. 
%		\end{itemize}
%	\item Subsequent modifications for analysis 
%		\begin{itemize}
%		\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9. 
%		\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
%		\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score 
%		\item As the COMPAS score is derived mainly from so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
%		\item Data was split to test and training sets
%		\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
%		\item We used these same variables as input to the CBI evaluator.
%		\end{itemize}
%	\item Results
%		\begin{itemize}
%		\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
%		\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
%		\end{itemize}
%	\end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)
	\begin{itemize}
	\item Size, availability, RisCanvi scoring
	\item Subsequent modifications for analysis
	\item Results
	\end{itemize}
\end{itemize}

\section{Discussion}

\begin{itemize}
\item Conclusions 
\item Future work / Impact
\end{itemize}


% \textbf{Acknowledgments.}
%The computational resources must be mentioned. 

%\clearpage
% \balance
\bibliographystyle{ACM-Reference-Format}
\bibliography{biblio}
%\balancecolumns % GM June 2007

\end{document}