sl.tex

\documentclass[sigconf,anonymous]{acmart}
% \documentclass[sigconf]{acmart}


\usepackage{tikz}
\usepackage{tikz-cd}
\usetikzlibrary{arrows,automata, positioning}

% Packages
\usepackage{type1cm}     % type1 computer modern font
\usepackage{graphicx}     % advanced figures
\usepackage{xspace}     % fix space in macros
\usepackage{balance}     % to better equalize the last page
\usepackage{multirow}     % multi rows for tables
\usepackage[font={bf}, tableposition=top]{caption}     % captions on top for tables
\usepackage{bold-extra}     % bold + {small capital, italic}
\usepackage{siunitx}          % \num for decimal grouping
\usepackage[vlined,linesnumbered,ruled,noend]{algorithm2e}     % algorithms
\usepackage{booktabs}     % nicer tables
%\usepackage[hyphens]{url}     % handle long urls
%\usepackage[bookmarks, pdftex, colorlinks=false]{hyperref}     % clickable references
%\usepackage[square,numbers]{natbib}     % better references
\usepackage{microtype}    % compress text
\usepackage{units}     % nicer slanted fractions
\usepackage{mathtools}     % amsmath++
%\usepackage{amssymb}     % math symbols
%\usepackage{amsmath}
\usepackage{relsize}
\usepackage{caption}
\captionsetup{belowskip=6pt,aboveskip=2pt} % to save space.
%\usepackage{subcaption}
% \usepackage{multicolumn}
\usepackage[]{inputenc}
\usepackage{xfrac}
\RequirePackage{graphicx,color}
\usepackage[font={small}]{subfig} % subfig, 4 figures in a row
\usepackage{pifont}
\usepackage{footnote} % show footnotes in tables
\makesavenoteenv{table}

\newcommand{\acomment}[1]{{{\color{orange} [A: #1]}}}
\newcommand{\rcomment}[1]{{{\color{red} [R: #1]}}}
\newcommand{\mcomment}[1]{{{\color{blue} [M: #1]}}}

\newtheorem{problem}{Problem}

\newcommand{\ourtitle}{Evaluating Decision Makers over Selectively Labeled Data}

\input{macros}
\usepackage{chato-notes}


\title{\ourtitle}

\author{Michael Mathioudakis}
\affiliation{%
  \institution{University of Helsinki}
  \city{Helsinki} 
  \country{Finland} 
}
\email{michael.mathioudakis@helsinki.fi}


\begin{abstract}
As an increasing number of decisions affecting people's lives are made by AI systems, automating the evaluation of such systems becomes increasingly important.
%
One major challenge for evaluation is that often decisions skew the data on which the evaluation is performed. 
%
% For example, when deciding whether a defendant should be granted bail or rather be led to jail, a decision is deemed successful if it grants bail to defendants who would honor the conditions of the bail and leads to jail ones who would violate them.
%
% However, in such cases, we are only able to directly evaluate the mechanism when it grants bail, while we cannot observe the potential bail violations by defendants who were led to jail. 
%
For example, when a bank decides whether a customer should be granted a loan or not, a decision is deemed successful if it grants a loan to a customer who would honor the its conditions, but not to one who would violate them.
%
However, in such cases, we are only able to directly evaluate the decision to grant the loan, while we cannot observe whether customers who were not granted the loan would indeed violate its conditions. 
%
To evaluate the decision not to grant the loan, one approach is to infer the outcome in the hypothetical case that the loan were granted.
%
In this paper, we develop a Bayesian approach towards this end that uses counterfactual-based imputation to infer unobserved outcomes.
%
Compared to previous state-of-the-art, the quality of decisions is estimated more accurately and with lower variance. 
%
The approach is also shown to be robust to different variations in the decision mechanisms in the data.
%
\mcomment{On one hand, since we use judicial data in our experiments, it makes sense to use the bail-or-jail case in the abstract. On the other hand, this does not connect with the motivation we provide to evaluate the decision of (computer/ML/AI) systems, since jail-or-bail decisions are not currently made by such systems. The bank loan example might look better in the abstract.}
%
\end{abstract}


\begin{document}


\fancyhead{}
\maketitle

\renewcommand{\shortauthors}{Authors}


\input{introduction}

%<<<<<<< HEAD
%
%\section{The Selective Labels Framework}
%
%
%
%
%We begin by formalizing the selective labels setting.
%
%Let binary variable $T$ denote a decision, where $T=1$ is interpreted as a positive decision. The binary variable $Y$ measures some outcome that is affected by the decision $T$.  The selective labels issue is that in the observed data when $T=0$ then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of missing data.
%%\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}
%} $Y=1$.
%
%For example, consider that
%$T$ denotes a decision to jail $T=0$ or bail $T=1$. 
%Outcome $Y=0$ then marks that a defendant offended and $Y=1$ the defendant did not. When a defendant is jailed $T=0$ the defendant obviously did not violate the bail and thus always $Y=1$.
%
%\subsection{Decision Makers}
%
%A decision maker $D(r)$ makes the decision $T$ based on the characteristics of the subject. We assume the decision maker gets input leniency $r$, which defines what percentage of subjects the decision maker makes a positive decision for. A decision maker may be human or a machine learning system. They seek to predict outcome $Y$ based on what they know and then decide $T$ based on this prediction: a negative decision $T=0$ is prefered for subjects predicted to have negative outcome $Y=0$ and a positive decision $T=1$ when the outcome is predicted as positive $Y=1$.  
%
%
%% We especially consider machine learning system that need to use similar data as used for the evaluation; they also need to take into account the selective labels issue.
%
%In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail. The leniency $r$ refers to the portion of bail decisions.
%
%
%The difference between the decision makers in the data and $D(r)$ is that usually we cannot observe all the information that has been available to the decision makers in the data.
%% In addition, we usually cannot observe the full decision-making process of the decider in the data step contrary to the decider in the modelling step.
%With unobservables we refer to some latent, usually non-written information regarding a certain outcome that is only available to the decision-maker. For example, a judge in court can observe the defendant's behaviour and level of remorse which might be indicative of bail violation. We denote the latent information regarding a person's guilt with variable \unobservable.
%
%
%\subsection{Evaluating Decision Makers}
%
%
% The goodness of a decision maker can be examined as follows. 
%%Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. 
%%DO WE NEED ACCEPTANCE RATE ANY MORE 
%Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions. 
%% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
%A good decision maker achieves as low failure rate FR  as possible, for any leniency level. 
%
%However, the data we have does not directly provide a way to evaluate FR. If a decision maker decides $T=1$ for a subject that had $T=0$ in the data, the outcome $Y$ recorded in the data is based on the decision $T=0$ and hence $Y=1$ regardless of the decision taken by $D$. The number of negative outcomes $Y=0$ for these decision needs to be calculated in some non-trivial way.
%
%In the example situation the difficulty is occurs when a decision maker decides to bail $T=0$ a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.
%
%Therefore, the aim is here to give an estimate of the FR at any given AR for any decision maker $D$, formalized as follows:
%\begin{problem}
%Given selectively labeled data, and a decision maker $D(r)$, give an estimate of the failure rate FR for any leniency $r$.
%\end{problem}
%\noindent
%The estimate of the evaluator should also be accurate for all levels of leniency.
%This estimate is vital in the employment machine learning and AI systems to every day use. 
%
%
%% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.
%
%%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
%
%
%
%%\setcounter{section}{1}
%
%
%
%%\section{ Framework ( by Riku)}
%%In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
%%Antti: In conference papers we do not waste space for such in this paper stuff!! In journals one can do that.
%
%%\begin{itemize}
%%\item Definitions \\
%%	In this paper we apply our approach on binary (positive / negative) outcomes, but our approach is readily extendable to accompany continuous or categorical responses. Then we could use e.g. sum of squared errors or other appropriate metrics as the measure for good performance.
%%	With positive or negative outcomes we refer to...
%	%\begin{itemize}
%	%\item Failure rate
%%		\begin{itemize}
%%		\item %Failure rate (FR) is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision / we can only observe the outcome when the corresponding decision is positive.
%%		\item %That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal. (rather about finding a good balance. > Resource issues in prisons etc.)
%		%\end{itemize}
%%	\item Acceptance rate
%		%\begin{itemize}
%%		\item %Acceptance rate (AR) or leniency is defined as the ratio of positive decisions to all decisions that a decision-maker will give. (Semantically, what is the difference between AR and leniency? AR is always computable, leniency doesn't manifest.) A: a good question! can we get ir of one
%		%\item
%		
%		% In some settings, (justice, medicine) people might want to find out if X\% are accepted what is the resulting failure rate, and what would be the highest acceptance rate to have to have the failure rate at an acceptable level. 
%%		\item We want to know the trade-off between acceptances and failure rate.
%%		\item %Lakkaraju mentioned the problem in the data that judges which have a higher leniency have labeled a larger portion of the data (which might results in bias).
%%		\item As mentioned earlier, these differences in AR might lead to subjects getting different decisions while haven the same observable and unobservable characteristics.
%		%\end{itemize}
%%	\item % Some deciders might have an incentive for positive decisions if it can mean e.g. savings. Judge makes saving by not jailing a defendant. Doctor makes savings by not assigning patient for a higher intensity care. (move to motivation?)
%	%\item 
%%\end{itemize}
%%\begin{itemize}
%%\item Modules \\
%%	We separated steps that modify the data into separate modules to formally define how they work. With observational data sets, the data goes through only a modelling step and an evaluation step. With synthetic data, we also need to define a data generating  step. We call the blocks doing these steps {\it modules}. To fully define a module, one must define its input and output. Modules have different functions, inputs and outputs. Modules are interchangeable with a similar type of module if they share the same input and output (You can change decider module of type A with decider module of type B). With this modular framework we achieve a unified way of presenting the key differences in different settings.
%%	\begin{itemize}
%%	\item Decider modules
%		%\begin{itemize}
%%		\item In general, the decider module assigns predictions to the observations based on some information.
%%		\item %The information available to a decision-maker in the decider module includes observable and -- possibly -- unobservable features, denoted with X and Z respectively.
%%		\item %The predictions given by a decider module can be relative or absolute. With relative predictions we refer to that a decider module can give out a ranking of the subjects based on their predicted tendency towards an outcome. Absolute predictions can be either binary or continuous in nature. For example, they can correspond to yes or no decisions or to a probability value.
%%		\item %Inner workings (procedure/algorithm) of the module may or may not be known. In observational data sets, the mechanism or the decider which has labeled the data is usually unknown. E.g. we do not -- eactly -- know how judges obtain a decision. Conversely, in synthetic data sets the procedure creating the decisions is fully known because we define the process.
%%		\item The decider (module) in the data step has unobservable information available for making the decisions. 
%%		\item %The behaviour of the decider module in the data generating step can be defined in many ways. We have used both the method presented by Lakkaraju et al. and two methods of our own. We created these two deciders to remove the interdependencies of the decisions made by the decider Lakkaraju et al. presented.
%%		\item 		\end{itemize}
%%	\item Evaluator modules
%		%\begin{itemize}
%%		\item Evaluator module gets the decisions, observable features of the subject and predictions made by the deciders and outputs an estimate of...
%%		\item The evaluator module outputs a reliable estimate of a decider module's performance. The estimate is created by the evaluator module and it should 
%		%	\begin{itemize}
%%		%	\item be precise and unbiased
%%			\item have a low variance
%%			\item be as robust as possible to slight changes in the data generation. 
%		%	\end{itemize}
%%		\item The estimate of the evaluator should also be accurate for all levels of leniency.
%		%\end{itemize}
%%	\end{itemize}
%%\item Example: in observational data sets, the deciders have already made decision concerning the subjects and we have a selectively labeled data set available. In the modular framework we refer to the actions of the human labelers as a decider module which has access to latent information. 
%%\item Problem formulation \\
%
%
%%The "eventual goal" is to create such a decider module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
%
%%(It's important to somehow keep these two different goals separate.)
%%We show that our method is robust against violations and modifications in the data generating mechanisms.
%
%%\end{itemize}
%%=======
\input{setting}
%>>>>>>> 0633ad0a31aec8c42625419aec223eb30ca34602

\section{Counterfactual-Based Imputation For Selective Labels}

%\acomment{This chapter should be our contributions. One discuss previous results we build over but one should consider putting them in the previous section.}

\subsection{Causal Modeling}


We model the selective labels setting as summarized by Figure~\ref{fig:model}\cite{lakkaraju2017selective}.

The outcome  $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
 In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in [0,1]$ of the decision maker.


We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.


\subsection{Imputation}
%\acomment{We need to start by noting that with a simple example how we assume this to work. If X indicates a safe subject that is jailed, then we know that (I dont know how this applies to other produces) that Z must have indicated a serious risk. This makes $Y=0$ more likely than what regression on $X$ suggests.} done by Riku!


%\acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}


Our approach is based on the fact that in almost all cases, some information regarding the latent variable is recoverable. For illustration, let us consider defendant $i$ who has been given a negative decision $\decisionValue_i = 0$. If the defendant's private features $\featuresValue_i$ would indicate that this subject would be safe to release, we could easily deduce that the unobservable variable $\unobservableValue_i$ indicated high risk since
%contained so significant information that 
the defendant had to be jailed. In turn, this makes $Y=0$ more likely than what would have been predicted based on $\featuresValue_i$ alone.
In the situation,  where the features $\featuresValue_i$ clearly indicate risk and the defendant is subsequently jailed, we do not have that much information available on the latent variable.

\acomment{Could emphasize the above with a plot, x and z in the axis and point styles indicating the decision.}
\acomment{The above assumes that the decision maker in the data is not totally bad.}


In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:model}. Using Stan, we model the observed data as 
\begin{align} \label{eq:data_model}
 \outcome ~|~\decision = 1, x & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
 \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
\end{align}


That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.

Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
\[
p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
\]

In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.

%\begin{itemize}
%\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
%	\begin{itemize}
%	\item Counterfactuals are 
%		\begin{itemize}
%		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
%		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
%		\item Using counterfactuals requires defining a structural causal model.
%		\item Pearl's Book of Why: "The fundamental problem"
%		\end{itemize}
%	\item By defining structural equations / a graph
%		\begin{itemize}
%		\item we can begin formulating causal questions to get answers to our questions.
%		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
%		\item We denote the counterfactual "Y had been y had T been t" with...
%		\item By first estimating the distribution of the latent variable Z we can impose 
%		\item Now counterfactuals can be defined as
%			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
%			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
%			\begin{equation} \label{eq:counterfactual}
%				Y_x(u) := Y_{M_x}(u)
%			\end{equation}
%			\end{definition}
%		\end{itemize}
%	\item In a high level
%		\begin{itemize}
%		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
%		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
%		\end{itemize}
%	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
%	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
%\end{itemize}
%\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
%	\begin{itemize}
%	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
%	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
%	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
%	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
%	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
%		\begin{align} \label{eq:structural_equations}
%		\outcome_0 & = NA \\ \nonumber
%		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
%		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
%		\outcome & =\outcome_\decisionValue \\ \nonumber
%		\end{align}
%	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
%	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
%	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
%	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
%%		\begin{align} \label{eq:data_model}
%%		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
%%		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
%%		\end{align}
%	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
%	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
%	\item Specify, motivate and explain priors here if space.
%	\end{itemize}
%\item Computation (Stan in general, ...)
%	\begin{itemize}
%	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
%	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
%	\end{itemize}
%\end{itemize}

\begin{figure}
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$R$};
  \node[state] (X) [right of=R] {$X$};
  \node[state] (T) [below of=X] {$T$};
  \node[state] (Z) [rectangle, right of=X] {$Z$};
  \node[state] (Y) [below of=Z] {$Y$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{ $R$ leniency of the decision maker, $T$ is a binary decision,  $Y$ is the outcome that is selectively labled. Background features  $X$ for a subject affect the decision and the outcome. Additional background features  $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}

\begin{algorithm}
	%\item Potential outcomes / CBI \acomment{Put this in section 3? Algorithm box with these?}
\DontPrintSemicolon
\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
\KwOut{Failure rate at acceptance rate $r$} 
Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
\For{i in $1, \ldots, S$}{
	\For{j in $1, \ldots, \datasize$}{
		Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
	}
	Impute missing values using outcomes drawn in the previous step.\;
	Sort the observations in ascending order based on the predictions of the predictive model.\;
	Estimate the FR as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
}
\Return{Mean of $\mathcal{U}$.}
	
\caption{Counterfactual-based imputation}	\end{algorithm}

%\section{Extension To Non-Linearity (2nd priority)}

% If X has multiple dimensions or the relationships between the features and the outcomes are clearly non-linear the presented approach can be extended to accomodate non-lineairty. Jung proposed that... Groups... etc etc.


\section{Related work}

Discuss this: \cite{DBLP:conf/icml/Kusner0LS19}

\begin{itemize}
\item Lakkaraju and contraction. \cite{lakkaraju2017selective}
	\item Contraction
		\begin{itemize}
		\item Algorithm by Lakkaraju et al. Assumes that the subjects are assigned to the judges at random and requires that the judges differ in leniency. 
		\item Can estimate the true failure only up to the leniency of the most lenient decision-maker.
		\item Performance is affected by the number of people judged by the most lenient decision-maker, the agreement rate and the leniency of the most lenient decision-maker. (Performance is guaranteed / better when ...)
		\item Works only on binary outcomes
		\item (We show that our method isn't constrained by any of these)
		\item The algorithm goes as follows...
%\begin{algorithm}[] 			% enter the algorithm environment
%\caption{Contraction algorithm \cite{lakkaraju17}} 		% give the algorithm a caption
%\label{alg:contraction} 			% and a label for \ref{} commands later in the document
%\begin{algorithmic}[1] 		% enter the algorithmic environment
%\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
%\ENSURE
%\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
%\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
%\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
%\STATE
%\STATE $\RR_q = \{(x, j, t, y) \in \D_q|t=1\}$
%\STATE \hskip3.0em $\rhd$ $\RR_q$ is the set of observations in $\D_q$ with observed outcome labels
%\STATE
%\STATE Sort observations in $\RR_q$ in descending order of confidence scores $\s$ and assign to $\RR_q^{sort}$.
%\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
%\STATE
%\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\RR_q |]$ observations of $\RR_q^{sort}$ and call this list $\mathcal{R_B}$
%\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
%\STATE
%\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
%\RETURN $\mathbf{u}$
%\end{algorithmic}
%\end{algorithm}
		\end{itemize}
\item Counterfactuals/Potential outcomes. \cite{pearl2010introduction} (also Rubin)
\item Approach of Jung et al for optimal policy construction. \cite{jung2018algorithmic}
\item Discussions of latent confounders in multiple contexts.
\item Imputation methods and other approaches to selective labels, eg. \cite{dearteaga2018learning}
\end{itemize}

\section{Experiments}

In this section we present our results from experiments with synthetic and realistic data. We show that our approach provides the best estimates for evaluating the performance of a predictive model on all levels of leniency.

\subsection{Synthetic data}

\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

We sampled $N=7k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$ and $0.2$ respectively. Then the leniency levels $R$ for each of the $M=14$ judges were assigned pairwise so that the judges had leniencies $0.1,~0.2,\ldots, 0.7$. The subjects were assigned randomly to the judges so each received $500$ subjects. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}

The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\features, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then the outcomes for which the decision was negative, were set to $0$.
 
We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome using only the recorded features \features so that $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as the default decision-maker but the values for the observed features \features observed by the decision-maker are altered. We modified the values so that if the value for \featuresValue  lied in the interval .. it was multiplied by 0.75 to induce more positive decisions. Similarly if the subject's \featuresValue was in the .. we deducted ... to induce more negative decisions. Additionally the effect of non-informative decisions were investigated by deploying a \emph{random} decision-maker. Given leniency $R$, a random decision-maker give a positive decision $T=1$ with probability given by $R$.

In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions 
on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour: a single judge would need to jail defendant today in order to release a defendant tomorrow.
We treat the observations as independent and the still the leniency would be a good estimate of the acceptance rate. The acceptance rate converges to the leniency. 
 
 %This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)

\paragraph{Evaluators} 
	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
\begin{itemize}
	\item  \emph{True evaluation:} True evaluation depicts the true performance of a model. The estimate is computed by first sorting the subjects into a descending order based on the prediction of the model. Then the true failure rate estimate is computable directly from the outcome labels of the top $1-r\%$ of the subjects. True evaluation can only be computed on synthetic data sets as the ground truth labels are missing.
	%\item \emph{Human evaluation:} Human evaluation presents the performance of the decision-makers who observe the latent variable. Human evaluation curve is computed by binning the decision-makers with similar values of leniency into bins and then computing their failure rate from the ground truth labels. \rcomment{Not computing now.}
	\item \emph{Labeled outcomes:} Labeled outcomes algorithm is the conventional method of computing the failure rate. We proceed as in the true evaluation method but use only the available outcome labels to estimate the failure rate.
	\item \emph{Contraction:} Contraction is an algorithm designed specifically to estimate the failure rate of a black-box predictive model under selective labeling. See previous section.
\end{itemize}

\paragraph{Results} We deployed the evaluators on the synthetic data set presented and the results are in Figure \ref{fig:results_main}. The new presented method can recover the true performance of a model for all levels of leniency. In the figure we see, how contraction algorithm can only estimate the true performance up to the level of the most lenient decision-maker. (The mean absolute errors on leniency levels from 0.1 to 0.6 were 0.007605 for contraction and 0.001912 for our method. Our error approximately $75\%$ smaller.)

(Target for this section from problem formulation: show that our evaluator is unbiased/accurate (show mean absolute error), robust to changes in data generation (some table perhaps, at least should discuss situations when the decisions are bad/biased/random = non-informative or misleading), also if the decider in the modelling step is bad and its information is used as input, what happens.)
	
\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_independent_decisions}
\caption{Failure rate vs Acceptance rate with independent decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve (green) for all levels of leniency regardless of the leniency of the most lenient decision maker.}
\label{fig:results_main}
\end{figure}

\begin{figure}
%\centering
\includegraphics[width=\linewidth]{./img/sl_results_batch_decisions}
\caption{Failure rate vs Acceptance rate with batch decisions -- comparison of the methods, error bars denote standard deviation of the estimate. Here we can see that the new proposed method (red) can recover the true failure rate more accurately than the contraction algorithm (blue). In addition, the new method can accurately track the \emph{true evaluation} curve for all levels of leniency regardless of the leniency of the most lenient decision maker. %\rcomment{Contraction at 0.7 is a bug. Standard deviations are in the order $0.003$ so their bars are quite tiny.} Non longer present.
}
\label{fig:results_main}
\end{figure}

\subsection{Realistic data}
In this section we present results from experiments with (realistic) data sets. 

\subsubsection{Analysis on COMPAS data}

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is is Northpointe's (now under different name) tool for guiding decisions in the criminal justice system. COMPAS tool provides judges with risk estimates regarding the probability of recidivism and failure to appear. The COMPAS score is mainly derived from  "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" and it predicts recidivism in the following two years. The sole use of the COMPAS score as a basis for judgement has been denied by law, judges must base their decisions to other factors too. 

The COMPAS data set is recidivism data from Broward county, California, USA. The data set was preprocessed by ProPublica for their article Machine Bias. The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. Following ProPublica's reasoning, after final data cleaning we were left with $6 172$ offences. Data includes the subjects' demographic information such as gender, age and race and information on their previous offences.

For the analysis, we created 9 synthetic judges with leniencies $0.1, 0.2, \ldots, 0.9$. All the subjects were distributed to the judges evenly and at random. In this semi-synthetic scenario, the judge would base their decisions on the COMPAS score, releasing the fraction of defendants according to their leniency. Those who were given a negative decision had their outcome label hidden. The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanour). We experimented with other models but the results remained the same. These same features were used as an input for the counterfactual imputing method.

\begin{itemize}
%\item COMPAS data set
%	\begin{itemize}
%	\item Size, availability, COMPAS scoring
%		\begin{itemize}
%		\item COMPAS general recidivism risk score is made to ,
%		\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
%		\item Data was made available ProPublica.
%		\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
%		\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences. 
%		\end{itemize}
%	\item Subsequent modifications for analysis 
%		\begin{itemize}
%		\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9. 
%		\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
%		\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score 
%		\item As the COMPAS score is derived mainly from so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
%		\item Data was split to test and training sets
%		\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
%		\item We used these same variables as input to the CBI evaluator.
%		\end{itemize}
%	\item Results
%		\begin{itemize}
%		\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
%		\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
%		\end{itemize}
%	\end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)
	\begin{itemize}
	\item Size, availability, RisCanvi scoring
	\item Subsequent modifications for analysis
	\item Results
	\end{itemize}
\end{itemize}

\section{Discussion}

\begin{itemize}
\item Conclusions 
\item Future work / Impact
\end{itemize}


% \textbf{Acknowledgments.}
%The computational resources must be mentioned. 

%\clearpage
% \balance
\bibliographystyle{ACM-Reference-Format}
\bibliography{biblio}
%\balancecolumns % GM June 2007

\end{document}