sl.tex

\documentclass[sigconf,anonymous]{acmart}
% \documentclass[sigconf]{acmart}


\usepackage{tikz}
\usepackage{tikz-cd}
\usetikzlibrary{arrows,automata, positioning}

% Packages
\usepackage{type1cm}     % type1 computer modern font
\usepackage{graphicx}     % advanced figures
\usepackage{xspace}     % fix space in macros
\usepackage{balance}     % to better equalize the last page
\usepackage{multirow}     % multi rows for tables
\usepackage[font={bf}, tableposition=top]{caption}     % captions on top for tables
\usepackage{bold-extra}     % bold + {small capital, italic}
\usepackage{siunitx}          % \num for decimal grouping
\usepackage[vlined,linesnumbered,ruled,noend]{algorithm2e}     % algorithms
\usepackage{booktabs}     % nicer tables
%\usepackage[hyphens]{url}     % handle long urls
%\usepackage[bookmarks, pdftex, colorlinks=false]{hyperref}     % clickable references
%\usepackage[square,numbers]{natbib}     % better references
\usepackage{microtype}    % compress text
\usepackage{units}     % nicer slanted fractions
\usepackage{mathtools}     % amsmath++
%\usepackage{amssymb}     % math symbols
%\usepackage{amsmath}
\usepackage{relsize}
\usepackage{caption}
\captionsetup{belowskip=6pt,aboveskip=2pt} % to save space.
%\usepackage{subcaption}
% \usepackage{multicolumn}
\usepackage[]{inputenc}
\usepackage{xfrac}
\RequirePackage{graphicx,color}
\usepackage[font={small}]{subfig} % subfig, 4 figures in a row
\usepackage{pifont}
\usepackage{footnote} % show footnotes in tables
\makesavenoteenv{table}

\newcommand{\acomment}[1]{{{\color{orange} [A: #1]}}}
\newcommand{\rcomment}[1]{{{\color{red} [R: #1]}}}
\newcommand{\mcomment}[1]{{{\color{blue} [M: #1]}}}

%\newcommand{\ourtitle}{Working title: From would-have-beens to should-have-beens: Counterfactuals in model evaluation}

\newcommand{\ourtitle}{Evaluating Decision Makers over Selectively Labeled Data}

\input{macros}
\usepackage{chato-notes}


\title{\ourtitle}

\author{Michael Mathioudakis}
\affiliation{%
  \institution{University of Helsinki}
  \city{Helsinki} 
  \country{Finland} 
}
\email{michael.mathioudakis@helsinki.fi}


\begin{abstract}
%We show how a causality-based approach can be used to estimate the performance of prediction algorithms in `selective labels' settings -- with particular application to `bail-or-jail' judicial decisions.
Increasing number of important decision affecting people's lives are being made by machine learning and AI systems. 
We study evaluating the quality of such decision makers.
The major difficulty in such evaluation is that existing decision makers in use, whether AI or human, influence the data the evaluation is based on. For example, when
deciding whether of defendant should be given bail or kept in jail, we are not able to directly observe the possible offences by defendants that the decision making system in use decides to keep in jail. To evaluate decision makers in these difficult settings, we derive a flexible Bayesian approach, that utilizes counterfactual-based imputation. Compared to previous state-of-the-art, the approach gives more accurate predictions on the decision quality with lower variance. The approach is also shown to be robust to different variations in the decision mechanisms in the data.
\end{abstract}


\begin{document}


\fancyhead{}
\maketitle

\renewcommand{\shortauthors}{Authors}


\section{Introduction} 
\acomment{'Decision maker' sounds and looks much better than 'decider'! Can we use that?}

\acomment{We should be careful with the word bias and unbiased, they may refer to statistical bias of estimator, some bias in the decision maker based on e.g. race, and finally selection bias.}

\begin{itemize}
\item What we study
	\begin{itemize}
		\item We studied methods to evaluate the performance of predictive algorithms/models when the historical data suffers from selective labeling and unmeasured confounding.
	\end{itemize}
\item Motivation for the study
	\begin{itemize}
		\item Lot of decisions are being made which affect the course of human lives
		\item Computational models could enhance the decision-making process in accuracy and fairness.
		\item The advantage of using models does not necessarily lie in pure performance, that a machine can make more decisions, but rather in that a machine can give bounds for uncertainty and can learn from a vast set of information and that with care, a machine can be made as unbiased as possible.
		\item Fairness has been discussed in the existing literature and numerous publications are available for interested readers. Our emphasis on this paper is on pure performance, getting the predictions accurate.
		\item Before deploying any algorithms, they should be evaluated to show that they actually improve on human decision-making.
		\item Evaluating algorithms in conventional settings is trivial, when (almost) all of the labels are available, numerous metrics have been proposed and are in use in multiple fields.
	\end{itemize}
\item Present the setting and challenge:
	\begin{itemize}
		\item Specifically, `Selective labels' settings arise in situations where data are the product of a decision mechanism that prevents us from observing outcomes for part of the data.
		\item A typical example is that of bail-or-jail decisions in judicial settings: a judge decides whether to grant bail to a defendant based on whether the defendant is considered likely to violate bail conditions while awaiting trial -- and therefore a violation might occur only in case bail is granted. Naturally similar scenarios are observed throughout many walks of life from banking to medicine.
		\item Such settings give rise to questions about the effect of alternative decision mechanisms  -- e.g., `how many defendants would violate bail conditions if more bail decisions were granted?'.
		\item In other words, one faces the challenge to estimate the performance of an alternative, potentially automated, decision policy that might make different decisions than the ones found in the existing data.
		\item Characteristically, in many of the settings the decisions hiding the outcomes are made by different deciders
		\item Labels are missing non-randomly, decisions might be made by different deciders who differ in leniency.
		\item So this might lead to situation where subjects with same characteristics may be given different decisions due to the differing leniency.
		\item Of course the differing decisions might be attributable to some unobserved information that the decision-maker might have had available ude to meeting with the subject.
		\item The explainability of black-box models has been discussed in X. We don't discuss fairness.
		\item In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions. This results in a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Recently this problem has been named the selective labels problem.
	\end{itemize}
\item Related work
	\begin{itemize}
		\item In the original paper, Lakkaraju et al. presented contraction which performed well compared to other methods previously presented in the literature. 
		\item We wanted to benchmark our approach to that and show that we can improve on their algorithm in terms of restrictions and accuracy. 
		\item Restrictions = our method doesn't have so many assumptions (random assignments, agreement rate, etc.) and can estimate the performance on all levels of leniency despite the judge with the highest leniency. See fig 5 from Lakkaraju
		\item Jung et al presented their method for constructing optimal policies, we show that that approach can also be applied to the selective labels setting.
		\item They didn't have selective labeling nor did they consider that the judges would differ in leniency.
		\item Selective labels issue has been addressed in the causal inference literature by discussing selection bias. discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.)
		\item Latent confounding has bee discussed by X when discussing the effect of latent confounders to ORs. ec etc.
	\end{itemize}
\item Our contribution
	\begin{itemize}
	\item In this paper we propose a (novel modular) framework to provide a systematic way of presenting these missing data problems by breaking it into different modules and explicating their function.
	\item In addition, we present an approach for inferring / imputing the missing labels to evaluate the performance of predictive models in settings where selective labeling and latent confounding is present. We use theory of counterfactuals and causal inference to formally define the problem. In computation we use the latest tools. / "a flexible, Bayesian approach".
	\end{itemize}
\end{itemize}


\section{The Selective Labels Framework}


We begin by formalizing the selective labels setting.

Let binary variable $T$ denote a decision, where $T=1$ is interpreted as a positive decision. The binary variable $Y$ measures some outcome that is affected by the decision $T$.  The selective labels issue is that in the observed data when $T=0$ then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}} $Y=1$.

For example, consider that
$T$ denotes a decision to jail $T=0$ or bail $T=1$. 
Outcome $Y=0$ then marks that a defendant offended and $Y=1$ the defendant did not. When a defendant is jailed $T=0$ the defendant obviously did not violate the bail and thus always $Y=1$.

\subsection{Decision Makers}

A decision maker $D$ makes the decision $T$ based on the characteristics of the subject. A decision maker may be human or a machine learning system. They seek to predict outcome $Y$ based on what they know and then decide $T$ based on this prediction: a negative decision $T=0$ is prefered for subjects predicted to have negative outcome $Y=0$ and a positive decision $T=1$ when the outcome is predicted as positive $Y=1$. We especially consider machine learning system that need to use similar data as used for the evaluation; they also need to take into account the selective labels issue.

In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail.


\subsection{Evaluating Decision Makers}

The goodness of a decision maker can be examined as follows. Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions. Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions. 
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
A good decision makes a large number of positive decisions with low failure rate. 

However, the data we have does not directly provide a way to evaluate FR. If a decision maker decides $T=1$ for a subject that had $T=0$ in the data, the outcome $Y$ recorded in the data is based on the decision $T=0$ and hence $Y=1$ regardless of the decision taken by $D$. The number of negative outcomes $Y=0$ for these decision needs to be calculated in some non-trivial way.

In the example situation the difficulty is occurs when a decision maker decides to bail $T=0$ a defendant that has been jailed in the data, we cannot directly observe whether the defendant was about to offend or not.

Therefore, the aim is here to give an estimate of the FR at any given AR for any decision maker $D$. This estimate is vital in the employment machine learning and AI systems to every day use.


% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.

%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.


\subsection{Causal Modeling}

\begin{figure}
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$R$};
  \node[state] (X) [right of=R] {$X$};
  \node[state] (T) [below of=X] {$T$};
  \node[state] (Z) [rectangle, right of=X] {$Z$};
  \node[state] (Y) [below of=Z] {$Y$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{ $R$ leniency of the decision maker, $T$ is a binary decision,  $Y$ is the outcome that is selectively labled. Background features  $X$ for a subject affect the decision and the outcome. Additional background features  $Z$ are visible only to the decision maker in use. }\label{fig:model}
\end{figure}

We model the selective labels setting as summarized by Figure~\ref{fig:model}\cite{lakkaraju2017selective}.

The outcome  $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
 In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in [0,1]$ of the decision maker.


We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.

\acomment{Not sure if this is good to discuss here or in the next section: if we would like the next section be full of our contributions and not lakkarajus, we should place it here.}

\setcounter{section}{1}


\section{ Framework ( by Riku)}
%In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
%Antti: In conference papers we do not waste space for such in this paper stuff!! In journals one can do that.

\begin{itemize}
\item Definitions \\
	In this paper we apply our approach on binary (positive / negative) outcomes, but our approach is readily extendable to accompany continuous or categorical responses. Then we could use e.g. sum of squared errors or other appropriate metrics as the measure for good performance.
	With positive or negative outcomes we refer to...
	\begin{itemize}
	\item Failure rate
		\begin{itemize}
		\item Failure rate (FR) is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision / we can only observe the outcome when the corresponding decision is positive.
		\item That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal. (rather about finding a good balance. > Resource issues in prisons etc.)
		\end{itemize}
	\item Acceptance rate
		\begin{itemize}
		\item Acceptance rate (AR) or leniency is defined as the ratio of positive decisions to all decisions that a decision-maker will give. (Semantically, what is the difference between AR and leniency? AR is always computable, leniency doesn't manifest.)
		\item In some settings, (justice, medicine) people might want to find out if X\% are accepted what is the resulting failure rate, and what would be the highest acceptance rate to have to have the failure rate at an acceptable level. 
		\item We want to know the trade-off between acceptances and failure rate.
		\item Lakkaraju mentioned the problem in the data that judges which have a higher leniency have labeled a larger portion of the data (which might results in bias).
		\item As mentioned earlier, these differences in AR might lead to subjects getting different decisions while haven the same observable and unobservable characteristics.
		\end{itemize}
	\item With decider or decision-maker we might refer to a judge, a doctor, ... who makes the decisions on which labels are available. % Some deciders might have an incentive for positive decisions if it can mean e.g. savings. Judge makes saving by not jailing a defendant. Doctor makes savings by not assigning patient for a higher intensity care. (move to motivation?)
	\item With unobservables we refer to some latent, usually non-written information regarding a certain outcome that is only available to the decision-maker. For example, a judge in court can observe the defendant's behaviour and level of remorse which might be indicative of bail violation. We denote the latent information regarding a person's guilt with variable \unobservable.
\end{itemize}
\item Modules \\
	We separated steps that modify the data into separate modules to formally define how they work. With observational data sets, the data goes through only a modelling step and an evaluation step. With synthetic data, we also need to define a data generating  step. We call the blocks doing these steps {\it modules}. To fully define a module, one must define its input and output. Modules have different functions, inputs and outputs. Modules are interchangeable with a similar type of module if they share the same input and output (You can change decider module of type A with decider module of type B). With this modular framework we achieve a unified way of presenting the key differences in different settings.
	\begin{itemize}
	\item Decider modules
		\begin{itemize}
		\item In general, the decider module assigns predictions to the observations based on some information.
		\item The information available to a decision-maker in the decider module includes observable and -- possibly -- unobservable features, denoted with X and Z respectively.
		\item The predictions given by a decider module can be relative or absolute. With relative predictions we refer to that a decider module can give out a ranking of the subjects based on their predicted tendency towards an outcome. Absolute predictions can be either binary or continuous in nature. For example, they can correspond to yes or no decisions or to a probability value.
		\item Inner workings (procedure/algorithm) of the module may or may not be known. In observational data sets, the mechanism or the decider which has labeled the data is usually unknown. E.g. we do not -- eactly -- know how judges obtain a decision. Conversely, in synthetic data sets the procedure creating the decisions is fully known because we define the process.
		\item The decider (module) in the data step has unobservable information available for making the decisions. 
		\item The behaviour of the decider module in the data generating step can be defined in many ways. We have used both the method presented by Lakkaraju et al. and two methods of our own. We created these two deciders to remove the interdependencies of the decisions made by the decider Lakkaraju et al. presented.
		\item The difference between the deciders in the data and modelling steps is that usually we cannot observe all the information that has been available to the decider in the data step as opposed to the decider in the modelling step. In addition, we usually cannot observe the full decision-making process of the decider in the data step contrary to the decider in the modelling step.
		\end{itemize}
	\item Evaluator modules
		\begin{itemize}
		\item Evaluator module gets the decisions, observable features of the subject and predictions made by the deciders and outputs an estimate of...
		\item The evaluator module outputs a reliable estimate of a decider module's performance. The estimate is created by the evaluator module and it should 
			\begin{itemize}
			\item be precise and unbiased
			\item have a low variance
			\item be as robust as possible to slight changes in the data generation. 
			\end{itemize}
		\item The estimate of the evaluator should also be accurate for all levels of leniency.
		\end{itemize}
	\end{itemize}
%\item Example: in observational data sets, the deciders have already made decision concerning the subjects and we have a selectively labeled data set available. In the modular framework we refer to the actions of the human labelers as a decider module which has access to latent information. 
\item Problem formulation \\
Given the selective labeling of data, multiple decision-makers and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the applicable domains.

%The "eventual goal" is to create such a decider module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.

%(It's important to somehow keep these two different goals separate.)
We show that our method is robust against violations and modifications in the data generating mechanisms.

\end{itemize}

\section{Counterfactual-Based Imputation For Selective Labels}

\begin{itemize}
\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
	\begin{itemize}
	\item Counterfactuals are 
		\begin{itemize}
		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
		\item Using counterfactuals requires defining a structural causal model.
		\item Pearl's Book of Why: "The fundamental problem"
		\end{itemize}
	\item By defining structural equations / a graph
		\begin{itemize}
		\item we can begin formulating causal questions to get answers to our questions.
		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
		\item We denote the counterfactual "Y had been y had T been t" with...
		\item By first estimating the distribution of the latent variable Z we can impose 
		\item Now counterfactuals can be defined as
			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
			\begin{equation} \label{eq:counterfactual}
				Y_x(u) := Y_{M_x}(u)
			\end{equation}
			\end{definition}
		\end{itemize}
	\item In a high level
		\begin{itemize}
		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
		\end{itemize}
	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
\end{itemize}
\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
	\begin{itemize}
	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
		\begin{align} \label{eq:structural_equations}
		\outcome_0 & = NA \\ \nonumber
		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
		\outcome & =\outcome_\decisionValue \\ \nonumber
		\end{align}
	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
		\begin{align} \label{eq:data_model}
		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
		\end{align}
	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
	\item Specify, motivate and explain priors here if space.
	\end{itemize}
\item Computation (Stan in general, ...)
	\begin{itemize}
	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
	\end{itemize}
\end{itemize}

\section{Extension To Non-Linearity (2nd priority)}

% If X has multiple dimensions or the relationships between the features and the outcomes are clearly non-linear the presented approach can be extended to accomodate non-lineairty. Jung proposed that... Groups... etc etc.

\section{Related work}

\begin{itemize}
\item Lakkaraju and contraction. \cite{lakkaraju2017selective}
\item Counterfactuals/Potential outcomes. \cite{pearl2010introduction} (also Rubin)
\item Approach of Jung et al for optimal policy construction. \cite{jung2018algorithmic}
\item Discussions of latent confounders in multiple contexts.
\item Imputation methods and other approaches to selective labels, eg. \cite{dearteaga2018learning}
\end{itemize}

\section{Experiments}

In this section we present our results from experiments with synthetic and realistic data. We show that our approach provides the best estimates for evaluating the performance of a predictive model on all levels of leniency.

\subsection{Synthetic data}

\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

We sampled $N=50k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively.  This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}.

%This is one data generation module.
% It can be / was modified by changing the outcome producing mechanism. For other experiments we changed the outcome generating mechanism so that the outcome was assigned value 1 if
We used a number of different decision mechanism in our simulations. The decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... . This way the observations were independent and the still the leniency would be a good estimate of the acceptance rate. (The acceptance rate would stochastically converge to the leniency.)
 This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)

\paragraph{Algorithms} 
	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
	\begin{itemize}
	\item True evaluation
		\begin{itemize}
		\item Depicts the true performance of the model. "How well would this model perform had it been deployed?" 
		\item Not available when using observational data. 
		\item Calculated by ordering the observations based on the predictions from the black-box model B and counting the failure rate from the ground truth labels.
		\end{itemize}
	\item Human evaluation
		\begin{itemize}
		\item The performance of the deciders in the data generation step. We binned deciders with similar values of leniency and counted their failure rate.
		\item In observational data sets, we can only record the decisions and acceptance rates of these decision-makers. 
		\item This curve is eventually the benchmark for the performance of a model.
		\end{itemize}
	\item Labeled outcomes
		\begin{itemize}
		\item Vanilla estimator of a model's performance. Obtained by first ordering the observations by the predictions assigned by the decider in the modelling step.
		\item Then 1-r \% of the most dangerous are detained and given a negative decision. The failure rate is computed as the ratio of negative outcomes to the number of subjects.
		\end{itemize}
	\item Contraction
		\begin{itemize}
		\item Algorithm by Lakkaraju et al. Assumes that the subjects are assigned to the judges at random and requires that the judges differ in leniency. 
		\item Can estimate the true failure only up to the leniency of the most lenient decision-maker.
		\item Performance is affected by the number of people judged by the most lenient decision-maker, the agreement rate and the leniency of the most lenient decision-maker. (Performance is guaranteed / better when ...)
		\item Works only on binary outcomes
		\item (We show that our method isn't constrained by any of these)
		\item The algorithm goes as follows...
%\begin{algorithm}[] 			% enter the algorithm environment
%\caption{Contraction algorithm \cite{lakkaraju17}} 		% give the algorithm a caption
%\label{alg:contraction} 			% and a label for \ref{} commands later in the document
%\begin{algorithmic}[1] 		% enter the algorithmic environment
%\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
%\ENSURE
%\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
%\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
%\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
%\STATE
%\STATE $\RR_q = \{(x, j, t, y) \in \D_q|t=1\}$
%\STATE \hskip3.0em $\rhd$ $\RR_q$ is the set of observations in $\D_q$ with observed outcome labels
%\STATE
%\STATE Sort observations in $\RR_q$ in descending order of confidence scores $\s$ and assign to $\RR_q^{sort}$.
%\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
%\STATE
%\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\RR_q |]$ observations of $\RR_q^{sort}$ and call this list $\mathcal{R_B}$
%\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
%\STATE
%\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
%\RETURN $\mathbf{u}$
%\end{algorithmic}
%\end{algorithm}
		\end{itemize}
	\item Potential outcomes / CBI
		\begin{itemize}
		\item Take test set
		\item Compute the posterior for parameters and variables presented in equation \ref{eq:data_model}.
		\item Using the posterior predictive distribution, draw estimates for the counterfactuals.
		\item Impute the missing outcomes using the estimates from previous step
		\item Obtain a point estimate for the failure rate by computing the mean.
		\item Estimates for the counterfactuals Y(1) for the unobserved values of Y were obtained using the posterior expectations from Stan. We used the NUTS sampler to estimate the posterior. When the values for...
		\end{itemize}
	\end{itemize}
\paragraph{Results} 
(Target for this section from problem formulation: show that our evaluator is unbiased/accurate (show mean absolute error), robust to changes in data generation (some table perhaps, at least should discuss situations when the decisions are bad/biased/random = non-informative or misleading), also if the decider in the modelling step is bad and its information is used as input, what happens.)
	\begin{itemize}
	\item Accuracy: we have defined two metrics, acceptance rate and failure rate. In this section we show that our method can accurately restore the true failure on all acceptance rates with low mean absolute error. As figure X shows are method can recover the true performance of the predictive model with good accuracy. The mean absolute errors w.r.t the true evaluation were 0.XXX and 0.XXX for contraction and CBI approach respectively. 
	\item In figure X we also present how are method can track the true evaluation curve with a low variance.
	\end{itemize}
%\end{itemize}

\subsection{Realistic data}
In this section we present results from experiments with (realistic) data sets. 

\begin{itemize}
\item COMPAS data set
	\begin{itemize}
	\item Size, availability, COMPAS scoring
		\begin{itemize}
		\item COMPAS = Correctional Offender Management Profiling for Alternative Sanctions is Northpointe's (now diff. name) tool for guiding decisions in the criminal justice system.
		\item COMPAS general recidivism risk score is made to predict recidivism in the following two years,
		\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
		\item Data was made available ProPublica.
		\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
		\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences. 
		\end{itemize}
	\item Subsequent modifications for analysis 
		\begin{itemize}
		\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9. 
		\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
		\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score 
		\item As the COMPAS score is derived mainly from "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
		\item Data was split to test and training sets
		\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
		\item We used these same variables as input to the CBI evaluator.
		\end{itemize}
	\item Results
		\begin{itemize}
		\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
		\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
		\end{itemize}
	\end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)
	\begin{itemize}
	\item Size, availability, RisCanvi scoring
	\item Subsequent modifications for analysis
	\item Results
	\end{itemize}
\end{itemize}

\section{Discussion}

\begin{itemize}
\item Conclusions 
\item Future work / Impact
\end{itemize}


% \textbf{Acknowledgments.}
%The computational resources must be mentioned. 

%\clearpage
% \balance
\bibliographystyle{ACM-Reference-Format}
\bibliography{biblio}
%\balancecolumns % GM June 2007

\end{document}