introduction.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Introduction} 
\label{sec:introduction}

\begin{figure*}[t]
\begin{center}
\includegraphics[height=1.75in]{img/intro2problem}
\end{center}
\caption{
The figure shows a situation in the {\it bail-or-jail} scenario, where a machine makes decisions for the {\it same defendants} previously decided by a judge. When the machine decides to allow bail for the defendant but the judge had denied bail, then we cannot evaluate directly the machine's decision.
}
\label{fig:example}
\end{figure*}

As digitalisation affects more and more aspects of life, so do the automated decisions that are made by algorithms based on statistical models learned from data~\cite{lakkaraju2017selective}.
This is quite prevalent for web services, where algorithms decide what results are shown by search engines, what news stories appear on social media platforms, or what products are recommended on online stores.
But automated-decision systems start to appear also in other situations -- for credit scoring, choosing medical treatment, insurance pricing, but also judicial decisions (COMPAS~\cite{brennan2009evaluating} and RisCanvi~\cite{tolan2019why} are two examples of algorithmic tools used to evaluate the risk for recidivism in the US and Catalan prison system, respectively).

Before deploying machines to make automated decisions, there is a need to evaluate their performance in real settings.
In practice, this is done by simulating their deployment over a log of past cases and measuring how well they would have performed, if they had been used to replace the human decision makers or other decision system currently in place.
Herein lies a challenge: previously-made decisions affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
Let us explain this with an example, also illustrated in Figure~\ref{fig:example}.

\noindent\hrulefill

\spara{Example}. 
In some judicial systems, in the time between a person's arrest and trial, the person may stay out of prison if the judge allows them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor other conditions).
The decision of whether to allow bail or lead to jail is deemed successful if bail is allowed to defendants who honor the bail conditions and not to ones who violate them.
Now suppose we consider a machine-based automated-decision system with the potential to assist or replace the judge as decision maker.
Before the system is actually deployed, we wish to evaluate its performance.
To do that, we ask the machine to provide its decisions for past cases decided by judges in the judicial system.
However, we are only able to directly evaluate the machine's decisions only for cases where the human judge it is supposed to replace had allowed bail.
Why? Because if the judge had not allowed bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
To evaluate the machine's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been allowed bail.
Here lies another challenge: that the judge might have made their decision based on more information than appears on record (and based on which the machine makes its decision).
For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
Such possibility should be taken into account when evaluating the machine's performance.

\noindent\hrulefill

The above exemplifies a general class of cases where a machine is to be evaluated against data from past decisions: a machine is asked to make a binary decision (eg., allow bail or not, grant a loan or not, provide medication or not, etc) for a specific case, based on a set of recorded features (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offense, criminal record, etc); the decision can be successful or not (eg., it fails if the bail conditions are violated); and some decisions prevent us from directly evaluating alternative decisions (eg. if the machine proposes to allow bail but the judge who had decided the case denied bail, we cannot evaluate the machine's decision directly).
%
Our task, then, is to evaluate the quality of machine decisions against a log of past cases.
%
For the evaluation to be accurate, we should take into account that past decisions may have been based on more information than the recorded features on which the machine bases its decision.
%
% The theory of causal inference~\cite{pearl2010introduction} gives us the tools to address this task in a principled manner.

There is a rich literature on problems that arise in similar settings.
%
At its core, our task is to answer a `what-if' question, asking ``what would the outcome have been if a different decision had been made''.
%
\citet{bookofwhy} refer to this task as the `fundamental problem' in causal inference.
%
In the literature, settings where a portion of the data is not observed due to some filtering mechanism are said to exhibit {\it selection bias} (see, for example, \citet{hernan2004structural}).
%
Research on selection bias has achieved results in recovery the structure of the generative model (i.e., the mechanism that results in bias) and estimating causal effects (e.g.,~\citet{pearl1995empirical} and~\citet{bareinboim2012controlling}).
%

Recently, \citet{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}'.
%
They presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where subjects are randomly assigned to decision makers with varying leniency levels. 
%
The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it measures the performance of the evaluated system using the cases of the most lenient judge. 
%
We note, however, that for contraction to work, we need lenient decision makers making decisions on a large number of subjects.
%
In another recent paper, \citet{jung2018algorithmic} studied unobserved confounding in the context of creating optimal decision policies. 
%
They approached the problem with Bayesian modelling, but they don't consider the selective labels issue where decisions deterministically define the outcomes or the effect of having multiple decision-makers with differing levels of leniency. { ... \bf TODO}

\todo{Michael}{Check, expand and elaborate on related work.}

\spara{Our contributions}
In this paper, we build upon the problem setting used in~\citet{lakkaraju2017selective} and present a novel, modular framework to evaluate decision makers over selectively labeled data. 
Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and uses counterfactual reasoning to impute unobserved outcomes in the data.
We experiment with synthetic data to highlight various properties of our approach.
We also perform an empirical evaluation in realistic settings, using real data from COMPAS.
Our results indicate that {... \bf TODO}.

\todo{Michael}{Summarize our experimental results}