introduction.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Introduction} 
\label{sec:introduction}

\begin{figure*}[t]
\begin{center}
\includegraphics[height=1.75in]{img/intro2problem}
\end{center}
\caption{
The figure shows a situation in the {\it bail-or-jail} scenario, where a machine makes decisions for the {\it same defendants} previously decided by a judge. When the machine decides to allow bail for the defendant but the judge had denied bail, then we cannot evaluate directly the machine's decision.
}
\label{fig:example}
\end{figure*}

As digitalisation affects more and more aspects of life, so do the automated decisions that are made by algorithms based on statistical models learned from data~\cite{lakkaraju2017selective}.
This is quite prevalent for web services, where algorithms decide what results are shown by search engines, what news stories appear on social media platforms, or what products are recommended on online stores.
But automated-decision systems start to appear also in other situations -- for credit scoring, choice of medical treatment, insurance pricing, but also judicial decisions (COMPAS~\cite{brennan2009evaluating} and RisCanvi~\cite{tolan2019why} are two examples of algorithmic tools used to evaluate the risk for recidivism in the US and Catalan prison system, respectively).

In contrast to human decision making, automatic decision making by AI systems holds promise for better decision quality~\cite{}, and possibly even guarantees of fairness~\cite{}.
But before deploying machines to make automated decisions, there is a need to evaluate their performance in real settings.
In practice, this is done by simulating their deployment over a log of past cases and measuring how well they would have performed, if they had been used to replace the human decision makers or other decision system currently in place.
Herein lies a challenge: previously-made decisions (based on case features some of which are not recorded in the data) affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
Let us explain this with an example, also illustrated in Figure~\ref{fig:example}.

\noindent\hrulefill

\spara{Example}. 
In some judicial systems, in the time between a person's arrest and trial, the person may stay out of prison if the judge allows them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor other conditions).
The decision of whether to allow bail or lead to jail is deemed successful if bail is allowed to defendants who honor the bail conditions and not to ones who violate them.
Now suppose we consider a machine-based automated-decision system with the potential to assist or replace the judge as decision maker.
Before the system is actually deployed, we wish to evaluate its performance.
To do that, we ask the machine to provide its decisions for past cases decided by judges in the judicial system.
However, we are only able to directly evaluate the machine's decisions for cases where the human judge had allowed bail.
Why? Because if the judge had not allowed bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
To evaluate the machine's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been allowed bail.
Here lie some challenges.
First,  the inference should take into account the bias in the observed data.
And second, the judge might have made their decision based on more information than appears on record (and based on which the machine makes its decision).
For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
% And third, the decisions of more than one judges may be reflected in the data.

\noindent\hrulefill

The above exemplifies a general class of cases where a machine is to be evaluated against data from past decisions: a machine is asked to make a binary decision (eg., allow bail or not, grant a loan or not, provide medication or not, etc) for a specific case, based on a set of recorded features (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offence, criminal record, etc); the decision can be successful or not (eg., it fails if the bail conditions are violated); and some decisions prevent us from directly evaluating alternative decisions (eg. if the machine proposes to allow bail but the judge who had decided the case denied bail, we cannot evaluate the machine's decision directly).
%
Our task, then, is to evaluate the quality of machine decisions against a log of past cases.
%
For the evaluation to be accurate, we should take into account the bias in the observations and that past decisions may have been based on more information than the recorded features on which the machine bases its decision.
%
% The theory of causal inference~\cite{pearl2010introduction} gives us the tools to address this task in a principled manner.

There is a rich literature on problems that arise in similar settings, our specific problem can be approach from different viewpoints.
%
%One can approach the challenges from different viewpoints, selection bias, missing data, , offline policy evaluation, and latent confounding sensitivity, analysis .
% COUNTERFACTUALS: FUNDAMENTAL PROBLEM
At its core, our task is to answer a `what-if' question, asking ``what would the outcome have been if a different decision had been made'' (a counterfactual), this is often mentioned as the `fundamental problem' in causal inference~\cite{holland1986statistics, bookofwhy}.
% SELECTION BIAS
Settings where data samples are chosen through some intricate filtering mechanism are said to exhibit {\it selection bias} (see, for example, \citet{hernan2004structural}). In the present case, any models predicting outcomes can only be on samples where the decision was positive.
%WHAT WE DO NOT HAVE THIS???
% MISSING DATA %IMPUTATION 
Settings where some variables are not observed for all samples have \emph{missing data}. Here, the outcomes for samples with a negative decision are considered missing, or labeled with some default value.
%Research on selection bias has achieved results in recovery the structure of the generative model (i.e., the mechanism that results in bias) and estimating causal effects (e.g.,~\citet{pearl1995empirical} and~\citet{bareinboim2012controlling}).
%OFFLINE POLICY EVALUATION
\emph{Offline policy assessment}  refers to evaluation of a decision policy over a dataset recorded under another policy~\cite{Jung2}, which is also the case here, the decision are always based on a particular policy.
%COUNFOUNDING AND SENSITIVITY ANALYSIS


%WE WANT TO CITE HERE ALL SELECTIVE LABELS PAPERS TO SELL THIS VIEWPOINT
Recently, \citet[KDD2017]{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}' empahasizing the fact that outcomes in the data are selectively labeled 
(see also \cite{dearteaga2018learning,kleinberg2018human}).
%
\citet{lakkaraju2017selective} also presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where subjects are randomly assigned to decision makers with varying leniency levels. 
%
The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it measures the performance of the evaluated system using the cases of the most lenient judge. 
%
%We note, however, that for contraction to work, we need lenient decision makers making decisions on a large number of subjects.
% THIS BELONGS TO RELATED WORK
%In another recent paper, \citet{Jung2} studied unobserved confounding in the context of creating optimal decision policies. 
%
%They approached the problem with Bayesian modelling, but they don't consider the selective labels issue or the possibility that the decisions reflected in the data may be taken by more than one decision makers with differing levels of leniency.

\spara{Our contributions}
In this paper, we build upon the problem setting used in~\citet{lakkaraju2017selective} and present a novel, modular framework to evaluate decision makers over selectively labeled data. 
%
Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and uses counterfactual reasoning to impute unobserved outcomes in the data.
%
We experiment with synthetic data to highlight various properties of our approach.
%
We also perform an empirical evaluation in realistic settings, using real recidivism data from COMPAS~\cite{angwin2016machine,brennan2009evaluating}.
%
The results indicate that our method achieves more accurate results with considerably less variation than the state-of-the-art, and unlike the contraction approach that is tailored to this setting~\cite{lakkaraju2017selective}, it does not depend on the existence of lenient decision makers in the data.

%\todo{Michael}{Summarize our experimental results}