introduction.tex

\section{Introduction} 
\label{sec:introduction}

As digitalisation affects more and more aspects of life, so do the automated decisions that are made by algorithms based on statistical models learned from data.
This is quite prevalent for web services, where algorithms decide what results are shown by search engines, what news stories appear on social media platforms, or what products are recommended on online stores.
But automated-decision systems start to appear also in other situations -- for credit scoring, choosing medical treatment, insurance pricing, but also judicial decisions (COMPAS and RisCanvi are two examples of algorithmic tools used to evaluate the risk for recidivism in the US and Catalan prison system, respectively).

Before deploying such automated-decision systems, there is a need to evaluate their performance in real settings.
In practice, this is done by simulating their deployment over a log of past cases and measuring how well they would have performed, had they been used to replace the human decision makers or other decision system currently in place.
Herein lies a challenge: previously-made decisions affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
Let us explain this with an example.

\hrulefill

\spara{Example}. 
In some judicial systems, in the time between a person's arrest and trial, the person may stay out of prison if the judge grants them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor other conditions).
The decision of whether to grant bail or lead to jail is deemed successful if bail is granted to defendants who honor the bail conditions and not to ones who violate them.
Now suppose we consider an automated-decision system with the potential to assist or replace the judge as decision maker.
Before the system is actually deployed, we wish to evaluate the system's performance.
To do that, we ask the system to provide its decisions for past cases decided by judges in the judicial system.
However, we are only able to directly evaluate the system's decisions only for cases where the human judge it is supposed to replace had granted a bail.
Why? Because if the judge had not granted bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
To evaluate the system's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been granted bail.
Here lies another challenge: that the judge might have made their decision based on more information than appears on record (and based on which the system makes its decision).
For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
Such possibility should be taken into account when evaluating the system's performance.

\hrulefill

\todo{MM}{Let's add a figure to describe the example and problem setting.}

The example above exemplifies a general class of cases where a system is to be evaluated against data from past decisions: the system is asked to make a binary decision (eg., grant bail or not, grant a loan or not, provide medication or not, etc) for a specific case, based on a set of recorded features for the case (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offense, criminal record, etc); the decision can be successful or not (eg., it fails if the bail conditions are violated); and some decisions prevent us from evaluating alternative decisions (eg. if the system proposes to grant bail but the judge who had decided the case denied bail, we cannot evaluate the system's decision directly).

\spara{Related Work}
\citet{bookofwhy} refer to the problem of missing the outcome under an alternative decision the 'fundamental problem' in causal inference.
In the causal inference literature, such cases are said to exhibit {\it selection bias}.
Research on selection bias has mainly been concentrated on recovering the structure of the generative model (i.e., the mechanism that results in bias) and estimating causal effects (Pearl, Bareinboim etc.).
Recently, \citet{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}'.
They presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where decisions are randomly assigned to decision makers with varying leniency levels. 
The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it  measures the performance of the evaluated system using the cases of the most lenient judge. 
We note, however, that for contraction to work, we need lenient decision makers making decision on a large number of subjects.
In another recent paper, \citet{jung2018algorithmic} \mcomment{TODO write what they do}.

\todo{MM}{Expand and elaborate on related work.}

\spara{Our contributions}
In this paper, we present a novel, modular framework to evaluate decision makers using selectively labeled data. 
Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and uses counterfactual reasoning to impute unobserved outcomes in the data.
We experiment with synthetic data to highlight various properties of our approach.
We also perform an empirical evaluation in realistic settings, using real data from COMPAS and RisCanvi).
Our results indicate that \mcomment{TODO we are the champions}.

\todo{MM}{Summarize our experimental results}