introduction.tex

\section{Introduction} 

\mcomment{
	New outline for introduction:
	\begin{itemize}
		\item Digitalisation has led to many automated decisions.
		\item This is quite prevalent for web services (eg. search engine results, content recommendation on media websites, product recommendation on online stores).
		\item But automated, AI-made or AI-assisted decisions start to appear also in other situations -- 
		\item for example for: credit scoring, choice of medical treatment, insurance pricing, but also judicial decisions (examples: COMPAS score; RisCanvi tool used in the Catalan prison system for risk assessment).
		\item Before deploying such tools, there is also a need to evaluate their performance in real settings.
		\item In practice, this would mean simulating their use over a log of past cases and measuring how well they would have performed, had they been used to replace the human decision makers or other decision mechanism currently in place.
		\item Here lies a challenge: previously-made decisions affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
		\item Let us explain this with an example.
		\item In some judicial systems, between a person's arrest and trial, the person may stay out of prison until trial if the judge grants them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor some other conditions).
		\item The decision of whether to grant bail or lead to jail is deemed successful if bail is granted to defendants who would honor the conditions of the bail and not to ones who would violate them.
		\item Now suppose we consider an AI system with the potential to assist or replace the judge as decision maker.
		\item Before the system is actually deployed, we wish to evaluate the system's performance.
		\item To do that, we could experimentally ask the system to provide its decisions for past cases.
		\item However, we are only able to directly evaluate the system's decisions only for cases where the human judge it is supposed to replace had granted a bail.
		\item Why? Because if the judge had not granted bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
		\item To evaluate the system's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been granted bail.
		\item Here lies another challenge: that the judge might have made their decision based on more information than appears on record (and based on which the system makes its decision).
		\item For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
		\item Such possibility should be taken into account when evaluating the system's performance.
		\item The example above is mentioned to exemplify a general class of cases where a system is to be evaluated against data from past decisions:
		\item the system is asked to make a binary decision (eg., grant bail or not, grant a loan or not, provide medication or not, etc) for a specific case based on a set of recorded features (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offense, criminal record, etc); 
		\item the decision can be successful or not (eg., whether the bail conditions are violated);
		\item and some decisions may prevent us from directly evaluating the outcome (eg. if the system proposes to grant bail but the judge who had decided the case denied bail).
	\end{itemize}

	\spara{Related Work}
	\begin{itemize}
		\item In the causal inference literature, such situations are said to exhibit {\it selection bias}.
		\item Discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.).
		\item Pearl calls missing the outcome under an alternative decision the 'fundamental problem' in causal inference \cite{bookofwhy}.
		\item Recently, \citet{lakkaraju2017selective} referred to the problem of evaluation is such settings as the 'selective labels problem'.
		\item They presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where decisions are randomly assigned to decision makers with varying leniency levels. 
		\item The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it  measures the performance of the evaluated system using the cases of the most lenient judge. 
		\item We note, however, that for contraction to work, we need lenient decision makers making decision on a large number of subjects.
		\item In another recent paper, \citet{jung2018algorithmic} ...
	\end{itemize}

	\spara{Contributions}
	\begin{itemize}
		\item In this paper, we present a novel, modular framework to evaluate decision makers using selectively labeled data. 
		\item Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and 
		\item uses counterfactual reasoning to impute unobserved outcomes in the data.
		\item We experiment with synthetic data to highlight various properties of our approach.
		\item We also perform an empirical evaluation in realistic settings, using real data from COMPAS and RisCanci).
	\end{itemize}
}

\todo{MM}{Expand and elaborate on related work.}