Skip to content
Snippets Groups Projects
Commit 8492da70 authored by Michael Mathioudakis's avatar Michael Mathioudakis
Browse files

Make a pass over the introduction

Updated the structure of the introduction, mainly to make the problem more clear
parent de6e6f98
No related branches found
No related tags found
No related merge requests found
\section{Introduction}
\mcomment{
New outline for introduction:
\begin{itemize}
\item Digitalisation has led to many automated decisions.
\item This is quite prevalent for web services (eg. search engine results, content recommendation on media websites, product recommendation on online stores).
\item But automated, AI-made or AI-assisted decisions start to appear also in other situations --
\item for example for: credit scoring, choice of medical treatment, insurance pricing, but also judicial decisions (examples: COMPAS score; RisCanvi tool used in the Catalan prison system for risk assessment).
\item Before deploying such tools, there is also a need to evaluate their performance in real settings.
\item In practice, this would mean simulating their use over a log of past cases and measuring how well they would have performed, had they been used to replace the human decision makers or other decision mechanism currently in place.
\item Here lies a challenge: previously-made decisions affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
\item Let us explain this with an example.
\item In some judicial systems, between a person's arrest and trial, the person may stay out of prison until trial if the judge grants them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor some other conditions).
\item The decision of whether to grant bail or lead to jail is deemed successful if bail is granted to defendants who would honor the conditions of the bail and not to ones who would violate them.
\item Now suppose we consider an AI system with the potential to assist or replace the judge as decision maker.
\item Before the system is actually deployed, we wish to evaluate the system's performance.
\item To do that, we could experimentally ask the system to provide its decisions for past cases.
\item However, we are only able to directly evaluate the system's decisions only for cases where the human judge it is supposed to replace had granted a bail.
\item Why? Because if the judge had not granted bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
\item To evaluate the system's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been granted bail.
\item Here lies another challenge: that the judge might have made their decision based on more information than appears on record (and based on which the system makes its decision).
\item For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
\item Such possibility should be taken into account when evaluating the system's performance.
\item The example above is mentioned to exemplify a general class of cases where a system is to be evaluated against data from past decisions:
\item the system is asked to make a binary decision (eg., grant bail or not, grant a loan or not, provide medication or not, etc) for a specific case based on a set of recorded features (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offense, criminal record, etc);
\item the decision can be successful or not (eg., whether the bail conditions are violated);
\item and some decisions may prevent us from directly evaluating the outcome (eg. if the system proposes to grant bail but the judge who had decided the case denied bail).
\end{itemize}
\spara{Related Work}
\begin{itemize}
\item In the causal inference literature, such situations are said to exhibit {\it selection bias}.
\item Discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.).
\item Pearl calls missing the outcome under an alternative decision the 'fundamental problem' in causal inference \cite{bookofwhy}.
\item Recently, \citet{lakkaraju2017selective} referred to the problem of evaluation is such settings as the 'selective labels problem'.
\item They presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where decisions are randomly assigned to decision makers with varying leniency levels.
\item The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it measures the performance of the evaluated system using the cases of the most lenient judge.
\item We note, however, that for contraction to work, we need lenient decision makers making decision on a large number of subjects.
\item In another recent paper, \citet{jung2018algorithmic} ...
\end{itemize}
\spara{Contributions}
\begin{itemize}
\item In this paper, we present a novel, modular framework to evaluate decision makers using selectively labeled data.
\item Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and
\item uses counterfactual reasoning to impute unobserved outcomes in the data.
\item We experiment with synthetic data to highlight various properties of our approach.
\item We also perform an empirical evaluation in realistic settings, using real data from COMPAS and RisCanci).
\end{itemize}
}
As digitalisation affects more and more aspects of life, so do the automated decisions that are made by algorithms based on statistical models learned from data.
This is quite prevalent for web services, where algorithms decide what results are shown by search engines, what news stories appear on social media platforms, or what products are recommended on online stores.
But automated-decision systems start to appear also in other situations -- for credit scoring, choosing medical treatment, insurance pricing, but also judicial decisions (regarding the latter, COMPAS and RisCanvi are algorithmic tools used to evaluate the risk for recidivism in the US and Catalan prison system, respectively).
Before deploying such automated-decision systems, there is a need to evaluate their performance in real settings.
In practice, this is done by simulating their deployment over a log of past cases and measuring how well they would have performed, had they been used to replace the human decision makers or other decision system currently in place.
Herein lies a challenge: previously-made decisions affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
Let us explain this with an example.
\hrulefill
\spara{Example}.
In some judicial systems, in the time between a person's arrest and trial, the person may stay out of prison if the judge grants them the possibility to post bail (i.e., deposit an amount of money that serves as a promise to appear in court on the day of the trial and honor other conditions).
The decision of whether to grant bail or lead to jail is deemed successful if bail is granted to defendants who honor the bail conditions and not to ones who violate them.
Now suppose we consider an automated-decision system with the potential to assist or replace the judge as decision maker.
Before the system is actually deployed, we wish to evaluate the system's performance.
To do that, we ask the system to provide its decisions for past cases decided by judges in the judicial system.
However, we are only able to directly evaluate the system's decisions only for cases where the human judge it is supposed to replace had granted a bail.
Why? Because if the judge had not granted bail in the first place, we do not have a chance to observe whether the defendant would have violated the bail conditions.
To evaluate the system's decisions in such cases, one approach would be to infer the defendant's behavior in the hypothetical case they had been granted bail.
Here lies another challenge: that the judge might have made their decision based on more information than appears on record (and based on which the system makes its decision).
For example, the judge might have witnessed anxious or aggressive behavior by the defendant during the ruling.
Such possibility should be taken into account when evaluating the system's performance.
\hrulefill
\todo{MM}{Let's add a figure to describe the example and problem setting.}
The example above exemplifies a general class of cases where a system is to be evaluated against data from past decisions: the system is asked to make a binary decision (eg., grant bail or not, grant a loan or not, provide medication or not, etc) for a specific case based on a set of recorded features for the case (for the bail-or-jail scenario, such features can be the defendant's age, gender, charged offense, criminal record, etc); the decision can be successful or not (eg., it fails if the bail conditions are violated); and some decisions prevent us from evaluating other decisions (eg. if the system proposes to grant bail but the judge who had decided the case denied bail, we cannot evaluate the system's decision directly).
\spara{Related Work}
\citet{bookofwhy} refer to the problem of missing the outcome under an alternative decision the 'fundamental problem' in causal inference.
In the causal inference literature, such cases are said to exhibit {\it selection bias}.
Research on selection bias has mainly been concentrated on recovering the structure of the generative model (i.e., the mechanism that results in bias) and estimating causal effects (Pearl, Bareinboim etc.).
Recently, \citet{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}'.
They presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where decisions are randomly assigned to decision makers with varying leniency levels.
The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it measures the performance of the evaluated system using the cases of the most lenient judge.
We note, however, that for contraction to work, we need lenient decision makers making decision on a large number of subjects.
In another recent paper, \citet{jung2018algorithmic} \mcomment{TODO write what they do}.
\todo{MM}{Expand and elaborate on related work.}
\spara{Our contributions}
In this paper, we present a novel, modular framework to evaluate decision makers using selectively labeled data.
Our approach makes use of causal modeling to represent our assumptions about the process that generated the data and uses counterfactual reasoning to impute unobserved outcomes in the data.
We experiment with synthetic data to highlight various properties of our approach.
We also perform an empirical evaluation in realistic settings, using real data from COMPAS and RisCanci).
Our results indicate that \mcomment{TODO we are the champions}.
\todo{MM}{Summarize our experimental results}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment