sl.tex

\documentclass[sigconf,anonymous]{acmart}
% \documentclass[sigconf]{acmart}

% Packages
\usepackage{type1cm}     % type1 computer modern font
\usepackage{graphicx}     % advanced figures
\usepackage{xspace}     % fix space in macros
\usepackage{balance}     % to better equalize the last page
\usepackage{multirow}     % multi rows for tables
\usepackage[font={bf}, tableposition=top]{caption}     % captions on top for tables
\usepackage{bold-extra}     % bold + {small capital, italic}
\usepackage{siunitx}          % \num for decimal grouping
\usepackage[vlined,linesnumbered,ruled,noend]{algorithm2e}     % algorithms
\usepackage{booktabs}     % nicer tables
%\usepackage[hyphens]{url}     % handle long urls
%\usepackage[bookmarks, pdftex, colorlinks=false]{hyperref}     % clickable references
%\usepackage[square,numbers]{natbib}     % better references
\usepackage{microtype}    % compress text
\usepackage{units}     % nicer slanted fractions
\usepackage{mathtools}     % amsmath++
%\usepackage{amssymb}     % math symbols
%\usepackage{amsmath}
\usepackage{relsize}
\usepackage{caption}
\captionsetup{belowskip=6pt,aboveskip=2pt} % to save space.
%\usepackage{subcaption}
% \usepackage{multicolumn}
\usepackage[]{inputenc}
\usepackage{xfrac}
\RequirePackage{graphicx,color}
\usepackage[font={small}]{subfig} % subfig, 4 figures in a row
\usepackage{pifont}
\usepackage{footnote} % show footnotes in tables
\makesavenoteenv{table}


\newcommand{\ourtitle}{A Causal Approach for Selective Labels}

\input{macros}
\usepackage{chato-notes}


\title{\ourtitle}

\author{Michael Mathioudakis}
\affiliation{%
  \institution{University of Helsinki}
  \city{Helsinki} 
  \country{Finland} 
}
\email{michael.mathioudakis@helsinki.fi}


\begin{abstract}
We show how a causality-based approach can be used to estimate the performance of prediction algorithms in `selective labels' settings -- with particular application to `bail-or-jail' judicial decisions.
\end{abstract}


\begin{document}


\fancyhead{}
\maketitle

\renewcommand{\shortauthors}{Authors}


\section{Introduction}

`Selective labels' settings arise in situations where data are the product of a decision mechanism that prevents us from observing certain variables for part of the data.
A typical example is that of bail-or-jail decisions in judicial settings: a judge decides whether to grant bail to a defendant based on whether the defendant is considered likely to violate bail conditions while awaiting trial -- and therefore a violation might occur only in case bail is granted.
Such settings give rise to  questions about the effect of alternative decision mechanisms  -- e.g., `how many defendants would violate bail conditions if more bail decisions were granted?'.
In other words, one faces the challenge to estimate the performance of an alternative, potentially automated, decision policy that might make different decisions than the one found in the judicial data.

The challenge was addressed by Lakkaraju et.al. in \cite{lakkaraju2017selective}, in a setting that involved multiple judges of varying leniency, and under the assumption that defendants are assigned to judges randomly. Lakkaraju et.al. estimate the performance of an automated decision-making algorithm (`algorithm', for short) via a technique they call `contraction' - it proceeds as follows:
\begin{itemize}
	\item It considers a set of judges with same number $N$ of judged defendants each.
	\item Judges are ordered from most lenient (most bail decisions) to least lenient. 
		Let $n_i$ be the number of bail decisions for judge $\#i$. We have $n_{i+1} \leq n_i$.
	\item The algorithm considers the $n_i$ defendants that were granted bail by the $i$-th judge.
	\item It keeps the $n_{i+1} \leq n_i$ defendants that it finds most likely to violate the bail.
	\item It makes its own bail-or-jail decision for each of those $n_{i+1}$ defendants.
	\item Its performance is measured as the number of defendants that it decides to bail but who, according to the data, eventually violated the bail.
	\item Its performance is compared to the performance of judge $\#(i+1)$, based on the cases they bailed.
\end{itemize}
The above procedure gives us a comparison between the performance of the algorithm to that of judges at the $n_{i+1}/N$ leniency level (leniency measured as the rate of bail decisions).
A major drawback of the {\it contraction} technique is that it requires data to include judges at a given leniency level.

In this document, we describe a different approach based on causal analysis, that allows us to estimate the performance of a decision-making system at any leniency level.

\section{Setting}

Consider a judge who decides whether to grant bail to a defendant based on whether the defendant is considered likely to violate bail conditions while awaiting trial.
We use variable \decision to store the outcome of the bail-or-jail decision, with $\decision = 1$ denoting a bail decision and $\decision = 0$ a jail decision.
Whether the defendant violates the bail conditions depends on the bail-or-jail decision \decision and the features \features of the defendant.

The decision is based on the following variables. First, the features \features of the defendant, which we assume to be observed.
Secondly, the leniency of the judge, expressed as a variable \leniency.
Specifically, we assume that every judge evaluates a given candidate according to the probability 
\[
\prob{\outcome = 0 | \features = \featuresValue, \doop{\decision = 1}} 
\]
that the candidate will violate bail conditions (\outcome = 0) if they were granted bail.
We write \outcome = 0 to refer to the case when the defendant does not violate bail, whether bail is granted or not.
The \doop{condition} expression signifies that, in evaluating the probability, we consider the event where the condition  (here, it is the condition $\decision = 1$) is imposed to the data-generation process (and therefore alters the generative model).
In addition, we assume that every judge would assign the same value to the above probability, given by a function \score{\featuresValue}.
\[
\score{\featuresValue} = \prob{\outcome = 0 | \features = \featuresValue, \doop{\decision = 1}}
\]
The assumption that, essentially, all judges have the same model for the probability that a defendant would violate bail is not far-fetched for the purposes of our analysis, particularly taking into account that \score{\featuresValue} can be learned from the observed data
\[
\prob{\outcome = 0 | \features = \featuresValue, \doop{\decision = 1}} = \prob{\outcome = 0 | \features = \featuresValue, \decision = 1}
\]
and that data are publicly accessible, allowing us to assume that all judges have access to the same information.
Where judges {\it do differ} is at the level of their leniency \leniency.
Following the above assumptions, a judge with leniency \leniency = \leniencyValue grants bail to the defendants for which $F(\featuresValue) < r$, where $F$ is the cumulative distribution.
\begin{equation}
	F(\featuresValue_0) = \int { \indicator{\prob{\outcome = 0| \decision = 1, \features = \featuresValue} > \prob{\outcome = 0| \decision = 1, \features = \featuresValue_0}}	d\prob{\featuresValue}	} 
\end{equation}

which should be equal to

\begin{equation}
	F(\featuresValue_0) = \int {\prob{\featuresValue}  \indicator{\prob{\outcome = 0| \decision = 1, \features = \featuresValue} > \prob{\outcome = 0| \decision = 1, \features = \featuresValue_0}} d\featuresValue}
\end{equation}

\note[RL]{
	Should the inequality be reversed? With some derivations
	\begin{equation}
	F(\featuresValue_0) = \int {\prob{\featuresValue}  \indicator{\score{\featuresValue} > \score{\featuresValue_0} } ~ d\featuresValue}
\end{equation}
}


The bail-or-jail scenario is just one example of settings that involve a decision $\decision \in\{0,1\}$ that is based on individual features \features and leniency (acceptance rate) \leniency -- and where a behavior of interest \outcome is observed only for the cases where \decision = 1.
The diagram of the causal model is shown in Figure~\ref{fig:causalmodel}.
Our results are applicable to other scenarios with same causal model.

\begin{figure}
\begin{center}
\includegraphics[width=\columnwidth]{img/causalmodel.png}
\end{center}
\caption{Causal model.}
\label{fig:causalmodel}
\end{figure}

\subsection{Analysis Task}

We will use existing machine-learning techniques from the literature to learn function \score{\featuresValue}, with the goal to build a decision system that outperforms judges.
The challenge we face is to estimate accurately the performance of the decision system -- given that we are in a `selective labels' setting.
Performance is measured {\it for a given leniency level} as the rate at which bail is granted {\it and} the defendant violates it.
In other words, performance is measured as the probability that a decision lead to undesired outcome.

\section{Analysis}

We wish to calculate the probability of undesired outcome (\outcome = 0) at a fixed leniency level.
\begin{align*}
& \prob{\outcome = 0 | \doop{\leniency = \leniencyValue}} = \nonumber \\
& = \sum_\decisionValue \prob{\outcome = 0, \decision = \decisionValue | \doop{\leniency = \leniencyValue}} \nonumber \\
& = \prob{\outcome = 0, \decision = 0 | \doop{\leniency = \leniencyValue}} + \prob{\outcome = 0, \decision = 1 | \doop{\leniency = \leniencyValue}} \nonumber \\
& = 0 + \prob{\outcome = 0, \decision = 1 | \doop{\leniency = \leniencyValue}} \nonumber \\
& = \prob{\outcome = 0, \decision = 1 | \doop{\leniency = \leniencyValue}} \nonumber \\
& = \sum_\featuresValue \prob{\outcome = 0, \decision = 1, \features = \featuresValue | \doop{\leniency = \leniencyValue}} \nonumber \\
& = \sum_\featuresValue \prob{\outcome = 0, \decision = 1 | \doop{\leniency = \leniencyValue}, \features = \featuresValue} \prob{\features = \featuresValue | \doop{\leniency = \leniencyValue}} \nonumber \\
& = \sum_\featuresValue \prob{\outcome = 0, \decision = 1 | \doop{\leniency = \leniencyValue}, \features = \featuresValue} \prob{\features = \featuresValue} \nonumber \\
& = \sum_\featuresValue \prob{\outcome = 0 | \decision = 1, \doop{\leniency = \leniencyValue}, \features = \featuresValue} \prob{\decision = 1 | \doop{\leniency = \leniencyValue}, \features = \featuresValue} \prob{\features = \featuresValue} \nonumber \\
& = \sum_\featuresValue \prob{\outcome = 0 | \decision = 1, \features = \featuresValue} \prob{\decision = 1 | \leniency = \leniencyValue, \features = \featuresValue} \prob{\features = \featuresValue}
\end{align*}

Expanding the above derivation for model \score{\featuresValue} learned from the data
\[
\score{\featuresValue} = \prob{\outcome = 0 | \features = \featuresValue, \decision = 1},
\]
the {\it generalized performance} \generalPerformance of that model is given by the following formula.
\begin{equation}
\generalPerformance = \sum_\featuresValue \score{\featuresValue} \indicator{F(\featuresValue) < r} \prob{\features = \featuresValue}
\label{eqn:gp}	
\end{equation}
Equation~\ref{eqn:gp} can be calculated for a given model \datadistr{\featuresValue} = \prob{\features = \featuresValue} of individual features.
Alternatively, we can have an empirical measure \empiricalPerformance of performance over the $\datasize$ data points in dataset \dataset, given by the following equation.
\begin{equation}
\empiricalPerformance = \frac{1}{\datasize} \sum_{(\featuresValue, \outcomeValue)\in\dataset}  \score{\featuresValue} \indicator{F(\featuresValue) < r} 
\label{eqn:gp}	
\end{equation}

\subsection{Comments}
Roughly speaking, the above formulas should work well if `bail' cases (\decision = 1) cover well the area spanned by the observed features of defendants -- i.e., we do not have large areas of \features with no or too few bail cases.

If there are such areas, then we cannot do much about the lack of data. 
One reasonable modeling choice, however, is to impose the following priors on \score{\featuresValue}: 
\begin{enumerate}
	\item $\score{\featuresValue} \approx 1$ for areas near values of \features for which we have observed data but few bail decisions (i.e., we assume a-priori that a defendant is more likely to violate bail -- a belief that will change if the data tell us otherwise);
	\item $\score{\featuresValue} \approx 0$ for areas near unobserved values of \features (i.e., we assume that people who are unlikely to ever be taken to court would probably `play nice' and not violate bail).
\end{enumerate}

Lack of data for large areas of \features is a potential problem for the {\it contraction} technique of Lakkaraju et.al., as well.
Unlike contraction, though, our approach does not require to have data at all leniency levels.
Moreover, it is easy to see based on the derivations of Eq.\ref{eqn:gp} that our approach would work identically in the case where defendants are not assigned to judges at random (i.e., if there was a causal relation $\features\rightarrow\leniency$).

\section{Results}

Below we present our results in various settings.

\subsection{Without unobservables}

The causal model for this scenario corresponds to that depicted in Figure \ref{fig:causalmodel}.
For the analysis, we assigned 500 subjects for each of the 100 judges randomly.
Every judge's leniency rate $\leniency$ was sampled uniformly from a half-open interval $[0.1; 0.9)$. 
Private features $\features$ were defined as i.i.d standard Gaussian random variables.
Next, probabilities for negative results $\outcome = 0$ were calculated as 
\[
\prob{\outcome = 0| \features = \featuresValue} = \dfrac{1}{1+\exp\{-\featuresValue\}} = p_{y_0}
\]

and consequently $\outcome \sim \text{Bernoulli}(1 - p_{y_0})$.
The decision variable $\decision$ was set to 0 if the value $p_{y_0}$ resided in the top $(1-\leniencyValue)\cdot 100 \%$ of the subjects appointed for that judge.
Results for estimating the causal quantity $\prob{\outcome = 0 | \doop{\leniency = \leniencyValue}}$ with various levels of leniency $\leniencyValue$ are presented in Figure \ref{fig:without_unobservables}.

\begin{figure}
\begin{center}
\includegraphics[width=\columnwidth]{img/without_unobservables.png}
\end{center}
\caption{$\prob{\outcome = 0 | \doop{\leniency = \leniencyValue}}$ with varying levels of acceptance rate. Error bars denote standard error of the mean.}
\label{fig:without_unobservables}
\end{figure}

% \textbf{Acknowledgments.}


%\clearpage
% \balance
\bibliographystyle{ACM-Reference-Format}
\bibliography{biblio}
%\balancecolumns % GM June 2007

\end{document}