setting.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.

\section{Setting and problem statement}
\label{sec:setting}

\begin{figure}
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$\leniency$};
  \node[state] (X) [right of=R] {$\obsFeatures$};
  \node[state] (T) [below of=X] {$\decision$};
  \node[state] (Z) [rectangle, right of=X] {$\unobservable$};
  \node[state] (Y) [below of=Z] {$\outcome$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{The causal model of the decision makers \human and \machine. 
$\leniency$ is the leniency of the decision maker, $\decision$ is a binary decision, $\outcome$ is the outcome that is selectively labeled. Background features  $\obsFeatures$ for a subject affect the decision and the outcome. Additional background features  $\unobservable$ are visible only to decision maker \human. }\label{fig:causalmodel}
\end{figure}

\note{Antti}{Lakkaraju had many decision makers. Can we have just one or do we run into trouble somewhere? Perhaps need to add judge(s) at places.}  

%The setting we consider is described in terms of {\it two decision processes}.
We consider data recorded from a decision making process with the following characteristics.
%
%In the first one,
In the process, a decision maker \human considers  a case described by a set of features \allFeatures and makes a binary decision $\decision = \decision_{_\human} \in\{0, 1\}$, nominally referred to as {\it positive} ($\decision = 1$) or {\it negative} ($\decision = 0$).
%
Intuitively, in our bail-or-jail example of Section~\ref{sec:introduction}, \human corresponds to the human judge deciding whether to grant bail ($\decision = 1$) or not ($\decision = 0$).
%
The decision is followed with a binary outcome $\outcome = \outcome_{_\human}$, which is nominally referred to as {\it successful} ($\outcome = 1$) or {\it unsuccessful} ($\outcome = 0$).
%
An outcome can be {\it unsuccessful} ($\outcome = 0$) only if the decision that preceded it was positive ($\decision = 1$).
%
If the decision was not positive ($\decision = 0$), then the outcome is considered by default successful ($\outcome = 1$).
%
Back in our example, the decision of the judge is unsuccessful only if the judge grants bail ($\decision = 1$) but the defendant violates its terms ($\outcome = 0$).
%
Otherwise, if the decision of the judge was to keep the defendant in jail ($\decision = 0$), the outcome is by default successful ($\outcome = 1$) since there can be no bail violation.
%
Moreover, we assume that decision maker \human is associated with a leniency level $\leniency$, which determines the fraction of cases for which they produce a positive decision, in expectation. 
%
%Formally, for leniency level $\leniency = r\in [0, 1]$, we have
%\begin{equation}
%	P(\decision = 1 | \leniency = \leniencyValue) = \sum_{\allFeatures} P(\decision = 1, \allFeatures~|~\leniency = \leniencyValue) = \leniencyValue .
%\end{equation}
%Antti I think this formula is mostly misleading

The product of this process is a record $(\human, \obsFeatures, \decision, \outcome)$ that contains only a subset $\obsFeatures\subseteq \allFeatures$ of the features of the case, the decision $\decision$ of the judge and the outcome $\outcome$ -- but leaves no trace for a subset $\unobservable = \allFeatures \setminus \obsFeatures$ of the features.
%
Intuitively, in our example, $\obsFeatures$ corresponds to publicly recorded information about the bail-or-jail case decided by the judge (e.g., the gender and age of the defendant) and $\unobservable$ corresponds to features that are observed by the judge but do not appear on record (e.g., whether the defendant appeared anxious in court).
%
The set of records $\dataset = \{(\human, \obsFeatures, \decision, \outcome)\}$ produced by decision maker \human becomes part of what we refer to as the {\bf dataset} -- and the dataset may include records from more than one decision makers.
%
Figure~\ref{fig:causalmodel} shows the causal diagram that describes the operation of a single decision-maker \human.


%\note{Riku}{Defining decision-maker \machine to give out binary decisions makes it difficult to define contraction. Contraction takes probabilistic predictions for negative outcome in interval [0, 1] as input. Then it computes new failure rate estimate based on the observed outcomes of subjects assigned to the most lenient decision-maker.}  

Based on the recorded data, we wish to evaluate a decision maker \machine that considers a case from the dataset -- and makes its own binary decision $\decision = \decision_{_\machine}$ based on recorded features $\obsFeatures$, followed by a binary outcome $\outcome = \outcome_{_\machine}$.
%
In our example, \machine corresponds to a machine-based automated-decision system that is considered for replacing the human judge in bail-or-jail decisions.
%
% Notice that we assume \machine has access only to some of the features that were available to \human, to model cases where the system would use only the recorded features and not other ones that would be available to a human judge.
%
The definitions and semantics of decision $\decision$ and outcome $\outcome$ follow those of the first process.
%
Moreover, decision maker \machine is also associated with a leniency level $\leniency$, defined as before for \human.
%
The causal diagram for decision-maker \machine is the same as that for \human (Figure~\ref{fig:causalmodel}), except that \machine does not observe variables $\unobservable$.

%\subsection{Evaluating decision makers}


The quality of a decision maker is measured in terms of its {\bf failure rate} \failurerate -- i.e., the fraction of undesired outcomes ($\outcome=0$) out of all the cases for which a decision is made. 
%
A good decision maker achieves as low failure rate \failurerate  as possible.
%
Note, however, that a decision maker that always makes a negative decision $\decision=0$, has failure rate $\failurerate = 0$, by definition.
%
For comparisons to be meaningful, we compare decision makers at the same leniency level $\leniency$.


The main challenge in estimating \failurerate is that in general the dataset does not directly provide a way to evaluate \failurerate. 
%
In particular, let us consider the case where we wish to evaluate decision maker \machine\ -- and suppose that \machine is making a decision $\decision_{_\machine}$ for the case corresponding to record $(\human, \obsFeatures, \decision_{_\human}, \outcome_{_\human})$, based on the recorded features \obsFeatures.
%
Suppose also that the decision by \human was $\decision_{_\human} = 0$, in which case the outcome is always positive, $\outcome_{_\human} = 1$.
%
If the decision by \machine is $\decision_{_\machine} = 1$, then it is not possible to tell directly from the dataset what its outcome $\outcome_{_\machine}$ would be.
%
The approach we take to deal with this challenge is to use counterfactual reasoning to infer $\outcome_{_\machine}$(see Section~\ref{sec:imputation} below).

Ultimately, our goal is to obtain an estimate of the failure rate \failurerate for a decision maker \machine.
\begin{problem}[Evaluation]
Given a dataset $\{(\human, \obsFeatures, \decision, \outcome)\}$, and a decision maker \machine, provide an estimate of the failure rate \failurerate at a given leniency level $R=r$.
\label{problem:the}
\end{problem}
\noindent
%Sometimes, we may have control over the leniency level of the decision maker we evaluate.
%
%In such cases, we would like to evaluate decision maker $\machine = \machine(\leniency = \leniencyValue)$ at various leniency levels $\leniency$.
%
%Ideally, the estimate returned by the evaluation should also be accurate for all levels of leniency.