\section{Setting and problem statement}

\todo{Michael}{Add identity of judge to dataset.}

The setting we consider is described in terms of two decision processes.
%
In the first one, a decision maker $H$ considers a case described by a set of features $F$ and makes a binary decision $T = T_{_H} = \in\{0, 1\}$, nominally referred to as {\it positive} ($T = 1$) or not ($T = 0$).
%
Intuitively, in our bail-or-jail example of Section~\ref{sec:introduction}, $H$ corresponds to the human judge deciding whether to grant bail ($T = 1$) or not ($T = 0$).
%
The decision is followed with a binary outcome $Y = Y_{_H}$, which is nominally referred to as {\it successful} ($Y = 1$) or not ($Y = 0$).
%
An outcome can be {\it unsuccessful} ($Y = 0$) only if the decision that preceded it was positive ($T = 0$).
%
If the decision was not positive ($T = 0$), then the outcome is considered by default successful ($Y = 1$).
%
Back in our example, the decision of the judge is unsuccessful only if the judge grants bail ($T = 1$) but the defendant violates its terms ($Y = 0$).
%
Otherwise, if the decision of the judge was to keep the defendant in jail ($T = 0$), the outcome is by default successful ($Y = 1$) since there can be no violation.
%
Moreover, we assume that decision maker $H$ is associated with a leniency level $R$, which determines the fraction of cases for which they produce a positive decision, in expectation. 
%
Formally, for leniency level $R = r\in [0, 1]$, we have
\begin{equation}
	P(T = 1 | R = r) = \sum_{X, Z} P(T = 1, X, Z | R = r) = r .
\end{equation}

The product of this process is a record $(X, T, Y)$ that contains only a subset $X\subseteq F$ of the features of the case, the decision $T$ of the judge and the outcome $Y$ -- but leaves no trace for a subset $Z = F - X$ of the features.
%
Intuitively, in our example, $X$ corresponds to publicly recorded information about the bail-or-jail case decided by the judge (e.g., the gender and age of the defendant) and $Z$ corresponds to features that are observed by the judge but do not appear on record (e.g., whether the defendant appeared anxious).
%
The set of records $\{(H, X, T, Y)\}$ produced by decision maker $H$ becomes part of what we refer to as the {\bf dataset} -- and the dataset may include records from more than one decision makers.
%
Figure~\ref{fig:model} shows the causal diagram that describes the operation of a single decision-maker $H$.



In the second decision process, a decision maker $M$ considers a case from the dataset, described by the set of recorded features $X$ and makes its own binary decision $T = T_{_M}$ based on those features, followed by a binary outcome $Y = Y_{_M}$.
%
In our example, $M$ corresponds to an automated-decision system that is considered for replacing the human judge in bail-or-jail decisions.
%
% Notice that we assume $M$ has access only to some of the features that were available to $H$, to model cases where the system would use only the recorded features and not other ones that would be available to a human judge.
%
The definitions and semantics of decision $T$ and outcome $Y$ follow those of the first process and are not repeated.
%
Moreover, decision maker $M$ is also associated with a leniency level $R$, defined as before for $H$.
%
Figure~\ref{fig:machine_model} shows the causal diagram that describes the operation of decision-maker $H$.

\todo{Michael}{Show diagram for machine decider, also. The setting can be summarized in one figure.}
\note{Michael}{I changed the notation and now refer to the two decision makers as $H$ and $M$, for "human" and system", respectively.}




\subsection{Evaluating Decision Makers}


The goodness of a decision maker is measured in terms of its failure rate {\bf FR} -- i.e., the fraction of undesired outcomes ($Y=0$) out of all the cases for which a decision is made. 
%
A good decision maker achieves as low failure rate FR  as possible.
%
Note, however, that a decision maker that always makes a negative decision $T=0$, has failure rate $FR = 0$, by definition.
%
For comparisons to be meaningful, we compare decision makers at the same leniency level $R$.


The main challenge is estimating FR, however, is that in general the dataset does not directly provide a way to evaluate FR. 
%
In particular, let us consider the case where we wish to evaluate decision maker $M$ -- and suppose that $M$ is making a decision $T_{_M}$ for the case corresponding to record $(H, X, T_{_H}, Y_{_H})$.
%
Suppose also that the decision by $H$ was $T_{_H} = 0$, in which case the outcome is always positive, $Y_{_H} = 1$.
%
If the decision by $M$ is $T_{_M} = 1$, then it is not possible to tell directly from the dataset what the outcome $Y_{_M}$ would have been in the hypothetical case where decision maker's $M$'s decision had been followed in the first place.
%
The approach we take to deal with this challenge is to use counterfactual reasoning to infer $Y_{_M}$.

Ultimately, our goal is to obtain an estimate of the failure rate FR for a decision maker $M = M(r)$ that is associated with a given leniency level $R = r$:
\begin{problem}[Evaluation]
Given a dataset $\{(H, X, T, Y)\}$, and a decision maker $M$, provide an estimate of the failure rate FR.
\end{problem}
\noindent
\mcomment{I think that leniency does not need to be part of the problem formulation, since imputation allows us to evaluate a decision maker even if we do not know its leniency level.}
Typically, we would like to evaluate decision maker $M$ at various leniency levels.
%
Ideally, the estimate returned by the evaluation should also be accurate for all levels of leniency.

\todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper.}