\section{Setting and problem statement} \todo{Michael}{Add identity of judge to dataset.} The setting we consider is described in terms of two decision processes. % In the first one, a decision maker $H$ considers a case described by a set of features $F$ and makes a binary decision $T = T_{_H} = \in\{0, 1\}$, nominally referred to as {\it positive} ($T = 1$) or not ($T = 0$). % Intuitively, in our bail-or-jail example of Section~\ref{sec:introduction}, $H$ corresponds to the human judge deciding whether to grant bail ($T = 1$) or not ($T = 0$). % The decision is followed with a binary outcome $Y = Y_{_H}$, which is nominally referred to as {\it successful} ($Y = 1$) or not ($Y = 0$). % An outcome can be {\it unsuccessful} ($Y = 0$) only if the decision that preceded it was positive ($T = 0$). % If the decision was not positive ($T = 0$), then the outcome is considered by default successful ($Y = 1$). % Back in our example, the decision of the judge is unsuccessful only if the judge grants bail ($T = 1$) but the defendant violates its terms ($Y = 0$). % Otherwise, if the decision of the judge was to keep the defendant in jail ($T = 0$), the outcome is by default successful ($Y = 1$) since there can be no violation. % Moreover, we assume that decision maker $H$ is associated with a leniency level $R$, which determines the fraction of cases for which they produce a positive decision, in expectation. % Formally, for leniency level $R = r\in [0, 1]$, we have \begin{equation} P(T = 1 | R = r) = \sum_{X, Z} P(T = 1, X, Z | R = r) = r . \end{equation} The product of this process is a record $(X, T, Y)$ that contains only a subset $X\subseteq F$ of the features of the case, the decision $T$ of the judge and the outcome $Y$ -- but leaves no trace for a subset $Z = F - X$ of the features. % Intuitively, in our example, $X$ corresponds to publicly recorded information about the bail-or-jail case decided by the judge (e.g., the gender and age of the defendant) and $Z$ corresponds to features that are observed by the judge but do not appear on record (e.g., whether the defendant appeared anxious). % The set of records $\{(H, X, T, Y)\}$ produced by decision maker $H$ becomes part of what we refer to as the {\bf dataset} -- and the dataset may include records from more than one decision makers. % Figure~\ref{fig:model} shows the causal diagram that describes the operation of a single decision-maker $H$. In the second decision process, a decision maker $M$ considers a case from the dataset, described by the set of recorded features $X$ and makes its own binary decision $T = T_{_M}$ based on those features, followed by a binary outcome $Y = Y_{_M}$. % In our example, $M$ corresponds to an automated-decision system that is considered for replacing the human judge in bail-or-jail decisions. % % Notice that we assume $M$ has access only to some of the features that were available to $H$, to model cases where the system would use only the recorded features and not other ones that would be available to a human judge. % The definitions and semantics of decision $T$ and outcome $Y$ follow those of the first process and are not repeated. % Moreover, decision maker $M$ is also associated with a leniency level $R$, defined as before for $H$. % Figure~\ref{fig:machine_model} shows the causal diagram that describes the operation of decision-maker $H$. \todo{Michael}{Show diagram for machine decider, also. The setting can be summarized in one figure.} \note{Michael}{I changed the notation and now refer to the two decision makers as $H$ and $M$, for "human" and system", respectively.} \subsection{Evaluating Decision Makers} The goodness of a decision maker is measured in terms of its failure rate {\bf FR} -- i.e., the fraction of undesired outcomes ($Y=0$) out of all the cases for which a decision is made. % A good decision maker achieves as low failure rate FR as possible. % Note, however, that a decision maker that always makes a negative decision $T=0$, has failure rate $FR = 0$, by definition. % For comparisons to be meaningful, we compare decision makers at the same leniency level $R$. The main challenge is estimating FR, however, is that in general the dataset does not directly provide a way to evaluate FR. % In particular, let us consider the case where we wish to evaluate decision maker $M$ -- and suppose that $M$ is making a decision $T_{_M}$ for the case corresponding to record $(H, X, T_{_H}, Y_{_H})$. % Suppose also that the decision by $H$ was $T_{_H} = 0$, in which case the outcome is always positive, $Y_{_H} = 1$. % If the decision by $M$ is $T_{_M} = 1$, then it is not possible to tell directly from the dataset what the outcome $Y_{_M}$ would have been in the hypothetical case where decision maker's $M$'s decision had been followed in the first place. % The approach we take to deal with this challenge is to use counterfactual reasoning to infer $Y_{_M}$. Ultimately, our goal is to obtain an estimate of the failure rate FR for a decision maker $M = M(r)$ that is associated with a given leniency level $R = r$: \begin{problem}[Evaluation] Given a dataset $\{(H, X, T, Y)\}$, and a decision maker $M$, provide an estimate of the failure rate FR. \end{problem} \noindent \mcomment{I think that leniency does not need to be part of the problem formulation, since imputation allows us to evaluate a decision maker even if we do not know its leniency level.} Typically, we would like to evaluate decision maker $M$ at various leniency levels. % Ideally, the estimate returned by the evaluation should also be accurate for all levels of leniency. \todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper.}