setting.tex

\section{Setting and problem statement}

The setting we consider is described in terms of two decision processes.
%
In the first one, a decision maker $H$ considers a case described by a set of features $F$ and makes a binary decision $T\in\{0, 1\}$, nominally referred to as {\it positive} ($T = 1$) or not ($T = 0$).
%
Intuitively, in our bail-or-jail example of Section~\ref{sec:introduction}, $H$ corresponds to the human judge deciding whether to grant bail ($T = 1$) or not ($T = 0$).
%
The decision is followed with a binary outcome $Y$, which is nominally referred to as {\it successful} ($Y = 1$) or not ($Y = 0$).
%
An outcome can be {\it unsuccessful} ($Y = 0$) only if the decision that preceded it was positive ($T = 0$).
%
If the decision was not positive ($T = 0$), then the outcome is considered by default successful ($Y = 1$).
%
Back in our example, the decision of the judge is unsuccessful only if the judge grants bail ($T = 1$) but the defendant violates its terms ($Y = 0$).
%
Otherwise, if the decision of the judge was to keep the defendant in jail ($T = 0$), the outcome is by default successful ($Y = 1$) since there can be no violation.
%
Moreover, we assume that decision maker $H$ is associated with a leniency level $R$, which determines the fraction of cases for which they produce a positive decision, in expectation. 
%
Formally, for leniency level $R = r\in [0, 1]$, we have
\begin{equation}
	P(T = 1 | R = r) = \sum_{X, Z} P(T = 1, X, Z | R = r) = r .
\end{equation}

The outcome of this process is a record $(X, T, Y)$ that contains only a subset $X\subseteq $ of the features of the case, the decision of the judge and the outcome -- but leaves no trace for a subset $Z = F - X$ of the features.
%
Intuitively, in our example, $X$ corresponds to publicly recorded information about the bail-or-jail case decided by the judge (e.g., the gender and age of the defendant) and $Z$ corresponds to features that are observed by the judge but do not appear on record (e.g., whether the defendant appeared stressed).
%
The set of records $\{(X, T, Y)\}$ priduced by decision maker $H$ constitute what we refer to as the {\bf dataset}.
%
Figure~\ref{fig:model} shows the causal diagram that describes the operation of decision-maker $H$.


In the second decision process, a decision maker $M$ considers a case described by a set of features $X$ and makes a binary decision $T$ based on those features, followed by a binary outcome $Y$.
%
In our example, $M$ corresponds to an automated-decision system that is considered for replacing the human judge in bail-or-jail decisions.
%
Notice that $M$ has access only to some of the features that were available to $H$, to model cases where the system would use only the recorded features and not other ones that would be available to a human judge.
%
The definitions and semantics of decision $T$ and outcome $Y$ follow those of the first process and are not repeated.
%
Moreover, decision maker $M$ is also associated with a leniency level $R$, defined as before for $H$.

\note{Michael}{I changed the notation and now refer to the two decision makers as H and S, for "human" and system", respectively.}


% We begin by formalizing the selective labels setting.

% Let binary variable $T$ denote a decision, where $T=1$ is interpreted as a positive decision. The binary variable $Y$ measures some outcome that is affected by the decision $T$.  The selective labels issue is that in the observed data when $T=0$ then deterministically\footnote{Alternatively, we could see it as not observing the value of $Y$ when $T=0$ inducing a problem of selection bias.\acomment{Want to keep this interpretation in the footnote not to interfere with the main interpretation.}} $Y=1$.

% For example, consider that
% $T$ denotes a decision to jail $T=0$ or bail $T=1$. 
% Outcome $Y=0$ then marks that a defendant offended and $Y=1$ the defendant did not. When a defendant is jailed $T=0$ the defendant obviously did not violate the bail and thus always $Y=1$.

% \subsection{Decision Makers}

% A decision maker $D(r)$ makes the decision $T$ based on the characteristics of the subject. We assume the decision maker gets input leniency $r$, which defines what percentage of subjects the decision maker makes a positive decision for. A decision maker may be human or a machine learning system. They seek to predict outcome $Y$ based on what they know and then decide $T$ based on this prediction: a negative decision $T=0$ is prefered for subjects predicted to have negative outcome $Y=0$ and a positive decision $T=1$ when the outcome is predicted as positive $Y=1$.  


% % We especially consider machine learning system that need to use similar data as used for the evaluation; they also need to take into account the selective labels issue.

% In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail. The leniency $r$ refers to the portion of bail decisions.


% The difference between the decision makers in the data and $D(r)$ is that usually we cannot observe all the information that has been available to the decision makers in the data.
% % In addition, we usually cannot observe the full decision-making process of the decider in the data step contrary to the decider in the modelling step.
% With unobservables we refer to some latent, usually non-written information regarding a certain outcome that is only available to the decision-maker. For example, a judge in court can observe the defendant's behaviour and level of remorse which might be indicative of bail violation. We denote the latent information regarding a person's guilt with variable \unobservable.


\subsection{Evaluating Decision Makers}


The goodness of a decision maker is measured in terms of its failure rate {\bf FR} -- i.e., the fraction of undesired outcomes ($Y=0$) out of all the cases for which a decision is made. 
%
A good decision maker achieves as low failure rate FR  as possible.
%
Note, however, that a decision maker that always makes a negative decision $T=0$, has failure rate $FR = 0$, by definition.
%
For comparisons to be meaningful, we compare decision makers at the same leniency level $R$.


The main challenge is estimating FR, however, is that in general the dataset does not directly provide a way to evaluate FR. 
%
In particular, let us consider the case where we wish to evaluate decision maker $S$ -- and suppose that $S$ is making a decision $T_{_S}$ for the case corresponding to record $(X, T_{_H}, Y_{_H})$.
%
Suppose also that the decision by $H$ was $T_{_H} = 0$, in which case the outcome is always positive, $Y_{_H} = 1$.
%
If the decision by $S$ is $T_{_S} = 1$, then it is not possible to tell directly from the dataset what the outcome $Y_{_S}$ would have been in the hypothetical case where decision maker's $S$'s decision had been followed in the first place.
%
The approach we take to deal with this challenge is to use counterfactual reasoning to infer $Y_{_S}$.

Ultimately, our goal is to obtain an estimate of the failure rate FR for a decision maker $S = S(r)$ that is associated with a given leniency level $R = r$:
\begin{problem}[Evaluation]
Given a dataset $\{(X, T, Y)\}$, and a decision maker $S(r)$ with leniency $R = r$, provide an estimate of the failure rate FR.
\end{problem}
\noindent
Ideally, the estimate returned by the evaluation should also be accurate for all levels of leniency.

\todo{Michael}{Create and use macros for all main terms and mathematical quantities, so that they stay consistent throughout the paper.}
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.

%The "eventual goal" is to create such an evaluator module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.


%\setcounter{section}{1}


%\section{ Framework ( by Riku)}
%In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
%Antti: In conference papers we do not waste space for such in this paper stuff!! In journals one can do that.

%\begin{itemize}
%\item Definitions \\
%	In this paper we apply our approach on binary (positive / negative) outcomes, but our approach is readily extendable to accompany continuous or categorical responses. Then we could use e.g. sum of squared errors or other appropriate metrics as the measure for good performance.
%	With positive or negative outcomes we refer to...
	%\begin{itemize}
	%\item Failure rate
%		\begin{itemize}
%		\item %Failure rate (FR) is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision / we can only observe the outcome when the corresponding decision is positive.
%		\item %That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal. (rather about finding a good balance. > Resource issues in prisons etc.)
		%\end{itemize}
%	\item Acceptance rate
		%\begin{itemize}
%		\item %Acceptance rate (AR) or leniency is defined as the ratio of positive decisions to all decisions that a decision-maker will give. (Semantically, what is the difference between AR and leniency? AR is always computable, leniency doesn't manifest.) A: a good question! can we get ir of one
		%\item
		
		% In some settings, (justice, medicine) people might want to find out if X\% are accepted what is the resulting failure rate, and what would be the highest acceptance rate to have to have the failure rate at an acceptable level. 
%		\item We want to know the trade-off between acceptances and failure rate.
%		\item %Lakkaraju mentioned the problem in the data that judges which have a higher leniency have labeled a larger portion of the data (which might results in bias).
%		\item As mentioned earlier, these differences in AR might lead to subjects getting different decisions while haven the same observable and unobservable characteristics.
		%\end{itemize}
%	\item % Some deciders might have an incentive for positive decisions if it can mean e.g. savings. Judge makes saving by not jailing a defendant. Doctor makes savings by not assigning patient for a higher intensity care. (move to motivation?)
	%\item 
%\end{itemize}
%\begin{itemize}
%\item Modules \\
%	We separated steps that modify the data into separate modules to formally define how they work. With observational data sets, the data goes through only a modelling step and an evaluation step. With synthetic data, we also need to define a data generating  step. We call the blocks doing these steps {\it modules}. To fully define a module, one must define its input and output. Modules have different functions, inputs and outputs. Modules are interchangeable with a similar type of module if they share the same input and output (You can change decider module of type A with decider module of type B). With this modular framework we achieve a unified way of presenting the key differences in different settings.
%	\begin{itemize}
%	\item Decider modules
		%\begin{itemize}
%		\item In general, the decider module assigns predictions to the observations based on some information.
%		\item %The information available to a decision-maker in the decider module includes observable and -- possibly -- unobservable features, denoted with X and Z respectively.
%		\item %The predictions given by a decider module can be relative or absolute. With relative predictions we refer to that a decider module can give out a ranking of the subjects based on their predicted tendency towards an outcome. Absolute predictions can be either binary or continuous in nature. For example, they can correspond to yes or no decisions or to a probability value.
%		\item %Inner workings (procedure/algorithm) of the module may or may not be known. In observational data sets, the mechanism or the decider which has labeled the data is usually unknown. E.g. we do not -- eactly -- know how judges obtain a decision. Conversely, in synthetic data sets the procedure creating the decisions is fully known because we define the process.
%		\item The decider (module) in the data step has unobservable information available for making the decisions. 
%		\item %The behaviour of the decider module in the data generating step can be defined in many ways. We have used both the method presented by Lakkaraju et al. and two methods of our own. We created these two deciders to remove the interdependencies of the decisions made by the decider Lakkaraju et al. presented.
%		\item 		\end{itemize}
%	\item Evaluator modules
		%\begin{itemize}
%		\item Evaluator module gets the decisions, observable features of the subject and predictions made by the deciders and outputs an estimate of...
%		\item The evaluator module outputs a reliable estimate of a decider module's performance. The estimate is created by the evaluator module and it should 
		%	\begin{itemize}
%		%	\item be precise and unbiased
%			\item have a low variance
%			\item be as robust as possible to slight changes in the data generation. 
		%	\end{itemize}
%		\item The estimate of the evaluator should also be accurate for all levels of leniency.
		%\end{itemize}
%	\end{itemize}
%\item Example: in observational data sets, the deciders have already made decision concerning the subjects and we have a selectively labeled data set available. In the modular framework we refer to the actions of the human labelers as a decider module which has access to latent information. 
%\item Problem formulation \\


%The "eventual goal" is to create such a decider module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.

%(It's important to somehow keep these two different goals separate.)
%We show that our method is robust against violations and modifications in the data generating mechanisms.

%\end{itemize}