imputation.tex

\section{Counterfactual-Based Imputation For Selective Labels}
\label{sec:imputation}

\begin{figure*}
\begin{center}
\includegraphics[height=2in]{img/setting}
\end{center}
\caption{The figure summarizes our approach for counterfactual-based imputation.
%
Negative decisions by decision maker $M$ ($T_{_M} = 0$) are evaluated as successful ($Y_{_M} = 1$) (shown with dashed arrows). For negative decisions by decision maker $H$ ($T_{_H} = 0$), the outcome is evaluated according to the table of imputed outcomes $\hat\outcome$ (dotted arrows).
%
Imputed outcomes are produced from the dataset outcomes by making a counterfactual prediction for those cases where $H$ had made a negative decision (solid arrows).
}
\label{fig:approach}
\end{figure*}

If decision maker \machine makes a positive decision for a case where decision maker \human had made a negative decision, how can we infer the outcome \outcome in the hypothetical case where \machine's decision had been followed? 
%
Such questions fall straight into the realm of causal analysis and particularly the evaluation of counterfactuals -- an approach that we follow in this paper.

The challenges we face are two-fold. 
%
Firstly, we do not have direct observations for the outcome under \machine's positive decision.
%
A first thought, then, would be to simply {\it predict} the outcome based on the features of the case.
%
In the bail-or-jail scenario, for example, we could investigate whether certain features of the defendant (e.g., their age and marital status) are good predictors of whether they comply to the bail conditions -- and use them if they do.
%
However, not all features that are available to \human are available to \machine in the setting we consider -- and so, making direct predictions based on the available features can be suboptimal.
%
This is because some information regarding the unobserved features \unobservable can often be recovered via the decision of decision maker \human.
%
This is exactly what our counterfactual approach achieves.


For illustration, let us consider a defendant who received a negative decision by the human judge.
%
Suppose also that, among defendants with similar recorded features \obsFeatures who were released, none violated the bail conditions -- and therefore, the defendant should be considered safe to release based on \obsFeatures.
%
If the judge was both lenient and precise -- i.e., was able to make those positive decisions that lead to successful outcome -- then it is very possible that the negative decision is attributed to unfavorable non-recorded features \unobservable. 
%
And therefore, if a positive decision were made, {\it the above reasoning makes a negative outcome more likely than what would have been predicted based alone on the recorded features \obsFeatures of released defendants}.


\acomment{Could emphasize the above with a plot, x and z in the axis and point styles indicating the decision.}
\mcomment{Actually, the paragraph above describes a scenario where {\it labeled outcomes} and possibly {\it contraction} would fail. Specifically, create cases where:
(i) Z has much larger coefficient than X, and (ii) the judge is good (the two logistic functions for judge decision and outcome are the same), and (iii) the machine is trained on labeled outcomes. The machine will see that the outcome is successful regardless of X, because Z will dominate the positive (and negative) decisions. So it will learn that everyone can be released. Labeled outcomes will evaluate the machine as good -- but our approach will unvover its true performance.}

\todo{Michael}{Create suggested plots and experiments above. Since the plots are on synthetic data, just like the experiments, I suggest to have them in the experimental section.}


Having provided the intuition for our approach, in what follows we describe it in detail.
%
We remind that the goal is to provide a solution to Problem~\ref{problem:the} -- and, to do that, we wish to address those cases where $\decision_\machine = 1$ while $\decision_\human = 0$, where evaluation cannot be done directly.
%
In other words, we wish to answer a `what-if' question: for each specific case where \human decided $\decision_\human = 0$, what if we had intervened to alter the decision to $\decision = 1$?
%
In the formalism of causality theory~\cite{pearl2010introduction}, we wish to evaluate the counterfactual expectation
\begin{align}
	\hat{\outcome} = & \expectss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} \nonumber\\
	= & \probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} \label{eq:counterfactual}
\end{align}
The probability expression above concerns a specific entry in the dataset with features \obsFeatures, for which decider \human made a decision $\decision_\human = 0$.
%
It is read as follows: conditional on what we know from the data entry ($\obsFeatures = \obsFeaturesValue$, $\decision_\human = 0$) as well as from the entire dataset \dataset, consider the probability that the outcome would have been positive ($\outcome = 1$) in the hypothetical case we had intervened to make the decision positive.
%
Notice that the presence of \dataset in the conditional part of~\ref{eq:counterfactual} gives us more information about the data entry compared to the entry-specific quantities ($\obsFeatures = \obsFeaturesValue$, $\decision_\human = 0$) and is thus not redundant.
%
In particular, it provides information about the leniency and other parameters of decider \human, which in turn is important to infer information about the unobserved variables \unobservable, as discussed in the beginning of this section.


Our approach for those cases unfolds over three steps: first, it learns a model over the dataset; then, it computes counterfactuals to predict unobserved outcomes; and finally, it uses predictions to evaluate a set of decisions.

\spara{Learning a model} We take a Bayesian approach to learn a model over the dataset \dataset.
%
In particular, we consider the full probabilistic model defined in Equations (\ref{eq:priorx} - \ref{eq:defendantmodel}) and obtain the posterior distribution of its parameters \parameters = \{ \text{{\bf TODO}} \} conditional on the dataset.
%
Notice that by ``parameters'' here we refer to all quantities that are not considered as known with certainty from the input, and so parameters include unobserved features \unobservable.
%
Formally, we obtain
\begin{equation}
	\prob{\parameters | \dataset} = \frac{\prob{\dataset | \parameters} \prob{\parameters}}{\prob{\dataset}} .
\end{equation}
%
In practice, we use the MCMC functionality of Stan\footnote{\url{https://mc-stan.org/}} to obtain a sample \sample of this posterior distribution, where each element of \sample contains one instance of parameters \parameters.
%
Sample \sample can now be used to compute various probabilistic quantities of interest, including a (posterior) distribution of \unobservable for each entry in dataset \dataset.

\spara{Computing counterfactuals} 
Having obtained a posterior probability distribution for parameters \parameters, we can now expand expression~(\ref{eq:counterfactual}) as follows.
\begin{align}
	& \hat{\outcome} = \probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} = \nonumber \\
	& = \int_\parameters\probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \parameters, \decision_\human = 0; \dataset}\ \diff{\prob{\parameters | \dataset}} \nonumber \\
	& = \int_\parameters\probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \parameters}\ \diff{\prob{\parameters | \dataset}} \nonumber \\
	& = \int_\parameters\prob{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision = 1, \parameters}\ \diff{\prob{\parameters | \dataset}} \label{eq:expandcf}
\end{align}
The first factor in the integrand of expression~\ref{eq:expandcf} is provided by the model in Equation~\ref{eq:defendantmodel}, while the second is sampled by MCMC, as explained above.

Note that, for all data entries other than the ones with $\decision_\human = 0$ and $\decision_\machine = 1$, we trivially have 
\begin{equation}
	\hat{\outcome} = \outcome
\end{equation}
where \outcome is the outcome recorded in the dataset \dataset.

\spara{Evaluation of decisions}
Expression~\ref{eq:expandcf} gives us a direct way to evaluate the outcome of decisions $\decision_\machine$ for any data entry for which $\decision_\human = 0$.
%
Note though that, unlike entries for which $\decision_\human = 1$ that takes integer values $\{0, 1\}$, $\hat{\outcome}$ may take fractional values $\hat{\outcome}\in [0, 1]$.


Having obtained outcome estimates for data entries with $\decision_\human = 0$ and $\decision_\machine = 1$, it is now straightforward to obtain an estimate for the failure rate $FR$ of decision maker \machine: it is simply the average value of $\hat{\outcome}$ over all data entries.
%
Our approach is summarized in Figure~\ref{fig:approach}.


\hide{
	In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:causalmodel}. Using Stan, we model the observed data as 
	% \begin{align} \label{eq:data_model}
	%  \outcome ~|~\decision = 1, x & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
	%  \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
	% \end{align}


	That is, we fit one logistic regression model modelling the decisions based on the observable features \obsFeatures and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.

	Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
	\[
	p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
	\]

	In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.


	\begin{algorithm}
	\DontPrintSemicolon
	\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
	\KwOut{Failure rate at acceptance rate $r$} 
	Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
	\For{i in $1, \ldots, S$}{
		\For{j in $1, \ldots, \datasize$}{
			Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
		}
		Impute missing values using outcomes drawn in the previous step.\;
		Sort the observations in ascending order based on the predictions of the predictive model.\;
		Estimate the FR as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
	}
	\Return{Mean of $\mathcal{U}$.}
		
	\caption{Counterfactual-based imputation}	\end{algorithm}

}