imputation.tex

%!TEX root = sl.tex
% The above command helps compiling in TexShop on a MAc. Hitting typeset complies sl.tex directly instead of producing an error here.


\section{Counterfactual-based Imputation (Anttis version)}

We use the following structural equation model over the graph structure in Figure~2:

\noindent
\hrulefill
\begin{align}
R & := \epsilon_R, \quad   %\epsilon_r \sim N(0,\sigma_z^2)  
\nonumber \\
Z & := \epsilon_Z, \quad   %\epsilon_z \sim N(0,\sigma_z^2)
 \nonumber \\
 X & := \epsilon_X, \quad   %\epsilon_z \sim N(0,\sigma_z^2)
 \nonumber \\
T & := g(R,X,Z,\epsilon_T),  \nonumber\\
Y & := f(T,X,Z,\epsilon_Y).  \nonumber
\end{align}
\hrulefill

For any cases where $T=0$ in the data, we calculate the counterfactual value of $Y$ if we had had $T=1$. We use the approach by Pearl consisting of three steps abduction, action prediction. We describe first what happens on fixed parameters and later generalize to the case where parameters are learned from data.

In the abduction step we update the distribution of the disturbance terms $(\epsilon_R, \epsilon_Z, \epsilon_X, \epsilon_T,\epsilon_Y)$ to take into account the evidence $T=0,Y=1,X=x$. At this point we make use of the additional information a negative decision has on the unobserved risk factor $Z$. We can directly update 


\section{Counterfactual-Based Imputation For Selective Labels}
\label{sec:imputation}

\begin{figure*}
\begin{center}
\includegraphics[height=2in]{img/setting}
\end{center}
\caption{The figure summarizes our approach for counterfactual-based imputation.
%
Negative decisions by decision maker $M$ ($\decision_{_\machine} = 0$) are evaluated as successful ($\outcome_{_\machine} = 1$) (shown with dashed arrows). For negative decisions by decision maker $\human$ ($\decision_{_\human} = 0$), the outcome is evaluated according to the table of imputed outcomes $\hat\outcome$ (dotted arrows).
%
Imputed outcomes are produced from the dataset outcomes by making a counterfactual prediction for those cases where $\human$ had made a negative decision (solid arrows).
}
\label{fig:approach}
\end{figure*}

If decision maker \machine makes a positive decision for a case where decision maker \human had made a negative decision, how can we infer the outcome \outcome in the hypothetical case where \machine's decision had been followed? 
%
Such questions fall straight into the realm of causal analysis and particularly the evaluation of counterfactuals -- an approach that we follow in this paper.

The challenges we face are two-fold. 
%
Firstly, we do not have direct observations for the outcome under \machine's positive decision.
%
A first thought, then, would be to simply {\it predict} the outcome based on the features of the case.
%
In the bail-or-jail scenario, for example, we could investigate whether certain features of the defendant (e.g., their age and marital status) are good predictors of whether they comply to the bail conditions -- and use them if they do.
%
However, not all features that are available to \human are available to \machine in the setting we consider -- and so, making direct predictions based on the available features can be suboptimal.
%
This is because some information regarding the unobserved features \unobservable can often be recovered via the decision of decision maker \human.
%
This is exactly what our counterfactual approach achieves.


For illustration, let us consider a defendant who received a negative decision by the human judge.
%
Suppose also that, among defendants with similar recorded features \obsFeatures who were released, none violated the bail conditions -- and therefore, judging from observations alone, the defendant should be considered safe to release based on \obsFeatures.
%
However, if the judge was both lenient and precise -- i.e., was able to make those positive decisions that lead to successful outcome -- then it is very possible that the negative decision is attributed to unfavorable non-recorded features \unobservable. 
%
And therefore, if a positive decision were made, {\it the above reasoning suggests that a negative outcome is more likely than what would have been predicted based alone on the recorded features \obsFeatures of released defendants}.

\begin{figure}
\includegraphics[width=\linewidth]{img/decisions_ZvsX}
\caption{...}
%\label{fig:}
\end{figure}

\note{Michael}{Actually, the paragraph above describes a scenario where {\it labeled outcomes} and possibly {\it contraction} would fail. Specifically, create cases where:
(i) Z has much larger coefficient than X, and (ii) the judge is good (the two logistic functions for judge decision and outcome are the same), and (iii) the machine is trained on labeled outcomes. The machine will see that the outcome is successful regardless of X, because Z will dominate the positive (and negative) decisions. So it will learn that everyone can be released. Labeled outcomes will evaluate the machine as good -- but our approach will unvover its true performance.}
\todo{Michael}{Create suggested plots and experiments above. Since the plots are on synthetic data, just like the experiments, I suggest to have them in the experimental section.}

\begin{figure}
\includegraphics[width=\linewidth]{img/leniency_figure}
\caption{Figure illustrating the relationship of \leniency, \obsFeatures and \unobservable. Points $A$, $B$, $C$ and $D$ represent four subjects each with different features \obsFeatures and \unobservable.
%
Lines $\leniencyValue_1$ and $\leniencyValue_2$ show decision boundaries for decision-makers with different leniencies.
%
Figure shows how while sharing features \obsFeatures subjects $A$ and $C$ receive different decisions from decision-maker $1$ but not from decision-maker $2$ due to difference in \unobservable.
%
The figure also explicates the interplay of features \obsFeatures and \unobservable. Considering subjects $A$ and $D$, one might claim $D$ to be more dangerous than subject $A$ based on features \obsFeatures alone. However, assuming that the decision-maker $2$ uses feature \unobservable efficiently, they will keep the decision the same as they observe reduction in \unobservable.}
\label{fig:approach}
\end{figure}

\subsection{Our approach}

Having provided the intuition for our approach, in what follows we describe it in detail.
%
We remind that the goal is to provide a solution to Problem~\ref{problem:the} -- and, to do that, we wish to address those cases where $\decision_\machine = 1$ while $\decision_\human = 0$, where evaluation cannot be performed directly.
%
In other words, we wish to answer a `what-if' question: for each specific case where \human decided $\decision_\human = 0$, what if we had intervened to alter the decision to $\decision = 1$?
%
In the formalism of causal inference~\cite{pearl2010introduction}, we wish to evaluate the counterfactual expectation
\begin{align}
	\cfoutcome = & \expectss{\decision \leftarrow 1}{\outcome~| \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} \nonumber\\
	= & \probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} \label{eq:counterfactual}
\end{align}
The probability expression above concerns a specific entry in the dataset with features \obsFeatures, for which decider \human made a decision $\decision_\human = 0$.
%
It is read as follows: conditional on what we know from the data entry ($\obsFeatures = \obsFeaturesValue$, $\decision_\human = 0$) as well as from the entire dataset \dataset, consider the probability that the outcome would have been positive ($\outcome = 1$) in the hypothetical case we had intervened to make the decision positive.
%
Notice that the presence of \dataset in the conditional part of~\ref{eq:counterfactual} gives us more information about the data entry compared to the entry-specific quantities ($\obsFeatures = \obsFeaturesValue$, $\decision_\human = 0$) and is thus not redundant.
%
In particular, it provides information about the leniency and other parameters of decider \human, which in turn is important to infer information about the unobserved variables \unobservable, as discussed in the beginning of this section.

Our approach for those cases unfolds over three steps: first, it learns a model over the dataset; then, it computes counterfactuals to predict unobserved outcomes; and finally, it uses predictions to evaluate a set of decisions.

\spara{Learning a model} We take a Bayesian approach to learn a model over the dataset \dataset.
%
In particular, we consider the full probabilistic model defined in Equations \ref{eq:judgemodel} -- \ref{eq:defendantmodel} and obtain the posterior distribution of its parameters $\parameters = \{ \alpha_\outcomeValue, \alpha_j, \beta_\obsFeaturesValue, \beta_\unobservableValue, \gamma_\obsFeaturesValue, \gamma_\unobservableValue, \unobservable_i\}$, where $i = 1, \ldots, \datasize$, conditional on the dataset.
% 
Notice that by ``parameters'' here we refer to all quantities that are not considered as known with certainty from the input, and so parameters include unobserved features \unobservable.
%
Formally, we obtain
\begin{equation}
	\prob{\parameters | \dataset} = \frac{\prob{\dataset | \parameters} \prob{\parameters}}{\prob{\dataset}} .
\end{equation}
%
In practice, we use the MCMC functionality of Stan\footnote{\url{https://mc-stan.org/}} to obtain a sample \sample of this posterior distribution, where each element of \sample contains one instance of parameters \parameters.
%
Sample \sample can now be used to compute various probabilistic quantities of interest, including a (posterior) distribution of \unobservable for each entry in dataset \dataset.

\spara{Computing counterfactuals} 
Having obtained a posterior probability distribution for parameters \parameters in parameter space \parameterSpace, we can now expand expression~(\ref{eq:counterfactual}) as follows.
\begin{align}
	\cfoutcome & = \probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision_\human = 0; \dataset} \nonumber \\
	& = \int_\parameterSpace\probss{\decision \leftarrow 1}{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \parameters, \decision_\human = 0; \dataset}\ \prob{\parameters | \dataset}\ \diff{\parameters} \nonumber \\
	& = \int_\parameterSpace\prob{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \doop{\decision = 1}, \parameters}\ \prob{\parameters | \dataset}\ \diff{\parameters} \nonumber \\
	& = \int_\parameterSpace\prob{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \doop{\decision = 1}, \alpha, \beta_{_\obsFeatures}, \beta_{_\unobservable}, \unobservable} \prob{\parameters | \dataset}\ \diff{\parameters}
\end{align}
The value of the first factor in the integrand of the expression above is provided by the model in Equation~\ref{eq:defendantmodel}, while the second is sampled by MCMC, as explained above.
%
The result is computed numerically over the sample.
\begin{equation}
	\cfoutcome \approxeq \sum_{\parameters\in\sample}\prob{\outcome = 1 | \obsFeatures = \obsFeaturesValue, \decision = 1, \alpha, \beta_{_\obsFeatures}, \beta_{_\unobservable}, \unobservable} \label{eq:expandcf}
\end{equation}

Note that, for all data entries other than the ones with $\decision_\human = 0$ and $\decision_\machine = 1$, we trivially have 
\begin{equation}
	\cfoutcome = \outcome
\end{equation}
where \outcome is the outcome recorded in the dataset \dataset.

\spara{Evaluation of decisions}
Expression~\ref{eq:expandcf} gives us a direct way to evaluate the outcome of decisions $\decision_\machine$ for any data entry for which $\decision_\human = 0$.
%
Note though that, unlike entries for which $\decision_\human = 1$ that takes integer values $\{0, 1\}$, \cfoutcome may take fractional values $\cfoutcome \in [0, 1]$.


Having obtained outcome estimates for data entries with $\decision_\human = 0$ and $\decision_\machine = 1$, it is now straightforward to obtain an estimate for the failure rate $\failurerate$ of decision maker \machine: it is simply the average value of \cfoutcome over all data entries.
%
Our approach is summarized in Figure~\ref{fig:approach}.

\subsection{Model definition} \label{sec:model_definition}

The causal diagram of Figure~\ref{fig:causalmodel} provides the structure of causal relationships for quantities of interest.
%
In addition, we consider \judgeAmount instances $\{\human_j, j = 1, 2, \ldots, \judgeAmount\}$ of decision makers \human.
%
For the purposes of Bayesian modelling, we present the hierarchical model and explicate our assumptions about the relationships and the quantities below.
%
Note that index $j$ refers to decision maker $\human_j$ and \invlogit is the standard logistic function.

\noindent
\hrulefill
\begin{align}
\prob{\unobservable = \unobservableValue} & = (2\pi)^{-\nicefrac{1}{2}}\exp(-\unobservableValue^2/2)  \nonumber \\
\prob{\decision = 1~|~\leniency_j = \leniencyValue, \obsFeatures = \obsFeaturesValue, \unobservable = \unobservableValue} & = \invlogit(\alpha_j + \gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue \unobservableValue + \epsilon_\decisionValue),  \label{eq:judgemodel} \\
	\text{where}~ \alpha_{j} & \approx \logit(\leniencyValue_j) \label{eq:leniencymodel}\\
\prob{\outcome=1~|~\decision, \obsFeatures=\obsFeaturesValue, \unobservable=\unobservableValue} & =
	\begin{cases}
		1,~\text{if}~\decision = 0\\
		\invlogit(\alpha_\outcomeValue + \beta_\obsFeaturesValue \obsFeaturesValue + \beta_\unobservableValue \unobservableValue + \epsilon_\outcomeValue),~\text{o/w} \label{eq:defendantmodel}
	\end{cases}
\end{align}
\hrulefill


As stated in the equations above, we consider normalized features \obsFeatures and \unobservable.
%
Moreover, the probability that the decision maker makes a positive decision takes the form of a logistic function (Equation~\ref{eq:judgemodel}).
% 
Note that we are making the simplifying assumption that coefficients $\gamma$ are the same for all defendants, but decision makers are allowed to differ in intercept $\alpha_j \approx \logit(\leniencyValue_j)$ so as to model varying leniency levels among them (Eq. \ref{eq:leniencymodel}).
%
The probability that the outcome is successful conditional on a positive decision (Eq.~\ref{eq:defendantmodel}) is also provided by a logistic function, applied on the same features as the logistic formula of equation \ref{eq:judgemodel}.
%
In general, these two logistic functions may differ in their coefficients.
%
However, in many settings, a decision maker would be considered good if the two functions were the same -- i.e., if the probability to make a positive decision was the same as the probability to obtain a successful outcome after a positive decision.

For the Bayesian modelling, the priors for the coefficients $\gamma_\obsFeaturesValue, ~\beta_\obsFeaturesValue, ~\gamma_\unobservableValue$ and $\beta_\unobservableValue$ were defined using the gamma-mixture representation of Student's t-distribution with $6$ degrees of freedom.
%
The gamma-mixture is obtained by first sampling a variance parameter from Gamma($\nicefrac{\nu}{2},~\nicefrac{\nu}{2}$) distribution and then drawing the  coefficient from zero-mean Gaussian distribution with variance equal to the inverse of the scale parameter.
%
Student's t-distribution was chosen for prior over the Gaussian for its better robustness against outliers \cite{ohagan1979outlier}.
%
The scale parameters $\eta_\unobservableValue, ~\eta_{\beta_\obsFeaturesValue}$ and $\eta_{\gamma_\obsFeaturesValue}$ were sampled independently from Gamma$(\nicefrac{6}{2},~\nicefrac{6}{2})$ and then the coefficients were sampled from Gaussian distribution with expectation $0$ and variance parameters $\eta_\unobservableValue^{-1}, ~\eta_{\beta_\obsFeaturesValue}^{-1}$ and $\eta_{\gamma_\obsFeaturesValue}^{-1}$ as shown below. The coefficients for the unobserved confounder \unobservable were bounded to the positive values to ensure identifiability.
\begin{align}
\eta_\unobservableValue, ~\eta_{\beta_\obsFeaturesValue}, ~\eta_{\gamma_\obsFeaturesValue} & \sim \text{Gamma}(3, 3) \\
\gamma_\unobservableValue, ~\beta_\unobservableValue & \sim N_+(0, \eta_\unobservableValue) \\
\gamma_\obsFeaturesValue & \sim N(0, \eta_{\gamma_\obsFeaturesValue}^{-1}) \\
\beta_\obsFeaturesValue & \sim N(0, \eta_{\beta_\obsFeaturesValue}^{-1})
\end{align}

The intercepts for the \judgeAmount decision-makers and outcome \outcome were defined to have hierarchical Gaussian priors with variances $\sigma_\decisionValue^2$ and $\sigma_\outcomeValue^2$ as shown below. Note that the decision-makers have a joint variance parameter $\sigma_\decisionValue^2$.
\begin{align}
\sigma_\decisionValue^2, ~\sigma_\outcomeValue^2 & \sim N_+(0, \tau^2) \\
\alpha_j & \sim N(0, \sigma_\decisionValue^2) \\
\alpha_\outcomeValue & \sim N(0, \sigma_\outcomeValue^2)
\end{align}
%
\note{Riku}{The prior for $\alpha_\decisionValue$ should probably be changed to Logistic($0, 1$) because if $X \sim U(0, 1)$ then $\sigma^{-1}(X) \sim \text{Logistic}(0, 1)$.}
%
The variance parameters $\sigma_\decisionValue^2$ and $\sigma_\outcomeValue^2$ were drawn independently from bounded zero-mean Gaussian distributions.
% 
The Gaussians were restricted to the positive real numbers and both had mean $0$ and variance $\tau^2=1$ -- other values were tested but observed to have no effect.


\subsection{Decision makers built on counterfactuals}
So far in our discussion, we have focused on the task of evaluating the performance of a decision-maker \machine that is specified as input to the task.
In light of our approach for evaluation, however, it is now possible to utilize the counterfactual outcomes \cfoutcome produced therein,  and build on them a decision-maker \cfmachine.

It is straightforward: consider the set of entries $\{\obsFeatures, \cfoutcome\}$ from the dataset after the application of counterfactual-based imputation; and build a probabilistic model on \prob{\cfoutcome | \obsFeatures}.
%
For any case with recorded features \obsFeatures, \cfmachine will use the model to evaluate the probability of successful outcome in the event of positive decision -- and take a positive decision if the probability is above a threshold to satisfy a leniency constraint (i.e., make a positive decision in a fraction \leniency of cases).


\hide{
	In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:causalmodel}. Using Stan, we model the observed data as 
	% \begin{align} \label{eq:data_model}
	%  \outcome ~|~\decision = 1, x & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
	%  \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
	% \end{align}


	That is, we fit one logistic regression model modelling the decisions based on the observable features \obsFeatures and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.

	Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
	\[
	p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
	\]

	In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.


	\begin{algorithm}
	\DontPrintSemicolon
	\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
	\KwOut{Failure rate at acceptance rate $r$} 
	Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
	\For{i in $1, \ldots, S$}{
		\For{j in $1, \ldots, \datasize$}{
			Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
		}
		Impute missing values using outcomes drawn in the previous step.\;
		Sort the observations in ascending order based on the predictions of the predictive model.\;
		Estimate the \failurerate as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
	}
	\Return{Mean of $\mathcal{U}$.}
		
	\caption{Counterfactual-based imputation}	\end{algorithm}

}