imputation.tex

\section{Counterfactual-Based Imputation For Selective Labels}

%\acomment{This chapter should be our contributions. One discuss previous results we build over but one should consider putting them in the previous section.}

\subsection{Causal Modeling}


We model the selective labels setting as summarized by Figure~\ref{fig:causalmodel}\cite{lakkaraju2017selective}.

The outcome  $Y$ is affected by the observed background factors $X$, unobserved background factors $Z$. These background factors also influence the decision $T$ taken in the data. Hence $Z$ includes information that was used by the decision maker in the data but that is not available to us as observations.
 In addition, there may be other background factors that affect $Y$ but not $T$. In addition, we assume the decision is affected by some observed leniency level $R \in [0,1]$ of the decision maker.


We use a propensity score framework to model $X$ and $Z$: they are assumed continuous Gaussian variables, with the interpretation that they represent summarized risk factors such that higher values denote higher risk for a negative outcome ($Y=0$). Hence the Gaussianity assumption here is motivated by the central limit theorem.


\subsection{Imputation}
%\acomment{We need to start by noting that with a simple example how we assume this to work. If X indicates a safe subject that is jailed, then we know that (I dont know how this applies to other produces) that Z must have indicated a serious risk. This makes $Y=0$ more likely than what regression on $X$ suggests.} done by Riku!


%\acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}


Our approach is based on the fact that in almost all cases, some information regarding the latent variable is recoverable. For illustration, let us consider defendant $i$ who has been given a negative decision $\decisionValue_i = 0$. If the defendant's private features $\featuresValue_i$ would indicate that this subject would be safe to release, we could easily deduce that the unobservable variable $\unobservableValue_i$ indicated high risk since
%contained so significant information that 
the defendant had to be jailed. In turn, this makes $Y=0$ more likely than what would have been predicted based on $\featuresValue_i$ alone.
In the situation,  where the features $\featuresValue_i$ clearly indicate risk and the defendant is subsequently jailed, we do not have that much information available on the latent variable.

\acomment{Could emphasize the above with a plot, x and z in the axis and point styles indicating the decision.}
\acomment{The above assumes that the decision maker in the data is not totally bad.}


In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:causalmodel}. Using Stan, we model the observed data as 
\begin{align} \label{eq:data_model}
 \outcome ~|~\decision = 1, x & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
 \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
\end{align}


That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.

Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
\[
p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
\]

In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.

%\begin{itemize}
%\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
%	\begin{itemize}
%	\item Counterfactuals are 
%		\begin{itemize}
%		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
%		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
%		\item Using counterfactuals requires defining a structural causal model.
%		\item Pearl's Book of Why: "The fundamental problem"
%		\end{itemize}
%	\item By defining structural equations / a graph
%		\begin{itemize}
%		\item we can begin formulating causal questions to get answers to our questions.
%		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
%		\item We denote the counterfactual "Y had been y had T been t" with...
%		\item By first estimating the distribution of the latent variable Z we can impose 
%		\item Now counterfactuals can be defined as
%			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
%			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
%			\begin{equation} \label{eq:counterfactual}
%				Y_x(u) := Y_{M_x}(u)
%			\end{equation}
%			\end{definition}
%		\end{itemize}
%	\item In a high level
%		\begin{itemize}
%		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
%		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
%		\end{itemize}
%	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
%	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
%\end{itemize}
%\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
%	\begin{itemize}
%	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
%	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
%	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
%	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
%	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
%		\begin{align} \label{eq:structural_equations}
%		\outcome_0 & = NA \\ \nonumber
%		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
%		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
%		\outcome & =\outcome_\decisionValue \\ \nonumber
%		\end{align}
%	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
%	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
%	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
%	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
%%		\begin{align} \label{eq:data_model}
%%		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
%%		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
%%		\end{align}
%	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
%	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
%	\item Specify, motivate and explain priors here if space.
%	\end{itemize}
%\item Computation (Stan in general, ...)
%	\begin{itemize}
%	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
%	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
%	\end{itemize}
%\end{itemize}

\begin{figure}
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$R$};
  \node[state] (X) [right of=R] {$X$};
  \node[state] (T) [below of=X] {$T$};
  \node[state] (Z) [rectangle, right of=X] {$Z$};
  \node[state] (Y) [below of=Z] {$Y$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{ $R$ leniency of the decision maker, $T$ is a binary decision,  $Y$ is the outcome that is selectively labled. Background features  $X$ for a subject affect the decision and the outcome. Additional background features  $Z$ are visible only to the decision maker in use. }\label{fig:causalmodel}
\end{figure}

\begin{algorithm}
	%\item Potential outcomes / CBI \acomment{Put this in section 3? Algorithm box with these?}
\DontPrintSemicolon
\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
\KwOut{Failure rate at acceptance rate $r$} 
Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
\For{i in $1, \ldots, S$}{
	\For{j in $1, \ldots, \datasize$}{
		Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
	}
	Impute missing values using outcomes drawn in the previous step.\;
	Sort the observations in ascending order based on the predictions of the predictive model.\;
	Estimate the FR as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
}
\Return{Mean of $\mathcal{U}$.}
	
\caption{Counterfactual-based imputation}	\end{algorithm}

%\section{Extension To Non-Linearity (2nd priority)}

% If X has multiple dimensions or the relationships between the features and the outcomes are clearly non-linear the presented approach can be extended to accomodate non-lineairty. Jung proposed that... Groups... etc etc.