Initial pass over ch. 3 and CBI algorithm

4f98882e · Riku-Laine · ef57d6b8 · 4f98882e
Commit 4f98882e authored 5 years ago by Riku-Laine
--- a/paper/sl.tex
+++ b/paper/sl.tex
@@ -324,76 +324,96 @@ We use a propensity score framework to model $X$ and $Z$: they are assumed conti

 \acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}

-\begin{itemize}
-\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
-	\begin{itemize}
-	\item Counterfactuals are 
-		\begin{itemize}
-		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
-		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
-		\item Using counterfactuals requires defining a structural causal model.
-		\item Pearl's Book of Why: "The fundamental problem"
-		\end{itemize}
-	\item By defining structural equations / a graph
-		\begin{itemize}
-		\item we can begin formulating causal questions to get answers to our questions.
-		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
-		\item We denote the counterfactual "Y had been y had T been t" with...
-		\item By first estimating the distribution of the latent variable Z we can impose 
-		\item Now counterfactuals can be defined as
-			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
-			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
-			\begin{equation} \label{eq:counterfactual}
-				Y_x(u) := Y_{M_x}(u)
-			\end{equation}
-			\end{definition}
-		\end{itemize}
-	\item In a high level
-		\begin{itemize}
-		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
-		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
-		\end{itemize}
-	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
-	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
-\end{itemize}
-\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
-	\begin{itemize}
-	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
-	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
-	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
-	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
-	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
-		\begin{align} \label{eq:structural_equations}
-		\outcome_0 & = NA \\ \nonumber
-		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
-		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
-		\outcome & =\outcome_\decisionValue \\ \nonumber
-		\end{align}
-	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
-	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
-	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
-	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
-		\begin{align} \label{eq:data_model}
-		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
-		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
-		\end{align}
-	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
-	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
-	\item Specify, motivate and explain priors here if space.
-	\end{itemize}
-\item Computation (Stan in general, ...)
-	\begin{itemize}
-	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
-	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
-	\end{itemize}
-\end{itemize}
+
+In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure X. Using Stan, we model the observed data as 
+\begin{align} \label{eq:data_model}
+ y(1) & \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\ \nonumber
+ t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
+\end{align}
+
+That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the coefficients $\beta_{xy}$ for the observed features and $\beta_{zy}$ for the unobserved features and the sole intercept $\alpha_y$.
+
+Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
+\[
+p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
+\]
+
+In practise, once we have used Stan, we have $S$ samples from all of the parameters of the model from the posterior distribution $p(\theta|y)$ (probability of parameters given the data). Then we use those values to sample the probable outcomes for the missing values. E.g. for some observation the outcome $\outcomeValue_i$ is missing. Using Stan we obtain a sample for the coefficients, intercepts and $\unobservableValue_i$ showing their distribution. This sample includes $S$ values. Now we put these values to the model presented in the first line of equation \ref{eq:data_model}. Now, using all these parameter values we can draw counterfactual values for the outcome Y from the distribution $y_{i, \decisionValue=1}  \sim \text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x_i + \beta_{zy} z))$. In essence, we use the sampled parameter values from the posterior to sample new values for the missing outcomes. As we have S "guesses" for each of the missing outcomes we then compute the failure rate for each set of the guesses and use the mean.
+
+%\begin{itemize}
+%\item Theory \\ (Present here (1) what counterfactuals are, (2) motivation for structural equations, (3) an example or other more easily approachable explanation of applying them, (4) why we used computational methods)
+%	\begin{itemize}
+%	\item Counterfactuals are 
+%		\begin{itemize}
+%		\item hypothesized quantities that encode the would-have-been relation of the outcome and the treatment assignment.
+%		\item Using counterfactuals, we can discuss hypothetical events that didn't happen. 
+%		\item Using counterfactuals requires defining a structural causal model.
+%		\item Pearl's Book of Why: "The fundamental problem"
+%		\end{itemize}
+%	\item By defining structural equations / a graph
+%		\begin{itemize}
+%		\item we can begin formulating causal questions to get answers to our questions.
+%		\item Once we have defined the equations, counterfactuals are obtained by... (abduction, action, prediction, don't we apply the do operator on the \decision, so that we obtain $\outcome_{\decision=1}(x)$?)
+%		\item We denote the counterfactual "Y had been y had T been t" with...
+%		\item By first estimating the distribution of the latent variable Z we can impose 
+%		\item Now counterfactuals can be defined as
+%			\begin{definition}[Unit-level counterfactuals \cite{pearl2010introduction}]
+%			Let $M$ be a structural model and $M_x$ a modified version of $M$, with the equation(s) of $X$ replaced by $X = x$. Denote the solution for $Y$ in the equations of $M_x$ by the symbol $Y_{M_x}(u)$. The counterfactual $Y_x(u)$ (Read: "The value of Y in unit u, had X been x") is given by:
+%			\begin{equation} \label{eq:counterfactual}
+%				Y_x(u) := Y_{M_x}(u)
+%			\end{equation}
+%			\end{definition}
+%		\end{itemize}
+%	\item In a high level
+%		\begin{itemize}
+%		\item there is usually some data recoverable from the unobservables. For example, if the observable attributes are contrary to the outcome/decision we can claim that the latent variable included some significant information.
+%		\item We retrieve this information using the prespecified structural equations. After estimating the desired parameters, we can estimate the value of the counterfactual (not observed) outcome by switching the value of \decision and doing the computations through the rest of the graph...
+%		\end{itemize}
+%	\item Because the causal effect of \decision to \outcome is not identifiable, we used a Bayesian approach
+%	\item Recent advances in the computational methods provide us with ways of inferring the value of the latent variable by applying Bayesian techniques to... Previously this kind of analysis required us to define X and compute Y...
+%\end{itemize}
+%\item Model (Structure, equations in a general and more specified level, assumptions, how we construct the counterfactual...) 
+%	\begin{itemize}
+%	\item Structure is as is in the diagram. Square around Z represents that it's unobservable/latent.
+%	The features of the subjects include observable and -- possibly -- unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R (depicting some baseline probability of a positive decision). The decisions given will be denoted with T and the resulting outcomes with Y, where 0 stands for negative outcome or decision and 1 for positive.
+%	\item The causal diagram presents how decision T is affected by the decider's leniency (R), the subject's observable private features (X) and the latent information regarding the subject's tendency for a negative outcome (Z). Correspondingly the outcome (Y) is affected only by the decision T and the above-mentioned features X and Z. 
+%	\item The causal directions and implied independencies are readable from the diagram. We assume X and Z to be independent.
+%	\item The structural equations connecting the variables can be formalized in a general level as (see Jung)
+%		\begin{align} \label{eq:structural_equations}
+%		\outcome_0 & = NA \\ \nonumber
+%		\outcome_1 & \sim f(\featuresValue, \unobservableValue; \beta_{\featuresValue\outcomeValue}, \beta_{\unobservableValue\outcomeValue}) \\ \nonumber
+%		\decision      & \sim g(\featuresValue, \unobservableValue; \beta_{\featuresValue\decisionValue}, \beta_{\unobservableValue\decisionValue}, \alpha_j) \\ \nonumber
+%		\outcome & =\outcome_\decisionValue \\ \nonumber
+%		\end{align}
+%	where the beta and alpha coefficients are the path coefficients specified in the causal diagram
+%	\item This general formulation of the selective labels problem enables the use of this approach even when the outcome is not binary. Notably this approach -- compared to that of Jung et al. -- explicates the selective labels issue to the structural equations when we deterministically set the value of outcome y to be one in the event of a negative decision. In addition, we allow the judges to differ in the baseline probabilities for positive decisions, which is by definition leniency.
+%	\item Now by imposing a value for the decision \decision we can obtain the counterfactual by simply assigning the desired value to the equations in \ref{eq:structural_equations}. This assumes that... (Consistency constraint) Now we want to know {\it what would have been the outcome \outcome for this individual \featuresValue had the decision been $\decision = 1$, or more specifically $\outcome_{\decision = 1}(\featuresValue)$}.
+%	\item To compute the value for the counterfactuals, we need to obtain estimates for the coefficients and latent variables. We specified a Bayesian (/structural) model, which requires establishing a set of probabilistic expressions connecting the observed quantities to the parameters of interest. The relationships of the variables and coefficients are presented in equation \ref{eq:structural_equations} and figure X in a general level. We modelled the observed data as  
+%%		\begin{align} \label{eq:data_model}
+%%		 y(1) & \sim \text{Bernoulli}(\invlogit(\beta_{xy} x + \beta_{zy} z)) \\ \nonumber
+%%		 t & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
+%%		\end{align}
+%	\item Bayesian models also require the specification of prior distributions for the variables of interest to obtain an estimate of their distribution after observations, the posterior distribution.
+%	\item Identifiability of models with unobserved confounding has been discussed by eg McCandless et al and Gelman. As by Gelman we note that scale-invariance has been tackled with specifying the priors.  (?)
+%	\item Specify, motivate and explain priors here if space.
+%	\end{itemize}
+%\item Computation (Stan in general, ...)
+%	\begin{itemize}
+%	\item Using the model specified in equation \ref{eq:data_model}, we used Stan to estimate the intercepts, path coefficients and latent variables. Stan provides tools for efficient computational estimates of posterior distributions.  Stan uses No-U-Turn Sampling (NUTS), an extension of Hamiltonian Monte Carlo (HMC) algorithm, to computationally estimate the posterior distribution for inferences. (In a high level, the sampler utilizes the gradient of the posterior to compute potential and kinetic energy of an object in the multi-dimensional surface of the posterior to draw samples from it.) Stan also has implementations of black-box variational inference algorithms and direct optimization algorithms for the posterior distribution but they were deemed to be insufficient for estimating the posterior in this setting
+%	\item Chain lengths were set to X and number of chains deployed was Y. (Explain algorithm fully later)
+%	\end{itemize}
+%\end{itemize}

 \begin{algorithm}
 	%\item Potential outcomes / CBI \acomment{Put this in section 3? Algorithm box with these?}
 		\begin{itemize}
 		\item Take test set
-		\item Compute the posterior for parameters and variables presented in equation \ref{eq:data_model}.
-		\item Using the posterior predictive distribution, draw estimates for the counterfactuals.
+		\item Using Stan, draw S samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.
+		\item For i in $1, \ldots, S$:
+		\item \hskip1.0em For j in $1, \ldots, \datasize$:
+		\item \hskip2.0em Draw new outcome $\outcome_{t=0}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$. 
+		\item \hskip1.0em Compute the failure rate and assign to $\mathcal{U}$.
+		\item Using the sampled values from the previous step, compute \datasize failure rate estimates.
 		\item Impute the missing outcomes using the estimates from previous step
 		\item Obtain a point estimate for the failure rate by computing the mean.
 		\item Estimates for the counterfactuals Y(1) for the unobserved values of Y were obtained using the posterior expectations from Stan. We used the NUTS sampler to estimate the posterior. When the values for...
@@ -457,17 +477,11 @@ In this section we present our results from experiments with synthetic and reali

 We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 

-We sampled $N=50k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively.  We sampled $50$ leniency levels $R$ uniformly from $[0,1]$. We assigned the randomly subjects such that a single leniency level was assigned  for $1000$ subjects. In the example, this mimics having 50 judges deciding each for $1000$ defendants. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}
+We sampled $N=50k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively.  We sampled leniency levels $R$ for each of the $M=100$ judges uniformly from $[0.1; 0.9]$. We assigned the subjects randomly such that a judge level was assigned for $500$ subjects. In the example, this mimics having 100 judges deciding each for $500$ defendants. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}

-%This is one data generation module.
-% It can be / was modified by changing the outcome producing mechanism. For other experiments we changed the outcome generating mechanism so that the outcome was assigned value 1 if
-
-The \emph{default} decision maker in the data fits a logistic regression model $Y \sim  \invlogit(\beta_xx+\beta_zz)$ using the training set. The decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... . 
- $T=1$ to $R$ percent of subjects given by the leniency with highest probability of $Y=1$ in the test set. For all subjects for which $T=0$ we set $Y=1$.
- 
+The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\features, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then -- as explained in section \ref{sec:framework} -- the outcomes for which the decision was negative, were then set deterministically set to $0$.
 
- We used a number of different decision mechanism. A \emph{limited} works as the default but uses regression model $Y \sim  \invlogit(\beta_xx)$. Hence it is unable to observe $Z$. 
-A \emph{biased} decision maker works similarly as limited but the logistic regression model is .. where biases decision.
+We used a number of different decision mechanisms. A \emph{limited} works as the default but uses regression model $Y \sim  \invlogit(\beta_xx)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as limited but the logistic regression model is .. where biases decision.
 Given leniency $R$, a \emph{random} decision maker decides on $T=1$ probability given by $R$.

 In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions