From 3d1679c42547a9fddb557f10bb3899e87e20d02a Mon Sep 17 00:00:00 2001
From: Riku-Laine <28960190+Riku-Laine@users.noreply.github.com>
Date: Fri, 9 Aug 2019 16:13:53 +0300
Subject: [PATCH] Chapters 6 and 3

---
 paper/sl.tex | 63 +++++++++++++++++++++-------------------------------
 1 file changed, 25 insertions(+), 38 deletions(-)

diff --git a/paper/sl.tex b/paper/sl.tex
index e933f90..d239154 100755
--- a/paper/sl.tex
+++ b/paper/sl.tex
@@ -324,6 +324,7 @@ We use a propensity score framework to model $X$ and $Z$: they are assumed conti
 
 \acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}
 
+Our approach is based on the fact that in almost all cases, some information regarding the latent variable is recoverable. For illustration, let us consider defendant $i$ who has been given a negative decision $\decisionValue_i = 0$. If the defendant's private features $\featuresValue_i$ would indicate that this subject would be safe to release, we could easily deduce that the unobservable variable $\unobservableValue_i$ had contained so significant information that the defendant had to be jailed. In an opposite situation,  where the features $\featuresValue_i$ clearly imply that the defendant is dangerous and is subsequently jailed, we do not have that much information available on the latent variable.
 
 In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:model}. Using Stan, we model the observed data as 
 \begin{align} \label{eq:data_model}
@@ -331,9 +332,9 @@ In counterfactual-based imputation we use counterfactual values of the outcome $
  \decision ~|~D, x & \sim \text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\ \nonumber
 \end{align}
 
-That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the coefficients $\beta_{xy}$ for the observed features and $\beta_{zy}$ for the unobserved features and the sole intercept $\alpha_y$.
+That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision = 1$ with a separate regression model to learn the parameters: coefficients $\beta_{xy}$ for the observed features, $\beta_{zy}$ for the unobserved features, the sole intercept $\alpha_y$ and the possible value for the latent variable \unobservable.
 
-Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. Parameters are the $\alpha$ intercepts, $\beta$ coefficients and the latent variable. The counterfactuals are formally drawn from the posterior predictive distribution
+Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
 \[
 p(\tilde{y}|y) = \int_\Omega p(\tilde{y}|\theta)(\theta|y)d\theta.
 \]
@@ -406,20 +407,21 @@ In practise, once we have used Stan, we have $S$ samples from all of the paramet
 
 \begin{algorithm}
 	%\item Potential outcomes / CBI \acomment{Put this in section 3? Algorithm box with these?}
-		\begin{itemize}
-		\item Take test set
-		\item Using Stan, draw S samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.
-		\item For i in $1, \ldots, S$:
-		\item \hskip1.0em For j in $1, \ldots, \datasize$:
-		\item \hskip2.0em Draw new outcome $\outcome_{t=0}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$. 
-		\item \hskip1.0em Compute the failure rate and assign to $\mathcal{U}$.
-		\item Using the sampled values from the previous step, compute \datasize failure rate estimates.
-		\item Impute the missing outcomes using the estimates from previous step
-		\item Obtain a point estimate for the failure rate by computing the mean.
-		\item Estimates for the counterfactuals Y(1) for the unobserved values of Y were obtained using the posterior expectations from Stan. We used the NUTS sampler to estimate the posterior. When the values for...
-		\end{itemize}
+\DontPrintSemicolon
+\KwIn{Test data set $\dataset = \{x, j, t, y\}$, acceptance rate $r$} 
+\KwOut{Failure rate at acceptance rate $r$} 
+Using Stan, draw $S$ samples of the all parameters from the posterior distribution defined in equation \ref{eq:data_model}. Every item of the vector \unobservableValue is treated as a parameter.\;
+\For{i in $1, \ldots, S$}{
+	\For{j in $1, \ldots, \datasize$}{
+		Draw new outcome $\tilde{\outcome}_{j}$ from $\text{Bernoulli}(\invlogit(\alpha_{j}[i] + \beta_{xt}[i] x + \beta_{zt}[i] z[i, j])$
+	}
+	Impute missing values using outcomes drawn in the previous step.\;
+	Sort the observations in ascending order based on the predictions of the predictive model.\;
+	Estimate the FR as $\frac{1}{\datasize}\sum_{k=1}^{\datasize\cdot r} \indicator{\outcomeValue_k=0}$ and assign to $\mathcal{U}$.\;
+}
+\Return{Mean of $\mathcal{U}$.}
 	
-\caption{Counterfactual based imputation}	\end{algorithm}
+\caption{Counterfactual-based imputation}	\end{algorithm}
 
 \section{Extension To Non-Linearity (2nd priority)}
 
@@ -481,8 +483,7 @@ We sampled $N=50k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussia
 
 The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision = 0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision = 0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision = 0~|~\features, \unobservable)$ the subject is given a negative decision $\decision = 0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then the outcomes for which the decision was negative, were set to $0$.
  
-We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome uses regression model $Y \sim  \invlogit(\beta_xx)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as limited but the logistic regression model is .. where biases decision.
-Given leniency $R$, a \emph{random} decision maker decides on $T=1$ probability given by $R$.
+We used a number of different decision mechanisms. A \emph{limited} decision-maker works as the default, but predicts the risk for a negative outcome using only the recorded features \features so that $P(\decision = 0~|~\features, \unobservable) = \invlogit(\beta_xx)$. Hence it is unable to observe $Z$.  A \emph{biased} decision maker works similarly as the default decision-maker but the values for the observed features \features observed by the decision-maker are altered. We modified the values so that if the value for \featuresValue  lied in the interval .. it was multiplied by 0.75 to induce more positive decisions. Similarly if the subject's \featuresValue was in the .. we deducted ... to induce more negative decisions. Additionally the effect of non-informative decisions were investigated by deploying a \emph{random} decision-maker. Given leniency $R$, a random decision-maker give a positive decision $T=1$ with probability given by $R$.
 
 In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions 
 on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour: a single judge would need to jail defendant today in order to release a defendant tomorrow.
@@ -493,26 +494,12 @@ We treat the observations as independent and the still the leniency would be a g
 \paragraph{Evaluators} 
 	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
 	\begin{itemize}
-	\item True evaluation
-		\begin{itemize}
-		\item Depicts the true performance of the model. "How well would this model perform had it been deployed?" 
-		\item Not available when using observational data. 
-		\item Calculated by ordering the observations based on the predictions from the black-box model B and counting the failure rate from the ground truth labels.
-		\end{itemize}
-	\item Human evaluation
-		\begin{itemize}
-		\item The performance of the deciders in the data generation step. We binned deciders with similar values of leniency and counted their failure rate.
-		\item In observational data sets, we can only record the decisions and acceptance rates of these decision-makers. 
-		\item This curve is eventually the benchmark for the performance of a model.
-		\end{itemize}
-	\item Labeled outcomes
-		\begin{itemize}
-		\item Vanilla estimator of a model's performance. Obtained by first ordering the observations by the predictions assigned by the decider in the modelling step.
-		\item Then 1-r \% of the most dangerous are detained and given a negative decision. The failure rate is computed as the ratio of negative outcomes to the number of subjects.
-		\end{itemize}
-
-
+	\item  \emph{True evaluation:} True evaluation depicts the true performance of a model. The estimate is computed by first sorting the subjects into a descending order based on the prediction of the model. Then the true failure rate estimate is computable directly from the outcome labels of the top $1-r\%$ of the subjects. True evaluation can only be computed on synthetic data sets as the ground truth labels are missing.
+	\item \emph{Human evaluation:} Human evaluation presents the performance of the decision-makers who observe the latent variable. Human evaluation curve is computed by binning the decision-makers with similar values of leniency into bins and then computing their failure rate from the ground truth labels. \rcomment{Not computing now.}
+	\item \emph{Labeled outcomes:} Labeled outcomes algorithm is the conventional method of computing the failure rate. We proceed as in the true evaluation method but use only the available outcome labels to estimate the failure rate.
+	\item \emph{Contraction:} Contraction is an algorithm designed specifically to estimate the failure rate of a black-box predictive model under selective labeling. See previous section.
 	\end{itemize}
+
 \paragraph{Results} 
 (Target for this section from problem formulation: show that our evaluator is unbiased/accurate (show mean absolute error), robust to changes in data generation (some table perhaps, at least should discuss situations when the decisions are bad/biased/random = non-informative or misleading), also if the decider in the modelling step is bad and its information is used as input, what happens.)
 	\begin{itemize}
@@ -528,9 +515,9 @@ In this section we present results from experiments with (realistic) data sets.
 
 COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is is Northpointe's (now under different name) tool for guiding decisions in the criminal justice system. COMPAS tool provides judges with risk estimates regarding the probability of recidivism and failure to appear. The COMPAS score is mainly derived from  "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" and it predicts recidivism in the following two years. The sole use of the COMPAS score as a basis for judgement has been denied by law, judges must base their decisions to other factors too. 
 
-The COMPAS data set is recidivism data from Broward county, California, USA. The data set was preprocessed by ProPublica for their article Machine Bias. The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. Following ProPublica's reasoning, after final data cleaning we were left with $6 172$ offences. Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences.
+The COMPAS data set is recidivism data from Broward county, California, USA. The data set was preprocessed by ProPublica for their article Machine Bias. The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. Following ProPublica's reasoning, after final data cleaning we were left with $6 172$ offences. Data includes the subjects' demographic information such as gender, age and race and information on their previous offences.
 
-For the analysis we created 9 synthetic judges with leniencies $0.1, 0.2, \ldots, 0.9$. All the subjects were distributed to the judges evenly and at random. In this semi-synthetic scenario, the judge would base their decisions on the COMPAS score, releasing the fraction of defendants according to their leniency. Those who were given a negative decision had their outcome label hidden. The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanour). We experimented with other models but the results remained the same. These same features were used as an input for the counterfactual imputing method.
+For the analysis, we created 9 synthetic judges with leniencies $0.1, 0.2, \ldots, 0.9$. All the subjects were distributed to the judges evenly and at random. In this semi-synthetic scenario, the judge would base their decisions on the COMPAS score, releasing the fraction of defendants according to their leniency. Those who were given a negative decision had their outcome label hidden. The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanour). We experimented with other models but the results remained the same. These same features were used as an input for the counterfactual imputing method.
 
 \begin{itemize}
 %\item COMPAS data set
-- 
GitLab