Effect of classifier and sanity check

5f1c99b7 · Riku-Laine · 32fb16c7 · 5f1c99b7 · 5f1c99b7 · 5f1c99b7
Commit 5f1c99b7 authored 5 years ago by Riku-Laine
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
@@ -61,6 +61,7 @@
 \title{Notes}
 \author{RL, 11 June 2019}
 %\date{}                                           % Activate to display a given date or no date
+
 \begin{document}

 \maketitle
@@ -81,7 +82,7 @@ This document presents the implementations of RL in pseudocode level. First, I p
 \item[T :] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0
 \item[Y :] result variable, no crime/positive result equal to 1, crime/negative result equal to 0
 \item[MAE :] Mean absolute error
-\item[SL :] Selective labels, see \cite{lakkaraju17}
+\item[SL :] Selective labels, for more information see Lakkaraju's paper  \cite{lakkaraju17}
 \item[Labeled data :] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA.
 \item[Full data :] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available.
 \item[Unobservables :] unmeasured confounders, latent variables, Z
@@ -95,11 +96,11 @@ Mnemonic rule for the binary coding: zero bad (crime or jail), one good!

 The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Other approaches, such as a data augmentation algorithm has been proposed by De-Arteaga \cite{dearteaga18}.)

-The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons. We can however use more selectively labeled data to estimate it's performance. But, because the data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.
+The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons and hidden labels. We can however use more selectively labeled data to estimate it's performance. But, because the available data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.

-One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given only selectively labeled data and an arbitrary black-box model $\mathcal{B}$ we are interested in estimating performance of model $\mathcal{B}$ in the whole data set with all ground truth labels.
+One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given only selectively labeled data and an arbitrary black-box model $\mathcal{B}$ we are interested in predicting the performance of model $\mathcal{B}$ in the whole data set with all ground truth labels.

-On the formalisation of R: We discussed how Lakkaraju's paper treats variable R in a seemingly non-sensical way, it is as if a judge would have to let someone go today in order to detain some other defendant tomorrow to keep their acceptance rate at some $r$. A more intuitive way of thinking $r$ would be the "threshold perspective". That is, if a judge sees that a defendant has probability $p_x$ of committing a crime if let out, the judge would detain the defendant if $p_x > r$. The problem in this case is that we cannot observe this innate $r$, we can only observe the decisions given by the judges. This is how Lakkaraju avoids computing $r$ twice by forcing the "acceptance threshold" to be an "acceptance rate" and then the effect of changing $r$ can be computed from the data directly.
+On the formalisation of R: We discussed how Lakkaraju's paper treats variable R in a seemingly non-sensical way, it is as if a judge would have to let someone go today in order to detain some other defendant tomorrow to keep their acceptance rate at some $r$. A more intuitive way of thinking $r$ would be the "threshold perspective". That is, if a judge sees that a defendant has probability $p_x$ of committing a crime if let out, the judge would detain the defendant if $p_x > r$, the defendant would be too dangerous to let out. The problem in this case is that we cannot observe this innate $r$, we can only observe the decisions given by the judges. This is how Lakkaraju avoids computing $r$ twice by forcing the "acceptance threshold" to be an "acceptance rate" and then the effect of changing $r$ can be computed from the data directly.

 \section{Data generation}

@@ -154,7 +155,7 @@ In the setting with unobservables Z, we first sample an acceptance rate r for al

 \section{Model fitting} \label{sec:model_fitting}

-The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details.
+The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. Similar analyses were conducted using random forest classifier, but the results (see section \ref{sec:random_forest}) were practically identical.

 All of the algorithms 4--7 and the contraction algorithm are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour."

@@ -169,7 +170,7 @@ The following quantities are estimated from the data:
 \item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}.
 \item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}.
 \item Contraction: See algorithm 1 of \cite{lakkaraju17}
-\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) predicting Y from X trained on the labeled data and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf}
+\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) trained on the labeled data predicting Y from X and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf}
 \end{itemize}

 The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}.
@@ -307,7 +308,8 @@ If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval
        \caption{$\beta_Z=0$}
        \label{fig:betaZ_0}
    \end{subfigure}
-    \caption{Failure rate vs. acceptance rate with unobservables in the data. Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$. Data was generated with algorithm \ref{alg:data_with_Z}.}\label{fig:betaZ_comp}
+    \caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$.}
+    \label{fig:betaZ_comp}
 \end{figure}

 \subsection{Noise added to the decision and data generated without unobservables}
@@ -316,11 +318,44 @@ In this part, Gaussian noise with zero mean and 0.1 variance was added to the pr

 \begin{figure}[H]
    \centering
-    \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
-    \caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data and $N_{iter}$ was set to 3.}
+    \includegraphics[width=0.75\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
+    \caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.}
    \label{fig:sigma_figure}
 \end{figure}

+\subsection{Predictions with random forest classifier} \label{sec:random_forest}
+
+In this section the predictive model was switched to random forest classifier to examine the effect of changing the model. Results are practically identical to then ones presented in figure \ref{fig:results} previously. The resulting outcome is presented in \ref{fig:random_forest}.
+
+\begin{figure}[H]
+    \centering
+    \begin{subfigure}[b]{0.5\textwidth}
+        \includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
+        \caption{Results without unobservables, \\$N_{iter}=4$.}
+        \label{fig:results_without_Z}
+    \end{subfigure}
+    ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.5\textwidth}
+        \includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
+        \caption{Results with unobservables, $\beta_Z=1$ and \\$N_{iter}=6$.}
+        \label{fig:results_with_Z}
+    \end{subfigure}
+    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
+    \label{fig:random_forest}
+\end{figure}
+
+\subsection{Sanity check for predictions}
+
+Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.
+
+\begin{figure}[H]
+    \centering
+    \includegraphics[width=0.75\textwidth]{sanity_check}
+    \caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
+    \label{fig:sanity_check}
+\end{figure}
+
 \begin{thebibliography}{9}

 \bibitem{dearteaga18}

--- a/figures/sanity_check.png
+++ b/figures/sanity_check.png
--- a/figures/sl_withZ_6iter_betaZ_1_0_randomforest.png
+++ b/figures/sl_withZ_6iter_betaZ_1_0_randomforest.png
--- a/figures/sl_withoutZ_4iter_randomforest.png
+++ b/figures/sl_withoutZ_4iter_randomforest.png