First new figures, text

12b19e55 · Riku-Laine · c5f348ce · 12b19e55 · c5f348ce · 12b19e55
Commit 12b19e55 authored 5 years ago by Riku-Laine
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
@@ -13,6 +13,8 @@
 \useunder{\uline}{\ul}{}
 \usepackage{multirow}

+\usepackage{enumitem}
+
 %\usepackage[toc,page]{appendix}
 %
 %\renewcommand\appendixtocname{Appendices -- Mostly old}
@@ -107,9 +109,7 @@
 \begin{abstract}
 %This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In sections \ref{sec:framework} and \ref{sec:modular_framework}, I present two frameworks for the selective labels problems. The latter of these has subsequently been established the current framework, but former is included for documentation purposes. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods in the old framework. Results are presented in section \ref{sec:results}. All the algorithms used in the modular framework are presented with motivations in section \ref{sec:modules}. 

-For notes of the meetings please refer to the \href{https://helsinkifi-my.sharepoint.com/personal/rikulain_ad_helsinki_fi/Documents/Meeting_Notes.docx?web=1}{{\ul word document}}.
-
-Old discussions have been moved to the appendices.
+For notes of the meetings please refer to the \href{https://helsinkifi-my.sharepoint.com/personal/rikulain_ad_helsinki_fi/Documents/Meeting_Notes.docx?web=1}{{\ul word document}}. Old discussions have been moved to the appendices.

 \end{abstract}

@@ -133,7 +133,6 @@ Mnemonic rule for the binary coding: zero bad (crime or jail), one good!

 \section{Introduction}

-\setcounter{figure}{-1}
 \begin{wrapfigure}{r}{0.3\textwidth} %this figure will be at the right
    \centering
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]
@@ -176,20 +175,18 @@ $$DG+DEC = DATA \rightarrow MODEL \rightarrow EVALUATION$$

 In our framework, the flow of information is divided into three steps: data generation, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described.

-The data generation step generates all the data, including the features and the selectively labeled outcomes. In our setting we discuss mainly five different variables: R for leniency, X for observed features, Z for unobserved features (or \emph{unobservables} in short), T for the decisions and Y for the outcome. The central variants for data generation are presented in section \ref{sec:modules}. (we have separated the labeling process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}.
-
-After the feature generation, the data is moved to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model we assign predictions for negative outcomes for the observations in the other part of the data.
-
-Finally, an evaluation algorithm tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency.
+The data generation step generates all the data, including the features and the selectively labeled outcomes of the subjects and the features of the deciders. In our setting we discuss mainly five different variables: subjects have feature X for observed features, Z for unobserved features (or \emph{unobservables} in short), and Y for the outcome. The deciders (judges, doctors) assigning the labels are defined by their leniency R and the decisions given to subjects T. The central variants for feature and decision generation are presented in section \ref{sec:modules}. (we have separated the decision process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}.

-%seuraavaksi treeni data annetaan prediktiiviselle malli lle joka oppii mallin jota soveltaa sitten testidataan antaen jokaiselle instanssille tn y nollalle. Malleja kuten neuroverkot, log.reg., lin.reg etc. me käytetään logreg jotka on upotettu evaluointi moduuleihin
+After the feature generation, the data is moved to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model we assign predictions for negative outcomes for the observations in the other part of the data. It is this model the performance of which we are interested in.

-%Viimeisenä data scoreineen annetaan evaluointialgoritmille. Evalualgon tavoitteena on antaa mahdollisimman tarkka ja konsistentti arvo mallin FRstä tietyllä leniencyllä. Kappaleessa moduulit evaluointi algot etc. Hyyvä evaluaatio on tarkka (pieni bias ja varianssi) ja robusti erilaisille dg metodeille.
+Finally, an evaluation algorithm tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency $r$. As explained the setting is characterized by the \emph{selctive labeling} of the data.


 \section{Potential outcomes in model evaluation}

-Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of Y for the missing outcomes. Full hierarchical model is presented in eq. \ref{eq:po} below. Priors for the $\beta$ coefficients were chosen to be sufficiently non-informative without restricting the density estimation. Coefficients for the latent Z were restricted to be positive for posterior estimation. The  resulting distribution is the half-normal distribution ($X\sim N(0, 1) \Rightarrow Y=|X| \sim$ Half-Normal). The alphas are only for the decision variable to emulate the differences in leniency. Then subjects with equal x and z will have a different probability for bail given the judges leniency. 
+Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of Y for the missing outcomes. Full hierarchical model is presented in eq. \ref{eq:po} below. Expectations for the missing outcomes Y are obtained as \sout{point estimates from the maximum of the joint posterior} means of the posterior predictive distribution. (Joint posterior appeared to be bimodal or skewed and therefore the mean gave better estimates.)
+
+Priors for the $\beta$ coefficients were chosen to be sufficiently non-informative without restricting the density estimation. Coefficients for the latent Z were restricted to be positive for posterior estimation. The  resulting distribution is the half-normal distribution ($X\sim N(0, 1) \Rightarrow Y=|X| \sim$ Half-Normal). The $\alpha$ intercepts are only for the decision variable to emulate the differences in leniency for each $M$ judge. Then subjects with equal x and z will have a different probability for bail given the judges leniency. 

 \begin{align} \label{eq:po}
 Y~|~t,~x,~z,~\beta_{xy},~\beta_{zy} & \sim \text{Bernoulli}(\invlogit(\beta_{xy}x + \beta_{zy}z)) \\ \nonumber
@@ -200,7 +197,7 @@ Potential outcomes evaluator uses Stan to infer the latent variable and the path
  p(\alpha_j) & \propto 1 \hskip1.0em \text{for } j \in \{1, 2, \ldots, M\}
 \end{align}

-Model is fitted on the test set and the expectations of potential outcomes $Y_{T=0}$ are used in place of missing outcomes.
+Model is fitted on the test set and the expectations of potential outcomes $Y_{T=1}$ are used in place of missing outcomes.

 \section{Results}

@@ -239,18 +236,20 @@ The four main result figures are presented in figure set \ref{fig:results_bayes}
 \end{figure}

 The robustness checks are presented in figure set \ref{fig:results_robustness} depicting deciders who
-\begin{itemize} 
+\begin{enumerate}[label=\Alph*)]
 \item assign bail randomly with probability $r$, 
 \item who favour and dislike some defendants with certain values of X
 \item bad judge with a wrong $\beta_x$ 
-\end{itemize}
-Last figure is to illustrate the situation where the data generation mechanism is exactly the same as in model specified in eq. \ref{eq:po} with $\beta$ coefficients equal to 1 and $\alpha$ coefficients equal to 0.
+\end{enumerate}
+Last figure (D) is to illustrate the situation where the data generation mechanism is exactly the same as in model specified in eq. \ref{eq:po} with $\beta$ coefficients equal to 1 and $\alpha$ coefficients equal to 0.

 The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015.

 Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.)

-Imposing constraints to model in equation \ref{eq:po} did not yield significantly better results. The model was constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. Other caveats of the current approach is its scalability to big data sets.
+% The figures show that the potential outcomes approach estimates the true failure rate with a slightly better performance than the contraction algorithm (MAEs 0.00X...0.00Y compared to 0.00T...0.00U respectively). The results are in for only some of the core situations discussed, but will be provided for the Thursday meeting. The diagnostic figures in section \ref{sec:diagnostic} also show that the predictions given by the potential outcomes approach have a lower variance.
+
+Imposing constraints to model in equation \ref{eq:po} did not yield significantly better results. The model was constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. Other caveats of the current approach is its scalability to large data sets.

 \begin{figure}[H]
    \centering
@@ -264,7 +263,7 @@ Imposing constraints to model in equation \ref{eq:po} did not yield significantl
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_result_biased}
        \caption{Biased decisions.}
-        %\label{fig:}
+        \label{fig:bias}
    \end{subfigure}
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_result_bad}
@@ -278,7 +277,7 @@ Imposing constraints to model in equation \ref{eq:po} did not yield significantl
        \caption{Data generation as corresponding to model.}
        %\label{fig:}
    \end{subfigure}
-    \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms.}
+    \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms. Only one data set used, which affects the performance of contraction.}
    \label{fig:results_robustness}
 \end{figure}


--- a/figures/sl_diagnostic_bernoulli_independent_with_Z.png
+++ b/figures/sl_diagnostic_bernoulli_independent_with_Z.png
--- a/figures/sl_result_bernoulli_independent.png
+++ b/figures/sl_result_bernoulli_independent.png