diff --git a/analysis_and_scripts/notes.tex b/analysis_and_scripts/notes.tex
index 0ca8d60b3e925f6c5f03c631b169899d18711cb2..83c9e58bbcfe489213037b46adb7fab312373c13 100644
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
@@ -13,6 +13,11 @@
 \useunder{\uline}{\ul}{}
 \usepackage{multirow}
 
+%\usepackage[toc,page]{appendix}
+%
+%\renewcommand\appendixtocname{Appendices -- Mostly old}
+%\renewcommand\appendixpagename{Appendices -- Mostly old}
+
 \usepackage{wrapfig} % wrap figures
 
 \usepackage{booktabs}% http://ctan.org/pkg/booktabs
@@ -88,8 +93,9 @@
 \usepackage{subcaption}
 \graphicspath{ {../figures/} }
 
-\title{Notes}
-\author{RL, 1 July 2019}
+\title[Potential outcomes in model evaluation]{Notes, working title: \\ From would-have-beens to should-have-beens: Potential outcomes in model evaluation}
+
+\author{RL}
 %\date{}                                           % Activate to display a given date or no date
 
 \begin{document}
@@ -99,9 +105,12 @@
 \tableofcontents
 
 \begin{abstract}
-This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In sections \ref{sec:framework} and \ref{sec:modular_framework}, I present two frameworks for the selective labels problems. The latter of these has subsequently been established the current framework, but former is included for documentation purposes. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods in the old framework. Results are presented in section \ref{sec:results}. All the algorithms used in the modular framework are presented with motivations in section \ref{sec:modules}. 
+%This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In sections \ref{sec:framework} and \ref{sec:modular_framework}, I present two frameworks for the selective labels problems. The latter of these has subsequently been established the current framework, but former is included for documentation purposes. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods in the old framework. Results are presented in section \ref{sec:results}. All the algorithms used in the modular framework are presented with motivations in section \ref{sec:modules}. 
 
 For notes of the meetings please refer to the \href{https://helsinkifi-my.sharepoint.com/personal/rikulain_ad_helsinki_fi/Documents/Meeting_Notes.docx?web=1}{{\ul word document}}.
+
+Old discussions have been moved to the appendices.
+
 \end{abstract}
 
 \section*{Terms and abbreviations}
@@ -122,7 +131,7 @@ For notes of the meetings please refer to the \href{https://helsinkifi-my.sharep
 
 Mnemonic rule for the binary coding: zero bad (crime or jail), one good!
 
-\section{RL's notes about the selective labels paper (optional reading)} \label{sec:comments}
+\section{Introduction}
 
 \setcounter{figure}{-1}
 \begin{wrapfigure}{r}{0.3\textwidth} %this figure will be at the right
@@ -148,9 +157,150 @@ Mnemonic rule for the binary coding: zero bad (crime or jail), one good!
 \label{fig:initial_model}
 \end{wrapfigure}
 
+Complex computational algorithms have already been deployed in numerous fields such as healthcare and justice to make influential decisions. These machine decisions have a great impact on human lives and therefore they should be audited to show that they improve on human decision-making. In the conventional settings, outcome labels are readily available and performance evaluation is trivial and hence numerous metrics have been proposed in the literature (ROC, AUC etc.). In settings as judicial bail decisions, some outcomes can not be observed due to the selective labeling of the data. This results in a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels are a non-random sample of the underlying population. Approaches addressing this problem can be separated into two major classes: imputation approaches and other approaches.
+
+%In this paper we propose a novel framework for presenting these missing data problems and present an approach for inferring the missing labels to evaluate the performance of predictive models. We use a flexible Bayesian approach and etc. Additionally we show that our method is robust against violations and modifications in the data generating mechanisms.
+
+%\begin{itemize}
+%\item Evaluating predictive models is important. Models are even now deployed in the fields such as X, Y and Z to make influential decisions.
+%\item Missing data problems ? Multiple kinds, MAR, MNAR, MCAR, what not. In the current setting data missingness is correlated with the outcome. Labeling has been performed by expert decision-makers wanting to avoid negative outcomes. This causes non-random missingness which has been recently been called selective labels/-ing.
+%\item Approaches to this problem can roughly be divided into three categories: ignore, impute and other methods. Imputation methods X, Y, Z. Other is contraction presented by Lakkaraju et al which utilises...
+%\item Several nuances to this setting a more relaxed setting in Jung et al preprint where counterfactuals were used to construct optimal policies. There wasn't selective labels.
+%\item in addition selective labels.
+%\item Interest in presenting a method for robust evaluation of predictive algorithms using potential outcomes.
+%\end{itemize}
+
+\section{Framework}
+
+$$DG+DEC = DATA \rightarrow MODEL \rightarrow EVALUATION$$
+
+In our framework, the flow of information is divided into three steps: data generation, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described.
+
+The data generation step generates all the data, including the features and the selectively labeled outcomes. In our setting we discuss mainly five different variables: R for leniency, X for observed features, Z for unobserved features (or \emph{unobservables} in short), T for the decisions and Y for the outcome. The central variants for data generation are presented in section \ref{sec:modules}. (we have separated the labeling process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}.
+
+After the feature generation, the data is moved to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model we assign predictions for negative outcomes for the observations in the other part of the data.
+
+Finally, an evaluation algorithm tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency.
+
+%seuraavaksi treeni data annetaan prediktiiviselle malli lle joka oppii mallin jota soveltaa sitten testidataan antaen jokaiselle instanssille tn y nollalle. Malleja kuten neuroverkot, log.reg., lin.reg etc. me käytetään logreg jotka on upotettu evaluointi moduuleihin
+
+%Viimeisenä data scoreineen annetaan evaluointialgoritmille. Evalualgon tavoitteena on antaa mahdollisimman tarkka ja konsistentti arvo mallin FRstä tietyllä leniencyllä. Kappaleessa moduulit evaluointi algot etc. Hyyvä evaluaatio on tarkka (pieni bias ja varianssi) ja robusti erilaisille dg metodeille.
+
+
+\section{Potential outcomes in model evaluation}
+
+Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of Y for the missing outcomes. Full hierarchical model is presented in eq. \ref{eq:po} below. Priors for the $\beta$ coefficients were chosen to be sufficiently non-informative without restricting the density estimation. Coefficients for the latent Z were restricted to be positive for posterior estimation. The  resulting distribution is the half-normal distribution ($X\sim N(0, 1) \Rightarrow Y=|X| \sim$ Half-Normal). The alphas are only for the decision variable to emulate the differences in leniency. Then subjects with equal x and z will have a different probability for bail given the judges leniency. 
+
+\begin{align} \label{eq:po}
+ Y~|~t,~x,~z,~\beta_{xy},~\beta_{zy} & \sim \text{Bernoulli}(\invlogit(\beta_{xy}x + \beta_{zy}z)) \\ \nonumber
+ T~|~x,~z,~\beta_{xt},~\beta_{zt} & \sim \text{Bernoulli}(\invlogit(\alpha_j + \beta_{xt}x + \beta_{zt}z)) \\ \nonumber
+ Z &\sim N(0, 1) \\ \nonumber
+  \beta_{xt}, \beta_{xy} & \sim N(0, 10^2) \\ \nonumber
+  \beta_{zt}, \beta_{zy} & \sim N_+(0, 10^2) \\ \nonumber
+  p(\alpha_j) & \propto 1 \hskip1.0em \text{for } j \in \{1, 2, \ldots, M\}
+\end{align}
+
+Model is fitted on the test set and the expectations of potential outcomes $Y_{T=0}$ are used in place of missing outcomes.
+
+\section{Results}
+
+Here we present the results obtained from four core settings and four settings created for robustness check. Our core settings include situations where data is generated with the unobservables Z. Settings are characterized by the outcome creating mechanism and decision assignment.
+
+The four main result figures are presented in figure set \ref{fig:results_bayes} with captions indicating the outcome and decision mechanisms. Figures showing the variance of the failure rate estimates are in appendix section \ref{sec:diagnostic}.
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_bernoulli_independent}
+        \caption{Outcome Y from Bernoulli, independent decisions using the quantiles.}
+        %\label{fig:}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_threshold_independent}
+        \caption{Outcome Y from threshold rule, independent decisions using the quantiles.}
+        %\label{fig:}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_bernoulli_batch}
+        \caption{Outcome Y from Bernoulli, non-independent decisions.}
+        %\label{fig:}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_threshold_batch}
+        \caption{Outcome Y from threshold rule, non-independent decisions.}
+        %\label{fig:}
+    \end{subfigure}
+    \caption{Core results: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms.}
+    \label{fig:results_bayes}
+\end{figure}
+
+The robustness checks are presented in figure set \ref{fig:results_robustness} depicting deciders who
+\begin{itemize} 
+\item assign bail randomly with probability $r$, 
+\item who favour and dislike some defendants with certain values of X
+\item bad judge with a wrong $\beta_x$ 
+\end{itemize}
+Last figure is to illustrate the situation where the data generation mechanism is exactly the same as in model specified in eq. \ref{eq:po} with $\beta$ coefficients equal to 1 and $\alpha$ coefficients equal to 0.
+
+The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015.
+
+Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.)
+
+Imposing constraints to model in equation \ref{eq:po} did not yield significantly better results. The model was constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. Other caveats of the current approach is its scalability to big data sets.
+
+\begin{figure}[H]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_random}
+        \caption{Random decisions.}
+        %\label{fig:}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_biased}
+        \caption{Biased decisions.}
+        %\label{fig:}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_bad}
+        \caption{Bad judge with $\beta_x=0.2$.}
+        %\label{fig:}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_result_bernoulli_bernoulli}
+        \caption{Data generation as corresponding to model.}
+        %\label{fig:}
+    \end{subfigure}
+    \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms.}
+    \label{fig:results_robustness}
+\end{figure}
+
+\begin{thebibliography}{9} % Might have been apa
+
+\bibitem{lakkaraju17} 
+   Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017. 
+
+\end{thebibliography}
+
+%\end{document}
+
+\newpage
+
+%\begin{appendices}
+\appendix
+
+\section{RL's comments on the selective labels paper} \label{sec:comments}
+
 \emph{This chapter is to present my comments and insight regarding the topic.}
 
-The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Other approaches, such as a data augmentation algorithm has been proposed by De-Arteaga \cite{dearteaga18}.)
+The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required.
 
 The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons and hidden labels. We can however use more selectively labeled data to estimate it's performance. But, because the available data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.
 
@@ -512,192 +662,15 @@ Now the equation \ref{eq:ep} simply calculates the mean of the probabilities for
 \end{algorithmic}
 \end{algorithm}
 
-
-\section{Results} \label{sec:results}
-
-Results obtained from running algorithm \ref{alg:perf_comp} are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.
-
-\begin{table}[H]
-\centering
-\caption{Mean absolute error (MAE) w.r.t true evaluation. \\ \emph{RL: Updated 26 June.}}
-\begin{tabular}{l | c c}
-Method & MAE without Z & MAE with Z \\ \hline
-Labeled outcomes 	& 0.107249375 	& 0.0827844\\
-Human evaluation 	& 0.002383729 	& 0.0042517\\
-Contraction 		& 0.004633164		& 0.0075497\\
-Causal model, ep 	& 0.000598624 	& 0.0411532\\
-\end{tabular}
-\label{tab:results}
-\end{table}
-
-
-\begin{figure}[]
-    \centering
-    \begin{subfigure}[b]{0.5\textwidth}
-        \includegraphics[width=\textwidth]{sl_without_Z_8iter}
-        \caption{Results without unobservables}
-        \label{fig:results_without_Z}
-    \end{subfigure}
-    ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.5\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_8iter_betaZ_1_0}
-        \caption{Results with unobservables, $\beta_Z=1$.}
-        \label{fig:results_with_Z}
-    \end{subfigure}
-    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. \emph{RL: Updated 26 June.}}
-    \label{fig:results}
-\end{figure}
-
-\subsection{$\beta_Z=0$ and data generated with unobservables.}
-
-If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Results are presented in figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}. 
-
-The disparities between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} (result without unobservables and with $\beta_Z=0$) can be explained in the slight difference in the data generating process, namely the effect of $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}.
-
-\begin{figure}[]
-    \centering
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
-        \caption{Results with unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
-        \label{fig:betaZ_1_5}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
-        \caption{Results with unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
-        \label{fig:betaZ_0}
-    \end{subfigure}
-    \caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp}.}
-    \label{fig:betaZ_comp}
-\end{figure}
-
-\subsection{Noise added to the decision and data generated without unobservables} \label{sec:epsilon}
-
-In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}.
-
-\begin{figure}[]
-    \centering
-    \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
-    \caption{Failure rate with varying levels of leniency without unobservables. Noise has been added to the decision probabilities. Logistic regression was trained on labeled training data.}
-    \label{fig:sigma_figure}
-\end{figure}
-
-\subsection{Predictions with random forest classifier} \label{sec:random_forest}
-
-In this section the predictive model was switched to random forest classifier to examine the effect of changing the predictive model. Results are practically identical to those presented in figure \ref{fig:results} previously and are presented in figure \ref{fig:random_forest}.
-
-\begin{figure}[]
-    \centering
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
-        \caption{Results without unobservables.}
-        \label{fig:results_without_Z_rf}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
-        \caption{Results with unobservables, $\beta_Z=1$.}
-        \label{fig:results_with_Z_rf}
-    \end{subfigure}
-    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
-    \label{fig:random_forest}
-\end{figure}
-
-\subsection{Sanity check for predictions}
-
-Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.
-
-\begin{figure}[]
-    \centering
-    \includegraphics[width=0.5\textwidth]{sanity_check}
-    \caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
-    \label{fig:sanity_check}
-\end{figure}
-
-\subsection{Fully random model $\M$}
-
-Given our framework defined in section \ref{sec:framework}, the results presented next are with model $\M$ that outputs probabilities 0.5 for every instance of $x$. Labeling process is still as presented in algorithm \ref{alg:data_with_Z}.  
-
-\begin{figure}[]
-    \centering
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_without_Z_15iter_random_model}
-        \caption{Failure rate vs. acceptance rate. Data without unobservables. Machine predictions with random model.}
-        \label{fig:random_predictions_without_Z}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_15iter_fully_random_model}
-        \caption{Failure rate vs. acceptance rate. Data with unobservables. Machine predictions with random model.}
-        \label{fig:random_predictions_with_Z}
-    \end{subfigure}
-    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Machine predictions were done with completely random model, that is prediction $P(Y=0|X=x)=0.5$ for all $x$.}
-    \label{fig:random_predictions}
-\end{figure}
-
-\subsection{Modular framework -- Monte Carlo evaluator} \label{sec:modules_mc}
-
-For these results, data was generated either with module in algorithm \ref{alg:dg:coinflip_with_z} (drawing Y from Bernoulli distribution with parameter $\pr(Y=0|X, Z, W)$ as previously) or with module in algorithm \ref{alg:dg:threshold_with_Z} (assign Y based on the value of $\invlogit(\beta_XX+\beta_ZZ)$). Decisions were determined using one of the two modules: module in algorithm \ref{alg:decider:quantile} (decision based on quantiles) or \ref{alg:decider:lakkaraju} ("human" decision-maker as in \cite{lakkaraju17}). Curves were computed with True evaluation (algorithm \ref{alg:eval:true_eval}), Labeled outcomes (\ref{alg:eval:labeled_outcomes}), Human evaluation (\ref{alg:eval:human_eval}), Contraction (\ref{alg:eval:contraction}) and Monte Carlo evaluators (\ref{alg:eval:mc}). Results are presented in figure \ref{fig:modules_mc}. The corresponding MAEs are presented in table \ref{tab:modules_mc}.
-
-From the result table we can see that the MAE is at the lowest when the data generating process corresponds closely to the Monte Carlo algorithm.
-
-\begin{table}[H]
-\centering
-\caption{Mean absolute error w.r.t true evaluation. See modules used in section \ref{sec:modules_mc}. Bern = Bernoulli,  indep. = independent, TH = threshold}
-\begin{tabular}{l | c c c c}
-Method & Bern + indep. & Bern + non-indep. & TH + indep. & TH + non-indep.\\ \hline
-Labeled outcomes 	& 0.111075	& 0.103235	& 0.108506 & 0.0970325\\
-Human evaluation 	& 0.027298	& NaN (TBA)	& 0.049582 & 0.0033916\\
-Contraction 		& 0.004206	& 0.004656	& 0.005557 & 0.0034591\\
-Monte Carlo	 	& 0.001292	& 0.016629	& 0.009429 & 0.0179825\\
-\end{tabular}
-\label{tab:modules_mc}
-\end{table}
-
-
-\begin{figure}[H]
-    \centering
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_quantile_defaults_mc}
-        \caption{Outcome Y from Bernoulli, independent decisions using the quantiles.}
-        %\label{fig:modules_mc_without_Z}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_20iter_threshold_quantile_defaults_mc}
-        \caption{Outcome Y from threshold rule, independent decisions using the quantiles.}
-        %\label{fig:modules_mc_with_Z}
-    \end{subfigure}
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_lakkarajudecider_defaults_mc}
-        \caption{Outcome Y from Bernoulli, non-independent decisions.}
-        %\label{fig:modules_mc_without_Z}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.475\textwidth}
-        \includegraphics[width=\textwidth]{sl_with_Z_10iter_threshold_lakkarajudecider_defaults_mc}
-        \caption{Outcome Y from threshold rule, non-independent decisions.}
-        %\label{fig:modules_mc_with_Z}
-    \end{subfigure}
-    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation modules. See other modules used in section \ref{sec:modules_mc}}
-    \label{fig:modules_mc}
-\end{figure}
-
 \section{Modules} \label{sec:modules}
 
-Different types of modules (data generation, decider and evaluator) are presented in this section. Summary table is presented last. See section \ref{sec:modular_framework} for some discussion on the properties of each module.
+Different types of modules (data generation, decider and evaluator) are presented in this section. Summary table of the modules is presented last. See section \ref{sec:modular_framework} for some discussion on the properties of each module.
 
 \subsection{Data generation modules} 
 
 The purpose of a data generation module is to define a process which generates all of the data consisting of all the features and outcomes. The data generation module should include information on what are the distributions of the features and how each of them affect the generation of the outcome.
 
-We have three different kinds of data generating modules (DG modules). The DG modules can be separated by two features: inclusion of unobservables and the outcome generating mechanism. Private features are mainly sampled from standard Gaussians. See summary of DG modules from table \ref{tab:dg_modules}.
+In our synthetic setting, we have three different kinds of data generating modules (DG modules). The DG modules can be separated by two features: inclusion of unobservables and the outcome generating mechanism. Private features are mainly sampled from standard Gaussians. See summary of DG modules from table \ref{tab:dg_modules}.
 
 \begin{table}[h]
 \centering
@@ -712,7 +685,7 @@ We have three different kinds of data generating modules (DG modules). The DG mo
 \end{tabular}
 \end{table}
 
-\begin{algorithm}[h] 			% enter the algorithm environment
+\begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Data generation module: outcome from Bernoulli without unobservables} 		% give the algorithm a caption
 \label{alg:dg:coinflip_without_z} 			% and a label for \ref{} commands later in the document
 \begin{algorithmic}[1] 		% enter the algorithmic environment
@@ -726,7 +699,7 @@ We have three different kinds of data generating modules (DG modules). The DG mo
 \end{algorithmic}
 \end{algorithm}
 
-\begin{algorithm}[h] 			% enter the algorithm environment
+\begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Data generation module: outcome by threshold with unobservables} 		% give the algorithm a caption
 \label{alg:dg:threshold_with_Z} 			% and a label for \ref{} commands later in the document
 \begin{algorithmic}[1] 		% enter the algorithmic environment
@@ -744,7 +717,7 @@ We have three different kinds of data generating modules (DG modules). The DG mo
 \end{algorithmic}
 \end{algorithm}
 
-\begin{algorithm}[h] 			% enter the algorithm environment
+\begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Data generation module: outcome from Bernoulli with unobservables} 		% give the algorithm a caption
 \label{alg:dg:coinflip_with_z} 			% and a label for \ref{} commands later in the document
 \begin{algorithmic}[1] 		% enter the algorithmic environment
@@ -969,7 +942,7 @@ In practise, in lines 1--3 and 10--13 of algorithm \ref{alg:eval:mc} we do as in
 computes the correct expectation automatically. Using the expectation, we then compute the probability for the counterfactual $\pr(Y(1) = 0)$ (probability of a negative outcome had a positive decision been given). In line 9 the imputation can be performed in couple of ways: either by taking a random guess with probability $\pr(Y(1) = 0)$ or by assigning the most likely value for Y.
 
 \begin{algorithm}[H] 			% enter the algorithm environment
-\caption{Evaluator module: Monte Carlo evaluator, imputation} 		% give the algorithm a caption
+\caption{Evaluator module: Analytic solution} 		% give the algorithm a caption
 \label{alg:eval:mc} 			% and a label for \ref{} commands later in the document
 \begin{algorithmic}[1] 		% enter the algorithmic environment
 \REQUIRE Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$, acceptance rate r
@@ -990,46 +963,6 @@ computes the correct expectation automatically. Using the expectation, we then c
 \end{algorithmic}
 \end{algorithm}
 
-%Comments this approach:
-%\begin{itemize}
-%\item Propensity ($\pr(T=1| X, Z)$) is taken as given and in correct form. In reality it is not known (?)
-%\item The equation for the inverse cdf \ref{eq:cum_inv} assumes the joint pdf of $\invlogit(\beta_XX+\beta_ZZ)$ known when in real data X might be multidimensional and non-normal etc.
-%\item 
-%\end{itemize}
-
-In the future, we should utilize an even more Bayesian approach to be able to include priors for the different $\beta$ coefficients. Priors are needed for learning the values for the coefficients. 
-
-The following hierarchical model was used as an initial approach to the problem. Data was generated with unobservables and both outcome Y and decision T were drawn from Bernoulli distributions. The $\beta$ coefficients were systematically overestimated as shown in figure \ref{fig:posteriors}.
-
-\begin{align} \label{eq1}
- 1-t~|~x,~z,~\beta_x,~\beta_z & \sim \text{Bernoulli}(\invlogit(\beta_xx + \beta_zz)) \\ \nonumber
- Z &\sim N(0, 1) \\ \nonumber
-% \alpha_j & \sim N(0, 100), j \in \{1, \ldots, N_{judges} \} \\
-  \beta_x & \sim N(0, 10^2) \\ \nonumber
-  \beta_z & \sim N_+(0, 10^2) 
-\end{align}
-
-
-\begin{figure}[]
-    \centering
-    \begin{subfigure}[b]{0.45\textwidth}
-        \includegraphics[width=\textwidth]{sl_posterior_betax}
-        \caption{Posterior of $\beta_x$.}
-        %\label{fig:random_predictions_without_Z}
-    \end{subfigure}
-    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
-      %(or a blank line to force the subfigure onto a new line)
-    \begin{subfigure}[b]{0.45\textwidth}
-        \includegraphics[width=\textwidth]{sl_posterior_betaz}
-        \caption{Posterior of $\beta_z$.}
-        %\label{fig:posteriors}
-    \end{subfigure}
-    \caption{Coefficient posteriors from model \ref{eq1}.}
-    \label{fig:posteriors}
-\end{figure}
-
-\newpage
-
 \subsection{Summary table}
 
 Summary table of different modules.
@@ -1042,16 +975,16 @@ Summary table of different modules.
     \multicolumn{3}{c}{Module type} \\[.5\normalbaselineskip]
     \textbf{Data generator} & \textbf{Decider} & \textbf{Evaluator}  \\
     \midrule
-    {\ul Without unobservables}	& Independent decisions		& {\ul Labeled outcomes} \\
+    {\ul Without unobservables}	& {\ul Independent decisions}	& {\ul Labeled outcomes} \\
      					 	& 1. draw T from a Bernoulli	& \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
     {\ul With unobservables}       	& with $P(T=0|X, Z)$			& \tabitem acceptance rate r \\
-    \tabitem $P(Y=0|X, Z, W)$ 	& 2. determine with $F^{-1}(r)$	& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
+    \tabitem $P(Y=0|X, Z, W)$ 	& 						& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
     
-     {\ul With unobservables}	& Non-independent decisions  	& {\ul True evaluation} \\
-     \tabitem assign Y by		& 3. sort by $P(T=0|X, Z)$		& \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
-     "threshold rule"			& and assign $t$ by $r$  		& and \emph{all outcome labels} \\
-     						&   						& \tabitem acceptance rate r \\
-     						&   						& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
+     {\ul With unobservables}	& 2. determine with $F^{-1}(r)$		& {\ul True evaluation} \\
+     \tabitem assign $Y=1$		& 							& \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
+     if $P(Y=0|X, Z, W) \geq 0.5$	& {\ul Non-independent decisions}	& and \emph{all outcome labels} \\
+     						& 3. sort by $P(T=0|X, Z)$			& \tabitem acceptance rate r \\
+     						& and assign $t$ by $r$			& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
      
      &  & {\ul Human evaluation} \\
      &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
@@ -1062,28 +995,262 @@ Summary table of different modules.
      &  & \tabitem acceptance rate r \\
      &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
      
-     &  & {\ul Causal model} \\
-     &  & \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
-     &  & \tabitem acceptance rate r \\
-     &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
+%     &  & {\ul Causal model} \\
+%     &  & \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
+%     &  & \tabitem acceptance rate r \\
+%     &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
      
-     &  & {\ul Monte Carlo evaluator} \\
+     &  & {\ul Analytic solution} \\
      &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
      &  & \tabitem acceptance rate r \\
      &  & \tabitem knowledge that X affects Y \\
      &  & \tabitem more intricate knowledge about $\M$ ? \\[.5\normalbaselineskip]
+     
+     &  & {\ul Potential outcomes evaluator} \\
+     &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
+     &  & \tabitem acceptance rate r \\
+     &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
     \bottomrule
   \end{tabular}
   \label{tab:modules}
 \end{table}
 
-\begin{thebibliography}{9} % Might have been apa
+\section{Old results} \label{sec:results}
 
-\bibitem{dearteaga18}
-   De-Arteaga, Maria. Learning Under Selective Labels in the Presence of Expert Consistency. 2018. 
-\bibitem{lakkaraju17} 
-   Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017. 
+Results obtained from running algorithm \ref{alg:perf_comp} are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.
 
-\end{thebibliography}
+\begin{table}[H]
+\centering
+\caption{Mean absolute error (MAE) w.r.t true evaluation. \\ \emph{RL: Updated 26 June.}}
+\begin{tabular}{l | c c}
+Method & MAE without Z & MAE with Z \\ \hline
+Labeled outcomes 	& 0.107249375 	& 0.0827844\\
+Human evaluation 	& 0.002383729 	& 0.0042517\\
+Contraction 		& 0.004633164		& 0.0075497\\
+Causal model, ep 	& 0.000598624 	& 0.0411532\\
+\end{tabular}
+\label{tab:results}
+\end{table}
+
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.5\textwidth}
+        \includegraphics[width=\textwidth]{sl_without_Z_8iter}
+        \caption{Results without unobservables}
+        \label{fig:results_without_Z}
+    \end{subfigure}
+    ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.5\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_8iter_betaZ_1_0}
+        \caption{Results with unobservables, $\beta_Z=1$.}
+        \label{fig:results_with_Z}
+    \end{subfigure}
+    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. \emph{RL: Updated 26 June.}}
+    \label{fig:results}
+\end{figure}
+
+\subsection{$\beta_Z=0$ and data generated with unobservables.}
+
+If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Results are presented in figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}. 
+
+The disparities between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} (result without unobservables and with $\beta_Z=0$) can be explained in the slight difference in the data generating process, namely the effect of $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}.
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
+        \caption{Results with unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
+        \label{fig:betaZ_1_5}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
+        \caption{Results with unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
+        \label{fig:betaZ_0}
+    \end{subfigure}
+    \caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp}.}
+    \label{fig:betaZ_comp}
+\end{figure}
+
+\subsection{Noise added to the decision and data generated without unobservables} \label{sec:epsilon}
+
+In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}.
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
+    \caption{Failure rate with varying levels of leniency without unobservables. Noise has been added to the decision probabilities. Logistic regression was trained on labeled training data.}
+    \label{fig:sigma_figure}
+\end{figure}
+
+\subsection{Predictions with random forest classifier} \label{sec:random_forest}
+
+In this section the predictive model was switched to random forest classifier to examine the effect of changing the predictive model. Results are practically identical to those presented in figure \ref{fig:results} previously and are presented in figure \ref{fig:random_forest}.
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
+        \caption{Results without unobservables.}
+        \label{fig:results_without_Z_rf}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
+        \caption{Results with unobservables, $\beta_Z=1$.}
+        \label{fig:results_with_Z_rf}
+    \end{subfigure}
+    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
+    \label{fig:random_forest}
+\end{figure}
+
+\subsection{Sanity check for predictions}
+
+Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=0.5\textwidth]{sanity_check}
+    \caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
+    \label{fig:sanity_check}
+\end{figure}
+
+\subsection{Fully random model $\M$}
+
+Given our framework defined in section \ref{sec:framework}, the results presented next are with model $\M$ that outputs probabilities 0.5 for every instance of $x$. Labeling process is still as presented in algorithm \ref{alg:data_with_Z}.  
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_without_Z_15iter_random_model}
+        \caption{Failure rate vs. acceptance rate. Data without unobservables. Machine predictions with random model.}
+        \label{fig:random_predictions_without_Z}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_15iter_fully_random_model}
+        \caption{Failure rate vs. acceptance rate. Data with unobservables. Machine predictions with random model.}
+        \label{fig:random_predictions_with_Z}
+    \end{subfigure}
+    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Machine predictions were done with completely random model, that is prediction $P(Y=0|X=x)=0.5$ for all $x$.}
+    \label{fig:random_predictions}
+\end{figure}
+
+\subsection{Modular framework -- Monte Carlo evaluator} \label{sec:modules_mc}
+
+For these results, data was generated either with module in algorithm \ref{alg:dg:coinflip_with_z} (drawing Y from Bernoulli distribution with parameter $\pr(Y=0|X, Z, W)$ as previously) or with module in algorithm \ref{alg:dg:threshold_with_Z} (assign Y based on the value of $\invlogit(\beta_XX+\beta_ZZ)$). Decisions were determined using one of the two modules: module in algorithm \ref{alg:decider:quantile} (decision based on quantiles) or \ref{alg:decider:lakkaraju} ("human" decision-maker as in \cite{lakkaraju17}). Curves were computed with True evaluation (algorithm \ref{alg:eval:true_eval}), Labeled outcomes (\ref{alg:eval:labeled_outcomes}), Human evaluation (\ref{alg:eval:human_eval}), Contraction (\ref{alg:eval:contraction}) and Monte Carlo evaluators (\ref{alg:eval:mc}). Results are presented in figure \ref{fig:modules_mc}. The corresponding MAEs are presented in table \ref{tab:modules_mc}.
+
+From the result table we can see that the MAE is at the lowest when the data generating process corresponds closely to the Monte Carlo algorithm.
+
+\begin{table}[]
+\centering
+\caption{Mean absolute error w.r.t true evaluation. See modules used in section \ref{sec:modules_mc}. Bern = Bernoulli,  indep. = independent, TH = threshold}
+\begin{tabular}{l | c c c c}
+Method & Bern + indep. & Bern + non-indep. & TH + indep. & TH + non-indep.\\ \hline
+Labeled outcomes 	& 0.111075	& 0.103235	& 0.108506 & 0.0970325\\
+Human evaluation 	& 0.027298	& NaN (TBA)	& 0.049582 & 0.0033916\\
+Contraction 		& 0.004206	& 0.004656	& 0.005557 & 0.0034591\\
+Monte Carlo	 	& 0.001292	& 0.016629	& 0.009429 & 0.0179825\\
+\end{tabular}
+\label{tab:modules_mc}
+\end{table}
+
+
+\begin{figure}[]
+    \centering
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_quantile_defaults_mc}
+        \caption{Outcome Y from Bernoulli, independent decisions using the quantiles.}
+        %\label{fig:modules_mc_without_Z}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_20iter_threshold_quantile_defaults_mc}
+        \caption{Outcome Y from threshold rule, independent decisions using the quantiles.}
+        %\label{fig:modules_mc_with_Z}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_lakkarajudecider_defaults_mc}
+        \caption{Outcome Y from Bernoulli, non-independent decisions.}
+        %\label{fig:modules_mc_without_Z}
+    \end{subfigure}
+    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
+      %(or a blank line to force the subfigure onto a new line)
+    \begin{subfigure}[b]{0.475\textwidth}
+        \includegraphics[width=\textwidth]{sl_with_Z_10iter_threshold_lakkarajudecider_defaults_mc}
+        \caption{Outcome Y from threshold rule, non-independent decisions.}
+        %\label{fig:modules_mc_with_Z}
+    \end{subfigure}
+    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation modules. See other modules used in section \ref{sec:modules_mc}}
+    \label{fig:modules_mc}
+\end{figure}
+
+\section{Diagnostic figures} \label{sec:diagnostic}
+
+Here we present supplementary figures of all the settings in the main result section.
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_bernoulli_independent_without_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_bernoulli_independent_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_threshold_independent_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_bernoulli_batch_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_threshold_batch_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_random_decider_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_biased_decider_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+\begin{figure}[]
+    \centering
+    \includegraphics[width=\textwidth]{sl_diagnostic_bad_decider_with_Z}
+    \caption{Results from estimating failure rate with different levels of leniency using different methods.}
+    %\label{fig:}
+\end{figure}
+
+%\end{appendices}
 
 \end{document}
\ No newline at end of file
diff --git a/figures/sl_diagnostic_bad_decider_with_Z.png b/figures/sl_diagnostic_bad_decider_with_Z.png
index f3bc10405dcc7380822f32435c56f262b182c921..66e78f52d96eb18653a051709726d474d20708e7 100644
Binary files a/figures/sl_diagnostic_bad_decider_with_Z.png and b/figures/sl_diagnostic_bad_decider_with_Z.png differ
diff --git a/figures/sl_diagnostic_bernoulli_batch_with_Z.png b/figures/sl_diagnostic_bernoulli_batch_with_Z.png
index 3443d2c6670fd75d7268997708b50dc0f6ebe554..278cc6bf34102302e4f42f3f75a794469eafb660 100644
Binary files a/figures/sl_diagnostic_bernoulli_batch_with_Z.png and b/figures/sl_diagnostic_bernoulli_batch_with_Z.png differ
diff --git a/figures/sl_diagnostic_bernoulli_bernoulli_with_Z.png b/figures/sl_diagnostic_bernoulli_bernoulli_with_Z.png
new file mode 100644
index 0000000000000000000000000000000000000000..b7ad6f0053c38b7b8980995aa89811286d382658
Binary files /dev/null and b/figures/sl_diagnostic_bernoulli_bernoulli_with_Z.png differ
diff --git a/figures/sl_diagnostic_bernoulli_independent_with_Z.png b/figures/sl_diagnostic_bernoulli_independent_with_Z.png
index 84def65c0c1c567b5eadc51dfec1638fff1d87ff..415be561e488af49607d86d1eb3d9f71ee911d24 100644
Binary files a/figures/sl_diagnostic_bernoulli_independent_with_Z.png and b/figures/sl_diagnostic_bernoulli_independent_with_Z.png differ
diff --git a/figures/sl_diagnostic_bernoulli_independent_without_Z.png b/figures/sl_diagnostic_bernoulli_independent_without_Z.png
index 1ba887c1bd6b006682c9c2e36d55829478e85f85..5d02998c188f501ba9e27a2fb6267afed7e29441 100644
Binary files a/figures/sl_diagnostic_bernoulli_independent_without_Z.png and b/figures/sl_diagnostic_bernoulli_independent_without_Z.png differ
diff --git a/figures/sl_diagnostic_biased_decider_with_Z.png b/figures/sl_diagnostic_biased_decider_with_Z.png
index bf4012e4da51ec965f9c7c3966301c90bbbe46d5..eec776b8626c1699d26164d9032ebf724caa853a 100644
Binary files a/figures/sl_diagnostic_biased_decider_with_Z.png and b/figures/sl_diagnostic_biased_decider_with_Z.png differ
diff --git a/figures/sl_diagnostic_random_decider_with_Z.png b/figures/sl_diagnostic_random_decider_with_Z.png
index 8e06181822ac585d0e991340106ac160faad648a..3c48ee41e71f0d113a5765227c0f2efc47b95959 100644
Binary files a/figures/sl_diagnostic_random_decider_with_Z.png and b/figures/sl_diagnostic_random_decider_with_Z.png differ
diff --git a/figures/sl_diagnostic_threshold_batch_with_Z.png b/figures/sl_diagnostic_threshold_batch_with_Z.png
index 4d6e94ed9969a8dcf937fa3d7a581167c991d526..bb3ea821d1761c23d2bfa6ef4dff943db7bd80ec 100644
Binary files a/figures/sl_diagnostic_threshold_batch_with_Z.png and b/figures/sl_diagnostic_threshold_batch_with_Z.png differ
diff --git a/figures/sl_diagnostic_threshold_independent_with_Z.png b/figures/sl_diagnostic_threshold_independent_with_Z.png
index 5d3caf1589fb4db79e5da4779317b779f3d02aad..00eef169d356b0c7321fac8dd96d39490f6ee405 100644
Binary files a/figures/sl_diagnostic_threshold_independent_with_Z.png and b/figures/sl_diagnostic_threshold_independent_with_Z.png differ
diff --git a/figures/sl_result_bad.png b/figures/sl_result_bad.png
new file mode 100644
index 0000000000000000000000000000000000000000..0c4f98f0ffe137ebf75c7927c3e84a2674da879c
Binary files /dev/null and b/figures/sl_result_bad.png differ
diff --git a/figures/sl_result_bernoulli_batch.png b/figures/sl_result_bernoulli_batch.png
new file mode 100644
index 0000000000000000000000000000000000000000..0f87e356250b267c6b46769e6c29744e8f073f2e
Binary files /dev/null and b/figures/sl_result_bernoulli_batch.png differ
diff --git a/figures/sl_result_bernoulli_bernoulli.png b/figures/sl_result_bernoulli_bernoulli.png
new file mode 100644
index 0000000000000000000000000000000000000000..7d7973560fdfc8e14c5fc33a543115f0a39559f7
Binary files /dev/null and b/figures/sl_result_bernoulli_bernoulli.png differ
diff --git a/figures/sl_result_bernoulli_independent.png b/figures/sl_result_bernoulli_independent.png
new file mode 100644
index 0000000000000000000000000000000000000000..8a9738a67d51174f293e1368181221efe2e5225d
Binary files /dev/null and b/figures/sl_result_bernoulli_independent.png differ
diff --git a/figures/sl_result_biased.png b/figures/sl_result_biased.png
new file mode 100644
index 0000000000000000000000000000000000000000..3ecdcf9457def2c3c536b26342e2e2e55b53df70
Binary files /dev/null and b/figures/sl_result_biased.png differ
diff --git a/figures/sl_result_random.png b/figures/sl_result_random.png
new file mode 100644
index 0000000000000000000000000000000000000000..65f50bdc57e2ef3c065f5bca4265f075f5ba3ff5
Binary files /dev/null and b/figures/sl_result_random.png differ
diff --git a/figures/sl_result_threshold_batch.png b/figures/sl_result_threshold_batch.png
new file mode 100644
index 0000000000000000000000000000000000000000..0200bfd449e1a4cf9227670910e1bb592d97816e
Binary files /dev/null and b/figures/sl_result_threshold_batch.png differ
diff --git a/figures/sl_result_threshold_independent.png b/figures/sl_result_threshold_independent.png
new file mode 100644
index 0000000000000000000000000000000000000000..f862b236fd1fa18df5e666631d0f79ba2fa98a1f
Binary files /dev/null and b/figures/sl_result_threshold_independent.png differ