Newer
Older
\usepackage{geometry} % See geometry.pdf to learn the layout options. There are lots.
%\geometry{a4paper} % ... or letterpaper or a5paper or ...
%\geometry{landscape} % Activate for for rotated page geometry
\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\usepackage{pgf}
\usepackage{tikz}
\usetikzlibrary{arrows,automata}
\usepackage{algorithm}% http://ctan.org/pkg/algorithms
\usepackage{algorithmic}% http://ctan.org/pkg/algorithms
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Procedure:}}
\renewcommand{\algorithmicreturn}{\textbf{Return}}
\newcommand{\pr}{\mathbb{P}} % tn merkki
\newcommand{\D}{\mathcal{D}} % aineisto
\newcommand{\s}{\mathcal{S}} % "fancy S"
\newcommand{\M}{\mathcal{M}} % "fancy M"
\newcommand{\B}{\mathcal{B}} % "fancy B"
\newcommand{\RR}{\mathcal{R}} % supistusalgon R
\renewcommand{\descriptionlabel}[1]{\hspace{\labelsep}\textnormal{#1}}
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
\makeatletter
%Table of Contents
\setcounter{tocdepth}{3}
% Add bold to \section titles in ToC and remove . after numbers
\renewcommand{\tocsection}[3]{%
\indentlabel{\@ifnotempty{#2}{\bfseries\ignorespaces#1 #2\quad}}\bfseries#3}
% Remove . after numbers in \subsection
\renewcommand{\tocsubsection}[3]{%
\indentlabel{\@ifnotempty{#2}{\ignorespaces#1 #2\quad}}#3}
%\let\tocsubsubsection\tocsubsection% Update for \subsubsection
%...
\newcommand\@dotsep{4.5}
\def\@tocline#1#2#3#4#5#6#7{\relax
\ifnum #1>\c@tocdepth % then omit
\else
\par \addpenalty\@secpenalty\addvspace{#2}%
\begingroup \hyphenpenalty\@M
\@ifempty{#4}{%
\@tempdima\csname r@tocindent\number#1\endcsname\relax
}{%
\@tempdima#4\relax
}%
\parindent\z@ \leftskip#3\relax \advance\leftskip\@tempdima\relax
\rightskip\@pnumwidth plus1em \parfillskip-\@pnumwidth
#5\leavevmode\hskip-\@tempdima{#6}\nobreak
\leaders\hbox{$\m@th\mkern \@dotsep mu\hbox{.}\mkern \@dotsep mu$}\hfill
\nobreak
\hbox to\@pnumwidth{\@tocpagenum{\ifnum#1=1\bfseries\fi#7}}\par% <-- \bfseries for \section page
\nobreak
\endgroup
\fi}
\AtBeginDocument{%
\expandafter\renewcommand\csname r@tocindent0\endcsname{0pt}
}
\def\l@subsection{\@tocline{2}{0pt}{2.5pc}{5pc}{}}
\makeatother
\usepackage{subcaption}
\graphicspath{ {../figures/} }
This document presents the implementations of RL in pseudocode level. First, I present the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods. In the end I present some some results that I was asked to present in the meeting Friday $7^{th}$.
\end{abstract}
\section*{Terms and abbreviations}
\begin{description}
\item[R :] acceptance rate, leniency of decision maker, $r \in [0, 1]$
\item[X :] personal features, observable to a predictive model
\item[Z :] some features of a subject, unobservable to a predictive model, latent variable
\item[W :] noise added to result variable Y
\item[T :] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0
\item[Y :] result variable, no crime/positive result equal to 1, crime/negative result equal to 0
\item[MAE :] Mean absolute error
\item[SL :] Selective labels, for more information see Lakkaraju's paper \cite{lakkaraju17}
\item[Labeled data :] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA.
\item[Full data :] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available.
\item[Unobservables :] unmeasured confounders, latent variables, Z
Mnemonic rule for the binary coding: zero bad (crime or jail), one good!
\section{RL's notes about the selective labels paper (optional reading)} \label{sec:comments}
\emph{This chapter is to present my comments and insight regarding the topic.}
The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Other approaches, such as a data augmentation algorithm has been proposed by De-Arteaga \cite{dearteaga18}.)
The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons and hidden labels. We can however use more selectively labeled data to estimate it's performance. But, because the available data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.
One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given only selectively labeled data and an arbitrary black-box model $\mathcal{B}$ we are interested in predicting the performance of model $\mathcal{B}$ in the whole data set with all ground truth labels.
On the formalisation of R: We discussed how Lakkaraju's paper treats variable R in a seemingly non-sensical way, it is as if a judge would have to let someone go today in order to detain some other defendant tomorrow to keep their acceptance rate at some $r$. A more intuitive way of thinking $r$ would be the "threshold perspective". That is, if a judge sees that a defendant has probability $p_x$ of committing a crime if let out, the judge would detain the defendant if $p_x > r$, the defendant would be too dangerous to let out. The problem in this case is that we cannot observe this innate $r$, we can only observe the decisions given by the judges. This is how Lakkaraju avoids computing $r$ twice by forcing the "acceptance threshold" to be an "acceptance rate" and then the effect of changing $r$ can be computed from the data directly.
\section{Framework definition -- 13 June discussion}
First, data is generated through a \textbf{data generating process (DGP)}. DGP comprises of generating the private features for the subjects, generating the acceptance rates for the judges and assigning the subjects to the judges. \textbf{Acceptance rate (AR)} is defined as the ratio of positive decisions to all decisions that a judge will give. As a formula \[ AR = \dfrac{\#\{Positive~decisions\}}{\#\{Decisions\}}. \] Data generation process is depicted in the first box of Figure \ref{fig:separation}.
Next, the generated data goes to the \textbf{labeling process}. In the labeling process, it is determined which instances of the data will have an outcome label available. This is done by humans and is presented in lines 5--7 of algorithm \ref{alg:data_without_Z} and 5--8 of algorithm \ref{alg:data_with_Z}.
In the third step, the labeled data is given to a machine that will either make decisions or predictions using some features of the data. The machine will output either binary decisions (yes/no), probabilities (a real number in interval $[0, 1]$) or a metric for ordering all the instances. The machine will be denoted with $\M$.
Finally the decisions and/or predictions made by the machine $\M$ and human judges (see dashed arrow in figure \ref{fig:separation}) will be evaluated using an \textbf{evaluation algorithm}. Evaluation algorithms will take the decisions, probabilities or ordering generated in the previous steps as input and then output an estimate of the failure rate. \textbf{Failure rate (FR)} is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision. More explicitly \[ FR = \dfrac{\#\{Failures\}}{\#\{Decisions\}}. \] Second characteristic of FR is that the number of positive decisions and therefore FR itself can be controlled through acceptance rate defined above.
Given the above framework, the goal is to create an evaluation algorithm that can accurately estimate the failure rate of any model $\M$ if it were to replace human decision makers in the labeling process. The estimations have to be made using only data that human decision-makers have labeled. The failure rate has to be accurately estimated for various levels of acceptance rate. The accuracy of the estimates can be compared by computing e.g. mean absolute error w.r.t the estimates given by \nameref{alg:true_eval} algorithm.
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=1.5cm,
\tikzstyle{every state}=[fill=none,draw=black,text=black, rectangle, minimum width=6cm]
\node[state] (D) {Data generation};
\node[state] (J) [below of=D] {Labeling process (human)};
\node[state] (MP) [below of=J] {$\mathcal{M}$ Machine decisions / predictions};
\node[state] (EA) [below of=MP] {Evaluation algorithm};
\path (D) edge (J)
(J) edge (MP)
edge [bend right=81, dashed] (EA)
(MP) edge (EA);
\caption{The selective labels framework. The dashed arrow indicates how human evaluations are evaluated without machine intervention using \nameref{alg:human_eval} algorithm.}
\label{fig:separation}
\end{figure}
Both of the data generating algorithms are presented in this chapter.
\subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})}
In the setting without unobservables Z, we first sample an acceptance rate $r$ for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).$$ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
\begin{algorithm}[] % enter the algorithm environment
\caption{Create data without unobservables} % give the algorithm a caption
\label{alg:data_without_Z} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of judges $M=100$ and number of subjects distributed to each of them $N=500$ s.t. $N_{total} = N \cdot M$
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X for each $N_{total}$ observations from standard Gaussian.
\STATE Calculate $P(Y=0|X=x)=\sigma(x)$ for each observation
\STATE Sample Y from Bernoulli distribution with parameter $1-\sigma(x)$.
\STATE Sort the data by (1) the judges and (2) by probabilities $P(Y=0|X=x)$ in descending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\end{algorithmic}
\end{algorithm}
\subsection{With unobservables (see also algorithm \ref{alg:data_with_Z})}
In the setting with unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects (50000 in total) for each of the judges randomly and simulate their features X, Z and W as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x, Z=z, W=w)=\sigma(\beta_Xx+\beta_Zz+\beta_Ww)$$ where $\beta_X=\beta_Z =1$ and $\beta_W=0.2$. Next, value for result Y is set to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and 1 otherwise. The conditional probability for the negative decision (T=0) is defined as $$P(T=0|X=x, Z=z)=\sigma(\beta_Xx+\beta_Zz)+\epsilon$$ where $\epsilon \sim N(0, 0.1)$. Next, the data is sorted for each judge by the probabilities $P(T=0|X, Z)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
\begin{algorithm}[] % enter the algorithm environment
\caption{Create data with unobservables} % give the algorithm a caption
\label{alg:data_with_Z} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of judges $M=100$, number of subjects distributed to each of them $N=500$ s.t $N_{total} = N \cdot M$, $\beta_X=1, \beta_Z=1$ and $\beta_W=0.2$.
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X, Z and W for each $N_{total}$ observations from standard Gaussian independently.
\STATE Calculate $P(Y=0|X, Z, W)$ for each observation.
\STATE Set Y to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and to 1 otherwise.
\STATE Calculate $P(T=0|X, Z)$ for each observation and attach to data.
\STATE Sort the data by (1) the judges' and (2) by probabilities $P(T=0|X, Z)$ in descending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\section{Model fitting} \label{sec:model_fitting}
The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. Similar analyses were conducted using random forest classifier, but the results (see section \ref{sec:random_forest}) were practically identical.
All of the algorithms 4--8 are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour."
NB: The sklearn's regression model can not be fitted if the data includes missing values. Therefore list-wise deletion is done in cases of missing data (whole record is discarded).
\section{Plotting}
The following quantities are computed from the data:
\item True evaluation: The true failure rate of the model. Can only be calculated for synthetic data sets. See algorithm \ref{alg:true_eval} and discussion in section \ref{sec:comments}.
\item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}.
\item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}.
\item Contraction: See algorithm \ref{alg:contraction} from \cite{lakkaraju17}.
\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}, random forest used in section \ref{sec:random_forest}) trained on the labeled data predicting Y from X and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) See algorithm \ref{alg:causal_model}. \label{causal_cdf}
\end{itemize}
The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}.
\begin{algorithm}[] % enter the algorithm environment
\caption{Performance comparison} % give the algorithm a caption
\label{alg:perf_comp} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of iterations $N_{iter}$
\ENSURE
\FORALL{$r$ in $0.1, 0.2, ..., 0.9$}
\FOR{i = 1 \TO $N_{iter}$}
\STATE Create data using either Algorithm \ref{alg:data_without_Z} or \ref{alg:data_with_Z}.
\STATE Train a logistic regression model using observations in the training set with available outcome labels and assign to $f$.
\STATE Using $f$, estimate probabilities $\s$ for Y=0 in both test sets (labeled and full) for all observations and attach them to the respective data sets.
\STATE Compute failure rate of true evaluation with leniency $r$ and full test data using algorithm \ref{alg:true_eval}.
\STATE Compute failure rate of labeled outcomes approach with leniency $r$ and labeled test data using algorithm \ref{alg:labeled_outcomes}.
\STATE Compute failure rate of human judges with leniency $r$ and labeled test data using algorithm \ref{alg:human_eval}.
\STATE Compute failure rate of contraction algorithm with leniency $r$ and labeled test data.
\STATE Compute the empirical performance of the causal model with leniency $r$, predictive model $f$ and labeled test data using algorithm \ref{alg:causal_model}.
\STATE Calculate means of the failure rates for each level of leniency and for each algorithm separately.
\STATE Calculate standard error of the mean for each level of leniency and for each algorithm separately.
\ENDFOR
\STATE Plot the failure rates with given levels of leniency $r$.
\STATE Calculate absolute mean errors of each algorithm compared to true evaluation.
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{True evaluation} % give the algorithm a caption
\label{alg:true_eval} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Full test data $\D$ with probabilities $\s$ and \emph{all outcome labels}, acceptance rate r
\STATE Sort the data by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D| \cdot r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Labeled outcomes} % give the algorithm a caption
\label{alg:labeled_outcomes} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\STATE Assign observations with observed outcomes to $\D_{observed}$.
\STATE Sort $\D_{observed}$ by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D_{observed}| \cdot r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Human evaluation} % give the algorithm a caption
\label{alg:human_eval} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Assign judges with leniency in $[r-0.05, r+0.05]$ to $\mathcal{J}$
\STATE $\D_{released} = \{(x, j, t, y) \in \D~|~t=1 \wedge j \in \mathcal{J}\}$
\STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
\RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\D_{released}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Contraction algorithm \cite{lakkaraju17}} % give the algorithm a caption
\label{alg:contraction} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
\STATE $\mathcal{R}_q = \{(x, j, t, y) \in \D_q|t=1\}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R}_q$ is the set of observations in $\D_q$ with observed outcome labels
\STATE Sort observations in $\mathcal{R}_q$ in descending order of confidence scores $\s$ and assign to $\mathcal{R}_q^{sort}$.
\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
\STATE
\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\mathcal{R}_q |]$ observations of $\mathcal{R}_q^{sort}$ and call this list $\mathcal{R_B}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
\STATE
\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
\RETURN $\mathbf{u}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Causal model, empirical performance (ep, see also section \ref{causal_cdf})} % give the algorithm a caption
\label{alg:causal_model} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, predictive model $f$, pdf $P_X(x)$ for features $x$, acceptance rate r
\STATE For all $x_0 \in \D$ evaluate $F(x_0) = \int_{x\in\mathcal{X}} P_X(x)\delta(f(x)<f(x_0)) ~dx$ and assign to $\mathcal{F}_{predictions}$
\STATE Create boolean array $T_{causal} = \mathcal{F}_{predictions} < r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{\D} \s \cdot T_{causal}$ which is equal to $\frac{1}{|\D|}\sum_{x\in\D} f(x)\delta(F(x) < r)$
Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.
\caption{Mean absolute error (MAE) w.r.t true evaluation}
\begin{center}
\begin{tabular}{l | c c}
Method & MAE without Z & MAE with Z \\ \hline
Labeled outcomes & 0.107563333 & 0.0817483\\
Human evaluation & 0.004403964 & 0.0042597\\
Contraction & 0.011049707 & 0.0054146\\
Causal model, ep & 0.001074039 & 0.0414928\\
\end{tabular}
\end{center}
\label{tab:results}
\end{table}%
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_without_Z_3iter}
\caption{Results without unobservables}
\label{fig:results_without_Z}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_3iter_betaZ_1_0}
\caption{Results with unobservables, $\beta_Z=1$.}
\label{fig:results_with_Z}
\end{subfigure}
\caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. $N_{iter}$ was set to 3.}\label{fig:results}
\end{figure}
\subsection{$\beta_Z=0$ and data generated with unobservables.}
If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Results are presented in Figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}.
The differences between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} could be explained in the slight difference in the data generating process, namely the effect of $W$ or $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}.
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
\caption{With unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
\label{fig:betaZ_1_5}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
\caption{With unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
\caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$.}
\label{fig:betaZ_comp}
\subsection{Noise added to the decision and data generated without unobservables} \label{sec:epsilon}
In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}.
\includegraphics[width=0.75\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
\caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.}
\subsection{Predictions with random forest classifier} \label{sec:random_forest}
In this section the predictive model was switched to random forest classifier to examine the effect of changing the predictive model. Results are practically identical to those presented in figure \ref{fig:results} previously and are presented in figure \ref{fig:random_forest}.
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
\caption{Results without unobservables, \\$N_{iter}=4$.}
\label{fig:results_without_Z_rf}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
\caption{Results with unobservables, $\beta_Z=1$ and \\$N_{iter}=6$.}
\label{fig:results_with_Z_rf}
\end{subfigure}
\caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
\label{fig:random_forest}
\end{figure}
\subsection{Sanity check for predictions}
Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.
\centering
\includegraphics[width=0.75\textwidth]{sanity_check}
\caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
\label{fig:sanity_check}
\end{figure}
\begin{thebibliography}{9}
\bibitem{dearteaga18}
De-Arteaga, Maria. Learning Under Selective Labels in the Presence of Expert Consistency. 2018.
\bibitem{lakkaraju17}
Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017.
\end{thebibliography}
\end{document}