Newer
Older
\usepackage{geometry} % See geometry.pdf to learn the layout options. There are lots.
%\geometry{a4paper} % ... or letterpaper or a5paper or ...
%\geometry{landscape} % Activate for for rotated page geometry
\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\usepackage{algorithm}% http://ctan.org/pkg/algorithms
\usepackage{algorithmic}% http://ctan.org/pkg/algorithms
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Procedure:}}
\renewcommand{\algorithmicreturn}{\textbf{Return}}
\renewcommand{\descriptionlabel}[1]{\hspace{\labelsep}\textnormal{#1}}
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
\makeatletter
%Table of Contents
\setcounter{tocdepth}{3}
% Add bold to \section titles in ToC and remove . after numbers
\renewcommand{\tocsection}[3]{%
\indentlabel{\@ifnotempty{#2}{\bfseries\ignorespaces#1 #2\quad}}\bfseries#3}
% Remove . after numbers in \subsection
\renewcommand{\tocsubsection}[3]{%
\indentlabel{\@ifnotempty{#2}{\ignorespaces#1 #2\quad}}#3}
%\let\tocsubsubsection\tocsubsection% Update for \subsubsection
%...
\newcommand\@dotsep{4.5}
\def\@tocline#1#2#3#4#5#6#7{\relax
\ifnum #1>\c@tocdepth % then omit
\else
\par \addpenalty\@secpenalty\addvspace{#2}%
\begingroup \hyphenpenalty\@M
\@ifempty{#4}{%
\@tempdima\csname r@tocindent\number#1\endcsname\relax
}{%
\@tempdima#4\relax
}%
\parindent\z@ \leftskip#3\relax \advance\leftskip\@tempdima\relax
\rightskip\@pnumwidth plus1em \parfillskip-\@pnumwidth
#5\leavevmode\hskip-\@tempdima{#6}\nobreak
\leaders\hbox{$\m@th\mkern \@dotsep mu\hbox{.}\mkern \@dotsep mu$}\hfill
\nobreak
\hbox to\@pnumwidth{\@tocpagenum{\ifnum#1=1\bfseries\fi#7}}\par% <-- \bfseries for \section page
\nobreak
\endgroup
\fi}
\AtBeginDocument{%
\expandafter\renewcommand\csname r@tocindent0\endcsname{0pt}
}
\def\l@subsection{\@tocline{2}{0pt}{2.5pc}{5pc}{}}
\makeatother
\usepackage{subcaption}
\graphicspath{ {../figures/} }
This document presents the implementations of RL in pseudocode level. First, I present the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods. In the end I present some some results that I was asked to present in the meeting Friday $7^{th}$.
\end{abstract}
\section*{Terms and abbreviations}
\begin{description}
\item[R :] acceptance rate, leniency of decision maker, $r \in [0, 1]$
\item[X :] personal features, observable to a predictive model
\item[Z :] some features of a subject, unobservable to a predictive model, latent variable
\item[W :] noise added to result variable Y
\item[T :] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0
\item[Y :] result variable, no crime/positive result equal to 1, crime/negative result equal to 0
\item[MAE :] Mean absolute error
\item[SL :] Selective labels, for more information see Lakkaraju's paper \cite{lakkaraju17}
\item[Labeled data :] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA.
\item[Full data :] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available.
\item[Unobservables :] unmeasured confounders, latent variables, Z
Mnemonic rule for the binary coding: zero bad (crime or jail), one good!
\section{RL's notes about the selective labels paper (optional reading)} \label{sec:comments}
\emph{This chapter is to present my comments and insight regarding the topic.}
The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Other approaches, such as a data augmentation algorithm has been proposed by De-Arteaga \cite{dearteaga18}.)
The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons and hidden labels. We can however use more selectively labeled data to estimate it's performance. But, because the available data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.
One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given only selectively labeled data and an arbitrary black-box model $\mathcal{B}$ we are interested in predicting the performance of model $\mathcal{B}$ in the whole data set with all ground truth labels.
On the formalisation of R: We discussed how Lakkaraju's paper treats variable R in a seemingly non-sensical way, it is as if a judge would have to let someone go today in order to detain some other defendant tomorrow to keep their acceptance rate at some $r$. A more intuitive way of thinking $r$ would be the "threshold perspective". That is, if a judge sees that a defendant has probability $p_x$ of committing a crime if let out, the judge would detain the defendant if $p_x > r$, the defendant would be too dangerous to let out. The problem in this case is that we cannot observe this innate $r$, we can only observe the decisions given by the judges. This is how Lakkaraju avoids computing $r$ twice by forcing the "acceptance threshold" to be an "acceptance rate" and then the effect of changing $r$ can be computed from the data directly.
Both of the data generating algorithms are presented in this chapter.
\subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})}
In the setting without unobservables Z, we first sample an acceptance rate $r$ for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).$$ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
\begin{algorithm}[] % enter the algorithm environment
\caption{Create data without unobservables} % give the algorithm a caption
\label{alg:data_without_Z} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of judges $M=100$ and number of subjects distributed to each of them $N=500$ s.t. $N_{total} = N \cdot M$
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X for each $N_{total}$ observations from standard Gaussian.
\STATE Calculate $P(Y=0|X=x)=\sigma(x)$ for each observation
\STATE Sample Y from Bernoulli distribution with parameter $1-\sigma(x)$.
\STATE Sort the data by (1) the judges and (2) by probabilities $P(Y=0|X=x)$ in descending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\end{algorithmic}
\end{algorithm}
\subsection{With unobservables (see also algorithm \ref{alg:data_with_Z})}
In the setting with unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects (50000 in total) for each of the judges randomly and simulate their features X, Z and W as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x, Z=z, W=w)=\sigma(\beta_Xx+\beta_Zz+\beta_Ww)$$ where $\beta_X=\beta_Z =1$ and $\beta_W=0.2$. Next, value for result Y is set to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and 1 otherwise. The conditional probability for the negative decision (T=0) is defined as $$P(T=0|X=x, Z=z)=\sigma(\beta_Xx+\beta_Zz)+\epsilon$$ where $\epsilon \sim N(0, 0.1)$. Next, the data is sorted for each judge by the probabilities $P(T=0|X, Z)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
\begin{algorithm}[] % enter the algorithm environment
\caption{Create data with unobservables} % give the algorithm a caption
\label{alg:data_with_Z} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of judges $M=100$, number of subjects distributed to each of them $N=500$ s.t $N_{total} = N \cdot M$, $\beta_X=1, \beta_Z=1$ and $\beta_W=0.2$.
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X, Z and W for each $N_{total}$ observations from standard Gaussian independently.
\STATE Calculate $P(Y=0|X, Z, W)$ for each observation.
\STATE Set Y to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and to 1 otherwise.
\STATE Calculate $P(T=0|X, Z)$ for each observation and attach to data.
\STATE Sort the data by (1) the judges' and (2) by probabilities $P(T=0|X, Z)$ in descending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\section{Model fitting} \label{sec:model_fitting}
The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. Similar analyses were conducted using random forest classifier, but the results (see section \ref{sec:random_forest}) were practically identical.
All of the algorithms 4--7 and the contraction algorithm are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour."
NB: The sklearn's regression model can not be fitted if the data includes missing values. Therefore list-wise deletion is done in cases of missing data (whole record is discarded).
\section{Plotting}
The following quantities are computed from the data:
\item True evaluation: The true failure rate of the model. Can only be calculated for synthetic data sets. See algorithm \ref{alg:true_eval} and discussion in section \ref{sec:comments}.
\item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}.
\item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}.
\item Contraction: See algorithm \ref{alg:contraction} from \cite{lakkaraju17}.
\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) trained on the labeled data predicting Y from X and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf}
\end{itemize}
The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}.
\begin{algorithm}[] % enter the algorithm environment
\caption{Performance comparison} % give the algorithm a caption
\label{alg:perf_comp} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Number of iterations $N_{iter}$
\ENSURE
\FORALL{$r$ in $0.1, 0.2, ..., 0.9$}
\FOR{i = 1 \TO $N_{iter}$}
\STATE Create data using either Algorithm \ref{alg:data_without_Z} or \ref{alg:data_with_Z}.
\STATE Train a logistic regression model using observations in the training set with available outcome labels and assign to $f$.
\STATE Using $f$, estimate probabilities $\mathcal{S}$ for Y=0 in both test sets (labeled and full) for all observations and attach them to the respective data sets.
\STATE Compute failure rate of true evaluation with leniency $r$ and full test data using algorithm \ref{alg:true_eval}.
\STATE Compute failure rate of labeled outcomes approach with leniency $r$ and labeled test data using algorithm \ref{alg:labeled_outcomes}.
\STATE Compute failure rate of human judges with leniency $r$ and labeled test data using algorithm \ref{alg:human_eval}.
\STATE Compute failure rate of contraction algorithm with leniency $r$ and labeled test data.
\STATE Compute the empirical performance of the causal model with leniency $r$, predictive model $f$ and labeled test data using algorithm \ref{alg:causal_model}.
\STATE Calculate means of the failure rates for each value of leniency and for each algorithm separately.
\STATE Calculate standard error of the mean for each value of leniency and for each algorithm separately.
\ENDFOR
\STATE Plot the failure rates with given levels of leniency $r$.
\STATE Calculate absolute mean errors of each algorithm compared to true evaluation.
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{True evaluation} % give the algorithm a caption
\label{alg:true_eval} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{all outcome labels}, acceptance rate r
\ENSURE
\STATE Sort the data by the probabilities $\mathcal{S}$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\mathcal{D}| \cdot r$.
\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Labeled outcomes} % give the algorithm a caption
\label{alg:labeled_outcomes} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Assign observations with observed outcomes to $\mathcal{D}_{observed}$.
\STATE Sort $\mathcal{D}_{observed}$ by the probabilities $\mathcal{S}$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\mathcal{D}_{observed}| \cdot r$.
\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Human evaluation} % give the algorithm a caption
\label{alg:human_eval} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Assign judges with leniency in $[r-0.05, r+0.05]$ to $\mathcal{J}$
\STATE $\mathcal{D}_{released} = \{(x, j, t, y) \in \mathcal{D}|t=1 \wedge j \in \mathcal{J}\}$
\STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
\RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\mathcal{D}_{released}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Contraction algorithm \cite{lakkaraju17}} % give the algorithm a caption
\label{alg:contraction} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Let $q$ be the decision-maker with highest acceptance rate in $\mathcal{D}$.
\STATE $\mathcal{D}_q = \{(x, j, t, y) \in \mathcal{D}|j=q\}$
\STATE \hskip3.0em $\rhd$ $\mathcal{D}_q$ is the set of all observations judged by $q$
\STATE
\STATE $\mathcal{R}_q = \{(x, j, t, y) \in \mathcal{D}_q|t=1\}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R}_q$ is the set of observations in $\mathcal{D}_q$ with observed outcome labels
\STATE
\STATE Sort observations in $\mathcal{R}_q$ in descending order of confidence scores $\mathcal{S}$ and assign to $\mathcal{R}_q^{sort}$.
\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
\STATE
\STATE Remove the top $[(1.0-r)|\mathcal{D}_q |]-[|\mathcal{D}_q |-|\mathcal{R}_q |]$ observations of $\mathcal{R}_q^{sort}$ and call this list $\mathcal{R_B}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
\STATE
\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \mathcal{D}_q |}$.
\RETURN $\mathbf{u}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[] % enter the algorithm environment
\caption{Causal model, empirical performance (ep)} % give the algorithm a caption
\label{alg:causal_model} % and a label for \ref{} commands later in the document
\begin{algorithmic}[1] % enter the algorithmic environment
\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, predictive model f, acceptance rate r
\STATE Create boolean array $T_{causal} = cdf(\mathcal{D}, f) < r$. See "Causal model" in \ref{causal_cdf}.
\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{\mathcal{D}} \mathcal{S} \cdot T_{causal}$ which is equal to $\frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}} f(x)\delta(F(x) < r)$
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}.
\begin{table}[H]
\caption{Mean absolute error (MAE) w.r.t true evaluation}
\begin{center}
\begin{tabular}{l | c c}
Method & MAE without Z & MAE with Z \\ \hline
Labeled outcomes & 0.107563333 & 0.0817483\\
Human evaluation & 0.004403964 & 0.0042597\\
Contraction & 0.011049707 & 0.0054146\\
Causal model, ep & 0.001074039 & 0.0414928\\
\end{tabular}
\end{center}
\label{tab:results}
\end{table}%
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_without_Z_3iter}
\caption{Results without unobservables}
\label{fig:results_without_Z}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_3iter_betaZ_1_0}
\caption{Results with unobservables, $\beta_Z=1$.}
\label{fig:results_with_Z}
\end{subfigure}
\caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. $N_{iter}$ was set to 3.}\label{fig:results}
\end{figure}
\subsection{$\beta_Z=0$ and data generated with unobservables.}
If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Results are presented in Figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}.
The differences between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} could be explained in the slight difference in the data generating process, namely the effect of $W$ or $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}.
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
\caption{With unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
\label{fig:betaZ_1_5}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
\caption{With unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
\caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$.}
\label{fig:betaZ_comp}
\subsection{Noise added to the decision and data generated without unobservables} \label{sec:epsilon}
In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}.
\includegraphics[width=0.75\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
\caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.}
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
\subsection{Predictions with random forest classifier} \label{sec:random_forest}
In this section the predictive model was switched to random forest classifier to examine the effect of changing the model. Results are practically identical to then ones presented in figure \ref{fig:results} previously. The resulting outcome is presented in \ref{fig:random_forest}.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
\caption{Results without unobservables, \\$N_{iter}=4$.}
\label{fig:results_without_Z}
\end{subfigure}
~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.5\textwidth}
\includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
\caption{Results with unobservables, $\beta_Z=1$ and \\$N_{iter}=6$.}
\label{fig:results_with_Z}
\end{subfigure}
\caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
\label{fig:random_forest}
\end{figure}
\subsection{Sanity check for predictions}
Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{sanity_check}
\caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
\label{fig:sanity_check}
\end{figure}
\begin{thebibliography}{9}
\bibitem{dearteaga18}
De-Arteaga, Maria. Learning Under Selective Labels in the Presence of Expert Consistency. 2018.
\bibitem{lakkaraju17}
Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017.
\end{thebibliography}
\end{document}