notes.tex

\documentclass[11pt,a4paper]{amsart}
\usepackage[margin=1in]{geometry}                % See geometry.pdf to learn the layout options. There are lots.
%\geometry{a4paper}                   % ... or letterpaper or a5paper or ... 
%\geometry{landscape}                % Activate for for rotated page geometry 
%\usepackage[parfill]{parskip}    % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
\usepackage[hidelinks, colorlinks=true]{hyperref}
%\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}

\usepackage[normalem]{ulem} %tables
\useunder{\uline}{\ul}{}

\usepackage{wrapfig} % wrap figures

\usepackage{booktabs}% http://ctan.org/pkg/booktabs
\newcommand{\tabitem}{~~\llap{\textbullet}~~}

\usepackage{pgf}
\usepackage{tikz}
\usepackage{tikz-cd}
\usetikzlibrary{arrows,automata, positioning}

\usepackage{algorithm}% http://ctan.org/pkg/algorithms
\usepackage{algorithmic}% http://ctan.org/pkg/algorithms
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Procedure:}}
\renewcommand{\algorithmicreturn}{\textbf{Return}}

\newcommand{\pr}{\mathbb{P}} % tn merkki
\newcommand{\D}{\mathcal{D}} % aineisto
\newcommand{\s}{\mathcal{S}} % "fancy S"
\newcommand{\M}{\mathcal{M}} % "fancy M"
\newcommand{\B}{\mathcal{B}} % "fancy B"
\newcommand{\RR}{\mathcal{R}} % supistusalgon R

\renewcommand{\descriptionlabel}[1]{\hspace{\labelsep}\textnormal{#1}}

\makeatletter
%Table of Contents
\setcounter{tocdepth}{3}

% Add bold to \section titles in ToC and remove . after numbers
\renewcommand{\tocsection}[3]{%
  \indentlabel{\@ifnotempty{#2}{\bfseries\ignorespaces#1 #2\quad}}\bfseries#3}
% Remove . after numbers in \subsection
\renewcommand{\tocsubsection}[3]{%
  \indentlabel{\@ifnotempty{#2}{\ignorespaces#1 #2\quad}}#3}
%\let\tocsubsubsection\tocsubsection% Update for \subsubsection
%...

\newcommand\@dotsep{4.5}
\def\@tocline#1#2#3#4#5#6#7{\relax
  \ifnum #1>\c@tocdepth % then omit
  \else
    \par \addpenalty\@secpenalty\addvspace{#2}%
    \begingroup \hyphenpenalty\@M
    \@ifempty{#4}{%
      \@tempdima\csname r@tocindent\number#1\endcsname\relax
    }{%
      \@tempdima#4\relax
    }%
    \parindent\z@ \leftskip#3\relax \advance\leftskip\@tempdima\relax
    \rightskip\@pnumwidth plus1em \parfillskip-\@pnumwidth
    #5\leavevmode\hskip-\@tempdima{#6}\nobreak
    \leaders\hbox{$\m@th\mkern \@dotsep mu\hbox{.}\mkern \@dotsep mu$}\hfill
    \nobreak
    \hbox to\@pnumwidth{\@tocpagenum{\ifnum#1=1\bfseries\fi#7}}\par% <-- \bfseries for \section page
    \nobreak
    \endgroup
  \fi}
\AtBeginDocument{%
\expandafter\renewcommand\csname r@tocindent0\endcsname{0pt}
}
\def\l@subsection{\@tocline{2}{0pt}{2.5pc}{5pc}{}}
\makeatother

%\makeatletter
%\def\listofalgorithms{\@starttoc{loa}\listalgorithmname}
%\def\l@algorithm{\@tocline{0}{3pt plus2pt}{0pt}{1.9em}{}}
%\renewcommand{\ALG@name}{AlGoRiThM}
%\renewcommand{\listalgorithmname}{List of \ALG@name s}
%\makeatother

\usepackage{subcaption}
\graphicspath{ {../figures/} }

\title{Notes}
\author{RL, 1 July 2019}
%\date{}                                           % Activate to display a given date or no date

\begin{document}

\maketitle

%\section*{Contents}

\tableofcontents

%\listofalgorithms

\begin{abstract}
This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In chapter 2, I define the framework for this problem and give the required definitions. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods. Finally in the last section, I present results using multiple different settings.
\end{abstract}

\section*{Terms and abbreviations}

\begin{description}
\item[R :] acceptance rate, leniency of decision maker, $r \in [0, 1]$
\item[X :] personal features, observable to a predictive model
\item[Z :] some features of a subject, unobservable to a predictive model, latent variable
\item[W :] noise added to result variable Y
\item[T :] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0
\item[Y :] result variable, no crime/positive result equal to 1, crime/negative result equal to 0
\item[MAE :] Mean absolute error
\item[SL :] Selective labels, for more information see Lakkaraju's paper  \cite{lakkaraju17}
\item[Labeled data :] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA.
\item[Full data :] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available.
\item[Unobservables :] unmeasured confounders, latent variables, Z
\end{description}

Mnemonic rule for the binary coding: zero bad (crime or jail), one good!

\section{RL's notes about the selective labels paper (optional reading)} \label{sec:comments}

\setcounter{figure}{-1}
\begin{wrapfigure}{r}{0.3\textwidth} %this figure will be at the right
    \centering
    \begin{tikzpicture}[->,>=stealth',node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none,draw=black,text=black]

  \node[state] (R)                    {$R$};
  \node[state] (X) [right of=R] {$X$};
  \node[state] (T) [below of=X] {$T$};
  \node[state] (Z) [rectangle, right of=X] {$Z$};
  \node[state] (Y) [below of=Z] {$Y$};

  \path (R) edge (T)
        (X) edge (T)
	     edge (Y)
        (Z) edge (T)
	     edge (Y)
        (T) edge (Y);
\end{tikzpicture}
\caption{Initial model.}
\label{fig:initial_model}
\end{wrapfigure}

\emph{This chapter is to present my comments and insight regarding the topic.}

The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Other approaches, such as a data augmentation algorithm has been proposed by De-Arteaga \cite{dearteaga18}.)

The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons and hidden labels. We can however use more selectively labeled data to estimate it's performance. But, because the available data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.

One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given only selectively labeled data and an arbitrary black-box model $\B$ we are interested in predicting the performance of model $\B$ in the whole data set with all ground truth labels.

On acceptance rate R: We discussed how Lakkaraju's paper treats variable R in a seemingly non-sensical way, it is as if a judge would have to let someone go today in order to detain some other defendant tomorrow to keep their acceptance rate at some $r$. A more intuitive way of thinking $r$ would be the "threshold perspective". That is, if a judge sees that a defendant has probability $p_x$ of committing a crime if let out, the judge would detain the defendant if $p_x > r$, the defendant would be too dangerous to let out. Now, it can be claimed that judges do have a threshold, even in Lakkaraju's setting. Let $c$ be a judge's acceptance threshold. Now decision $T$ can be defined with function $f_T(x, z)$ as follows:

\begin{equation}
  \pr(T=0|x, z)=\begin{cases}
    p_T, & \text{if $f_T(x, z) < c$}\\
    1-p_T, & \text{otherwise}.
  \end{cases}
\end{equation}

If $c$ is defined so that the ratio of positive decisions to all decisions will be equal to $r$, we will arrive at a similar data generation process as Lakkaraju  and as is presented in algorithm \ref{alg:data_with_Z}.

Finally, chapters from Lakkaraju \cite{lakkaraju17} about counterfactual inference, see references from their paper [sic]:

\begin{quote}
There has also been some work on inferring labels using counterfactual inference techniques [9, 12, 16, 36, 38] and leveraging these estimates when computing standard evaluation metrics. However, counterfactual inference techniques explicitly assume that there are no unmeasured confounders (that is, no unobservable variables Z) that could affect the outcome Y. This assumption does not typically hold in cases where human decisions are providing data labels [7, 21]. Thus, the combination of two ingredients -- selective labels and non-trivial unobservables -- poses problems for these existing techniques.

Counterfactual inference. Counterfactual inference techniques have been used extensively to estimate treatment effects in observational studies. These techniques have found applications in a variety of fields such as machine learning, epidemiology, and sociology [3, 8--10, 30, 34]. Along the lines of Johansson et al. [16], counterfactual inference techniques can be broadly categorized as: (1) parametric methods which model the relationship between observed features, treatments, and outcomes. Examples include any type of regression model such as linear and logistic regression, random forests and regression trees [12, 33, 42]. (2) non-parametric methods such as propensity score matching, nearest-neighbor matching, which do not explicitly model the relationship between observed features, treatments, and outcomes [4, 15, 35, 36, 41]. (3) doubly robust methods which combine the two aforementioned classes of techniques typically via a propensity score weighted regression [5, 10]. The effectiveness of parametric and non-parametric methods depends on the postulated regression model and the postulated propensity score model respectively. If the postulated models are not identical to the true models, then these techniques result in biased estimates of outcomes. Doubly robust methods require only one of the postulated models to be identical to the true model in order to generate unbiased estimates. However, due to the presence of unobservables, we cannot guarantee that either of the postulated models will be identical to the true models.
\end{quote}

\section{Framework definition -- 13 June discussion} \label{sec:framework}

\emph{In this section we define some key terms and concepts and derive a more unambiguous framework for the selective labels problem. The framework is presented in writing and as a picture in figures \ref{fig:framework} and \ref{fig:framework_data_flow}.}

First, data is generated through a \textbf{data generating process (DGP)}. DGP comprises of generating the private features for the subjects, generating the acceptance rates for the judges and assigning the subjects to the judges. \textbf{Acceptance rate (AR)} is defined as the ratio of positive decisions to all decisions that a judge will give. As a formula \[ AR = \dfrac{\#\{Positive~decisions\}}{\#\{Decisions\}}. \]   Data generating process is depicted in the first box of Figure \ref{fig:framework}.

Next, the all of the generated data goes to the \textbf{labeling process}. In the labeling process, it is determined which instances of the data will have an outcome label available. This is done by humans and is presented in lines 5--7 of algorithm \ref{alg:data_without_Z} and 5--8 of algorithm \ref{alg:data_with_Z}. The data is then split randomly into training and test datasets, $\D_{train}$ and $\D_{test}$ respectively.

After labeling, the labeled training data is used to train a machine that will either make decisions or predictions using some features of the data. Then, the test data will be given to the machine and it will output either binary decisions (yes/no), probabilities (a real number in interval $[0, 1]$) or a metric for ordering for all the instances in the test data set. The machine will be denoted with $\M$.

Finally the decisions and/or predictions made by the machine $\M$ and human judges (see dashed arrow in figure \ref{fig:framework}) will be evaluated using an \textbf{evaluation algorithm}. Evaluation algorithms will take the decisions, probabilities or ordering generated in the previous steps as input and then output an estimate of the machine's failure rate in the test data. \textbf{Failure rate (FR)} is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision. More explicitly \[ FR = \dfrac{\#\{Failures\}}{\#\{Decisions\}}. \] Second characteristic of FR is that the number of positive decisions and therefore FR itself can be controlled through acceptance rate defined above.

Given the above framework, the goal is to create an evaluation algorithm that can accurately estimate the failure rate of any model $\M$ if it were to replace human decision makers in the labeling process. The estimations have to be made using only data that human decision-makers have labeled. The failure rate has to be accurately estimated for various levels of acceptance rate. The accuracy of the estimates can be compared by computing e.g. mean absolute error w.r.t the estimates given by \nameref{alg:true_eval} algorithm.

\begin{figure} [H]
\centering
\begin{tikzpicture}[->, >=stealth, shorten >=1pt, auto, node distance=1.5cm, semithick]

  \tikzstyle{every state}=[fill=none, draw=black, text=black, rectangle, minimum width=6.0cm]

  \node[state] (DG) {Data generation};
  \node[state] (LP) [below of=DG] {Labeling process (human)};
  \node[state] (MP) [below of=LP] {Machine predictions};
  \node[state] (EA)  [below of=MP] {Evaluation algorithm};

  \path (DG) edge (LP)
                    edge [bend right=90, dashed] node [left] {(1)} (MP)
        (LP) edge (MP)
             edge [bend left=90, dashed] node {(2)} (EA)
        (MP) edge (EA);
\end{tikzpicture}
\caption{The selective labels framework. Dashed arrow (1) marks the flow of unlabeled data to \nameref{alg:true_eval} algorithm and (2) indicates how human evaluations are evaluated without machine intervention using \nameref{alg:human_eval} algorithm.}
\label{fig:framework}
\end{figure}

\begin{figure} [H]
\centering
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=1.5cm,
                            semithick]
        
          \tikzstyle{every state}=[fill=none,draw=black,text=black, rectangle, minimum width=6.0cm]
        
          \node[state] (DG) {Data generation};
          \node[state] (LP) [below of = DG] {Labeling process (human)};
          \node[state] (MT) [below left=1.0cm and -4cm of LP] {Model training};
          \node[state] (MP) [below=1.0cm of MT] {Machine predictions};
          \node[state] (EA)  [below right=0.75cm and -4cm of MP] {Evaluation algorithm};
        
          \path (DG) edge (LP)
                            edge [out=180, in=180, dashed] node [left]  {$\D_{test,~unlabeled}$ (1)}   (MP)
                (LP) edge [bend right=19] node [left] {$\D_{train}$} (MT)
                      edge [bend left=60] node [right]  {$\D_{test}$} (MP)
                     edge [bend left=75, dashed] node [right]  {$\D_{test}$ (2)}  (EA)
                (MT) edge node {$\M$} (MP)
                (MP) edge (EA);
        \end{tikzpicture}
\caption{The selective labels framework with explicit data flow. Dashed arrow (1) marks the flow of unlabeled data to \nameref{alg:true_eval} algorithm and (2) indicates how human evaluations are evaluated without machine intervention using \nameref{alg:human_eval} algorithm.}
\label{fig:framework_data_flow}
\end{figure}

\section{Modular framework -- based on 19 June discussion} \label{sec:modular_framework}

\begin{wrapfigure}{r}{0.25\textwidth} %this figure will be at the right
    \centering
    \begin{tikzcd}
       \arrow[d]   &  \arrow[d]  &  \arrow[d]   \\
    X \arrow[rd] & Z \arrow[d] & W \arrow[ld] \\
                       & Y                 &             
    \end{tikzcd}
    \caption{$\M$}
    \label{fig:dgm}
\end{wrapfigure}

\emph{Below is the framework as was written on the whiteboard, then RL presents his own remarks on how he understood this.}
~ \\
\begin{description}

\item[Data generation:] 
	~ \\
	\hskip 3em \textbf{Input:} [none] \\ 
			 \textbf{Output:} $X, Z, W, Y$ as specified by $\M$~ \\ 

\item[Decider:] single vs. batch 
	~ \\
	\hskip 3em \textbf{Input:}  
		\begin{itemize}  
			\item one defendant 
			\item $\M$ 
		\end{itemize} 
	\textbf{Output:} 
		\begin{itemize}  
			\item argmax likelihood  $y$ 
			\item $\pr(Y=0~|~input)$
			\item order \\
		\end{itemize} 
\item[Evaluator:] 
	~ \\
	\hskip 3em \textbf{Input:} 
		\begin{itemize}  
			\item Data sample $(X, T, Y)$ 
			\item something about $\M$ and something about Decider(r)
		\end{itemize}
	\textbf{Output:}
		\begin{itemize}  
			\item $\mathbb{E}[FR~|~input]$
			\item curve
		\end{itemize} 

\end{description}

The above framework is now separated into three different modules: data generation, decider and evaluator. In the first module, all the data points $\{x_i, z_i, w_i, y_i\}$ for all $i=1, \ldots, n$ are created. Outcome $y_i$ is available for all observations.

The next module, namely the decider, assigns decisions for each observation with a given/defined way. This 'decision' can be either the most likely value for y (argmax likelihood y, usually binary 0 or 1), probability of an outcome or an ordering of the defendants.

The evaluator module takes a data sample, some information about the data generation and some information about the decider as input. The data sample includes features $X,~T$ and $Y$ where $Y \in \{0, 1, NA\}$ as specified before. The "some information about $\M$" might be knowledge on the distribution of some of the variables or their interdependencies. In our example, we know that the $X$ is a standard Gaussian and independent from the other variables. From the decider it is known that its decisions are affected by leniency and private properties X. Our objective is to simulate the decision-maker's process within the data sample. But to do this we need to learn the predictive model $\B$ with the restriction that Z can't be observed. With this setting, we now need to define an algorithm which outputs $\mathbb{E}[FR~|~input]$ for a given model $\B$ and leniency $r$. 

\begin{quote}
\emph{MM:} For example, consider an evaluation process that knows (i.e., is given as input) the decision process and what decisions it took for a few data points. The same evaluation process knows only some of the attributes of those data points -- and therefore it has only partial information about the data generation process. To make the example more specific, consider the case of decision process $\s$ mentioned above, which does not know W -- and consider an evaluation process that knows exactly how $\s$ works and what decisions it took for a few data points, but does not know either W or Z of those data points. This evaluation process outputs the expected value of FR according to the information that's given to it.
\end{quote}

\textbf{Example} For illustration, let's consider the contraction algorithm in this framework. Data is generated from three Gaussians and the outcome Y is decided as in algorithm \ref{alg:data_with_Z}. Now the deciders are humans assigning the decisions as in algorithm \ref{alg:data_with_Z}. Next, the evaluator module
\begin{itemize}
\item takes the whole of the observable data ($\{x_i, t_i, y_i\}$) as input
\item knows that X affects the results and the decisions
\item knows that the variables are Gaussians.
\end{itemize}
Now the evaluator module splits the data to training and test sets and trains a predictive model $\B$ on the training set. Finally the contraction algorithm outputs an estimate of the failure rate following its definition (see algorithm \ref{alg:contraction}) using predictions from $\B$.


\section{Data generation}

Both of the data generating algorithms are presented in this chapter. 

\subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})}

In the setting without unobservables Z, we first sample an acceptance rate $r$ for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as 
\begin{equation}
	P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).
\end{equation}
Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Create data without unobservables} 		% give the algorithm a caption
\label{alg:data_without_Z} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Number of judges $M=100$ and number of subjects distributed to each of them $N=500$ s.t. $N_{total} = N \cdot M$
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X for each $N_{total}$ observations from standard Gaussian.
\STATE Calculate $P(Y=0|X=x)=\sigma(x)$ for each observation
\STATE Sample Y from Bernoulli distribution with parameter $1-\sigma(x)$.
\STATE Sort the data by (1) the judges and (2) by probabilities $P(Y=0|X=x)$ in descending order. 
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\end{algorithmic}
\end{algorithm}

\subsection{With unobservables (see also algorithm \ref{alg:data_with_Z})}

In the setting with unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects (50000 in total) for each of the judges randomly and simulate their features X, Z and W as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as
\begin{equation}
	P(Y=0|X=x, Z=z, W=w)=\sigma(\beta_Xx+\beta_Zz+\beta_Ww)~,
\end{equation}
where $\beta_X=\beta_Z =1$ and $\beta_W=0.2$. Next, value for result Y is set to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and 1 otherwise. The conditional probability for the negative decision (T=0) is defined as 
\begin{equation}
	P(T=0|X=x, Z=z)=\sigma(\beta_Xx+\beta_Zz)+\epsilon~,
\end{equation}
where $\epsilon \sim N(0, 0.1)$. Next, the data is sorted for each judge by the probabilities $P(T=0|X, Z)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Create data with unobservables} 		% give the algorithm a caption
\label{alg:data_with_Z} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Number of judges $M=100$, number of subjects distributed to each of them $N=500$ s.t $N_{total} = N \cdot M$, $\beta_X=1, \beta_Z=1$ and $\beta_W=0.2$.
\ENSURE
\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
\STATE Sample features X, Z and W for each $N_{total}$ observations from standard Gaussian independently.
\STATE Calculate $P(Y=0|X, Z, W)$ for each observation.
\STATE Set Y to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and to 1 otherwise.
\STATE Calculate $P(T=0|X, Z)$ for each observation and attach to data.
\STATE Sort the data by (1) the judges' and (2) by probabilities $P(T=0|X, Z)$ in descending order. 
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
\STATE Halve the data to training and test sets at random.
\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
\RETURN labeled training data, full training data, labeled test data, full test data
\end{algorithmic}
\end{algorithm}

\section{Model fitting} \label{sec:model_fitting}

The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. Similar analyses were conducted using random forest classifier, but the results (see section \ref{sec:random_forest}) were practically identical.

All of the algorithms 4--8 are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour."

NB: The sklearn's regression model can not be fitted if the data includes missing values. Therefore list-wise deletion is done in cases of missing data (whole record is discarded). 

\section{Plotting}

The following quantities are computed from the data:

\begin{itemize}
\item True evaluation: The true failure rate of the model. Can only be calculated for synthetic data sets. See algorithm \ref{alg:true_eval} and discussion in section \ref{sec:comments}.
\item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}.
\item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}.
\item Contraction: See algorithm \ref{alg:contraction} from \cite{lakkaraju17}.
\item Causal model: In essence, the empirical performance is calculated over the test set as 
\begin{equation}\label{eq:ep}
	\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)
\end{equation}
where
\begin{equation}\label{eq:causal_prediction}
	f(x) = P(Y=0|T=1, X=x)
\end{equation}
is a logistic regression model (see section \ref{sec:model_fitting}, random forest used in section \ref{sec:random_forest}) trained on the labeled data predicting Y from X and 
\begin{equation} \label{eq:causal_cdf}
F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.
\end{equation}
All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) See also algorithm \ref{alg:causal_model} and motivation from section \ref{sec:motivation} . \label{causal_cdf}
\end{itemize}

The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}.

\subsection{Short motivation of the causal model presented above:} \label{sec:motivation}

The causal model tries to predict the probability of adverse outcome $Y=0$ when an acceptance rate $r$ is imposed. Estimating such probability in the selective labels setting consists of two parts: predicting the probability for an individual to commit a crime if they are given bail and deciding whether to bail or jail them. 

In equations \ref{eq:ep} and \ref{eq:causal_prediction}, $f(x)$ gives the probability given private features $x$. In equation \ref{eq:ep} $\delta(F(x) < r)$ indicates the defendants bail decision. They will be let out if the proportion of people less dangerous than $x_0$ is under $r$. For example, if a defendant $x_0$ arrives in front of a judge with leniency 0.65 they will not be left out if the judge deems that $F(x_0) > 0.65$ that is if the judge thinks that more than 65\% of the defendants are more dangerous than them.

Now the equation \ref{eq:ep} simply calculates the mean of the probabilities forcing the probbility of crime to zero if they will not be given bail.

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Performance comparison} 		% give the algorithm a caption
\label{alg:perf_comp} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Number of iterations $N_{iter}$
\ENSURE
\FORALL{$r$ in $0.1, 0.2, ..., 0.9$}
	\FOR{i = 1 \TO $N_{iter}$}
		\STATE Create data using either Algorithm \ref{alg:data_without_Z} or \ref{alg:data_with_Z}.
		\STATE Train a logistic regression model using observations in the training set with available outcome labels and assign to $f$.
		\STATE Using $f$, estimate probabilities $\s$ for Y=0 in both test sets (labeled and full) for all observations and attach them to the respective data sets.
        		\STATE Compute failure rate of true evaluation with leniency $r$ and full test data using algorithm \ref{alg:true_eval}.
        		\STATE Compute failure rate of labeled outcomes approach with leniency $r$ and labeled test data using algorithm \ref{alg:labeled_outcomes}.
        		\STATE Compute failure rate of human judges with leniency $r$ and labeled test data using algorithm \ref{alg:human_eval}.
        		\STATE Compute failure rate of contraction algorithm with leniency $r$ and labeled test data.
        		\STATE Compute the empirical performance of the causal model with leniency $r$, predictive model $f$ and labeled test data using algorithm \ref{alg:causal_model}.
	\ENDFOR
	\STATE Calculate means of the failure rates for each level of leniency and for each algorithm separately.
	\STATE Calculate standard error of the mean for each level of leniency and for each algorithm separately.
\ENDFOR 
\STATE Plot the failure rates with given levels of leniency $r$.
\STATE Calculate absolute mean errors of each algorithm compared to true evaluation.
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{True evaluation} 		% give the algorithm a caption
\label{alg:true_eval} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Full test data $\D$ with probabilities $\s$ and \emph{all outcome labels}, acceptance rate r
\ENSURE
\STATE Sort the data by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D| \cdot r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Labeled outcomes} 		% give the algorithm a caption
\label{alg:labeled_outcomes} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Assign observations with observed outcomes to $\D_{observed}$.
\STATE Sort $\D_{observed}$ by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D_{observed}| \cdot r$.
\RETURN $\frac{1}{|\D_{observed}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Human evaluation} 		% give the algorithm a caption
\label{alg:human_eval} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Assign judges with leniency in $[r-0.05, r+0.05]$ to $\mathcal{J}$
\STATE $\D_{released} = \{(x, j, t, y) \in \D~|~t=1 \wedge j \in  \mathcal{J}\}$
\STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
\RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\D_{released}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Contraction algorithm \cite{lakkaraju17}} 		% give the algorithm a caption
\label{alg:contraction} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r
\ENSURE
\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
\STATE
\STATE $\RR_q = \{(x, j, t, y) \in \D_q|t=1\}$
\STATE \hskip3.0em $\rhd$ $\RR_q$ is the set of observations in $\D_q$ with observed outcome labels
\STATE
\STATE Sort observations in $\RR_q$ in descending order of confidence scores $\s$ and assign to $\RR_q^{sort}$.
\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
\STATE
\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\RR_q |]$ observations of $\RR_q^{sort}$ and call this list $\mathcal{R_B}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
\STATE
\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
\RETURN $\mathbf{u}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Causal model, empirical performance (ep, see also section \ref{causal_cdf})} 		% give the algorithm a caption
\label{alg:causal_model} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Labeled test data $\D$ with probabilities $\s$ and \emph{missing outcome labels} for observations with $T=0$, predictive model $f$, pdf $P_X(x)$ for features $x$, acceptance rate r
\ENSURE
\STATE For all $x_0 \in \D$ evaluate $F(x_0) = \int_{x\in\mathcal{X}} P_X(x)\delta(f(x)<f(x_0)) ~dx$ and assign to $\mathcal{F}_{predictions}$ 
\STATE Create boolean array $T_{causal} = \mathcal{F}_{predictions} < r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{\D} \s \cdot T_{causal}$ which is equal to $\frac{1}{|\D|}\sum_{x\in\D} f(x)\delta(F(x) < r)$
\end{algorithmic}
\end{algorithm}


\section{Results}

Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.

\begin{table}[H]
\centering
\caption{Mean absolute error (MAE) w.r.t true evaluation. \\ \emph{RL: Updated 26 June.}}
\begin{tabular}{l | c c}
Method & MAE without Z & MAE with Z \\ \hline
Labeled outcomes 	& 0.107249375 	& 0.0827844\\
Human evaluation 	& 0.002383729 	& 0.0042517\\
Contraction 		& 0.004633164		& 0.0075497\\
Causal model, ep 	& 0.000598624 	& 0.0411532\\
\end{tabular}
\label{tab:results}
\end{table}


\begin{figure}[]
    \centering
    \begin{subfigure}[b]{0.5\textwidth}
        \includegraphics[width=\textwidth]{sl_without_Z_8iter}
        \caption{Results without unobservables}
        \label{fig:results_without_Z}
    \end{subfigure}
    ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.5\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_8iter_betaZ_1_0}
        \caption{Results with unobservables, $\beta_Z=1$.}
        \label{fig:results_with_Z}
    \end{subfigure}
    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. $N_{iter}$ was set to 8. \emph{RL: Updated 26 June.}}
    \label{fig:results}
\end{figure}

\subsection{$\beta_Z=0$ and data generated with unobservables.}

If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Results are presented in figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}. 

The disparities between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} (result without unobservables and with $\beta_Z=0$) can be explained in the slight difference in the data generating process, namely the effect of $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}.

\begin{figure}[]
    \centering
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
        \caption{Results with unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
        \label{fig:betaZ_1_5}
    \end{subfigure}
    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
        \caption{Results with unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
        \label{fig:betaZ_0}
    \end{subfigure}
    \caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$.}
    \label{fig:betaZ_comp}
\end{figure}

\subsection{Noise added to the decision and data generated without unobservables} \label{sec:epsilon}

In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}.

\begin{figure}[]
    \centering
    \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
    \caption{Failure rate with varying levels of leniency without unobservables. Noise has been added to the decision probabilities. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.}
    \label{fig:sigma_figure}
\end{figure}

\subsection{Predictions with random forest classifier} \label{sec:random_forest}

In this section the predictive model was switched to random forest classifier to examine the effect of changing the predictive model. Results are practically identical to those presented in figure \ref{fig:results} previously and are presented in figure \ref{fig:random_forest}.

\begin{figure}[]
    \centering
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest}
        \caption{Results without unobservables with \\$N_{iter}=4$.}
        \label{fig:results_without_Z_rf}
    \end{subfigure}
    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest}
        \caption{Results with unobservables, $\beta_Z=1$ and \\$N_{iter}=6$.}
        \label{fig:results_with_Z_rf}
    \end{subfigure}
    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data}
    \label{fig:random_forest}
\end{figure}

\subsection{Sanity check for predictions}

Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth.

\begin{figure}[]
    \centering
    \includegraphics[width=0.5\textwidth]{sanity_check}
    \caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}
    \label{fig:sanity_check}
\end{figure}

\subsection{Fully random model $\M$}

Given our framework defined in section \ref{sec:framework}, the results presented next are with model $\M$ that outputs probabilities 0.5 for every instance of $x$. Labeling process is still as presented in algorithm \ref{alg:data_with_Z}.  

\begin{figure}[]
    \centering
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_without_Z_15iter_random_model}
        \caption{Failure rate vs. acceptance rate. Data without unobservables and $N_{iter}=15$. Machine predictions with random model.}
        \label{fig:random_predictions_without_Z}
    \end{subfigure}
    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_15iter_fully_random_model}
        \caption{Failure rate vs. acceptance rate. Data with unobservables and $N_{iter}=15$. Machine predictions with random model.}
        \label{fig:random_predictions_with_Z}
    \end{subfigure}
    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Machine predictions were done with completely random model, that is prediction $P(Y=0|X=x)=0.5$ for all $x$.}
    \label{fig:random_predictions}
\end{figure}

\subsection{Modular framework -- Monte Carlo evaluator} \label{sec:modules_mc}

For these results, data was generated either with module in algorithm \ref{alg:dg:coinflip_with_z} (drawing Y from Bernoulli distribution with parameter $\pr(Y=0|X, Z, W)$ as previously) or with module in algorithm \ref{alg:dg:threshold_with_Z} (assign Y based on the value of $\sigma(\beta_XX+\beta_ZZ)$). Decisions were determined using one of the two modules: module in algorithm \ref{alg:decider:quantile} (decision based on quantiles) or \ref{alg:decider:human} ("human" decision-maker as in \cite{lakkaraju17}). Curves were computed with True evaluation (algorithm \ref{alg:eval:true_eval}), Labeled outcomes (\ref{alg:eval:labeled_outcomes}), Human evaluation (\ref{alg:eval:human_eval}), Contraction (\ref{alg:eval:contraction}) and Monte Carlo evaluators (\ref{alg:eval:mc}). Results are presented in figure \ref{fig:modules_mc}. The corresponding MAEs are presented in table \ref{tab:modules_mc}.

From the result table we can see that the MAE is at the lowest when the data generating process corresponds closely to the Monte Carlo algorithm.

\begin{table}[H]
\centering
\caption{Mean absolute error w.r.t true evaluation. See modules used in section \ref{sec:modules_mc}. Bern = Bernoulli,  indep. = independent, TH = threshold}
\begin{tabular}{l | c c c c}
Method & Bern + indep. & Bern + non-indep. & TH + indep. & TH + non-indep.\\ \hline
Labeled outcomes 	& 0.111075	& 0.103235	& 0.108506 &\\
Human evaluation 	& 0.027298	& NaN (TBA)	& 0.049582 &\\
Contraction 		& 0.004206	& 0.004656	& 0.005557 &\\
Monte Carlo	 	& 0.001292	& 0.016629	& 0.009429 &\\
\end{tabular}
\label{tab:modules_mc}
\end{table}


\begin{figure}[H]
    \centering
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_quantile_defaults_mc}
        \caption{Outcome Y from Bernoulli, independent decisions using the quantiles and $N_{iter}=10$.}
        %\label{fig:modules_mc_without_Z}
    \end{subfigure}
    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_20iter_threshold_quantile_defaults_mc}
        \caption{Outcome Y from threshold rule, independent decisions using the quantiles and $N_{iter}=20$.}
        %\label{fig:modules_mc_with_Z}
    \end{subfigure}
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_10iter_coinflip_lakkarajudecider_defaults_mc}
        \caption{Outcome Y from Bernoulli, non-independent decisions and $N_{iter}=10$.}
        %\label{fig:modules_mc_without_Z}
    \end{subfigure}
    \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
      %(or a blank line to force the subfigure onto a new line)
    \begin{subfigure}[b]{0.475\textwidth}
        \includegraphics[width=\textwidth]{sl_with_Z_4iter_threshold_lakkarajudecider_defaults_mc}
        \caption{Outcome Y from threshold rule, non-independent decisions and $N_{iter}=4$.}
        %\label{fig:modules_mc_with_Z}
    \end{subfigure}
    \caption{Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation modules. See other modules used in section \ref{sec:modules_mc}}
    \label{fig:modules_mc}
\end{figure}

\section{Modules}

Different types of modules (data generation, decider and evaluator) are presented in this section. Summary table is presented last. See section \ref{sec:modular_framework} for a more thorough break-down on the properties of each module.

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Data generation module: "coin-flip results" without unobservables} 		% give the algorithm a caption
\label{alg:dg:coinflip_without_z} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Parameters: Total number of subjects $N_{total}$
\ENSURE
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\STATE Draw $x_i$ from from a standard Gaussian.
	\STATE Draw $y_i$ from Bernoulli$(1-\sigma(x_i))$.
	\STATE Attach to data.
\ENDFOR 
\RETURN data
\end{algorithmic}
\end{algorithm}


\begin{algorithm}[] 			% enter the algorithm environment
\caption{Data generation module: "results by threshold" with unobservables} 		% give the algorithm a caption
\label{alg:dg:threshold_with_Z} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Parameters: Total number of subjects $N_{total},~\beta_X=1,~\beta_Z=1$ and $\beta_W=0.2$.
\ENSURE
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\STATE Draw $x_i, z_i$ and $w_i$ from from standard Gaussians independently.
	\IF{$\sigma(\beta_Xx_i+\beta_Zz_i+\beta_Ww_i) \geq 0.5$}
		\STATE {Set $y_i$ to 0.}
	\ELSE
		\STATE {Set $y_i$ to 1.}
	\ENDIF
	\STATE Attach to data.
\ENDFOR 
\RETURN data
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Data generation module: "coin-flip results" with unobservables} 		% give the algorithm a caption
\label{alg:dg:coinflip_with_z} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Parameters: Total number of subjects $N_{total},~\beta_X=1,~\beta_Z=1$ and $\beta_W=0.2$.
\ENSURE
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\STATE Draw $x_i, z_i$ and $w_i$ from from standard Gaussians independently.
	\STATE Draw $y_i$ from Bernoulli$(1-\sigma(\beta_XX+\beta_ZZ+\beta_WW))$.
	\STATE Attach to data.
\ENDFOR 
\RETURN data
\end{algorithmic}
\end{algorithm}

%For decider modules, input as terms of knowledge and parameters should be as explicitly specified as possible.

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Decider module: human judge as specified by Lakkaraju et al. \cite{lakkaraju17}} 		% give the algorithm a caption
\label{alg:decider:human} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data with features $X, Z$ of size $N_{total}$, knowledge that both of them affect the outcome Y and that they are independent / Parameters: $M=100, \beta_X=1, \beta_Z=1$.
\ENSURE
\STATE Sample acceptance rates for each M judges from Uniform$(0.1; 0.9)$ and round to tenth decimal place.
\STATE Assign each observation to a judge at random.
\STATE Calculate $\pr(T=0|X, Z) = \sigma(\beta_XX+\beta_ZZ) + \epsilon$ for each observation and attach to data.
\STATE Sort the data by (1) the judges and (2) by the probabilities in descending order. 
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
\STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Might not be performed.}
\RETURN data with decisions.
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Decider module: "coin-flip decisions" (pseudo-leniencies set at 0.5)} 		% give the algorithm a caption
\label{alg:decider:coinflip} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data with features $X, Z$ of size $N_{total}$, knowledge that both of them affect the outcome Y and that they are independent / Parameters: $\beta_X=1, \beta_Z=1$.
\ENSURE
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\STATE Draw $t_i$ from Bernoulli$(\sigma(\beta_Xx_i+\beta_Zz_i))$.
	\STATE Attach to data.
\ENDFOR 
\STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Might not be performed.}
\RETURN data with decisions.
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Decider module: "quantile decisions"} 		% give the algorithm a caption
\label{alg:decider:quantile} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data with features $X, Z$ of size $N_{total}$, knowledge that both of them affect the outcome Y and that they are independent / Parameters: $\beta_X=1, \beta_Z=1$.
\ENSURE
\STATE Sample acceptance rates for each M judges from Uniform$(0.1; 0.9)$ and round to tenth decimal place.
\STATE Assign each observation to a judge at random.
\STATE Calculate $\pr(T=0|X, Z) = \sigma(\beta_XX+\beta_ZZ)$ for each observation and attach to data.
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\IF{$\sigma(\beta_Xx_i+\beta_Zz_i) \geq F^{-1}_{\pr(T=0|X, Z)}(r)$ \footnotemark} % Footnote text below algorithm
		\STATE{Set $t_i=0$.} 
	\ELSE 
		\STATE{Set $t_i=1$.} 
	\ENDIF
	\STATE Attach to data.
\ENDFOR 
\STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Might not be performed.}
\RETURN data with decisions.
\end{algorithmic}
\end{algorithm}

\footnotetext{The inverse cumulative distribution function (or quantile function) $F^{-1}$ was constructed by first sampling $10^7$ observations from $N(0, 2)$ (sum of two Gaussians) and applying the inverse of logit function $\sigma(x)$. The value of $F^{-1}(r)$ was computed utilizing the previously computed array and numpy's \texttt{quantile} function.}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: Contraction algorithm \cite{lakkaraju17}} 		% give the algorithm a caption
\label{alg:eval:contraction} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$, acceptance rate r, knowledge that X affects Y
\ENSURE
\STATE Split data to test set and training set.
\STATE Train a predictive model $\B$ on training data.
\STATE Estimate probability scores $\s$ using $\B$ for all observations in test data and attach to test data.
\STATE Let $q$ be the decision-maker with highest acceptance rate in $\D$.
\STATE $\D_q = \{(x, j, t, y) \in \D|j=q\}$
\STATE \hskip3.0em $\rhd$ $\D_q$ is the set of all observations judged by $q$
\STATE $\RR_q = \{(x, j, t, y) \in \D_q|t=1\}$
\STATE \hskip3.0em $\rhd$ $\RR_q$ is the set of observations in $\D_q$ with observed outcome labels
\STATE Sort observations in $\RR_q$ in descending order of confidence scores $\s$ and assign to $\RR_q^{sort}$.
\STATE \hskip3.0em $\rhd$ Observations deemed as high risk by the black-box model $\mathcal{B}$ are at the top of this list
\STATE Remove the top $[(1.0-r)|\D_q |]-[|\D_q |-|\RR_q |]$ observations of $\RR_q^{sort}$ and call this list $\mathcal{R_B}$
\STATE \hskip3.0em $\rhd$ $\mathcal{R_B}$ is the list of observations assigned to $t = 1$ by $\mathcal{B}$
\STATE Compute $\mathbf{u}=\sum_{i=1}^{|\mathcal{R_B}|} \dfrac{\delta\{y_i=0\}}{| \D_q |}$.
\RETURN $\mathbf{u}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: True evaluation} 		% give the algorithm a caption
\label{alg:eval:true_eval} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, t_i, y_i\}$ and \emph{all outcome labels}, acceptance rate r, knowledge that X affects Y
\ENSURE
\STATE Split data to test set and training set.
\STATE Train a predictive model $\B$ on training data.
\STATE Estimate probability scores $\s$ using $\B$ for all observations in test data and attach to test data.
\STATE Sort the data by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D| \cdot r$.
\RETURN $\frac{1}{|\D|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: Labeled outcomes} 		% give the algorithm a caption
\label{alg:eval:labeled_outcomes} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, t_i, y_i\}$, acceptance rate r, knowledge that X affects Y
\ENSURE
\STATE Split data to test set and training set.
\STATE Train a predictive model $\B$ on training data.
\STATE Estimate probability scores $\s$ using $\B$ for all observations in test data and attach to test data.
\STATE Assign observations in test data with observed outcomes (T=1) to $\D_{observed}$.
\STATE Sort $\D_{observed}$ by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D_{observed}| \cdot r$.
\RETURN $\frac{1}{|\D_{observed}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: Human evaluation} 		% give the algorithm a caption
\label{alg:eval:human_eval} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$, acceptance rate r
\ENSURE
\STATE \emph{Split data to test set and training set and discard the training set.}
\STATE Assign judges with acceptance rate in $[r-0.05, r+0.05]$ to $\mathcal{J}$
\STATE $\D_{released} = \{(x, j, t, y) \in \D~|~t=1 \wedge j \in  \mathcal{J}\}$
\STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
\RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\D_{released}}\delta\{y_i=0\}$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: Causal evaluator (?)} 		% give the algorithm a caption
\label{alg:eval:causal_eval} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, t_i, y_i\}$, acceptance rate r
\ENSURE
\STATE Split data to test set and training set.
\STATE Train a predictive model $\B$ on training data.
\STATE Estimate probability scores $\s$ using $\B$ for all observations in test data and attach to test data.
\FORALL{$i$ in $1, \ldots, N_{total}$}
	\STATE Evaluate $F(x_i) = \int_{x\in\mathcal{X}} P_X(x)\delta(f(x)<f(x_i)) ~dx$ and assign to $\mathcal{F}_{predictions}$ 
\ENDFOR
\STATE Create boolean array $T_{causal} = \mathcal{F}_{predictions} < r$.
\RETURN $\frac{1}{|\D_{test}|}\sum_{i=1}^{|\D_{test}|} \s_i \cdot T_{i, causal}$ which is equal to $\frac{1}{|\D|}\sum_{x\in\D} f(x)\delta(F(x) < r)$
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[] 			% enter the algorithm environment
\caption{Evaluator module: Monte Carlo evaluator, imputation} 		% give the algorithm a caption
\label{alg:eval:mc} 			% and a label for \ref{} commands later in the document
\begin{algorithmic}[1] 		% enter the algorithmic environment
\REQUIRE Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$, acceptance rate r
\ENSURE
\STATE Split data to test set and training set.
\STATE Train a predictive model $\B$ on training data.
\STATE Estimate probability scores $\s$ using $\B$ for all observations in test data and attach to test data.
\STATE Sample $N_{sim}$ observations from a standard Gaussian and assign to Z.
\STATE Sample $N_{sim}$ observations from sum of two standard Gaussians (N(0, 2)) and assign to \texttt{quants}.
\STATE Transform the values of the samples in \texttt{quants} using the inverse of logit function.
\STATE Compute the values of the inverse cdf of the observations in \texttt{quants} for the acceptance rates r of each judge and assign to $Q_r$.
\FORALL{$i$ in $1, \ldots, N_{test}$}
	\IF{$t_i = 0$}
		\STATE{Take all $Z + \epsilon > logit(Q_{r,i})-x_i$ , \footnotemark~where $\epsilon \sim N(0, 0.1)$.} 
	\ELSE 
		\STATE{Take all $Z + \epsilon < logit(Q_{r,i})-x_i$ , where $\epsilon \sim N(0, 0.1)$.} 
	\ENDIF
	\STATE Compute $\bar{z}=\frac{1}{n}\sum z$
	\STATE Draw predictions $\hat{p}_{i,y}$ from Bernoulli($1-logit^{-1}(x_i+\bar{z})$) and assign to data.
\ENDFOR
\STATE Impute missing observations using $\hat{p}_y$.
\STATE Sort the data by the probabilities $\s$ to ascending order.
\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
\STATE Calculate the number to release $N_{free} = |\D_{test}| \cdot r$.
\RETURN Compute $\frac{1}{|\D_{test}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$ using the observed and imputed observations.
\end{algorithmic}
\end{algorithm}

\footnotetext{$logit^{-1}(x+z)>a \Leftrightarrow x+z > logit(a) \Leftrightarrow z > logit(a)-x$}

\begin{table}[h!]
  \centering
  \caption{Summary of modules (under construction)}
  \begin{tabular}{lll}
    \toprule
    \multicolumn{3}{c}{Module type} \\[.5\normalbaselineskip]
    \textbf{Data generator} & \textbf{Decider} & \textbf{Evaluator}  \\
    \midrule
    {\ul Without unobservables}		& Independent decisions		& {\ul Labeled outcomes} \\
     					 		& 1. flip a coin by 			& \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
    {\ul With unobservables}       		& $P(T=0|X, Z)$			& \tabitem acceptance rate r \\
    \tabitem $P(Y=0|X, Z, W)$ 		& 2. determine with $F^{-1}(r)$	& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
    
     {\ul With unobservables}	& Non-independent decisions  	& {\ul True evaluation} \\
     \tabitem assign Y by		& 3. sort by $P(T=0|X, Z)$		& \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
     "threshold rule"			& and assign $t$ by $r$  		& and \emph{all outcome labels} \\
     						&   						& \tabitem acceptance rate r \\
     						&   						& \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
     
     &  & {\ul Human evaluation} \\
     &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
     &  & \tabitem acceptance rate r \\[.5\normalbaselineskip]
     
     &  & {\ul Contraction algorithm} \\
     &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
     &  & \tabitem acceptance rate r \\
     &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
     
     &  & {\ul Causal model (?)} \\
     &  & \tabitem Data $\D$ with properties $\{x_i, t_i, y_i\}$ \\
     &  & \tabitem acceptance rate r \\
     &  & \tabitem knowledge that X affects Y \\[.5\normalbaselineskip]
     
     &  & {\ul Monte Carlo evaluator} \\
     &  & \tabitem Data $\D$ with properties $\{x_i, j_i, t_i, y_i\}$ \\
     &  & \tabitem acceptance rate r \\
     &  & \tabitem knowledge that X affects Y \\
     &  & \tabitem more intricate knowledge about $\M$ ? \\[.5\normalbaselineskip]
    \bottomrule
  \end{tabular}
  \label{tab:modules}
\end{table}

\begin{thebibliography}{9}

\bibitem{dearteaga18}
   De-Arteaga, Maria. Learning Under Selective Labels in the Presence of Expert Consistency. 2018. 
\bibitem{lakkaraju17} 
   Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017. 

\end{thebibliography}

\end{document}