Algorithm explanations finalized

41284469 · Riku-Laine · 0a8dc153 · 41284469
Commit 41284469 authored 5 years ago by Riku-Laine
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
@@ -11,6 +11,7 @@

 \usepackage[normalem]{ulem} %tables
 \useunder{\uline}{\ul}{}
+\usepackage{multirow}

 \usepackage{wrapfig} % wrap figures

@@ -94,14 +95,12 @@

 \maketitle

-%\section*{Contents}
-
 \tableofcontents

-%\listofalgorithms
-
 \begin{abstract}
-This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In chapter 2, I define the framework for this problem and give the required definitions. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods. Finally in the last section, I present results using multiple different settings.
+This document presents the implementations of RL in pseudocode level. First, I present most of the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In sections \ref{sec:framework} and \ref{sec:modular_framework}, I present two frameworks for the selective labels problems. The latter of these has subsequently been established the current framework, but former is included for documentation purposes. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods in the old framework. Results are presented in section \ref{sec:results}. All the algorithms used in the modular framework are presented with motivations in section \ref{sec:modules}. 
+
+For notes of the meetings please refer to the \href{https://helsinkifi-my.sharepoint.com/personal/rikulain_ad_helsinki_fi/Documents/Meeting_Notes.docx?web=1}{{\ul word document}}.
 \end{abstract}

 \section*{Terms and abbreviations}
@@ -313,10 +312,10 @@ Both of the data generating algorithms are presented in this chapter.
 \subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})}

 In the setting without unobservables Z, we first sample an acceptance rate $r$ for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as 
-\begin{equation}
-	P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).
+\begin{equation} \label{eq:inv_logit}
+	P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=logit^{-1}(x).
 \end{equation}
-Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
+Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-logit^{-1}(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-logit^{-1}(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.

 \begin{algorithm}[] 			% enter the algorithm environment
 \caption{Create data without unobservables} 		% give the algorithm a caption
@@ -326,8 +325,8 @@ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be
 \ENSURE
 \STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
 \STATE Sample features X for each $N_{total}$ observations from standard Gaussian.
-\STATE Calculate $P(Y=0|X=x)=\sigma(x)$ for each observation
-\STATE Sample Y from Bernoulli distribution with parameter $1-\sigma(x)$.
+\STATE Calculate $P(Y=0|X=x)=logit^{-1}(x)$ for each observation
+\STATE Sample Y from Bernoulli distribution with parameter $1-logit^{-1}(x)$.
 \STATE Sort the data by (1) the judges and (2) by probabilities $P(Y=0|X=x)$ in descending order. 
 \STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
 \STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, set $T=0$ else set $T=1$.
@@ -341,11 +340,11 @@ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be

 In the setting with unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects (50000 in total) for each of the judges randomly and simulate their features X, Z and W as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as
 \begin{equation}
-	P(Y=0|X=x, Z=z, W=w)=\sigma(\beta_Xx+\beta_Zz+\beta_Ww)~,
+	P(Y=0|X=x, Z=z, W=w)=logit^{-1}(\beta_Xx+\beta_Zz+\beta_Ww)~,
 \end{equation}
 where $\beta_X=\beta_Z =1$ and $\beta_W=0.2$. Next, value for result Y is set to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and 1 otherwise. The conditional probability for the negative decision (T=0) is defined as 
 \begin{equation}
-	P(T=0|X=x, Z=z)=\sigma(\beta_Xx+\beta_Zz)+\epsilon~,
+	P(T=0|X=x, Z=z)=logit^{-1}(\beta_Xx+\beta_Zz)+\epsilon~,
 \end{equation}
 where $\epsilon \sim N(0, 0.1)$. Next, the data is sorted for each judge by the probabilities $P(T=0|X, Z)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.

@@ -513,9 +512,9 @@ Now the equation \ref{eq:ep} simply calculates the mean of the probabilities for
 \end{algorithm}


-\section{Results}
+\section{Results} \label{sec:results}

-Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.
+Results obtained from running algorithm \ref{alg:perf_comp} are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained.

 \begin{table}[H]
 \centering
@@ -580,7 +579,7 @@ In this part, Gaussian noise with zero mean and 0.1 variance was added to the pr
 \begin{figure}[]
    \centering
    \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01}
-    \caption{Failure rate with varying levels of leniency without unobservables. Noise has been added to the decision probabilities. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.}
+    \caption{Failure rate with varying levels of leniency without unobservables. Noise has been added to the decision probabilities. Logistic regression was trained on labeled training data.}
    \label{fig:sigma_figure}
 \end{figure}

@@ -641,7 +640,7 @@ Given our framework defined in section \ref{sec:framework}, the results presente

 \subsection{Modular framework -- Monte Carlo evaluator} \label{sec:modules_mc}

-For these results, data was generated either with module in algorithm \ref{alg:dg:coinflip_with_z} (drawing Y from Bernoulli distribution with parameter $\pr(Y=0|X, Z, W)$ as previously) or with module in algorithm \ref{alg:dg:threshold_with_Z} (assign Y based on the value of $\sigma(\beta_XX+\beta_ZZ)$). Decisions were determined using one of the two modules: module in algorithm \ref{alg:decider:quantile} (decision based on quantiles) or \ref{alg:decider:human} ("human" decision-maker as in \cite{lakkaraju17}). Curves were computed with True evaluation (algorithm \ref{alg:eval:true_eval}), Labeled outcomes (\ref{alg:eval:labeled_outcomes}), Human evaluation (\ref{alg:eval:human_eval}), Contraction (\ref{alg:eval:contraction}) and Monte Carlo evaluators (\ref{alg:eval:mc}). Results are presented in figure \ref{fig:modules_mc}. The corresponding MAEs are presented in table \ref{tab:modules_mc}.
+For these results, data was generated either with module in algorithm \ref{alg:dg:coinflip_with_z} (drawing Y from Bernoulli distribution with parameter $\pr(Y=0|X, Z, W)$ as previously) or with module in algorithm \ref{alg:dg:threshold_with_Z} (assign Y based on the value of $logit^{-1}(\beta_XX+\beta_ZZ)$). Decisions were determined using one of the two modules: module in algorithm \ref{alg:decider:quantile} (decision based on quantiles) or \ref{alg:decider:lakkaraju} ("human" decision-maker as in \cite{lakkaraju17}). Curves were computed with True evaluation (algorithm \ref{alg:eval:true_eval}), Labeled outcomes (\ref{alg:eval:labeled_outcomes}), Human evaluation (\ref{alg:eval:human_eval}), Contraction (\ref{alg:eval:contraction}) and Monte Carlo evaluators (\ref{alg:eval:mc}). Results are presented in figure \ref{fig:modules_mc}. The corresponding MAEs are presented in table \ref{tab:modules_mc}.

 From the result table we can see that the MAE is at the lowest when the data generating process corresponds closely to the Monte Carlo algorithm.

@@ -689,13 +688,28 @@ Monte Carlo	 	& 0.001292	& 0.016629	& 0.009429 & 0.0179825\\
    \label{fig:modules_mc}
 \end{figure}

-\section{Modules}
+\section{Modules} \label{sec:modules}

-Different types of modules (data generation, decider and evaluator) are presented in this section. Summary table is presented last. See section \ref{sec:modular_framework} for a more thorough break-down on the properties of each module.
+Different types of modules (data generation, decider and evaluator) are presented in this section. Summary table is presented last. See section \ref{sec:modular_framework} for some discussion on the properties of each module.

 \subsection{Data generation modules} 

-We have three different kinds of data generating modules (DG modules). The differences of the DG modules are due to two factors: whether there are unobservables and whether the outcome will be drawn from Bernoulli distribution. The only algorithm generating data without unobservables is algorithm \ref{alg:dg:coinflip_without_z}, algorithms \ref{alg:dg:threshold_with_Z} and \ref{alg:dg:coinflip_with_z} generate data with unobservables. The outcome is drawn from a Bernoulli distribution in algorithms \ref{alg:dg:coinflip_without_z} and \ref{alg:dg:coinflip_with_z} and in algorithm \ref{alg:dg:threshold_with_Z} the outcome is set when a value exceeds a certain threshold.
+The purpose of a data generation module is to define a process which generates all of the data consisting of all the features and outcomes. The data generation module should include information on what are the distributions of the features and how each of them affect the generation of the outcome.
+
+We have three different kinds of data generating modules (DG modules). The DG modules can be separated by two features: inclusion of unobservables and the outcome generating mechanism. Private features are mainly sampled from standard Gaussians. See summary of DG modules from table \ref{tab:dg_modules}.
+
+\begin{table}[h]
+\centering
+\caption{Summary of data generation modules}
+\label{tab:dg_modules}
+\begin{tabular}{@{}llcc@{}}
+\toprule
+                                             &                       & \multicolumn{2}{c}{Feature generation}                                             \\ \midrule
+                                             &                       & \multicolumn{1}{l}{With unobservables} & \multicolumn{1}{l}{Without unobservables} \\
+\multicolumn{1}{c}{\multirow{2}{*}{Outcome}} & Drawn from Bernoulli  & \ref{alg:dg:coinflip_with_z}           & \ref{alg:dg:coinflip_without_z}           \\
+\multicolumn{1}{c}{}                         & Assigned by threshold & \ref{alg:dg:threshold_with_Z}          &                                           \\ \cmidrule(l){2-4} 
+\end{tabular}
+\end{table}

 \begin{algorithm}[h] 			% enter the algorithm environment
 \caption{Data generation module: outcome from Bernoulli without unobservables} 		% give the algorithm a caption
@@ -705,7 +719,7 @@ We have three different kinds of data generating modules (DG modules). The diffe
 \ENSURE
 \FORALL{observations}
 	\STATE Draw $x$ from from a standard Gaussian.
-	\STATE Draw $y$ from Bernoulli$(1-\sigma(x))$.
+	\STATE Draw $y$ from Bernoulli$(1-logit^{-1}(x))$.
 \ENDFOR 
 \RETURN data
 \end{algorithmic}
@@ -719,7 +733,7 @@ We have three different kinds of data generating modules (DG modules). The diffe
 \ENSURE
 \FORALL{observations}
 	\STATE Draw $x, z$ and $w$ from from standard Gaussians independently.
-	\IF{$\sigma(\beta_Xx+\beta_Zz+\beta_Ww) \geq 0.5$}
+	\IF{$logit^{-1}(\beta_Xx+\beta_Zz+\beta_Ww) \geq 0.5$}
 		\STATE {Set $y$ to 0.}
 	\ELSE
 		\STATE {Set $y$ to 1.}
@@ -737,7 +751,7 @@ We have three different kinds of data generating modules (DG modules). The diffe
 \ENSURE
 \FORALL{observations}
 	\STATE Draw $x, z$ and $w$ from from standard Gaussians independently.
-	\STATE Draw $y$ from Bernoulli$(1-\sigma(\beta_Xx+\beta_Zz+\beta_Ww))$.
+	\STATE Draw $y$ from Bernoulli$(1-logit^{-1}(\beta_Xx+\beta_Zz+\beta_Ww))$.
 \ENDFOR 
 \RETURN data
 \end{algorithmic}
@@ -747,27 +761,27 @@ We have three different kinds of data generating modules (DG modules). The diffe

 \subsection{Decider modules} 

-We have three different kinds of decider modules. Their distinctive feature is the decisions' independence, for example in algorithm \ref{alg:decider:human} the decisions of a decision-maker are dependent on the other subjects assigned to that decision-maker.
+We have three different kinds of decider modules. Their distinctive feature is the decisions' independence, for example in algorithm \ref{alg:decider:lakkaraju} the decisions of a decision-maker are dependent on the other subjects assigned to that decision-maker.

-Below is presented the human decision-maker \cite{lakkaraju17}. The human decision-maker (1) takes all the subjects as a batch, (2) makes an approximation of the subjects' probabilities for a negative outcome and (3) assigns the decisions by giving $r\cdot 100\%$ of the least likely to fail a positive decision. The resulting decisions are not independent as they depend on the presence of other observations.
+Below is presented the decision-maker by Lakkaraju \cite{lakkaraju17}. The decision-maker (1) takes all the subjects as a batch, (2) makes an approximation of the subjects' tendencies for negative outcomes and (3) assigns the decisions by giving $r\cdot 100\%$ of the least likely to fail a positive decision. The resulting decisions are not independent as they depend on the presence of other observations.

 \begin{algorithm}[H] 			% enter the algorithm environment
-\caption{Decider module: human decision-maker by Lakkaraju et al. \cite{lakkaraju17}} 		% give the algorithm a caption
-\label{alg:decider:human} 			% and a label for \ref{} commands later in the document
+\caption{Decider module: decision-maker by Lakkaraju et al. \cite{lakkaraju17}} 		% give the algorithm a caption
+\label{alg:decider:lakkaraju} 			% and a label for \ref{} commands later in the document
 \begin{algorithmic}[1] 		% enter the algorithmic environment
 \REQUIRE Data with features $X, Z$, knowledge that both of them affect the outcome Y and that they are independent / Parameters: $M=100, \beta_X=1, \beta_Z=1$.
 \ENSURE
 \STATE Sample acceptance rates for each M judges from Uniform$(0.1; 0.9)$ and round to tenth decimal place.
 \STATE Assign each observation to a judge at random.
-\STATE Calculate $\pr(T=0|X, Z) = \sigma(\beta_XX+\beta_ZZ) + \epsilon$ for each observation and attach to data.
+\STATE Calculate $\pr(T=0|X, Z) = logit^{-1}(\beta_XX+\beta_ZZ) + \epsilon$ for each observation and attach to data.
 \STATE Sort the data by (1) the judges and (2) by the probabilities in descending order. 
 \STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
-\STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Optional.}
+\STATE Set $Y=$ NA if decision is negative ($T=0$).
 \RETURN data with decisions.
 \end{algorithmic}
 \end{algorithm}

-One discussed way of making the decisions independent was to "flip a coin at some probability". An implementation of that idea is presented below in algorithm \ref{alg:decider:coinflip}. As $\pr(T=0|X, Z) = \sigma(\beta_XX+\beta_ZZ)$ the parameter for the Bernoulli distribution is set to $1-\sigma(\beta_XX+\beta_ZZ)$. In the practical implementation, as some algorithms need to know the leniency of the decision, acceptance rate is then calculated then from the decisions.
+One discussed way of making the decisions independent was to "flip a coin at some probability". An implementation of that idea is presented below in algorithm \ref{alg:decider:coinflip}. As $\pr(T=0|X, Z) = logit^{-1}(\beta_XX+\beta_ZZ)$ the parameter for the Bernoulli distribution is set to $1-logit^{-1}(\beta_XX+\beta_ZZ)$. In the practical implementation, as some algorithms need to know the leniency of the decision, acceptance rate is then calculated then from the decisions.

 \begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Decider module: decisions from Bernoulli} 		% give the algorithm a caption
@@ -775,18 +789,16 @@ One discussed way of making the decisions independent was to "flip a coin at som
 \begin{algorithmic}[1] 		% enter the algorithmic environment
 \REQUIRE Data with features $X, Z$, knowledge that both of them affect the outcome Y and that they are independent / Parameters: $\beta_X=1, \beta_Z=1$.
 \ENSURE
-\STATE Draw $t$ from Bernoulli$(1-\sigma(\beta_Xx+\beta_Zz))$ for all observations.
+\STATE Draw $t$ from Bernoulli$(1-logit^{-1}(\beta_Xx+\beta_Zz))$ for all observations.
 \STATE Compute the acceptance rate.
 \STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Optional.}
 \RETURN data with decisions.
 \end{algorithmic}
 \end{algorithm}

-A quantile-based decider module is presented in algorithm \ref{alg:decider:quantile}. The algorithm tries to emulate the human decision-maker as in algorithm \ref{alg:decider:human} while giving out independent decisions. To achieve this, we first "train" the decision-maker by showing it a large number of subjects so that they can assess how high would a new subject rank in their probability for a negative outcome, whether they would be in the top 10\% or in the bottom 25\%. Then new decisions can be made using this rule with a guarantee that the fraction of positive decisions will converge to $r$.
-
-In practice, the pdf and subsequently the inverse cdf $F^{-1}$ is constructed by first sampling $10^7$ (i.e. many) observations from $\beta_XX+\beta_ZZ$ (where $X, Z \sim N(0, 1)$) and applying the inverse of logit function $\sigma(x)$. Now the decision-maker has a reference distribution against which to compare any new subjects. Whenever presented with a new subject, the decision-maker uses the reference distribution and makes a judgement based on the $r^{th}$ quantile. 
+A quantile-based decider module is presented in algorithm \ref{alg:decider:quantile}. The algorithm tries to emulate Lakkaraju's decision-maker while giving out independent decisions. The independence is achieved by comparing the values of $logit^{-1}(\beta_Xx+\beta_Zz)$ to the corresponding value of the inverse cumulative distribution function $F^{-1}_{logit^{-1}(\beta_XX+\beta_ZZ)}$ or $F^{-1}$ in short. The derivation of $F^{-1}$ is deferred to the next section. The decisions have a guarantee that the fraction of positive decisions will converge to $r$ based on the law of large numbers. 

-For example, a decision-maker with leniency 0.60 gets a new subject $\{x, z\}$ with a predicted probability $\sigma(\beta_Xx+\beta_Zz)\approx 0.7$ for a negative outcome with some coefficients $\beta$. Now, as the judge has leniency 0.6 their cut-point $F_{-1}(0.60)\approx0.65$. That is, the judge will not give a positive decision to anyone with failure probability greater than 0.65 so our example subject will receive a negative decision. Due to simulating a large number of instances for training the judge, we can say that in the long run the judge will give positive decisions to 60\% of subjects presented to them.
+\textbf{Example} Consider a decision-maker with leniency 0.60 who gets a new subject $\{x, z\}$ with a predicted probability $logit^{-1}(\beta_Xx+\beta_Zz)\approx 0.7$ for a negative outcome with some coefficients $\beta$. Now, as the judge has leniency 0.6 their cut-point $F^{-1}(0.60)\approx0.65$. That is, the judge will not give a positive decision to anyone with failure probability greater than 0.65 so our example subject will receive a negative decision.

 \begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Decider module: "quantile decisions"} 		% give the algorithm a caption
@@ -796,15 +808,14 @@ For example, a decision-maker with leniency 0.60 gets a new subject $\{x, z\}$ w
 \ENSURE
 \STATE Sample acceptance rates for each M judges from Uniform$(0.1; 0.9)$ and round to tenth decimal place.
 \STATE Assign each observation to a judge at random.
-\STATE Construct the quantile function $F^{-1}(q)$.
-\STATE Calculate $\pr(T=0|X, Z) = \sigma(\beta_XX+\beta_ZZ)$ for all observations.
-\STATE If $\sigma(\beta_Xx+\beta_Zz) \geq F^{-1}(r)$ set $t=0$, otherwise set $t=1$.
+\STATE Calculate $\pr(T=0|X, Z) = logit^{-1}(\beta_XX+\beta_ZZ)$ for all observations.
+\STATE If $logit^{-1}(\beta_Xx+\beta_Zz) \geq F^{-1}(r)$ set $t=0$, otherwise set $t=1$.
 \STATE Set $Y=$ NA if decision is negative ($T=0$). \emph{Optional.}
 \RETURN data with decisions.
 \end{algorithmic}
 \end{algorithm}

-\subsection{Evaluator modules}
+\subsection{Evaluator modules} \label{sec:evaluators}

 Evaluator modules take some version of data as input and output an estimate of the failure rate given the input. More discussion on the evaluator module is in section \ref{sec:modular_framework}.

@@ -835,7 +846,7 @@ Black box predictive model. Another input to our framework is a black box predic

 True evaluation module computes the "true failure rate" of a predictive model \emph{had it been deployed to make independent decisions}. For computing the true failure rate "had the model been deployed" we need all outcome labels which is why the true failure rate can only be computed on synthetic data.

-In practice, the module first trains a model $\B$ and assigns each observation with a probability score $\s$ using it as described above. Then the observations are sorted in ascending order by the scores so that most risky subjects are last (subjects with the highest predicted probability for a negative outcome). Now when taking the first $r \cdot 100\%$ of observations the true failure rate can be computed straight from the ground truth.
+In practice, the module first trains a model $\B$ and assigns each observation with a probability score $\s$ using it as described above. Then the observations are sorted in ascending order by the scores so that most risky subjects are last (subjects with the highest predicted probability for a negative outcome). Taking the first $r \cdot 100\%$ of observations, the true failure rate can be computed straight from the ground truth.

 \begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Evaluator module: True evaluation} 		% give the algorithm a caption
@@ -875,10 +886,10 @@ Labeled Outcomes Only: To plot this curve, we first obtain all the subjects whos
 \end{algorithmic}
 \end{algorithm}

-The performance of human decision-makers (this is the decider in the modular framework) is evaluated with algorithm \ref{alg:eval:human_eval}. Following quite comprehensive description from Lakkaraju:
+The performance of human decision-makers (this is the decider in the modular framework) is evaluated with algorithm \ref{alg:eval:human_eval}. From Lakkaraju:

 \begin{quote}
-This [failure rate estimation] can be done by grouping decision-makers with similar values of acceptance rate into bins and treating each bin as a single hypothetical decision-maker. We can then compute the failure rate and acceptance rate values for each such bin and plot them as a curve. We refer to this curve as the human evaluation curve.
+This [failure rate estimation for human decision-makers] can be done by grouping decision-makers with similar values of acceptance rate into bins and treating each bin as a single hypothetical decision-maker. We can then compute the failure rate and acceptance rate values for each such bin and plot them as a curve. We refer to this curve as the human evaluation curve.
 \end{quote}

 \begin{algorithm}[H] 			% enter the algorithm environment
@@ -887,7 +898,7 @@ This [failure rate estimation] can be done by grouping decision-makers with simi
 \begin{algorithmic}[1] 		% enter the algorithmic environment
 \REQUIRE Data $\D$ with properties $\{j_i, t_i, y_i\}$, acceptance rate r
 \ENSURE
-\STATE Assign judges with acceptance rate in $[r-0.05, r+0.05]$ to $\mathcal{J}$
+\STATE Assign judges with similar acceptance rate to $\mathcal{J}$
 \STATE $\D_{released} = \{(j, t, y) \in \D~|~t=1 \wedge j \in  \mathcal{J}\}$
 \STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
 \RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\D_{released}}\delta\{y_i=0\}$
@@ -909,13 +920,11 @@ The initial approach to the selective labels problem is in algorithm \ref{alg:ev
 	\STATE Evaluate $F(x_i) = \int_{x\in\mathcal{X}} P_x(x)\delta(\B(x)<\B(x_i)) ~dx$ and assign to $\mathcal{F}_{predictions}$ 
 \ENDFOR
 \STATE Create boolean array $T_{causal} = \mathcal{F}_{predictions} < r$.
-\RETURN $\frac{1}{|\D_{test}|}\sum_{i=1}^{|\D_{test}|} \s[i] \cdot T_{causal}[i]$ which is equal to $\frac{1}{|\D|}\sum_{x\in\D} f(x)\delta(F(x) < r)$
+\RETURN $\frac{1}{|\D_{test}|}\sum_{i=1}^{|\D_{test}|} \s[i] \cdot T_{causal}[i]$
 \end{algorithmic}
 \end{algorithm}

-%alla oleva montecarlo perusajatus ennustaa Z ja sen perusteella imputoida Y. selitä kaikki ja yksinkertaista
-
-The latest approach to the problem is presented below in algorithm \ref{alg:eval:mc}. The high-level idea is to use the counterfactual outcomes Y(1) where the outcomes are missing and then compute the failure rate. To infer the probability  $\pr(Y(1)=0)$, we need to make some inference about the latent variable Z. We can always infer some information about Z. If features X and the decision T are contradictory, we can infer more about Z than when X and T are aligned.
+The latest approach to the problem is presented below in algorithm \ref{alg:eval:mc}. The high-level idea is to use the counterfactual outcomes Y(1) to impute the missing outcomes and then compute the failure rate. To infer the probability  $\pr(Y(1)=0)$, we need to make some inference about the latent variable Z. We can always infer some information about Z. If features X and the decision T are contradictory, we can infer more about Z than when X and T are aligned.

 The algorithm is based on the following equation which expresses the posterior probability of Z after observing T, X and R:
 \begin{equation} \label{eq:posterior_Z}
@@ -937,7 +946,7 @@ in probabilities or just deterministically
 \end{equation}
 In equations \ref{eq:Tprob} and \ref{eq:Tdet}, $\pr(Y=0|x, z, DG)$ is the predicted probability of a negative outcome given x and z. The probability $\pr(Y=0|x, z, DG)$ is predicted by the judge and here we used an approximation that 
 \begin{equation}
-\pr(Y=0|x, z, DG) = \sigma(\beta_Xx+\beta_Zz)
+\pr(Y=0|x, z, DG) = logit^{-1}(\beta_Xx+\beta_Zz)
 \end{equation}
 which is an increasing function of $z$ when $x$ is given. Now we do not know the $\beta$ coefficients so here we used the information that they are one. (In the future, they should be inferred.) 

@@ -947,17 +956,17 @@ F^{-1}(r) = logit^{-1}\left(\text{erf}^{-1}(2r-1)\sqrt{2\sigma^2}-\mu\right)
 \end{equation}
 where the parameters are as discussed and erf is the error function.

-With this knowledge, it can be stated that if we observed $T=0$ with some $x$ and $r$ it must have been that $\sigma(\beta_Xx+\beta_Zz) \geq F^{-1}(r)$. Using basic algebra we obtain that
+With this knowledge, it can be stated that if we observed $T=0$ with some $x$ and $r$ it must have been that $logit^{-1}(\beta_Xx+\beta_Zz) \geq F^{-1}(r)$. Using basic algebra we obtain that
 \begin{equation} \label{eq:bounds}
 logit^{-1}(x + z) \geq F^{-1}(r) \Leftrightarrow x+z \geq logit(F^{-1}(r)) \Leftrightarrow z \geq logit(F^{-1}(r)) - x
 \end{equation}
-because the logit and its inverse are strictly increasing functions and hence preserve the order of magnitude for all pairs of values in their domains. From equations \ref{eq:posterior_Z}, \ref{eq:Tprob} and \ref{eq:bounds} we can conclude that $\pr(Z \geq logit^{-1}(F^{-1}(r)) - x | T=0, X=x, R=r) = 0$ and that elsewhere the distribution of Z follows a truncated Gaussian with an upper bound of $logit^{-1}(F^{-1}(r)) - x$. The expectation of Z can be computed analytically. All this follows analogically for cases with $T=1$ with the changes of some inequalities.
+as the logit and its inverse are strictly increasing functions and hence preserve the order of magnitude for all pairs of values in their domains. From equations \ref{eq:posterior_Z}, \ref{eq:Tprob} and \ref{eq:bounds} we can conclude that $\pr(Z < logit^{-1}(F^{-1}(r)) - x | T=0, X=x, R=r) = 0$ and that elsewhere the distribution of Z follows a truncated Gaussian with a lower bound of $logit(F^{-1}(r)) - x$. The expectation of Z can be computed analytically. All this follows analogically for cases with $T=1$ with the changes of some inequalities.

-In practise, in lines 1--3 and 10--13 of algorithm \ref{alg:eval:mc} we do as in the True evaluation evaluator algorithm with the distinction that some of the values of Y are imputed with the corresponding counterfactual probabilities. In line 4 we compute the bounds as motivated above. In the for-loop (lines 5--8) we merely compute the expectation of Z given the knowledge of the decision and that the distribution of Z follows a truncated Gaussian. Using the expectation, we then compute the probability for the counterfactual $\pr(Y(1) = 0)$ (probability of a negative outcome had a positive decision been given). The equation
+In practise, in lines 1--3 and 10--13 of algorithm \ref{alg:eval:mc} we do as in the True evaluation evaluator algorithm with the distinction that some of the values of Y are imputed with the corresponding counterfactual probabilities. In line 4 we compute the bounds as motivated above. In the for-loop (lines 5--8) we merely compute the expectation of Z given the knowledge of the decision and that the distribution of Z follows a truncated Gaussian. The equation
 \begin{equation}
 \hat{z} = (1-t) \cdot E(Z | Z > Q_r) + t \cdot E(Z | Z < Q_r)
 \end{equation}
-computes the correct expectation then automatically. In line 9 the imputation can be performed in couple of ways: either by taking a random guess with probability $\pr(Y(1) = 0)$ or by assigning the most likely value for Y.
+computes the correct expectation automatically. Using the expectation, we then compute the probability for the counterfactual $\pr(Y(1) = 0)$ (probability of a negative outcome had a positive decision been given). In line 9 the imputation can be performed in couple of ways: either by taking a random guess with probability $\pr(Y(1) = 0)$ or by assigning the most likely value for Y.

 \begin{algorithm}[H] 			% enter the algorithm environment
 \caption{Evaluator module: Monte Carlo evaluator, imputation} 		% give the algorithm a caption