diff --git a/analysis_and_scripts/notes.tex b/analysis_and_scripts/notes.tex
index fd7b61496ce2aa7c988202b78807ef732c929f9d..ff801dbe3a19e74de18f51e1dbb072610967ff6a 100644
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
@@ -125,9 +125,11 @@ On acceptance rate R: We discussed how Lakkaraju's paper treats variable R in a
 
 If $c$ is defined so that the ratio of positive decisions to all decisions will be equal to $r$, we will arrive at a similar data generation process as Lakkaraju  and as is presented in algorithm \ref{alg:data_with_Z}.
 
-Finally, chapter from Lakkaraju \cite{lakkaraju17} about counterfactual inference, see references from their paper [sic]:
+Finally, chapters from Lakkaraju \cite{lakkaraju17} about counterfactual inference, see references from their paper [sic]:
 
 \begin{quote}
+There has also been some work on inferring labels using counterfactual inference techniques [9, 12, 16, 36, 38] and leveraging these estimates when computing standard evaluation metrics. However, counterfactual inference techniques explicitly assume that there are no unmeasured confounders (that is, no unobservable variables Z) that could affect the outcome Y. This assumption does not typically hold in cases where human decisions are providing data labels [7, 21]. Thus, the combination of two ingredients -- selective labels and non-trivial unobservables -- poses problems for these existing techniques.
+
 Counterfactual inference. Counterfactual inference techniques have been used extensively to estimate treatment effects in observational studies. These techniques have found applications in a variety of fields such as machine learning, epidemiology, and sociology [3, 8--10, 30, 34]. Along the lines of Johansson et al. [16], counterfactual inference techniques can be broadly categorized as: (1) parametric methods which model the relationship between observed features, treatments, and outcomes. Examples include any type of regression model such as linear and logistic regression, random forests and regression trees [12, 33, 42]. (2) non-parametric methods such as propensity score matching, nearest-neighbor matching, which do not explicitly model the relationship between observed features, treatments, and outcomes [4, 15, 35, 36, 41]. (3) doubly robust methods which combine the two aforementioned classes of techniques typically via a propensity score weighted regression [5, 10]. The effectiveness of parametric and non-parametric methods depends on the postulated regression model and the postulated propensity score model respectively. If the postulated models are not identical to the true models, then these techniques result in biased estimates of outcomes. Doubly robust methods require only one of the postulated models to be identical to the true model in order to generate unbiased estimates. However, due to the presence of unobservables, we cannot guarantee that either of the postulated models will be identical to the true models.
 \end{quote}
 
@@ -141,7 +143,7 @@ Next, the all of the generated data goes to the \textbf{labeling process}. In th
 
 After labeling, the labeled training data is used to train a machine that will either make decisions or predictions using some features of the data. Then, the test data will be given to the machine and it will output either binary decisions (yes/no), probabilities (a real number in interval $[0, 1]$) or a metric for ordering for all the instances in the test data set. The machine will be denoted with $\M$.
 
-Finally the decisions and/or predictions made by the machine $\M$ and human judges (see dashed arrow in figure \ref{fig:framework}) will be evaluated using an \textbf{evaluation algorithm}. Evaluation algorithms will take the decisions, probabilities or ordering generated in the previous steps as input and then output an estimate of the failure rate. \textbf{Failure rate (FR)} is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision. More explicitly \[ FR = \dfrac{\#\{Failures\}}{\#\{Decisions\}}. \] Second characteristic of FR is that the number of positive decisions and therefore FR itself can be controlled through acceptance rate defined above.
+Finally the decisions and/or predictions made by the machine $\M$ and human judges (see dashed arrow in figure \ref{fig:framework}) will be evaluated using an \textbf{evaluation algorithm}. Evaluation algorithms will take the decisions, probabilities or ordering generated in the previous steps as input and then output an estimate of the machine's failure rate in the test data. \textbf{Failure rate (FR)} is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision. More explicitly \[ FR = \dfrac{\#\{Failures\}}{\#\{Decisions\}}. \] Second characteristic of FR is that the number of positive decisions and therefore FR itself can be controlled through acceptance rate defined above.
 
 Given the above framework, the goal is to create an evaluation algorithm that can accurately estimate the failure rate of any model $\M$ if it were to replace human decision makers in the labeling process. The estimations have to be made using only data that human decision-makers have labeled. The failure rate has to be accurately estimated for various levels of acceptance rate. The accuracy of the estimates can be compared by computing e.g. mean absolute error w.r.t the estimates given by \nameref{alg:true_eval} algorithm.
 
@@ -180,7 +182,7 @@ Given the above framework, the goal is to create an evaluation algorithm that ca
           \node[state] (EA)  [below right=0.75cm and -4cm of MD] {Evaluation algorithm};
         
           \path (DG) edge (LP)
-                (LP) edge [bend left=-15] node [right, pos=0.6] {$\D_{train}$} (MT)
+                (LP) edge [bend left=-19] node [right, pos=0.6] {$\D_{train}$} (MT)
                       edge [bend left=45] node [right]  {$\D_{test}$} (MD)
                      edge [bend left=70, dashed] node [right]  {$\D_{test}$}  (EA)
                 (MT) edge node {$\M$} (MD)
@@ -414,14 +416,14 @@ The differences between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0
     \centering
     \begin{subfigure}[b]{0.5\textwidth}
         \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5}
-        \caption{With unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
+        \caption{Results with unobservables, $\beta_Z$ set to 1.5 in algorithm \ref{alg:data_with_Z}.}
         \label{fig:betaZ_1_5}
     \end{subfigure}
     ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. 
       %(or a blank line to force the subfigure onto a new line)
     \begin{subfigure}[b]{0.5\textwidth}
         \includegraphics[width=\textwidth]{sl_with_Z_4iter_beta0}
-        \caption{With unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
+        \caption{Results with unobservables, $\beta_Z$ set to 0 in algorithm \ref{alg:data_with_Z}.}
         \label{fig:betaZ_0}
     \end{subfigure}
     \caption{Effect of $\beta_z$. Failure rate vs. acceptance rate with unobservables in the data (see algorithm \ref{alg:data_with_Z}). Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$.}