Comments, old figures back and error bar explanation.

f38d647a · Riku-Laine · 223301c1 · f38d647a · f38d647a · f38d647a
Commit f38d647a authored 5 years ago by Riku-Laine
--- a/paper/appendix.tex
+++ b/paper/appendix.tex
@@ -204,6 +204,8 @@ The Gaussians were restricted to the positive real axis and both had mean $0$ an

 \acomment{Needs updating?}

+\rcomment{Similar decision maker has been proposed by \citet{kleinberg2018human}, see p. 256. They formalize the decision making threshold for the decision maker as a trade-off point between the costs of incarceration and committing a crime (as evaluated by that judge).}
+
 In section \ref{sec:decisionmakers} we presented an {\it independent} decision maker. 
 %
 Here we motivate it.
@@ -230,11 +232,11 @@ The decision is assigned deterministically based on the features:
 %
 In the above equation, $\prob{\outcome=0|~\obsFeatures= \obsFeaturesValue, \unobservable= \unobservableValue}$ is the predicted probability of a negative outcome given the features and it is predicted by the judge.
 %
-The prediction is computed with equation \ref{eq:defendantmodel} and it assumes that the judge is nearly perfect, i.e. that $\gamma_\unobservable \approx \beta_\unobservable$ and $\gamma_\obsFeatures \approx \beta_\obsFeatures$. 
+The prediction is computed with equation \ref{eq:judgemodel} and it assumes that the judge is nearly perfect, i.e. that $\gamma_\unobservable \approx \beta_\unobservable$ and $\gamma_\obsFeatures \approx \beta_\obsFeatures$. 

 We note that the right hand side of equation \ref{eq:defendantmodel} defines a random variable when the values of \obsFeatures and \unobservable are not known. 
 %
-The random variable is a logistic transformation of the sum of two Gaussian random variables and hence follows \emph{logit-normal distribution}.
+The random variable is a logistic transformation of the sum of two Gaussian random variables and hence follows a \emph{logit-normal distribution}.
 %
 The inverse cumulative distribution function $F^{-1}(\leniencyValue')$ in equation \ref{eq:Tdet} is then the inverse cumulative distribution of logit-normal distribution with mean $\mu=0$ and variance $s^2=\beta_\obsFeatures^2 + \beta_\unobservable^2$. 
 %

--- a/paper/experiments.tex
+++ b/paper/experiments.tex
@@ -96,6 +96,14 @@ The decision makers in the data and the evaluated decision makers differ in the
 \label{fig:basic}
 \end{figure}

+\begin{figure}
+\includegraphics[width=1.1\linewidth]{./img/_deciderH_independent_deciderM_batch_maxR_0_9coefZ1_0_all} 
+\caption{OLD FIGURE: Evaluation of batch decision maker on synthetic data with independent decision makers in the data. Error bars denote the standard deviation of the \failurerate estimate across data splits. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhiblower standard deviations as shown by the error bars. \rcomment{Here labeled outcomes is divided by the number of all subjects in the data.}
+}
+\label{fig:basic}
+\end{figure}
+
+
 \subsection{Evaluators}

 In addition to counterfactual-based imputation (\cfbi) presented in this paper, we consider three other ways of evaluating decision makers. For the synthetic data, we can obtain the outcomes even for cases with negative decisions. We call the acceptance rate failure rate tradeoff curve obtained by using the true outcomes \textbf{True evaluation}. Note that for a realistic setting the true evaluation would not be available. We also report the failure rate using only the cases that were released in the data as \textbf{Labeled outcomes}. This naive baseline has already been previously shown to considerably underestimate the true failure rate \citep{lakkaraju2017selective}.
@@ -117,13 +125,17 @@ To produce a train-test split, we randomly choose the decisions of half of the j
 %
 The training data sets were used only to train the machine decision makers.

-The evaluation algorithms produced separate \failurerate estimates for each test data set for each tested level of leniency.
+The evaluation algorithms produced separate \failurerate estimates for each test data set.
 %
-Means of the given estimates at a level of leniency were used as the point estimates for all the algorithms and the error bars in figures \ref{fig:basic} and \ref{fig:results_rmax05} represent the standard deviations of these estimates at given level of leniency for the corresponding algorithms.
-% 
-For summary figures \ref{fig:results_errors}, \ref{fig:highz} and \ref{fig:results_compas} the failure rate estimates for each test data set at all levels of leniency were compared to the estimate given by true evaluation algorithm.
+Curves in figures \ref{fig:basic} and \ref{fig:results_rmax05} present the mean of the estimate at the given level of leniency.
 %
-The errors of the estimates were computed, means are the presented point estimates and error bars denote the standard deviation of the estimate's error. \acomment{You really need to describe how error bars are gotten. It is not enough to say that they a sds. They could be sds from bootstrapping, cross-validation, over several data sets, over decision makers etc etc. For example followingly: \textbf{We divided the data set 10 times to learning and datasets. We learned the decision maker $\machine$ from the learning set and evaluated its performance on the test set using different evaluators. The error bars denote the std. deviation from the means in this process.}}
+The error bars denote the standard deviation of the estimated failure rates across the test data sets.
+%
+For summary figures \ref{fig:results_errors}, \ref{fig:highz} and \ref{fig:results_compas} the failure rate estimates for each test data set were compared to the estimate given by true evaluation algorithm.
+%
+Error bars in the figures stand for the standard deviation of the error.
+
+\acomment{You really need to describe how error bars are gotten. It is not enough to say that they a sds. They could be sds from bootstrapping, cross-validation, over several data sets, over decision makers etc etc. For example followingly: \textbf{We divided the data set 10 times to learning and datasets. We learned the decision maker $\machine$ from the learning set and evaluated its performance on the test set using different evaluators. The error bars denote the std. deviation from the means in this process.}}


 \spara{Original start of subsection}
@@ -153,6 +165,11 @@ Figure~\ref{fig:results_errors} shows the summarized error rates for top evaluat
 \label{fig:results_rmax05}
 \end{figure}

+\begin{figure}
+\includegraphics[width=1.1\linewidth]{./img/_deciderH_independent_deciderM_batch_maxR_0_5coefZ1_0_all}
+\caption{OLD FIGURE: Evaluating batch decision maker on data employing independent decision makers and with leniency at most $0.5$. The proposed method (\cfbi) offers good estimates of the failure rates for all levels of leniency, whereas contraction cailure rate only up to leniency $0.5$. \rcomment{Here labeled outcomes is divided by the number of all subjects in the data.}}
+\label{fig:results_rmax05}
+\end{figure}

 Figure~\ref{fig:results_rmax05} shows the evaluation over leniencies similarly as~Figure~\ref{fig:basic} but this time the maximum leniencies of decision makers in the data were limited to below $0.5$. 
 %

--- a/paper/related.tex
+++ b/paper/related.tex
@@ -51,9 +51,4 @@ Finally, more applied work can be found for example in~\cite{murder,tolan2019why

 %\subsection{Imputation}

-
-
-
-
-
-
+\rcomment{Imputation approach presented by \citet{kleinberg2018human} in p.270 is a bit different than the one we have used. Our approach is more similar to Lakkaraju's original paper \cite{lakkaraju2017selective}. Their approach has three stages, 1. use observed outcomes whenever available. 2. impute results convetionally up to the leniency off the most lenient decision maker and 3. impute normally, but multiply the prediction by $\alpha$.}