...

a3d47768 · Antti Hyttinen · 012b97de · a3d47768
Commit a3d47768 authored 5 years ago by Antti Hyttinen
--- a/paper/experiments.tex
+++ b/paper/experiments.tex
@@ -143,8 +143,8 @@ The first evaluator we consider is \cfbi, the proposed method of this paper, des
 To summarize: \cfbi uses the dataset to learn a model, i.e., a distribution for the parameters involved in formulas~\ref{eq:defendantmodel} and~\ref{eq:judgemodel}; using this distribution, it predicts the outcome of the cases for which \humanset made a negative decision and \machine makes a positive one; and finally it evaluates the failure rate of \machine on the dataset.


-The second evaluator we consider is \contraction, proposed in recent work~\cite{lakkaraju2017selective}, KDD'17.
-%
+The second evaluator we consider is \contraction, proposed in recent work~\cite{lakkaraju2017selective}.
+% , KDD'17 I DONT THINK WE SHOULD DO THIS MORE THAN TWICE AND THOSE ARE USED ALREADY
 It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
 %
 Contraction bases its evaluation only on the cases assigned to the most lenient decision maker in the data. 
@@ -219,7 +219,7 @@ For the results we describe immediately below, we executed the pipeline for mult

 %Figure~\ref{fig:basic} shows some representative results from this process.
 %SOME REPRESENTATIVE??? WE CANNOT BE THIS VAGUE
-Figure~\ref{fig:basic} shows estimated failure rates for each of the evaluators, at different leniency levels, when decisions in the data were made by \batch decision maker, while \machine was of \independent type.
+\spara{The basic scenario.}   Figure~\ref{fig:basic} shows estimated failure rates for each of the evaluators, at different leniency levels, when decisions in the data were made by \batch decision maker, while \machine was of \independent type.
 %
 %Specifically, the plot shows the %
 %The same leniency level was used for decisions in the data and for decisions by \machine.
@@ -239,10 +239,6 @@ The error bars denote the standard deviation of the estimates in this process.},
 %
 At the same time, the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
 %
-
-
-
-
 \begin{figure}
 %\centering
 \includegraphics[width=\linewidth]{./img/sl_errors_betaZ1}
@@ -252,26 +248,31 @@ varies considerably within and across different decision makers.}
 \label{fig:results_errors}
 \end{figure}

-\spara{The effect of limited leniency.}  
-%However, the picture changes somewhat when we restrict leniency to take values from a more restricted range.
+
+%WHY WOULD WE JUMP TO LIMITED LENIENCY AND THEN BACK TO UNLIMITED??? I DONT UNDERSTAND!!!!!
 %
-%This is because \contraction depends crucially on the most lenient decision maker to estimate the performance of the rest.
-%WE CANNOT DESCRIBE INTUITION AND RESULTS BEFORE WE OBSERVE AND LOOK AT THE PLOTS! IT IS LIKE WE ARE MANUFACTURING A PLOT TO SHOW THIS
+%This picture painted by Figures~\ref{fig:results_errors} and~\ref{fig:results_rmax05} is representative for all combinations of types of decision makers $\human$ in the data and for \machine.
+%WHAT ARE WE ARTISTS?
+%Figures~\ref{fig:results_errors} shows the error rates for combinations of different types of decision makers $\humanset$ in the data and for \machine.
 %
-%For example, consider Figure~\ref{fig:results_rmax05}.
+%While we do not have space to include the corresponding figures for all other settings, we do provide summary results in Figure~\ref{fig:results_errors}.
+%WE HAVE THE APPENDIX; WE DO HAVE PLENTY SPACE; FOR EXAMPLE REPLACING ANY DISCUSSIONS IN THE PAPER
 %
-%The figure was generated in the same way as Figure~\ref{fig:basic}, with the only difference that the leniency of decision makers in the data was allowed to take values only among $\{0.1, 0.2, ..., 0.5\}$, and not up to $0.9$ as was the case for Figure~\ref{fig:basic}.
-%NOT TRUE THE LENIENCIES IN THE DATA ARE NOT LIKE THIS THEY ARE UNIFORMLY AT RANDOM
-Figure~\ref{fig:results_rmax05} shows the results, with leniency of decision makers in the data was restricted to below $0.5$, and not below $0.9$ as was the case for Figure~\ref{fig:basic}.
+%Specifically,
+ Figure~\ref{fig:results_errors} shows the aggregate error rates for the best evaluators (\cfbi and \contraction), for different types of evaluators in the data and for \machine.
+ %, when the full range of leniencies is present in the data (like in Figure~\ref{fig:basic}).
 %
-%Figure~\ref{fig:results_rmax05} demonstrates that 
-\contraction is only able to estimate the failure rate up to $0.5$, which is the highest leniency of decision makers in the data -- but for higher leniency rates it does not output any results. 
+The overall result is that \cfbi evaluates the decision makers accurately and robustly accross different decision makers.
 %
-On the contrary, the proposed method \cfbi produces failure rate estimates for all leniencies.
+It is able to learn model parameters that capture the behavior of decision makers employed in the data and use that model to evaluate any decision maker \machine using only selectively labeled data.
+\contraction shows consistently poorer performance, and markedly larger variation as shown by the error bars.
 %
-We note of course that, when we compare with \trueevaluation, we see that the accuracy \cfbi decreases for the largest leniencies -- a fact to be expected, as such cases do not exist in the data.
+%\contraction shows consistently poorer performance, but not dramatically worse.
+% THE VARIATION IS DRAMATICALLY WORSE
 %
-This observation is vitally important in the sense that decision makers based on elaborate machine learning techniques, such as \cfbi, may well allow for evaluation at higher leniency rates than those (often human) employed in the data.
+Again, our interpretation is that this is due to the fact that \contraction crucially depends on the data points that correspond to the most lenient decision makers, while \cfbi makes full use of all data.
+
+

 \begin{figure}
 \includegraphics[width=1.1\linewidth]{./img/with_epsilon_deciderH_independent_deciderM_batch_maxR_0_5coefZ1_0_all}
@@ -279,20 +280,28 @@ This observation is vitally important in the sense that decision makers based on
 \label{fig:results_rmax05}
 \end{figure}

+\spara{The effect of limited leniency.}  
+%However, the picture changes somewhat when we restrict leniency to take values from a more restricted range.
 %
-This picture painted by Figures~\ref{fig:results_errors} and~\ref{fig:results_rmax05} is representative for all combinations of types of decision makers in the data and for \machine.
+%This is because \contraction depends crucially on the most lenient decision maker to estimate the performance of the rest.
+%WE CANNOT DESCRIBE INTUITION AND RESULTS BEFORE WE OBSERVE AND LOOK AT THE PLOTS! IT IS LIKE WE ARE MANUFACTURING A PLOT TO SHOW THIS
 %
-While we do not have space to include the corresponding figures for all other settings, we do provide summary results in Figure~\ref{fig:results_errors}.
+%For example, consider Figure~\ref{fig:results_rmax05}.
 %
-Specifically, Figure~\ref{fig:results_errors} shows the aggregate error rates for the best evaluators (\cfbi and \contraction), for different types of evaluators in the data and for \machine, when the full range of leniencies is present in the data (like in Figure~\ref{fig:basic}).
+%The figure was generated in the same way as Figure~\ref{fig:basic}, with the only difference that the leniency of decision makers in the data was allowed to take values only among $\{0.1, 0.2, ..., 0.5\}$, and not up to $0.9$ as was the case for Figure~\ref{fig:basic}.
+%NOT TRUE THE LENIENCIES IN THE DATA ARE NOT LIKE THIS THEY ARE UNIFORMLY AT RANDOM
+Figure~\ref{fig:results_rmax05} shows the results, with leniency of decision makers in the data was restricted to below $0.5$, and not below $0.9$ as was the case for Figure~\ref{fig:basic}.
 %
-The overall result is that \cfbi evaluates the decision makers accurately and robustly.
+%Figure~\ref{fig:results_rmax05} demonstrates that 
+Here, \contraction is only able to estimate the failure rate up to $0.5$, which is the highest leniency of decision makers in the data -- but for higher leniency rates it does not output any results. 
 %
-It is able to learn model parameters that capture the behavior of decision makers employed in the data and use that model to evaluate any decision maker \machine using only selectively labeled data. 
+On the contrary, the proposed method \cfbi produces failure rate estimates for all leniencies.
 %
-\contraction shows consistently poorer performance, but not dramatically worse.
+We note of course that, when we compare with \trueevaluation, we see that the accuracy \cfbi decreases for the largest leniencies -- a fact to be expected, as such cases do not exist in the data.
 %
-Again, our interpretation is that this is due to the fact that \contraction crucially depends on the data points that correspond to the most lenient decision makers, while \cfbi makes full use of all data.
+This observation is vitally important in the sense that decision makers based on elaborate machine learning techniques, such as \cfbi, may well allow for evaluation at higher leniency rates than those (often human) employed in the data.
+
+

 \spara{The effect of unobservables.} So far in our synthetic experiments, we have assumed that observed and unobserved features are of equal importance in determining possible outcomes, an assumption encoded in the value of parameters $\beta_\obsFeatures = \beta_\unobservable = 1$ (see Section~\ref{sec:syntheticsetting}).
 %
@@ -307,6 +316,7 @@ Nevertheless, the proposed method (\cfbi) is able to evaluate different decision
 %
 Contraction shows again consistently worse performance in comparison, also in comparison to \contraction in Fig.~\ref{fig:results_errors} (again, this was to be expected due to the higher weight of $\beta_\unobservable$).

+\vspace{2pt}

 Thus overall, in these synthetic settings our method achieves more accurate results with considerably less variation than \contraction, allowing for evaluation in situations where the strong assumptions of contraction inhibit evaluation altogether.

@@ -319,8 +329,8 @@ COMPAS provides needs assesments and risk estimates of recidivism.
 %
 The COMPAS score is derived from prior criminal history, socio-economic and personal factors among other things and it predicts recidivism in the following two years \cite{brennan2009evaluating}.
 %
-The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was biased against black people \cite{angwin2016machine}.
-%
+The system was under scrutiny in 2016 after ProPublica published an article claiming that the tool was ethnically biased \cite{angwin2016machine}.
+% against black people
 After the discussion, \citet{kleinberg2016inherent} showed that the criteria for fairness used by ProPublica and Northpointe couldn't have been consolidated.