...

012b97de · Antti Hyttinen · b46de73d · 012b97de
Commit 012b97de authored 5 years ago by Antti Hyttinen
--- a/paper/experiments.tex
+++ b/paper/experiments.tex
@@ -34,9 +34,10 @@ The leniency \leniency of each decision maker is drawn
 from Uniform$(0.1,~0.9)$.
 %
 %As soon as a case is assigned to a decision maker, a decision $\decision$ is made for the case.
-%DOeS NOT 
-%
-The exact way a decision is made by the different types of decision makers we consider is described in the next subsection (Sec.~\ref{sec:dm_exps}).
+%DOES NOT SOUND WHAT A REAL JUDGE WOULD DO 
+%I AM NOT SURE WE DO THIS; WE GET A BATCH AND THEN DO DECISIONS
+A decision $\decision$ is made for each case by the assigned decision maker.
+The exact way this is made by the different types of decision makers we consider is described in the next subsection (Sec.~\ref{sec:dm_exps}).
 %
 If the decision is positive, then an outcome is assigned to the case according to Eq.~\ref{eq:defendantmodel}
 %
@@ -88,7 +89,8 @@ Because of this, we refer to this type of decision makers as \independent.

 \todo{MM}{Make sure Appendix~\ref{sec:independent} is correct.} \acomment{Appendix has F as the cumulative distribution function, here we have G.}

-In addition, we also experiment with a different type of decision makers, namely \batch.
+In addition, we also experiment with a different type of decision makers, namely \batch, also used in \cite{lakkaraju2017selective}.
+%NEED TO CITE HERE AS THIS IS WHAT THEY DO; THEY WILL BE FURIOUS AS REVIEWERS IF WE DONT
 %
 Decision makers of this type are assumed to consider all cases assigned to them at once, as a batch; sort them by risk score; and, for leniency \leniency = \leniencyValue, release $\leniencyValue$ portion of the batch with the best risk score. 
 %
@@ -96,8 +98,11 @@ Such decision makers still have a good knowledge of the relative risk that the c
 %
 For example, if a decision maker is randomly assigned a batch of cases that are all very likely to lead to a good outcome, a large portion $1-\leniencyValue$ of them will still be handed a negative decision.

-Finally, we consider a random version of \batch as a third type of decision maker, namely \random.
+%Finally, we consider a random version of \batch as a third type of decision maker, namely \random.
+%WHY CONFUSE THE READER BY SAYING THAT RANDOM IS BATCH
+% MUCH EASIER TO TALK ABOUT IT SEPARATELY
 %
+Finally, we consider a third type of decision maker, namely \random.
 Decision makers of this type simply select uniformly at random a portion $\leniency=\leniencyValue$ of cases from the batch assigned to them, make a positive decision for them -- and make a negative decision for the remaining cases.
 %
 \random decision makers make poor decisions -- but they do not introduce selection bias, as their decision is not correlated with the possible outcome.
@@ -108,8 +113,9 @@ For this reason, including them in the experiments is useful, as it allows us to
 %
 For \machine, we consider the same three types of decision makers as for \humanset above, with one difference: decision makers \humanset have access to \unobservable, while \machine does not.
 %
-Their definitions are adapted in the obvious way -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures -- but we still refer to them with the same names.
-
+Their definitions are adapted in the obvious way -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures.
+% -- but we still refer to them with the same names.
+%WHY ADVERTISE THIS 


 \begin{figure}
@@ -150,12 +156,16 @@ Then the failure rate estimate for a given leniency level $\leniencyValue$ can b
 \todo{MM}{Re-write the last two sentences.}


-In addition, we experiment with a few baselines.
+In addition, we consider two baselines.
+%In addition, we experiment with a few baselines.
 %
-As a first baseline, we consider the method that evaluates the failure rate \machine based only on those cases that received a positive decision by \humanset.
+As a first baseline, we consider the method that evaluates the failure rate \machine based only on those cases that received a positive decision by \humanset in the data.
 %
-We refer to it as \labeledoutcomes.
-
+This is refered to as \labeledoutcomes.
+%We refer to it as \labeledoutcomes.
+%WE DO NOT; LAKKARAJU DOES!!!!! WE CANNOT TAKE THEIR TERM AND SAY THAT
+%WE REFER TO IT AS
+% PUTTING THE BASELINES 
 As a second baseline, we consider a method that performs straightforward imputation:
 given a training dataset, it considers only those cases that were accompanied with a positive decision and builds a logistic regression model on them;
 it then uses the prediction of this logistic regression to impute the outcome in the test data for those cases where \machine makes a positive decision but \humanset had made a negative decision.
@@ -175,11 +185,15 @@ We refer to it as \trueevaluation.
 \subsection{Pipeline}
 \label{sec:pipeline}

+\acomment{I propose deleting this subsection as a whole, still putting the training set test set split earlier, for example in synthetic data. 1) The separate section of pipeline confuses readers, since previous sections explain or should explain these. 2) It takes even longer to get the results. 3) Anybody can read the pipeline in the online code. 4) It is totally confusing and too complicated, it is better that the reader does not get confused and think of the experiments as a complicated mess. 5) I dont see pipeline sections anywhere (I dont read bioinformatics papers) 6) the paper will be shorter.  }
+
 Having described the synthetic data, decision makers, and evaluators, we now summarize how the above components are put together into the experimental pipeline.

 As a first step, we generate three synthetic datasets (Section~\ref{sec:syntheticsetting}).
 %
-The decisions in each dataset \dataset have been made by a different type of decision maker (Section~\ref{sec:dm_exps}) -- the leniency of which is set uniformly at random over discrete values $\{0.1, 0.2, \ldots\}$.
+The decisions in each dataset \dataset have been made by a different type of decision maker (Section~\ref{sec:dm_exps}).
+% -- the leniency of which is set uniformly at random over discrete values $\{0.1, 0.2, \ldots\}$.
+%THIS IS SIMPLY NOT TRUE, LENIENCIES ARE DRAWN UNIFORMLY AT RANDOM; values between 0.1 and 0.2 are possible.
 %
 As a second step, we create three instances of decision maker \machine -- again, one for each type of decision maker.
 %
@@ -203,25 +217,29 @@ And finally, in a fourth step, all evaluators are employed to estimate the failu

 For the results we describe immediately below, we executed the pipeline for multiple random train-test splits and different leniency levels for \machine.

-Figure~\ref{fig:basic} shows some representative results from this process.
-%
-Specifically, the plot shows the estimated failure rate for each of the evaluators, at different leniency levels.
-%
-The same leniency level was used for decisions in the data and for decisions by \machine.
+%Figure~\ref{fig:basic} shows some representative results from this process.
+%SOME REPRESENTATIVE??? WE CANNOT BE THIS VAGUE
+Figure~\ref{fig:basic} shows estimated failure rates for each of the evaluators, at different leniency levels, when decisions in the data were made by \batch decision maker, while \machine was of \independent type.
 %
-For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
+%Specifically, the plot shows the %
+%The same leniency level was used for decisions in the data and for decisions by \machine.
+% NO THIS IS NOT TRUE, DATA HAS MANY LENIENCY LEVELS AND M IS RUN ON SEVERAL LENIENCY ELVELS AND THESE MAY NOT EVEN BE THE SAME
+%For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
+%ALREADY
 %
 In interpreting this plot, we should consider an evaluator to be good if its curve follows well that of the optimal evaluator, i.e., \trueevaluation.
 %
-In this scenario, we see that \cfbi performs best, but with \contraction being a very close second.
+%In this scenario, we see that \cfbi performs best, but with \contraction being a very close second.
+%WE DO NOT SEE! THE LINES ARE ON TOP OF EACH OTHER
+In this scenario, we see that \cfbi exhibits considerably lower variation shown by the error bars\footnote{To obtain the error bars, we divided the data $10$ times to training and test datasets.
 %
-At the same time, the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
+We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators. 
 %
-To obtain the error bars, we divided the data $10$ times to training and test datasets.
+The error bars denote the standard deviation of the estimates in this process.}, as the second best  \contraction.
 %
-We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators. 
+At the same time, the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
 %
-The error bars denote the standard deviation of the estimates in this process.
+



@@ -234,15 +252,20 @@ varies considerably within and across different decision makers.}
 \label{fig:results_errors}
 \end{figure}

-However, the picture changes somewhat when we restrict leniency to take values from a more restricted range.
+\spara{The effect of limited leniency.}  
+%However, the picture changes somewhat when we restrict leniency to take values from a more restricted range.
 %
-This is because \contraction depends crucially on the most lenient decision maker to estimate the performance of the rest.
+%This is because \contraction depends crucially on the most lenient decision maker to estimate the performance of the rest.
+%WE CANNOT DESCRIBE INTUITION AND RESULTS BEFORE WE OBSERVE AND LOOK AT THE PLOTS! IT IS LIKE WE ARE MANUFACTURING A PLOT TO SHOW THIS
 %
-For example, consider Figure~\ref{fig:results_rmax05}.
+%For example, consider Figure~\ref{fig:results_rmax05}.
 %
-The figure was generated in the same way as Figure~\ref{fig:basic}, with the only difference that the leniency of decision makers in the data was allowed to take values only among $\{0.1, 0.2, ..., 0.5\}$, and not up to $0.9$ as was the case for Figure~\ref{fig:basic}.
+%The figure was generated in the same way as Figure~\ref{fig:basic}, with the only difference that the leniency of decision makers in the data was allowed to take values only among $\{0.1, 0.2, ..., 0.5\}$, and not up to $0.9$ as was the case for Figure~\ref{fig:basic}.
+%NOT TRUE THE LENIENCIES IN THE DATA ARE NOT LIKE THIS THEY ARE UNIFORMLY AT RANDOM
+Figure~\ref{fig:results_rmax05} shows the results, with leniency of decision makers in the data was restricted to below $0.5$, and not below $0.9$ as was the case for Figure~\ref{fig:basic}.
 %
-Figure~\ref{fig:results_rmax05} demonstrates that \contraction is only able to estimate the failure rate up to $0.5$, which is the highest leniency of decision makers in the data -- but for higher leniency rates it does not output any results. 
+%Figure~\ref{fig:results_rmax05} demonstrates that 
+\contraction is only able to estimate the failure rate up to $0.5$, which is the highest leniency of decision makers in the data -- but for higher leniency rates it does not output any results. 
 %
 On the contrary, the proposed method \cfbi produces failure rate estimates for all leniencies.
 %