\caption{Evaluation of \batch decision maker on data with \independent. Error bars show the standard deviation of the \failurerate estimate across 10 test/training set splits. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhibits lower variation as shown by the error bars.
\caption{Evaluation of \batch decision maker on data with \independent. Error bars show the standard deviation of the \failurerate estimate across 10 datasets. In this basic setting, both our \cfbi and contraction are able to match the true evaluation curve closely but the former exhibits lower variation as shown by the error bars.
}
}
\label{fig:basic}
\label{fig:basic}
\end{figure}
\end{figure}
...
@@ -254,25 +254,23 @@ These results show the accuracy of the different evaluators (Section~\ref{sec:ev
...
@@ -254,25 +254,23 @@ These results show the accuracy of the different evaluators (Section~\ref{sec:ev
%For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
%For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
%ALREADY
%ALREADY
%
%
In interpreting this plot, we should consider an evaluator to be good if its curve follows well that of the optimal evaluator, i.e., \trueevaluation.
In interpreting this plot, we should consider an evaluator to be accurate if its curve follows well that of the optimal evaluator, i.e., \trueevaluation.
%
%
%In this scenario, we see that \cfbi performs best, but with \contraction being a very close second.
In this scenario, we see that \cfbiand \contraction are quite accurate, while the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
%WE DO NOT SEE! THE LINES ARE ON TOP OF EACH OTHER
%
In this scenario, we see that\cfbi exhibits considerably lower variation shown by the error bars
In addition,\cfbi exhibits considerably lower variation than \contraction, as shown by the error bars.
%\footnote{To obtain the error bars, we divided the data $10$ times to training and test datasets.
%\footnote{To obtain the error bars, we divided the data $10$ times to training and test datasets.
%
%
%We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators.
%We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators.
%
%
%The error bars denote the standard deviation of the estimates in this process.}
%The error bars denote the standard deviation of the estimates in this process.}
, as the second best \contraction.
%
%
At the same time, the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
\caption{Mean absolute error (MAE) of estimate w.r.t. true evaluation.
\caption{Mean absolute error (MAE) of estimate w.r.t. true evaluation.
Error bars show standard deviation of the absolute error over 10 dataset splits. The presented method (\cfbi) is able to offer stable estimates with low variance robustly across different decision makers. The error of \contraction
Error bars show standard deviation of the absolute error over 10 datasets. The presented method (\cfbi) is able to offer stable estimates with low variance robustly across different decision makers. The error of \contraction
varies considerably within and across different decision makers.}
varies considerably within and across different decision makers.}
\label{fig:results_errors}
\label{fig:results_errors}
\end{figure}
\end{figure}
...
@@ -288,8 +286,10 @@ varies considerably within and across different decision makers.}
...
@@ -288,8 +286,10 @@ varies considerably within and across different decision makers.}
%WE HAVE THE APPENDIX; WE DO HAVE PLENTY SPACE; FOR EXAMPLE REPLACING ANY DISCUSSIONS IN THE PAPER; BETTER TO HAVE THEM ASK AND THEN PRODUCE
%WE HAVE THE APPENDIX; WE DO HAVE PLENTY SPACE; FOR EXAMPLE REPLACING ANY DISCUSSIONS IN THE PAPER; BETTER TO HAVE THEM ASK AND THEN PRODUCE
%
%
%Specifically,
%Specifically,
Figure~\ref{fig:results_errors} shows the aggregate absolute error rates over leniencies from $0.1$ to $0.8$ for two evaluators (\cfbi and \contraction), for different types of decision makers for $\humanset$ and for \machine.
Figure~\ref{fig:results_errors} shows the aggregate absolute error rates of the two evaluators, \cfbi and \contraction.
%, when the full range of leniencies is present in the data (like in Figure~\ref{fig:basic}).
%
Each error bar is based on all datasets and leniencies from $0.1$ to $0.8$, for different types of decision makers for $\humanset$ and for \machine.
%, when the full range of leniencies is present in the data (like in Figure~\ref{fig:basic}).
%
%
The overall result is that \cfbi evaluates the decision makers accurately and robustly across different decision makers.
The overall result is that \cfbi evaluates the decision makers accurately and robustly across different decision makers.