\caption{Left: Evaluation of \batch decision maker on data with \independent. Error bars show std. of the \failurerate estimate across 10 datasets. In this basic setting, both our \cfbi and contraction follow the true evaluation curve closely but \cfbi exhibits lower variation.
Right: Evaluating \batch on data employing \independent and with leniency at most $0.5$. \cfbi offers sensible estimates of the failure rates for all levels of leniency, whereas \contraction only up to leniency $0.5$.}
...
...
@@ -188,7 +188,7 @@ In addition, \cfbi exhibits considerably lower variation than \contraction.
\caption{Mean absolute error (MAE) of estimate w.r.t. true evaluation.
Error bars show std. of the absolute error over 10 datasets. \cfbi offers robust estimates across all decision makers. The error of \contraction varies within and across different decision makers.}
\label{fig:results_errors}
...
...
@@ -308,7 +308,7 @@ The deployed machine decision maker was defined to release \leniencyValue fracti
\caption{MAE of estimate w.r.t true evaluation when the effect of the unobserved $\unobservable$ is high ($b_\unobservable=5$). The decision quality is poorer, but \cfbi can still evaluate the decisions accurately. \contraction shows higher variance and lower accuracy.}
\label{fig:highz}
\end{figure}% RL: Note that only machine decision maker is poorer, not the human.
...
...
@@ -316,7 +316,7 @@ The deployed machine decision maker was defined to release \leniencyValue fracti
\caption{Results with COMPAS data. Error bars show std. of the absolute \failurerate estimate errors across all levels of leniency w.r.t. true evaluation. \cfbi gives both more accurate and precise estimates despite of the number of judges used.
% Performance of \contraction gets notably worse as the number of judges increases.
\and HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
}
%
\maketitle% typeset the header of the contribution
...
...
@@ -91,7 +92,7 @@ Based on this model, we compute counterfactuals to impute missing outcomes, whic
As we demonstrate over real and synthetic data, our approach estimates the quality of decisions more accurately and robustly compared to previous methods.
@@ -109,7 +110,7 @@ As we demonstrate over real and synthetic data, our approach estimates the quali
\input{conclusions}
\subsubsection{Acknowledgments.}
Authors acknowledge the computer capacity from the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:research-infras-2016072533).
Authors acknowledge the computer capacity from the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:research-infras-2016072533). AH was supported by Academy of Finland grants 295673, 316771 and by HIIT.