@@ -376,7 +376,7 @@ We treat the observations as independent and the still the leniency would be a g
%This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)
\paragraph{Algorithms}
\paragraph{Evaluators}
We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve.