@@ -359,10 +359,25 @@ We experimented with synthetic data sets to examine accurateness, unbiasedness a
We sampled $N=50k$ samples of $X$, $Z$, and $W$ as independent standard Gaussians. We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p =1-\invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W)=\invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}.
\acomment{How were the leniencies drawn?}\acomment{Explain the how we have several different judges?}
%This is one data generation module.
% It can be / was modified by changing the outcome producing mechanism. For other experiments we changed the outcome generating mechanism so that the outcome was assigned value 1 if
We used a number of different decision mechanism in our simulations. The decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... . This way the observations were independent and the still the leniency would be a good estimate of the acceptance rate. (The acceptance rate would stochastically converge to the leniency.)
This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)
The \emph{default} decision maker in the data fits a logistic regression model $Y \sim\invlogit(\beta_xx+\beta_zz)$ using the training set. The decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... .
$T=1$ to $R$ percent of subjects given by the leniency with highest probability of $Y=1$ in the test set.
We used a number of different decision mechanism. A \emph{limited} works as the default but uses regression model $Y \sim\invlogit(\beta_xx)$ .
A \emph{biased} decision maker works similarly but the logistic regression model is .. where biases decision...
Given leniency $R$, a \emph{random} decision maker decides on $T=1$ probability $R$.
In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions
on a subject should not depend on the decision on other subject. In the example this would induce unethical behaviour, a judge would need to jail somebody today in order to release a defendant tomorrow.
We treat the observations as independent and the still the leniency would be a good estimate of the acceptance rate. The acceptance rate converges to the leniency.
%This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)
\paragraph{Algorithms}
We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve.