@@ -28,7 +28,9 @@ Each case is assigned randomly to one out of $\judgeAmount=50$ decision makers,
%
The leniency \leniency of each decision maker is drawn independently of other decision makers from Uniform$(0.1,~0.9)$.
%
As soon as a case is assigned to a decision maker, a decision $\decision$ is made for the case -- and the exact way this happens for different types of decision makers is described in the next subsection (Sec.~\ref{sec:dm_exps}).
As soon as a case is assigned to a decision maker, a decision $\decision$ is made for the case.
%
The exact way this happens for different types of decision makers is described in the next subsection (Sec.~\ref{sec:dm_exps}).
%
If the decision if positive, then an outcome is assigned to the case according to Eq.~\ref{eq:defendantmodel}
%
...
...
@@ -39,29 +41,35 @@ Note that, in the event of a positive decision, the intercept $\alpha_\outcome$
Additional noise is added to the outcome of each case via $\epsilon_\outcome$, which was drawn from a zero-mean Gaussian distribution with small variance, $\epsilon_\outcome\sim\gaussian{0}{0.1}$.
\subsection{Decision Makers}
\label{sec:dm_exps}
\mpara{The following should be moved after the description of decision makers and evaluators.}
Our experimentation involves two categories of decision makers: (i) the set of decision makers \humanset, the decisions of which are reflected in a dataset, and (ii) the decision maker \machine, whose performance is to be evaluated on the log of cases decided by \humanset.
%
% The resulting dataset was split 10 times to training and test data sets each containing the decisions from randomly selected 25 judges.
We describe both of them below.
\mpara{Decisions by \humanset}\newline
%
To produce a train-test split, we randomly choose the decisions of half of the judges to be in the training dataset -- while the rest are assigned to the test dataset.
Among cases that receive a positive decision, the probability to have a positive or negative outcome is higher or lower depending on the quantity below (see Equation~\ref{eq:defendantmodel}), to which we refer as the `{\it risk score}' of each case
Lower values indicate that a negative outcome is more likely.
%
The training data sets were used only to train the machine decision makers.
The evaluation algorithms produced separate \failurerate estimates for each test data set for each tested level of leniency.
%
Means of the given estimates at a level of leniency were used as the point estimates for all the algorithms and the error bars in figures \ref{fig:basic} and \ref{fig:results_rmax05} represent the standard deviations of these estimates at given level of leniency for the corresponding algorithms.
%
For summary figures \ref{fig:results_errors}, \ref{fig:highz} and \ref{fig:results_compas} the failure rate estimates for each test data set at all levels of leniency were compared to the estimate given by true evaluation algorithm.
We assume that the decision makers are well-informed and rational: their decisions reflect the probability that a case would have a positive or negative outcome.
%
The errors of the estimates were computed, means are the presented point estimates and error bars denote the standard deviation of the estimate's error. \acomment{You really need to describe how error bars are gotten. It is not enough to say that they a sds. They could be sds from bootstrapping, cross-validation, over several data sets, over decision makers etc etc. For example followingly: \textbf{We divided the data set 10 times to learning and datasets. We learned the decision maker $\machine$ from the learning set and evaluated its performance on the test set using different evaluators. The error bars denote the std. deviation from the means in this process.}}
The remaining parameter $\alpha_\judgeValue$ is set so as to conform with a pre-determined level of leniency \leniency.
%
Specifically, consider that the risk scores of all defendants follow a cumulative distribution function $G$.
%
Now, given a decision maker with leniency $\leniency=\leniencyValue$, we set the value of $\alpha_\judgeValue$ so that the decision maker is more likely to make a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue.
%
See Appendix~\ref{sec:independent} for in-depth details.
We used several different kinds of decision makers in the experiments, both as providing simulated decisions for the data set and as candidate decision makers to be evaluated. Recall from Section~\ref{sec:setting} that the portion of cases each decision maker makes a positive decision for, can be controlled by leniency level $\leniency$.
\spara{Variants.} Besides
The simplest decision maker, \textbf{Random}, simply selects portion $\leniency=\leniencyValue$ of the cases assigned to it at random, makes a positive decision $\decision=1$ from them and a negative decision $\decision=0$ for the remaining cases.
%
...
...
@@ -72,7 +80,7 @@ We still chose to analyse the effect of using poor decision-makers in the data t
Following~\citet{lakkaraju2017selective}, we also used \textbf{Batch} decision maker. This decision maker sorts its cases by some risk scores and then releases $\leniencyValue$ portion of the cases with the lowest score. In the experiments the risk scores were computed using expression given in Equation~\ref{eq:defendantmodel}.%expression
The previous decision maker may seem unfair as it makes decisions based on a subject \emph{depending} on other cases. To put it simply, they may need to make a negative decision for a subject today in order to make a positive decision for some subject tomorrow. To this end, we formulated an \textbf{Independent} decision maker, generalizing the batch decision maker, in the following way.
The risk scores of all defendants have some distribution which subsequently has some cumulative distribution function $G$. Now, given a decision maker with leniency $r$, the independent decision maker deals a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue. In the experiments the risk scores were computed by logistic regression according to equation~\ref{eq:defendantmodel}. See Appendix~\ref{sec:independent} for in-depth details.
The decision makers in the data and the evaluated decision makers differ in the observability of \unobservable: the former have access to \unobservable and include it in their logistic regression model while the latter omit \unobservable completely. All parameters of the logistic regression models for evaluated decision makers are learned from the training data set; evaluation is solely based on the test set.
...
...
@@ -100,6 +108,24 @@ Assuming that the distribution of cases assigned to each case is similar, this f
\subsection{Results}
\spara{Moved from previous}
% The resulting dataset was split 10 times to training and test data sets each containing the decisions from randomly selected 25 judges.
%
To produce a train-test split, we randomly choose the decisions of half of the judges to be in the training dataset -- while the rest are assigned to the test dataset.
%
The training data sets were used only to train the machine decision makers.
The evaluation algorithms produced separate \failurerate estimates for each test data set for each tested level of leniency.
%
Means of the given estimates at a level of leniency were used as the point estimates for all the algorithms and the error bars in figures \ref{fig:basic} and \ref{fig:results_rmax05} represent the standard deviations of these estimates at given level of leniency for the corresponding algorithms.
%
For summary figures \ref{fig:results_errors}, \ref{fig:highz} and \ref{fig:results_compas} the failure rate estimates for each test data set at all levels of leniency were compared to the estimate given by true evaluation algorithm.
%
The errors of the estimates were computed, means are the presented point estimates and error bars denote the standard deviation of the estimate's error. \acomment{You really need to describe how error bars are gotten. It is not enough to say that they a sds. They could be sds from bootstrapping, cross-validation, over several data sets, over decision makers etc etc. For example followingly: \textbf{We divided the data set 10 times to learning and datasets. We learned the decision maker $\machine$ from the learning set and evaluated its performance on the test set using different evaluators. The error bars denote the std. deviation from the means in this process.}}
\spara{Original start of subsection}
Figure~\ref{fig:basic} shows the basic evaluation of a batch decision maker on data employing batch decision makers over different leniencies. Here evaluation metric is good if it can match the true evaluation (only available for synthetic data) for any leniency.
%
In this basic setting, the proposed \cfbi achieves estimates with considerable lower variance (given by the error bars) than the state-otraction.