@@ -10,10 +10,16 @@ Towards, this end, we employ both synthetic and real data.
%
We compare \cfbi especially with \contraction~\cite{lakkaraju2017selective}.
\spara{Reproducibility}. Our manuscript contains a full specification of the parameters of the models and synthetic datasets we used, as well as links to the real public datasets.
%
\spara{Reproducibility}.
The implementation uses Python 3.6.9 and PyStan v.2.19.0.0 with cmdstanpy 0.4.3 -- and will be made available online upon publication.
%
Our manuscript contains a full specification of the parameters of the models and synthetic datasets we used, as well as links to the real public datasets.
%% TODO Replace mentions of "Appendix" with "full version".
%% TODO Replace last paragraph
% This manuscript contains a full specification of the synthetic datasets we used, as well as links to the real public datasets.
% %
% Due to space constraint, we provide the parameters of prior distributions in the appendix of the full version\footnote{arxiv-link}.
\subsection{Synthetic Data}
\label{sec:syntheticsetting}
...
...
@@ -50,12 +56,6 @@ with $b_\obsFeatures = b_\unobservable = 1$.
%
Additional noise is added to the outcome of each case via $e_\outcome$, which was drawn from a zero-mean Gaussian distribution with small variance, $e_\outcome\sim\gaussian{0}{0.1}$. The data set was split in half to training and tests sets, such that each decision maker appears only in one. The evaluated decision maker $\machine$ is trained on the training set while the evaluation is based only on the test set.
%\acomment{$\epsilon_\outcome$ does not appear anywhere in the formulas except appendix A where it is used for a different purpose. Can you Riku describe it here in words?}
%
%\rcomment{I'm not exactly sure what you wish me to describe. The noise $\epsilon_\outcome$ is generated as stated above. It represents something unobserved by the decision maker, events happening after the decision is made or just simply chance.}
%\acomment{Like this: To add noise to the outcomes we flipped the outcomes with probability X. Or: To add noise to the outcomes, we included a noise term from $\gaussian{0}{0.1}$ into the logistic regression formula for each subject. In a way that I can do it without being confused. Now I am because epsilon is no longer in the formulas.}
@@ -64,17 +64,9 @@ Our experimentation involves two categories of decision makers: (i) the set of d
%
We describe both of them below.
\mpara{Decisions by \humanset.}%\newline
%%
%Among cases that receive a positive decision, the probability to have a positive or negative outcome is higher or lower depending on the quantity below (see Equation~\ref{eq:defendantmodel}), to which we refer as the `{\it risk score}' of each case
%Lower values indicate that a negative outcome is more likely.
%%
\mpara{Decisions by \humanset.}
%
The decisions of decision makers \humansetare based on their perception of the dangerousness of a case, to which we refer as the {\it risk score}.
Decision makers \humansetdecide based on their perception of the dangerousness of a case, to which we refer as the {\it risk score}.
%
With synthetic data we compute the risk score as
\begin{equation}\label{eq:risk}
...
...
@@ -85,7 +77,8 @@ For the {\it first} type of decision makers we consider, we assume that decision
%
Specifically, we assume that the decision-makers know the cumulative distribution function $F$ that the risk scores $s = b_\obsFeatures\obsFeaturesValue+ b_\unobservable\unobservableValue$ of defendants follow.
%
This is a reasonable assumption to make in settings where decision makers have accurate knowledge of the joint feature distribution as such knowledge allows one to calculate $F$.
This is a reasonable assumption to make when decision makers have accurate knowledge of the joint feature distribution.
% as such knowledge allows one to calculate $F$.
%
For example, an experienced judge who has tried a large volume and variety of defendants may have a good idea about the various cases that appear at court and which of them pose higher risk.
%
...
...
@@ -95,43 +88,27 @@ Since in our setting the distribution $F$ is given and fixed, such decisions for
%
Because of this, we refer to this type of decision makers as \independent.
%OUT TO GET A SUBMITTABLE VERSION
%\todo{MM}{Make sure Appendix~\ref{sec:independent} is correct.} \acomment{Appendix has F as the cumulative distribution function, here we have G.} %RL: Of the presented decision-makers, independent is the only one that could be deployed in an online setting. You couldn't deploy batch to make decisions on single subjects.
In addition, we also experiment with a different type of decision makers, namely \batch, also used in \cite{lakkaraju2017selective}.
In addition, we consider a different type of decision makers, namely \batch, also used in \cite{lakkaraju2017selective}.
%
Decision makers of this type are assumed to consider all cases assigned to them at once, as a batch; sort them by the risk score in Equation~\ref{eq:risk}; and, for leniency \leniency = \leniencyValue, release $\leniencyValue$ portion of the batch with the lowest risk score.
Decision makers of this type consider all cases assigned to them at once, as a batch; sort them by the risk score in Equation~\ref{eq:risk}; and, for leniency \leniency = \leniencyValue, release $\leniencyValue$ portion of the batch with the lowest risk score.
%
Such decision makers still have a good knowledge of the relative risk that the cases assigned to them pose, but they are also shortsighted, as they make decisions for a case \emph{depending} on other cases in their batch.
%
For example, if a decision maker is randomly assigned a batch of cases that are all very likely to lead to a good outcome, a large portion $1-\leniencyValue$ of them will still be handed a negative decision.
%RL: This is a problem of any decision-maker on who we impose a _leniency_. I.e. if we say to any decision-maker that you must release 20% of people brought in front of you, the above situation is bound to happen. If we impose a risk threshold, such problem will not happen as the judge can adjust leniency as needed.
%Finally, we consider a random version of \batch as a third type of decision maker, namely \random.
%WHY CONFUSE THE READER BY SAYING THAT RANDOM IS BATCH
% MUCH EASIER TO TALK ABOUT IT SEPARATELY
%
Finally, we consider a third type of decision maker, namely \random. It simply makes a positive decision with probability \leniencyValue.
%
We include this to test the evaluation methods also in settings where some of their assumptions may be violated.
%
%%\random decision makers may make poor decisions -- but they do not introduce selection bias, as their decision is not correlated with the possible outcome.
%WE SHOULD NOT ADVERTISE THIS, SOUNDS LIKE SILLY EXPLANATIONS
%
%For this reason, including them in the experiments is useful, as it allows us to test performance of different evaluation approaches in the `extreme' scenario of absence of selection bias in the data.
%
%\rcomment{Couple of comments hidden.} % First, A positive decision is given with probability \leniencyValue. We do not assign a positive decision to \leniencyValue portion of the cases. Hence \random is more like independent. Second, we include the random decision maker to show that our method performs well despite violation of assumptions. We assume accuracy of judge and that there is information in the decisions. Informative decisions are assumed as the basis for this approach but they are not necessary. If the data shows the decisions are random, we observe that and don't utilize them.
\mpara{Decisions by \machine.}%\newline
%
For \machine, we consider the same three types of decision makers as for \humanset above, with one difference: decision makers \humanset have access to \unobservable, while \machine does not.
%
Their definitions are adapted in the obvious way -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures.
% -- but we still refer to them with the same names.
%WHY ADVERTISE THIS
Their definitions are adapted in the obvious way.
% -- i.e., for \independent and \batch, risk score involves only on the values of feature \obsFeatures.
%
Risk scores are computed with a logistic regression model which is trained in the training data set.
%
For the \independent decision maker \machine, the cumulative distribution function of the risk scores is constructed using the empirical distribution of risk scores of all the observations in the training data.
...
...
@@ -146,25 +123,21 @@ For the \independent decision maker \machine, the cumulative distribution functi
\end{figure}
\subsection{Evaluators}
\label{sec:evaluators}
The purpose of the experiments is to investigate the performance of different methods in evaluating \machine on a dataset that records cases decided by \humanset.
We aim to investigate the performance of different methods in evaluating \machine on a dataset that records cases decided by \humanset.
%
We call those methods ``evaluators''.
%
For an evaluator, desired properties are accuracy (i.e., the evaluator should estimate well the failure rate of \machine), but also robustness (i.e., besides performing well on average, an evaluator is desired to perform consistently for all cases).
Desired evaluator properties are accuracy (i.e., the evaluator should estimate well the failure rate of \machine), but also robustness (i.e., consistent performance).
The first evaluator we consider is \cfbi, the proposed method of this paper, described in Section~\ref{sec:imputation}.
The first evaluator we consider is \cfbi, the method we propose in Section~\ref{sec:imputation}.
%
To summarize: \cfbi uses the dataset to learn a model, i.e., a distribution for the parameters involved in formulas~\ref{eq:defendantmodel} and~\ref{eq:judgemodel}; using this distribution, it predicts the outcome of the cases for which \humanset made a negative decision and \machine makes a positive one; and finally it evaluates the failure rate of \machine on the dataset.
The second evaluator we consider is \contraction, proposed in recent work~\cite{lakkaraju2017selective}.
% , KDD'17 I DONT THINK WE SHOULD DO THIS MORE THAN TWICE AND THOSE ARE USED ALREADY
%
It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
%
Contraction bases its evaluation only on the cases assigned to the most lenient decision maker $\human_l$ in the data.
...
...
@@ -174,11 +147,8 @@ Because of the lower leniency of the evaluated decision maker $\machine$, the ap
%
The cases with a positive decision by $\human_l$ are sorted according to the lowest leniency level at which they receive positive decisions by $\machine$. % R sort q
% I PUT HERE MATCH BECAUSE THE FRACTION NEEDS CONSIDER
The sorted list is then {\it contracted} to match the leniency level $\leniencyValue$ at which \machine is evaluated. Because all outcomes for cases in this list are available in the data,
%so that only cases in the \leniencyValue fraction
%of the least dangerous
%of cases are considered in the computation of the failure rate estimate: % R_b
we can estimate the failure rate by the number of cases in the contracted list with a negative outcomes divided by the number of cases assigned to $\human_l$. Because the cases are assigned randomly to all decision makers, this estimates the failure rate on the whole dataset.
The sorted list is then {\it contracted} to match the leniency level $\leniencyValue$ at which \machine is evaluated. Because all outcomes for cases in this list are available in the data, we can estimate \failurerate by the number of cases in the contracted list with a negative outcomes divided by the number of cases assigned to $\human_l$.
% Because the cases are assigned randomly to all decision makers, this estimates the failure rate on the whole dataset.
%
%\rcomment{Rewrote it to the best of my abilities. Ok now?}
%\todo{MM}{Re-write the last two sentences.}
...
...
@@ -210,32 +180,17 @@ This is refered to as \trueevaluation.
\subsection{Results}
\label{sec:results}
These results show the accuracy of the different evaluators (Section~\ref{sec:evaluators}) on different types of automated decision makers \machine over data sets employing different decision makers \humanset (Section~\ref{sec:decisionmakers}).
%In these results we evaluate decision maker \machine over data
%For the results we describe immediately below, we executed the pipeline for multiple random train-test splits and different leniency levels for \machine.
We show the accuracy of the different evaluators (Section~\ref{sec:evaluators}) on different decision makers \machine over data sets employing different decision makers \humanset (Section~\ref{sec:decisionmakers}).
%Figure~\ref{fig:basic} shows some representative results from this process.
%SOME REPRESENTATIVE??? WE CANNOT BE THIS VAGUE
\spara{The basic setting.} Figure~\ref{fig:basic} shows estimated failure rates for each of the evaluators, at different leniency levels, when decisions in the data were made by \independent decision maker, while \machine was of \batch type.
%
%Specifically, the plot shows the %
%The same leniency level was used for decisions in the data and for decisions by \machine.
% NO THIS IS NOT TRUE, DATA HAS MANY LENIENCY LEVELS AND M IS RUN ON SEVERAL LENIENCY ELVELS AND THESE MAY NOT EVEN BE THE SAME
%For the results included in the figure, decisions in the data were made by \batch decision maker, while \machine was of \independent type.
%ALREADY
%
In interpreting this plot, we should consider an evaluator to be accurate if its curve follows well that of the optimal evaluator, i.e., \trueevaluation.
%
In this scenario, we see that \cfbi and \contraction are quite accurate, while the naive evaluation of \labeledoutcomes, but also the straightforward imputation by \logisticregression perform quite poorly.
%
In addition, \cfbi exhibits considerably lower variation than \contraction, as shown by the error bars.
%\footnote{To obtain the error bars, we divided the data $10$ times to training and test datasets.
%
%We learned the decision maker \machine from the training set and evaluated its performance on the test sets using different evaluators.
%
%The error bars denote the standard deviation of the estimates in this process.}
@@ -246,16 +201,6 @@ Error bars show standard deviation of the absolute error over 10 datasets. \cfbi
\end{figure}
%WHY WOULD WE JUMP TO LIMITED LENIENCY AND THEN BACK TO UNLIMITED??? I DONT UNDERSTAND!!!!!
%
%This picture painted by Figures~\ref{fig:results_errors} and~\ref{fig:results_rmax05} is representative for all combinations of types of decision makers $\human$ in the data and for \machine.
%WHAT ARE WE ARTISTS?
%Figures~\ref{fig:results_errors} shows the error rates for combinations of different types of decision makers $\humanset$ in the data and for \machine.
%
%While we do not have space to include the corresponding figures for all other settings, we do provide summary results in Figure~\ref{fig:results_errors}.
%WE HAVE THE APPENDIX; WE DO HAVE PLENTY SPACE; FOR EXAMPLE REPLACING ANY DISCUSSIONS IN THE PAPER; BETTER TO HAVE THEM ASK AND THEN PRODUCE
%
%Specifically,
Figure~\ref{fig:results_errors} shows the aggregate absolute error rates of the two evaluators, \cfbi and \contraction.
%
Each error bar is based on all datasets and leniencies from $0.1$ to $0.8$, for different types of decision makers for $\humanset$ and for \machine.
...
...
@@ -266,16 +211,15 @@ The overall result is that \cfbi evaluates the decision makers accurately and ro
It is able to learn model parameters that capture the behavior of decision makers employed in the data and use that model to evaluate any decision maker \machine using only selectively labeled data.
\contraction shows consistently poorer performance, and markedly larger variation as shown by the error bars.
%
%\contraction shows consistently poorer performance, but not dramatically worse.
% THE VARIATION IS DRAMATICALLY WORSE
%
Again, our interpretation is that this is due to the fact that \contraction crucially depends on the cases assigned to the most lenient decision makers, while \cfbi makes full use of all data.
\caption{Evaluating \batch on data employing \independent and with leniency at most $0.5$. \cfbi offers sensible estimates of the failure rates for all levels of leniency, whereas \contraction produces failure rates only up to leniency $0.5$.}
\caption{Evaluating \batch on data employing \independent and with leniency at most $0.5$. \cfbi offers sensible estimates of the failure rates for all levels of leniency, whereas \contraction only up to leniency $0.5$.}
\label{fig:results_rmax05}
\end{figure}
...
...
@@ -341,7 +285,6 @@ a set of tools for assisting decision-making in the criminal justice system.
COMPAS provides needs assessments and risk estimates of recidivism.
It is derived from prior criminal history, socio-economic and personal factors and it predicts recidivism for two years \cite{brennan2009evaluating}.
%
The COMPAS dataset used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
%
Judges and defendants in the data correspond to decision makers and cases in our setting (Section~\ref{sec:setting}), respectively.
@@ -120,10 +120,10 @@ Note also that we are making the simplifying assumption that coefficients $\gamm
Parameter $\alpha_{\judgeValue}$ controls the leniency of a decision maker $\human_\judgeValue\in\humanset$.
We take a Bayesian approach to learn the model over the dataset\dataset.
We take a Bayesian approach to learn the model from the dataset.
%
In particular, we consider the full probabilistic model defined in Equations \ref{eq:defendantmodel} and \ref{eq:judgemodel} and obtain the posterior distribution of its parameters $\parameters=\{\alpha_\outcome, \beta_\obsFeatures, \beta_\unobservable, \gamma_\obsFeatures, \gamma_\unobservable\}\cup\bigcup_{\human_\judgeValue\in\human}\{\alpha_\judgeValue\}$, which includes intercepts $\alpha_\judgeValue$ for all $\human_\judgeValue$ employed in the data.
We use prior distributions given in Appendix~\ref{sec:priors}to ensure the identifiability of the parameters.
We use suitable prior distributions to ensure the identifiability of the parameters (Appendix~\ref{sec:priors}).
\subsection{Computing Counterfactual Outcomes}
...
...
@@ -154,7 +154,7 @@ For a fully defined model (with fixed parameters) the counterfactual expectatio
In essence, we determine the distribution of the unobserved features $\unobservable$ using the decision, observed features $\obsFeaturesValue$, and the leniency of the employed decision maker, and then determine the distribution of $\outcome$ conditional on all features, integrating over the unobserved features (see Appendix~\ref{sec:counterfactuals} for more details). Note that the decision maker model in Equation~\ref{eq:judgemodel} affects the distribution of the unobserved features $\prob{\unobservable|\judgeValue, \decision=0,\obsFeaturesValue}$.
In essence, we determine the distribution of the unobserved features $\unobservable$ using the decision, observed features $\obsFeaturesValue$, and the leniency of the employed decision maker, and then determine the distribution of $\outcome$ conditional on all features, integrating over the unobserved features (Appendix~\ref{sec:counterfactuals}). Note that the decision maker model in Equation~\ref{eq:judgemodel} affects the distribution of the unobserved features $\prob{\unobservable|\judgeValue, \decision=0,\obsFeaturesValue}$.
Having obtained a posterior probability distribution for parameters \parameters we can estimate the counterfactual outcome value based on the data:
\begin{equation}
...
...
@@ -188,8 +188,6 @@ Note though that, unlike $\outcome$ that takes integer values $\{0, 1\}$, \cfout
Having obtained outcome estimates for all data entries, it is now straightforward to obtain an estimate for the failure rate $\failurerate$ of decision maker \machine:
it can be computed as a simple average over all data entries.
%it is simply the average value of \cfoutcome over all data entries.
%IT IS NOT THE AVERAGE; SEE PICTURE, IT IS average of 1-hatY !!!!
%
Our approach is summarized in Figure~\ref{fig:approach}.
%
...
...
@@ -198,7 +196,7 @@ We will refer to it as \textbf{\cfbi}, for {\underline c}ounter{\underline f}act