@@ -9,7 +9,7 @@ In the future we will examine further generalizing the setting and modeling assu
%
Since our approach predicts outcomes based on decision made by educated decision makers, it is an open question, whether this information can be used also when learning the statistical models the automatic decision makers are ultimately based on.
%
We believe such approaches will allow for better evaluations in new application fields, ensuring the accuracy and fairness of automatic decision making procedures that can be then adopted within the society.
We believe such approaches will allow for better evaluations in new application fields, ensuring the accuracy and fairness of automatic decision making procedures that can be then adopted in the society.
We used several different kinds of decision makers in the experiments, both as providing simulated decision in the data and as candidate decision makers to be evaluated. Recall from Section~2 that the portion of subjects each decision maker makes a positive decision for, can be controlled by leniency level $\leniencyValue$.
We used several different kinds of decision makers in the experiments, both as providing simulated decisions for the data set and as candidate decision makers to be evaluated. Recall from Section~\ref{sec:setting} that the portion of subjects each decision maker makes a positive decision for, can be controlled by leniency level $\leniency$.
The simplest decision maker, \textbf{Random}, simply selects portion $\leniencyValue$ of the subjects assigned to it at random, makes a positive decision from them$\decision=1$ and a negative decision $\decision=0$ for the remaining subjects.
The simplest decision maker, \textbf{Random}, simply selects portion $\leniency=\leniencyValue$ of the subjects assigned to it at random, makes a positive decision $\decision=1$ from them and a negative decision $\decision=0$ for the remaining subjects.
%\item
Following~\citet{lakkaraju2017selective}, the \textbf{Batch} decision maker sorts its subjects with risk scores and then releases $\leniencyValue$ portion of the subjects with the lowest score. In the experiments the risk scores were given by expression
Following~\citet{lakkaraju2017selective}, the \textbf{Batch} decision maker sorts its subjects with risk scores and then releases $\leniencyValue$ portion of the subjects with the lowest score. In the experiments the risk scores were given by logistic regression according to Equation~\ref{eq:defendantmodel}.%expression
The previous decision makers may seem unfair as they make decision based on a subject \emph{dependent} on other subject. To put it simply, they may need to make a negative decision for a subject today in order to make a positive decision for some subject tomorrow. To this end, we formulated an \textbf{Independent} decision maker, generalizing the batch decision maker, in the following way.
The risk scores of all defendants have some distribution which subsequently has some cumulative distribution function $G$. Now, given a decision maker with leniency $r$, the independent decision maker deals a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue. In the experiments the risk scores were computed using equation\ref{eq:riskscore} without the error term $\epsilon_\decisionValue$.
The previous decision makers may seem unfair as they make decision based on a subject \emph{dependent} on other subjects. To put it simply, they may need to make a negative decision for a subject today in order to make a positive decision for some subject tomorrow. To this end, we formulated an \textbf{Independent} decision maker, generalizing the batch decision maker, in the following way.
The risk scores of all defendants have some distribution which subsequently has some cumulative distribution function $G$. Now, given a decision maker with leniency $r$, the independent decision maker deals a positive decision if a defendant's risk score is in the \leniencyValue portion of the lowest scores, i.e. if the risk score of the defendant is lower than the inverse cumulative distribution function $G^{-1}$ at \leniencyValue. In the experiments the risk scores were computed by logistic regression according to equation~\ref{eq:defendantmodel}. See Appendix A.3 for in-depth details.% without the error term $\epsilon_\decisionValue$.
The decision makers in the data and the evaluated decision makers differ in the observability of \unobservable: the former have access to \unobservable and include it in their regression model while the latter omit \unobservable completely. All parameters of the regression models for evaluated decision makers are learned from the training data set; evaluation is solely based on the test set.
The decision makers in the data and the evaluated decision makers differ in the observability of \unobservable: the former have access to \unobservable and include it in their logistic regression model while the latter omit \unobservable completely. All parameters of the logistic regression models for evaluated decision makers are learned from the training data set; evaluation is solely based on the test set.
%WHAT ABOUT ADDING NOISE!!!! LEAVING IT OUT NOW.
%
%
\subsection{Evaluators}
In addition to counterfactual imputation (\textbf{Counterfactuals}) presented in this paper, we consider three other ways of evaluating decision makers. For the synthetic data, we can obtain the outcomes even for subjects with negative decision. We plot these are \textbf{True evaluation}. Note that for a realistic setting the true evaluation would not be available. We also report the failure rate using only on the subjects that were released in the data as \textbf{Labeled outcomes}. This naive baseline has already been previously shown to considerable underestimate the true failure rate.
The state-of-the-art method for this setting is \textbf{Contraction} of \citet{lakkaraju2017selective}.
The state-of-the-art method for evaluation decision makers in this setting is \textbf{Contraction} of \citet{lakkaraju2017selective}.
%
It is designed specifically to estimate the true failure rate of a machine decision maker in the selective labels setting.
%
...
...
@@ -75,7 +79,7 @@ Assuming that the distribution of subjects assigned to each subject is similar,
Figure~\ref{fig:basic} shows the basic evaluation of a batch decision maker on data employing batch decision makers over different leniencies. Here evaluation metric is good if it can match the true evaluation (only available for synthetic data) for any leniency.
%
In this basic setting our approach (counterfactuals) achieves more accurate estimates with lower variance than the state-of-the-art contraction.
In this basic setting our approach (counterfactuals) achieves estimates with considerable lower variance (given by the error bars) than the state-of-the-art contraction.
%
The naive approach of comparing only cases where outcome is not masked by decision (labeled outcomes) underestimates the failure rate considerably.
...
...
@@ -114,7 +118,8 @@ This observation is vitally important in the sense that decision makers based on
\caption{Error of estimate w.r.t true evaluation when the effect of the unbserved $\unobservable$ is high ($\beta_\unobservable=\gamma_\unobservable=5$). Although the decision maker quality is poorer, the proposed approach (counterfactuals) can still evaluate the decision accurately. Contraction shows higher variance and less accuracy.}
\label{fig:highz}
\end{figure}
Figure~\ref{fig:highz} still shows experiments where the effect of the unobserved $\unobservable$ is higher, i.e. we used $\beta_\unobservable=\gamma_\unobservable=5$ when generating the data. In this case the decisions in the data are made based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance is expected to be not very good. Nevertheless the proposed method (counterfactuals) is able to evaluate different decision makers $\machine$ accurately. Contraction shows markedly poorer performance in comparison, also when comparing to contraction in Figure~\ref{fig:results_errors} where the effect of $\unobservable$ on the decisions in the data was not as high.
Figure~\ref{fig:highz} still shows experiments where the effect of the unobserved $\unobservable$ is higher, i.e. we used $\beta_\unobservable=\gamma_\unobservable=5$ when generating the data. In this case the decisions in the data are made mostly based on background factors not observed by the decision maker $\machine$ being evaluated, thus the performance $\machine$ is expected to be not too good. Nevertheless the proposed method (counterfactuals) is able to evaluate different decision makers $\machine$ accurately. Contraction shows markedly poorer performance in comparison, also when comparing to contraction in Figure~\ref{fig:results_errors} where the effect of $\unobservable$ on the decisions in the data was not as high.