@@ -18,7 +18,7 @@ As digitalisation affects more and more aspects of life, so do the automated dec
This is quite prevalent for web services, where algorithms decide what results are shown by search engines, what news stories appear on social media platforms, or what products are recommended on online stores.
But automated-decision systems start to appear also in other situations -- for credit scoring, choice of medical treatment, insurance pricing, but also judicial decisions (COMPAS~\cite{brennan2009evaluating} and RisCanvi~\cite{tolan2019why} are two examples of algorithmic tools used to evaluate the risk for recidivism in the US and Catalan prison system, respectively).
In contrast to human decision making, automatic decision making by AI systems holds promise for better decision quality~\cite{}, and possibly even guarantees of fairness~\cite{}.
In contrast to human decision making, automatic decision making by AI systems holds promise for a better decision quality~\cite{kleinberg2018human}, and possibly even guarantees of fairness~\cite{DBLP:conf/icml/Kusner0LS19}.
But before deploying machines to make automated decisions, there is a need to evaluate their performance in real settings.
In practice, this is done by simulating their deployment over a log of past cases and measuring how well they would have performed, if they had been used to replace the human decision makers or other decision system currently in place.
Herein lies a challenge: previously-made decisions (based on case features some of which are not recorded in the data) affect the data on which the evaluation is performed, in a way that prevents straightforward evaluation.
...
...
@@ -58,28 +58,29 @@ There is a rich literature on problems that arise in similar settings, our speci
At its core, our task is to answer a `what-if' question, asking ``what would the outcome have been if a different decision had been made'' (a counterfactual), this is often mentioned as the `fundamental problem' in causal inference~\cite{holland1986statistics, bookofwhy}.
% SELECTION BIAS
Settings where data samples are chosen through some intricate filtering mechanism are said to exhibit {\it selection bias} (see, for example, \citet{hernan2004structural}). In the present case, any models predicting outcomes can only be on samples where the decision was positive.
%WHAT WE DO NOT HAVE THIS???
% MISSING DATA %IMPUTATION
Settings where some variables are not observed for all samples have \emph{missing data}. Here, the outcomes for samples with a negative decision are considered missing, or labeled with some default value.
Settings where some variables are not observed for all samples have \emph{missing data} (see, for example, \citet{little2019statistical}). Here, the outcomes for samples with a negative decision are considered missing, or labeled with some default value.
%
\emph{Latent confounding} refers to the presence of unobserved variables that affect two or more of the observed variables (see, for example, \citet{pearl2000}). In our case, there generally are features not recorded in the data that confound both the decision and the outcome.
%Research on selection bias has achieved results in recovery the structure of the generative model (i.e., the mechanism that results in bias) and estimating causal effects (e.g.,~\citet{pearl1995empirical} and~\citet{bareinboim2012controlling}).
%OFFLINE POLICY EVALUATION
\emph{Offline policy assessment} refers to evaluation of a decision policy over a dataset recorded under another policy~\cite{Jung2}, which is also the case here, the decision are always based on a particular policy.
\emph{Offline policy assessment} refers to evaluation of a decision policy over a dataset recorded under another policy (see, for example,~\cite{Jung2}), which is also the case here, the decision are generally based on a particular policy.
%COUNFOUNDING AND SENSITIVITY ANALYSIS
%WE WANT TO CITE HERE ALL SELECTIVE LABELS PAPERS TO SELL THIS VIEWPOINT
Recently, \citet[KDD2017]{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}' empahasizing the fact that outcomes in the data are selectively labeled
(see also \cite{dearteaga2018learning,kleinberg2018human}).
Recently, \citet[KDD2017]{lakkaraju2017selective} referred to the problem of evaluation is such settings as the '{\it selective labels problem}' empahasizing the fact that outcomes in the data are selectively labeled based on the decisions
(also considered by~\cite{dearteaga2018learning,kleinberg2018human}).
%
\citet{lakkaraju2017selective} also presented {\it contraction}, a method for evaluating decision making mechanisms in a setting where subjects are randomly assigned to decision makers with varying leniency levels.
%
The {\it contraction} technique takes advantage of the assumed random assignment and variance in leniency: essentially it measures the performance of the evaluated system using the cases of the most lenient judge.
%
%We note, however, that for contraction to work, we need lenient decision makers making decisions on a large number of subjects.
We note, however, that for contraction to work, we need lenient decision makers making decisions on a large number of subjects.
% THIS BELONGS TO RELATED WORK
%In another recent paper, \citet{Jung2} studied unobserved confounding in the context of creating optimal decision policies.
In another recent paper, \citet{Jung2} studied unobserved confounding in the context of creating optimal decision policies.
%
%They approached the problem with Bayesian modelling, but they don't consider the selective labels issue or the possibility that the decisions reflected in the data may be taken by more than one decision makers with differing levels of leniency.
They approached the problem with Bayesian modelling, but they don't consider selective labeling or the possibility that the decisions reflected in the data may be taken by several decision makers with differing levels of leniency.
\spara{Our contributions}
In this paper, we build upon the problem setting used in~\citet{lakkaraju2017selective} and present a novel, modular framework to evaluate decision makers over selectively labeled data.