Skip to content
Snippets Groups Projects
Commit 472ef7c6 authored by Antti Hyttinen's avatar Antti Hyttinen
Browse files

Section 2.

parent d0499dff
No related branches found
No related tags found
No related merge requests found
......@@ -90,7 +90,7 @@ deciding whether of defendant should be given bail or kept in jail, we are not a
The advantage of using models does not necessarily lie in pure performance, that a machine can make more decisions, but rather in that a machine can give bounds for uncertainty and can learn from a vast set of information and that with care, a machine can be made as unbiased as possible.
However, before deploying any decision making algorithms, they should be evaluated to show that they actually improve on previous, often human, decision-making. This evaluation far from trivial.
However, before deploying any decision making algorithms, they should be evaluated to show that they actually improve on previous, often human, decision-making: a judge, a doctor, ... who makes the decisions on which outcome labels are available. This evaluation far from trivial.
%Although, evaluating algorithms in conventional settings is trivial, when (almost) all of the labels are available, numerous metrics have been proposed and are in use in multiple fields.
Specifically, `Selective labels' settings arise in situations where data are the product of a decision mechanism that prevents us from observing outcomes for part of the data \cite{lakkaraju2017selective}. As a typical example, consider bail-or-jail decisions in judicial settings: a judge decides whether to grant bail to a defendant based on whether the defendant is considered likely to violate bail conditions while awaiting trial -- and therefore a violation might occur only in case bail is granted. Naturally similar scenarios are observed throughout many applications from ecnomy or medicine.
%For example, given a data set of bail violations and bail/jail decision according some background factors, there will never be bail violations on those subjects kept in jail by the current decision making mechanism, hence the evaluation of a decision bailing such subjects is left undefined.
......@@ -100,16 +100,27 @@ Such settings give rise to questions about the effect of alternative decision me
%In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions.
This can be seen as a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Lakkaraju et al. recently name this the selective labels problem \cite{lakkaraju2017selective}. Pearl calls missing the data under an alternative decision the 'fundamental problem' of causal inference.
This can be seen as a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Lakkaraju et al. recently name this the selective labels problem \cite{lakkaraju2017selective}. Selective labels issue has been addressed in the causal inference literature by discussing selection bias. discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.)Pearl calls missing the outcome under an alternative decision the 'fundamental problem' in causal inference \cite{bookofwhy}.
\begin{itemize}
\item What we study
\begin{itemize}
\item We studied methods to evaluate the performance of predictive algorithms/models when the historical data suffers from selective labeling and unmeasured confounding.
\end{itemize}
\item Motivation for the study
\begin{itemize}
Recently, Lakkaraju et al. presented method for evaluation of decision making mechanisms called contraction \cite{lakkaraju2017selective}. It assumes the subjects are random assigned to decision makers with a given leniency level. Due to the assumptions, an estimate of the performance can be obtained by essentially considering the most lenient decision maker. It was shown to perform well compared to other methods previously presented.
For contraction to work, we need lenient decision makers making decision on a large number of subjects. If there is need have a better decision maker, there may not be sufficiently lenient decision makers. Furthermore, evaluating only on one decision maker when possibly many are present may produce estimates with higher variation. In reality, the decision maker in the data is not perfect but e.g. biased. Our aim is to develop a method that overcome these challenges and limitations.
In this paper we propose a (novel modular) framework to provide a systematic way of evaluation decision makers from selectively labeled data. Our approach that is based on imputing the missing labels using counterfactual reasoning. We also build on Jung et al. who present a method for constructing optimal policies, we show that that approach can also be applied to the selective labels setting \cite{jung2018algorithmic}.
%to evaluate the performance of predictive models in settings where selective labeling and latent confounding is present. We use theory of counterfactuals and causal inference to formally define the problem.
We define a flexible, Bayesian approach and the inference is performed with the latest tools.
%\begin{itemize}
%\item What we study
% \begin{itemize}
% \item We studied methods to evaluate the performance of predictive algorithms/models when the historical data suffers from selective labeling and unmeasured confounding.
% \end{itemize}
%item Motivation for the study%
% \begin{itemize}
% \item %Fairness has been discussed in the existing literature and numerous publications are available for interested readers. Our emphasis on this paper is on pure performance, getting the predictions accurate.
% \item
......@@ -125,24 +136,25 @@ This can be seen as a complicated missing data problem where the missingness of
% \item So this might lead to situation where subjects with same characteristics may be given different decisions due to the differing leniency.
% \item Of course the differing decisions might be attributable to some unobserved information that the decision-maker might have had available ude to meeting with the subject.
% \item %The explainability of black-box models has been discussed in X. We don't discuss fairness.
\item In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions. This results in a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Recently this problem has been named the selective labels problem.
\end{itemize}
\item Related work
\begin{itemize}
\item In the original paper, Lakkaraju et al. presented contraction which performed well compared to other methods previously presented in the literature.
\item We wanted to benchmark our approach to that and show that we can improve on their algorithm in terms of restrictions and accuracy.
\item Restrictions = our method doesn't have so many assumptions (random assignments, agreement rate, etc.) and can estimate the performance on all levels of leniency despite the judge with the highest leniency. See fig 5 from Lakkaraju
\item Jung et al presented their method for constructing optimal policies, we show that that approach can also be applied to the selective labels setting.
\item They didn't have selective labeling nor did they consider that the judges would differ in leniency.
\item Selective labels issue has been addressed in the causal inference literature by discussing selection bias. discussion has mainly been concentrated on recovering causal effects + model structure has usually been different (Pearl, Bareinboim etc.)
\item Latent confounding has bee discussed by X when discussing the effect of latent confounders to ORs. ec etc.
\end{itemize}
\item Our contribution
\begin{itemize}
\item In this paper we propose a (novel modular) framework to provide a systematic way of presenting these missing data problems by breaking it into different modules and explicating their function.
\item In addition, we present an approach for inferring / imputing the missing labels to evaluate the performance of predictive models in settings where selective labeling and latent confounding is present. We use theory of counterfactuals and causal inference to formally define the problem. In computation we use the latest tools. / "a flexible, Bayesian approach".
\end{itemize}
\end{itemize}
% \item In settings like judicial bail decisions, some outcomes cannot be observed due to the nature of the decisions. This results in a complicated missing data problem where the missingness of an item is connected with its outcome and where the available labels aren't a random sample of the true population. Recently this problem has been named the selective labels problem.
%\end{itemize}
%\item
%Related work
%\begin{itemize}
% \item In %the original paper, Lakkaraju et al. presented contraction which performed well compared to other methods previously presented in the literature.
% \item We wanted to benchmark our approach to that and show that we can improve on their algorithm in terms of restrictions and accuracy.
% \item %Restrictions = our method doesn't have so many assumptions (random assignments, agreement rate, etc.) and can estimate the performance on all levels of leniency despite the judge with the highest leniency. See fig 5 from Lakkaraju
% \item
% \item They didn't have selective labeling nor did they consider that the judges would differ in leniency.%
% \item
% \item %Latent confounding has bee discussed by X when discussing the effect of latent confounders to ORs. ec etc.
% \end{itemize}
%\item Our contribution
% \begin{itemize}
% \item
% \end{itemize}
%\end{itemize}
\section{The Selective Labels Framework}
......@@ -168,15 +180,21 @@ A decision maker $D(r)$ makes the decision $T$ based on the characteristics of t
In the bail or jail example, a decision maker seeks to jail $T=0$ all dangerous defendants that would violate their bail ($Y=0$), but let out the defendants that will not violate their bail. The leniency $r$ refers to the portion of bail decisions.
The difference between the decision makers in the data and $D(r)$ is that usually we cannot observe all the information that has been available to the decision makers in the data.
% In addition, we usually cannot observe the full decision-making process of the decider in the data step contrary to the decider in the modelling step.
With unobservables we refer to some latent, usually non-written information regarding a certain outcome that is only available to the decision-maker. For example, a judge in court can observe the defendant's behaviour and level of remorse which might be indicative of bail violation. We denote the latent information regarding a person's guilt with variable \unobservable.
\subsection{Evaluating Decision Makers}
The goodness of a decision maker can be examined as follows.
The goodness of a decision maker can be examined as follows.
%Acceptance rate (AR) is the number of positive decisions ($T=1$) divided by the number of all decisions.
%DO WE NEED ACCEPTANCE RATE ANY MORE
Failure rate (FR) is the number of undesired outcomes ($Y=0$) divided by the number of all decisions.
% One special characteristic of FR in this setting is that a failure can only occur with a positive decision ($T=1$).
%That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal.
A good decision maker achieves as low failure rate FR as possible, for any leniency level.
A good decision maker achieves as low failure rate FR as possible, for any leniency level.
However, the data we have does not directly provide a way to evaluate FR. If a decision maker decides $T=1$ for a subject that had $T=0$ in the data, the outcome $Y$ recorded in the data is based on the decision $T=0$ and hence $Y=1$ regardless of the decision taken by $D$. The number of negative outcomes $Y=0$ for these decision needs to be calculated in some non-trivial way.
......@@ -187,9 +205,8 @@ Therefore, the aim is here to give an estimate of the FR at any given AR for any
Given selectively labeled data, and a decision maker $D(r)$, give an estimate of the failure rate FR for any leniency $r$.
\end{problem}
\noindent
This estimate is vital in the employment machine learning and AI systems to every day use.
The estimate of the evaluator should also be accurate for all levels of leniency.
This estimate is vital in the employment machine learning and AI systems to every day use.
% Given the selective labeling of data and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the real world / applicable domains. The evaluator module should be able to accurately estimate the failure rate for all levels of leniency and all data sets.
......@@ -230,69 +247,72 @@ We use a propensity score framework to model $X$ and $Z$: they are assumed conti
\acomment{Not sure if this is good to discuss here or in the next section: if we would like the next section be full of our contributions and not lakkarajus, we should place it here.}
\setcounter{section}{1}
%\setcounter{section}{1}
\section{ Framework ( by Riku)}
%\section{ Framework ( by Riku)}
%In this section, we define the key terms used in this paper, present the modular framework for selective labels problems and state our problem.
%Antti: In conference papers we do not waste space for such in this paper stuff!! In journals one can do that.
\begin{itemize}
\item Definitions \\
In this paper we apply our approach on binary (positive / negative) outcomes, but our approach is readily extendable to accompany continuous or categorical responses. Then we could use e.g. sum of squared errors or other appropriate metrics as the measure for good performance.
With positive or negative outcomes we refer to...
\begin{itemize}
\item Failure rate
\begin{itemize}
\item Failure rate (FR) is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision / we can only observe the outcome when the corresponding decision is positive.
\item That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal. (rather about finding a good balance. > Resource issues in prisons etc.)
\end{itemize}
\item Acceptance rate
\begin{itemize}
\item Acceptance rate (AR) or leniency is defined as the ratio of positive decisions to all decisions that a decision-maker will give. (Semantically, what is the difference between AR and leniency? AR is always computable, leniency doesn't manifest.)
\item In some settings, (justice, medicine) people might want to find out if X\% are accepted what is the resulting failure rate, and what would be the highest acceptance rate to have to have the failure rate at an acceptable level.
\item We want to know the trade-off between acceptances and failure rate.
\item Lakkaraju mentioned the problem in the data that judges which have a higher leniency have labeled a larger portion of the data (which might results in bias).
\item As mentioned earlier, these differences in AR might lead to subjects getting different decisions while haven the same observable and unobservable characteristics.
\end{itemize}
\item With decider or decision-maker we might refer to a judge, a doctor, ... who makes the decisions on which labels are available. % Some deciders might have an incentive for positive decisions if it can mean e.g. savings. Judge makes saving by not jailing a defendant. Doctor makes savings by not assigning patient for a higher intensity care. (move to motivation?)
\item With unobservables we refer to some latent, usually non-written information regarding a certain outcome that is only available to the decision-maker. For example, a judge in court can observe the defendant's behaviour and level of remorse which might be indicative of bail violation. We denote the latent information regarding a person's guilt with variable \unobservable.
\end{itemize}
\item Modules \\
We separated steps that modify the data into separate modules to formally define how they work. With observational data sets, the data goes through only a modelling step and an evaluation step. With synthetic data, we also need to define a data generating step. We call the blocks doing these steps {\it modules}. To fully define a module, one must define its input and output. Modules have different functions, inputs and outputs. Modules are interchangeable with a similar type of module if they share the same input and output (You can change decider module of type A with decider module of type B). With this modular framework we achieve a unified way of presenting the key differences in different settings.
\begin{itemize}
\item Decider modules
\begin{itemize}
\item In general, the decider module assigns predictions to the observations based on some information.
\item The information available to a decision-maker in the decider module includes observable and -- possibly -- unobservable features, denoted with X and Z respectively.
\item The predictions given by a decider module can be relative or absolute. With relative predictions we refer to that a decider module can give out a ranking of the subjects based on their predicted tendency towards an outcome. Absolute predictions can be either binary or continuous in nature. For example, they can correspond to yes or no decisions or to a probability value.
\item Inner workings (procedure/algorithm) of the module may or may not be known. In observational data sets, the mechanism or the decider which has labeled the data is usually unknown. E.g. we do not -- eactly -- know how judges obtain a decision. Conversely, in synthetic data sets the procedure creating the decisions is fully known because we define the process.
\item The decider (module) in the data step has unobservable information available for making the decisions.
\item The behaviour of the decider module in the data generating step can be defined in many ways. We have used both the method presented by Lakkaraju et al. and two methods of our own. We created these two deciders to remove the interdependencies of the decisions made by the decider Lakkaraju et al. presented.
\item The difference between the deciders in the data and modelling steps is that usually we cannot observe all the information that has been available to the decider in the data step as opposed to the decider in the modelling step. In addition, we usually cannot observe the full decision-making process of the decider in the data step contrary to the decider in the modelling step.
\end{itemize}
\item Evaluator modules
\begin{itemize}
\item Evaluator module gets the decisions, observable features of the subject and predictions made by the deciders and outputs an estimate of...
\item The evaluator module outputs a reliable estimate of a decider module's performance. The estimate is created by the evaluator module and it should
\begin{itemize}
\item be precise and unbiased
\item have a low variance
\item be as robust as possible to slight changes in the data generation.
\end{itemize}
\item The estimate of the evaluator should also be accurate for all levels of leniency.
\end{itemize}
\end{itemize}
%\begin{itemize}
%\item Definitions \\
% In this paper we apply our approach on binary (positive / negative) outcomes, but our approach is readily extendable to accompany continuous or categorical responses. Then we could use e.g. sum of squared errors or other appropriate metrics as the measure for good performance.
% With positive or negative outcomes we refer to...
%\begin{itemize}
%\item Failure rate
% \begin{itemize}
% \item %Failure rate (FR) is defined as the ratio of undesired outcomes to given decisions. One special characteristic of FR in this setting is that a failure can only occur with a positive decision / we can only observe the outcome when the corresponding decision is positive.
% \item %That means that a failure rate of zero can be achieved just by not giving any positive decisions but that is not the ultimate goal. (rather about finding a good balance. > Resource issues in prisons etc.)
%\end{itemize}
% \item Acceptance rate
%\begin{itemize}
% \item %Acceptance rate (AR) or leniency is defined as the ratio of positive decisions to all decisions that a decision-maker will give. (Semantically, what is the difference between AR and leniency? AR is always computable, leniency doesn't manifest.) A: a good question! can we get ir of one
%\item
% In some settings, (justice, medicine) people might want to find out if X\% are accepted what is the resulting failure rate, and what would be the highest acceptance rate to have to have the failure rate at an acceptable level.
% \item We want to know the trade-off between acceptances and failure rate.
% \item %Lakkaraju mentioned the problem in the data that judges which have a higher leniency have labeled a larger portion of the data (which might results in bias).
% \item As mentioned earlier, these differences in AR might lead to subjects getting different decisions while haven the same observable and unobservable characteristics.
%\end{itemize}
% \item % Some deciders might have an incentive for positive decisions if it can mean e.g. savings. Judge makes saving by not jailing a defendant. Doctor makes savings by not assigning patient for a higher intensity care. (move to motivation?)
%\item
%\end{itemize}
%\begin{itemize}
%\item Modules \\
% We separated steps that modify the data into separate modules to formally define how they work. With observational data sets, the data goes through only a modelling step and an evaluation step. With synthetic data, we also need to define a data generating step. We call the blocks doing these steps {\it modules}. To fully define a module, one must define its input and output. Modules have different functions, inputs and outputs. Modules are interchangeable with a similar type of module if they share the same input and output (You can change decider module of type A with decider module of type B). With this modular framework we achieve a unified way of presenting the key differences in different settings.
% \begin{itemize}
% \item Decider modules
%\begin{itemize}
% \item In general, the decider module assigns predictions to the observations based on some information.
% \item %The information available to a decision-maker in the decider module includes observable and -- possibly -- unobservable features, denoted with X and Z respectively.
% \item %The predictions given by a decider module can be relative or absolute. With relative predictions we refer to that a decider module can give out a ranking of the subjects based on their predicted tendency towards an outcome. Absolute predictions can be either binary or continuous in nature. For example, they can correspond to yes or no decisions or to a probability value.
% \item %Inner workings (procedure/algorithm) of the module may or may not be known. In observational data sets, the mechanism or the decider which has labeled the data is usually unknown. E.g. we do not -- eactly -- know how judges obtain a decision. Conversely, in synthetic data sets the procedure creating the decisions is fully known because we define the process.
% \item The decider (module) in the data step has unobservable information available for making the decisions.
% \item %The behaviour of the decider module in the data generating step can be defined in many ways. We have used both the method presented by Lakkaraju et al. and two methods of our own. We created these two deciders to remove the interdependencies of the decisions made by the decider Lakkaraju et al. presented.
% \item \end{itemize}
% \item Evaluator modules
%\begin{itemize}
% \item Evaluator module gets the decisions, observable features of the subject and predictions made by the deciders and outputs an estimate of...
% \item The evaluator module outputs a reliable estimate of a decider module's performance. The estimate is created by the evaluator module and it should
% \begin{itemize}
% % \item be precise and unbiased
% \item have a low variance
% \item be as robust as possible to slight changes in the data generation.
% \end{itemize}
% \item The estimate of the evaluator should also be accurate for all levels of leniency.
%\end{itemize}
% \end{itemize}
%\item Example: in observational data sets, the deciders have already made decision concerning the subjects and we have a selectively labeled data set available. In the modular framework we refer to the actions of the human labelers as a decider module which has access to latent information.
\item Problem formulation \\
Given the selective labeling of data, multiple decision-makers and the latent confounders present, our goal is to create an evaluator module that can output a reliable estimate of a given decider module's performance. We use acceptance rate and failure rate as measures against which we compare our evaluators because they have direct and easily understandable counterparts in the applicable domains.
%\item Problem formulation \\
%The "eventual goal" is to create such a decider module that it can outperform (have a lower failure on all levels of acceptance rate) the deciders in the data generating process. The problem is of course comparing the performance of the deciders. We try to address that.
%(It's important to somehow keep these two different goals separate.)
We show that our method is robust against violations and modifications in the data generating mechanisms.
%We show that our method is robust against violations and modifications in the data generating mechanisms.
\end{itemize}
%\end{itemize}
\section{Counterfactual-Based Imputation For Selective Labels}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment