@@ -325,15 +325,15 @@ We use a propensity score framework to model $X$ and $Z$: they are assumed conti
\acomment{I do not understand what we are doing from this section. It needs to be described ASAP.}
In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure X. Using Stan, we model the observed data as
In counterfactual-based imputation we use counterfactual values of the outcome $\outcome_{\decisionValue=1}$ to impute the missing labels. The SCM required to compute the counterfactuals is presented in figure \ref{fig:model}. Using Stan, we model the observed data as
\begin{align}\label{eq:data_model}
y(1)&\sim\text{Bernoulli}(\invlogit(\alpha_y + \beta_{xy} x + \beta_{zy} z)) \\\nonumber
t&\sim\text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\\nonumber
\decision ~|~D, x&\sim\text{Bernoulli}(\invlogit(\alpha_{j} + \beta_{xt} x + \beta_{zt}z)). \\\nonumber
\end{align}
That is, we fit one logistic regression model modelling the decisions based on the observable features \features and the identity of the judge using all of the data. The identity of the judge is encoded into the intercept $\alpha_j$. (We use different intercepts for different judges.) We model the observed outcomes with $\decision=1$ with a separate regression model to learn the coefficients $\beta_{xy}$ for the observed features and $\beta_{zy}$ for the unobserved features and the sole intercept $\alpha_y$.
Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. The counterfactuals are formally drawn from the posterior predictive distribution
Using the samples from the posterior distribution for all the parameters given by Stan, we can estimate the values of the counterfactuals. Parameters are the $\alpha$ intercepts, $\beta$ coefficients and the latent variable. The counterfactuals are formally drawn from the posterior predictive distribution
@@ -481,9 +481,9 @@ We experimented with synthetic data sets to examine accurateness, unbiasedness a
We sampled $N=50k$ samples of $X$, $Z$, and $W$ as independent standard Gaussians. We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p =1-\invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W)=\invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively. We sampled leniency levels $R$ for each of the $M=100$ judges uniformly from $[0.1; 0.9]$. We assigned the subjects randomly such that a judge level was assigned for $500$ subjects. In the example, this mimics having 100 judges deciding each for $500$ defendants. The data was divided in half to form a training set and a test set. This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}. \acomment{Check before?}
The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision=0~|~\features, \unobservable)=\invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision=0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision=0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision=0~|~\features, \unobservable)$ the subject is given a negative decision $\decision=0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then -- as explained in section \ref{sec:framework} -- the outcomes for which the decision was negative, were then set deterministically set to $0$.
The \emph{default} decision maker in the data predicts a subjects' probability for recidivism to be $P(\decision=0~|~\features, \unobservable)=\invlogit(\beta_xx+\beta_zz)$. Each of the decision-makers is assigned a leniency value, so the decision is then assigned by comparing the value of $P(\decision=0~|~\features, \unobservable)$ to the value of the inverse cumulative density function $F^{-1}_{P(\decision=0~|~\features, \unobservable)}(r)=F^{-1}(r)$. Now, if $F^{-1}(r) < P(\decision=0~|~\features, \unobservable)$ the subject is given a negative decision $\decision=0$ and a positive otherwise. \rcomment{Needs double checking.} This ensures that the decisions made are independent and stochastically converge to $r$. Then the outcomes for which the decision was negative, were set to $0$.
We used a number of different decision mechanisms. A \emph{limited} works as the default but uses regression model $Y \sim\invlogit(\beta_xx)$. Hence it is unable to observe $Z$. A \emph{biased} decision maker works similarly as limited but the logistic regression model is .. where biases decision.
We used a number of different decision mechanisms. A \emph{limited}decision-maker works as the default, but predicts the risk for a negative outcome uses regression model $Y \sim\invlogit(\beta_xx)$. Hence it is unable to observe $Z$. A \emph{biased} decision maker works similarly as limited but the logistic regression model is .. where biases decision.
Given leniency $R$, a \emph{random} decision maker decides on $T=1$ probability given by $R$.
In contrast, Lakkaraju et al. essentially order the subjects and decide $T=1$ with the percentage given by the leniency $R$. We see this as unrealistic: the decisions
...
...
@@ -526,34 +526,41 @@ We treat the observations as independent and the still the leniency would be a g
\subsection{Realistic data}
In this section we present results from experiments with (realistic) data sets.
\subsubsection{Analysis on COMPAS data}
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is is Northpointe's (now under different name) tool for guiding decisions in the criminal justice system. COMPAS tool provides judges with risk estimates regarding the probability of recidivism and failure to appear. The COMPAS score is mainly derived from "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" and it predicts recidivism in the following two years. The sole use of the COMPAS score as a basis for judgement has been denied by law, judges must base their decisions to other factors too.
The COMPAS data set is recidivism data from Broward county, California, USA. The data set was preprocessed by ProPublica for their article Machine Bias. The original data contained information about $18610$ defendants who were given a COMPAS score during 2013 or 2014. After removing defendants who were not preprocessed at pretrial stage $11757$ defendants were left. Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7214$ observations. Following ProPublica's reasoning, after final data cleaning we were left with $6172$ offences. Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences.
For the analysis we created 9 synthetic judges with leniencies $0.1, 0.2, \ldots, 0.9$. All the subjects were distributed to the judges evenly and at random. In this semi-synthetic scenario, the judge would base their decisions on the COMPAS score, releasing the fraction of defendants according to their leniency. Those who were given a negative decision had their outcome label hidden. The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanour). We experimented with other models but the results remained the same. These same features were used as an input for the counterfactual imputing method.
\begin{itemize}
\item COMPAS data set
\begin{itemize}
\item Size, availability, COMPAS scoring
\begin{itemize}
\item COMPAS = Correctional Offender Management Profiling for Alternative Sanctions is Northpointe's (now diff. name) tool for guiding decisions in the criminal justice system.
\item COMPAS general recidivism risk score is made to predict recidivism in the following two years,
\item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
\item Data was made available ProPublica.
\item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
\item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences.
\end{itemize}
\item Subsequent modifications for analysis
\begin{itemize}
\item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9.
\item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
\item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score
\item As the COMPAS score is derived mainly from "prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems" so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
\item Data was split to test and training sets
\item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
\item We used these same variables as input to the CBI evaluator.
\end{itemize}
\item Results
\begin{itemize}
\item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
\item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
\end{itemize}
\end{itemize}
%\item COMPAS data set
% \begin{itemize}
% \item Size, availability, COMPAS scoring
% \begin{itemize}
% \item COMPAS general recidivism risk score is made to ,
% \item The final data set comprises of 6172 subjects assessed at Broward county, California. The data was preprocessed to include only subjects assessed at the pretrial stage and (something about traffic charges).
% \item Data was made available ProPublica.
% \item Their analysis and results are presented in the original article "Machine Bias" in which they argue that the COMPAS metric assigns biased risk evaluations based on race.
% \item Data includes the subjects' demographic information (incl. gender, age, race) and information on their previous offences.
% \end{itemize}
% \item Subsequent modifications for analysis
% \begin{itemize}
% \item We created 9 synthetic judges with leniencies 0.1, 0.2, ..., 0.9.
% \item Subjects were distributed to all the judges evenly and at random to enable comparison to contraction method
% \item We employed similar decider module as explained in Lakkaraju's paper, input was the COMPAS Score
% \item As the COMPAS score is derived mainly from so it can be said to have external information available, not coded into the four above-mentioned variables. (quoted text copy-pasted from here)
% \item Data was split to test and training sets
% \item A logistic regression model was built to predict two-year recidivism from categorized age, gender, the number of priors, degree of crime COMPAS screened for (felony/misdemeanor)
% \item We used these same variables as input to the CBI evaluator.
% \end{itemize}
% \item Results
% \begin{itemize}
% \item Results from this analysis are presented in figure X. In the figure we see that CBI follows the true evaluation curve very closely.
% \item We can also deduce from the figure that if this predictive model was to be deployed, it wouldn't necessarily improve on the decisions made by these synthetic judges.
% \end{itemize}
% \end{itemize}
\item Catalonian data (this could just be for our method? Hide ~25\% of outcome labels and show that we can estimate the failure rate for ALL levels of leniency despite the leniency of this one judge is only 0.25) (2nd priority)