Figure updates and experiment text update

7c13656d · Riku-Laine · 31962431 · 7c13656d · 7c13656d · 31962431
Commit 7c13656d authored 5 years ago by Riku-Laine
--- a/paper/appendix.tex
+++ b/paper/appendix.tex
@@ -228,7 +228,7 @@ where erf is the error function. After assigning the decisions with this method,
 \begin{figure}
 \begin{center}\includegraphics[width=0.5\linewidth]{img/decisions_ZvsX}
 \end{center}
-\caption{This is only one judge. Leniency 0.5. What porportion of the subjects do a crime if released? 150 out 300 have $T=1$? What is the proportion of Y? 56 $Y=0, T=1$, 94 have $T=1,Y=1$. How many had $Y=0$ for the subjects (before censoring with decision)? 157.}
+\caption{This is only one judge (batch decision-maker with an error term $\epsilon_\decisionValue$). Leniency 0.5. What porportion of the subjects do a crime if released? 150 out 300 have $T=1$? What is the proportion of Y? 56 $Y=0, T=1$, 94 have $T=1,Y=1$. How many had $Y=0$ for the subjects (before censoring with decision)? 157.}
 %\label{fig:}
 \end{figure}


--- a/paper/experiments.tex
+++ b/paper/experiments.tex
@@ -30,7 +30,7 @@ We employed the following decision makers in our experiments:
 $y$ given features
 %$y \sim x$ or $y\sim x+z$ 
 and releases $r$ portion of the subjects.
-\item \textbf{Independent}: Each subject is released with respect to a cumulative distribution function based on the logistic regression model. Given a subject with features \obsFeatures and \unobservable, we make decision \decision based on the value of expression $\invlogit(\gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue\unobservableValue)$ compared to the inverse cumulative distribution of a random variable $\invlogit(\gamma_\obsFeaturesValue\obsFeatures + \gamma_\unobservableValue\unobservable)$. If, for a decision-maker with leniency $\leniencyValue$, $\invlogit(\gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue\unobservableValue + \epsilon_\decisionValue) < F^{-1}(r)$ subject can be assigned with a positive decision. \acomment{EXPLAIN BETTER. MAKE DETERMINISTIC?}
+\item \textbf{Independent}: Each subject is released with respect to a cumulative distribution function based on the logistic regression model. Given a subject with features \obsFeatures and \unobservable, we make decision \decision based on the value of expression $\invlogit(\gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue\unobservableValue)$ compared to the inverse cumulative distribution of a random variable $\invlogit(\gamma_\obsFeaturesValue\obsFeatures + \gamma_\unobservableValue\unobservable)$. If, for a decision-maker with leniency $\leniencyValue$, $\invlogit(\gamma_\obsFeaturesValue\obsFeaturesValue + \gamma_\unobservableValue\unobservableValue) < F^{-1}(r)$, subject can be assigned with a positive decision. \acomment{EXPLAIN BETTER. MAKE DETERMINISTIC?}
 \item \textbf{Probabilistic}: Each subject is released with probability based on the logistic regression model, where the leniency is inputted through $\alpha_j$.
 \end{itemize}
 Decision makers in the data (\human) have access to \unobservable. Evaluated decision makers do not have access to \unobservable. All parameters of the models are for evaluated decision makers are learned from the training data set.
@@ -66,9 +66,9 @@ Let's describe how we do the following.
 		\end{itemize}
 \end{itemize}

-For our experiments we sampled $\datasize=20k$ observations of \obsFeatures and \unobservable independently from standard Gaussians.
+For our experiments we sampled $\datasize=5k$ observations of \obsFeatures and \unobservable independently from standard Gaussians.
 %
-The leniencies \leniency for the $\judgeAmount=40$ decision-makers \human were then drawn from Uniform$(0.1,~0.9)$ and rounded to tenth decimal place.
+The leniencies \leniency for the $\judgeAmount=50$ decision-makers \human were then drawn from Uniform$(0.1,~0.9)$ and rounded to tenth decimal place.
 %
 The subjects were randomly distributed to the judges and decisions \decision and outcomes \outcome were sampled from Bernoulli distributions (see equations \ref{eq:judgemodel} and \ref{eq:defendantmodel}).
 %
@@ -88,13 +88,13 @@ In contrast, Lakkaraju et al. employed a human decision-maker making decisions b
 %
 Their decision-maker would then assign positive decisions to subjects belonging to the $\leniencyValue \cdot 100$ percentile rendering the decision dependent \cite{lakkaraju2017selective}.

-We deployed multiple machine decision-makers on the synthetic data sets.
+We deployed two machine decision-makers on the synthetic data sets.
 %
 The decision-maker \machine can be \emph{random} and giving out positive decisions only with probability \leniencyValue ignoring information concerning the subjects in variables \obsFeatures and \unobservable.
 %
 The machine \machine can also be taught naively to make decisions based on a separate set of labeled data.
 %
-This \emph{default} decision-maker \machine uses the observable features \obsFeatures and observed outcomes (with $\decision = 1$) to predict probability for the outcome and assign decisions...
+This \emph{default} decision-maker \machine uses the observable features \obsFeatures and observed outcomes (with $\decision = 1$) to predict probability for the outcome and assigns positive decisions...

 We evaluated the machine decision-maker \machine at leniency \leniencyValue by comparing the \failurerate estimate they produced to the true failure rate.
 %
@@ -110,8 +110,6 @@ The algorithm first sorts all observations with $\decision_\human=1$ based on th
 %
 The mean absolute errors of \failurerate estimates w.r.t. true evaluation from contraction, counterfactual imputation and labeled outcomes were the compared to each other.

-\note{Riku}{Last chapter is now written in line with decision-maker \machine giving out predictions in interval [0, 1].}
-
 \spara{Results} We show how the different evaluation methods perform for decision makers \machine of different type and leniency level.

 We also make a plot to show how our Counterfactuals method infers correctly values Z based on X and T.
@@ -124,7 +122,12 @@ Perform the same experiment but with $\beta_{_\unobservableValue} \gg \beta_{_\o

 Perform the same experiment, but the minimum leniency of \human is now larger than that of \machine.

-
+\begin{figure}
+%\centering
+\includegraphics[width=\linewidth]{./img/sl_rmax05}
+\caption{Results when $\max(\leniencyValue)=0.5$. Here we observe how our proposed method is able to estimate the true failure rate accurately despite the maximum leniency in the data. Contraction however is only able to estimate the true failure rate only up to $\max(\leniencyValue)$ and does it with lower accuracy.}
+\label{fig:results_rmax05}
+\end{figure}

 \noindent
 \hrulefill
@@ -143,51 +146,67 @@ After the discussion, \citet{kleinberg2016inherent} showed that the criteria for

 The COMPAS data set used in this study is recidivism data from Broward county, California, USA made available by ProPublica\footnote{\url{https://github.com/propublica/compas-analysis}}.
 %
-We preprocessed the two year recidivism data as ProPublica did for their article.
-%
 The original data contained information about $18 610$ defendants who were given a COMPAS score during 2013 or 2014. 
 %
 After removing defendants who were not preprocessed at pretrial stage $11 757$ defendants were left. 
 %
 Additionally, defendants for whom the COMPAS score couldn't be matched with a corresponding charge were removed from analysis resulting in a data set consisting of $7 214$ observations. 
 %
-Following ProPublica's analysis, after final data cleaning we were left with $6 172$ offences.
+Following ProPublica's data cleaning process, finally the data consisted of $6 172$ offenders.
 %
 Data includes the subjects' demographic information such as gender, age and race together with information on their previous offences.

-For the analysis, we created $\judgeAmount=16$ synthetic judges \human with leniencies sampled from Uniform$(0.1, 0.9)$.
+For the analysis, we created $\judgeAmount=12$ synthetic judges with fixed leniency levels 0.1, 0.5 and 0.9 so that 4 decision-makers shared a leniency level.
 %
-The subjects were distributed to the judges as evenly as possible and at random.
+The $\datasize=6 172$ subjects were distributed to these judges as evenly as possible and at random.
 %
 In this semi-synthetic scenario, the judges based their decisions on the COMPAS score, releasing the fraction of defendants with the lowest score according to their leniency.
 %
-E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with with the lowest COMPAS score.
+E.g. if a synthetic judge had leniency $0.4$, they would release $40\%$ of defendants with the lowest COMPAS score.
 %
 Those who were given a negative decision had their outcome label set to positive $\outcome = 1$.
 %
-The data was then split to training and test sets and a logistic regression model was built to predict two-year recidivism from categorised age, gender, number of priors and the degree of crime COMPAS screened for (felony or misdemeanour).
+After assigning the decisions, the data was split 10 times to training and test sets containing the decisions of 6 judges each.
+%
+A logistic regression model was trained on the training data to predict two-year recidivism from categorised age, race, gender, number of priors and the degree of crime COMPAS screened for (felony or misdemeanour) using only observations with positive decisions.
 %
-This model was used as decision-maker \machine and these same features were used as input for the counterfactual imputation.
+As the COMPAS score is derived from a larger set of predictors then the aforementioned five \cite{}, the unobservable information was would then be encoded in the COMPAS score.
+%
+The built logistic regression model was used as decision-maker \machine in the test data and the same features were given as input for the counterfactual imputation.
+
+\spara{Results}

-\spara{Results} \textbf{TODO}...
+Results for experiments with the COMPAS data are shown in figures \ref{fig:results_compas} and \ref{fig:results_errors}.
+%
+The mean absolute error for the proposed method at all levels of leniency was XXX, which was lower than that of contractions (YYYY).
+%
+The experiments showed an agreement rate ranging from ZZZZ to QQQQ for contraction.
+%
+That combined with a high maximum acceptance rate and approximately 514 subjects per decision-maker should guarantee a comparable performance for contraction.
+%
+The experiments gave an estimate of XXX for $\gamma_\unobservable$ which indicates that the COMPAS score encoded some additional information.

 \begin{figure}
 %\centering
-\includegraphics[width=\linewidth]{./img/sl_rmax05}
-\caption{Results when $\max(\leniencyValue)=0.5$. Here we observe how our proposed method is able to estimate the true failure rate accurately despite the maximum leniency in the data. Contraction however is only able to estimate the true failure rate only up to $\max(\leniencyValue)$ and does it with loer accuracy.}
+\includegraphics[width=\linewidth]{./img/sl_compas_all}
+\caption{Results with COMPAS data.}
 \label{fig:results_compas}
 \end{figure}

 \begin{figure}
 %\centering
-\includegraphics[width=\linewidth]{./img/sl_absolute_errors}
-\caption{Results using different decision-makers and settings with leniences from $0.1$ to $0.9$. The most lenient decision maker in the data set had  $\max(\leniencyValue)=0.9$.}
-\label{fig:results_compas}
+\includegraphics[width=\linewidth]{./img/sl_errors}
+\caption{Error of estimate w.r.t true evaluation using different decision-makers with leniences from $0.1$ to $0.9$ and coefficient $\beta_\unobservable=\gamma_\unobservable=1$. Error bars denote standard deviations. The figure shows how the performance of  presented method is robust to the mechanism and accuracy of the decision-making processes \human and \machine.}
+\label{fig:results_errors}
 \end{figure}

 \noindent
 \hrulefill

+
+\hide{
+
+%% These are old results. Do we want a similar table?
 \begin{table}[]
 \begin{tabular}{@{}lll@{}}
 \toprule
@@ -203,7 +222,6 @@ Lakkaraju's decision-maker \human \cite{lakkaraju2017selective} & 0.01187
 \label{tab:}
 \end{table}

-\hide{

 \subsubsection*{Old Content}
 \rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

--- a/paper/img/sl_absolute_errors.png
+++ b/paper/img/sl_absolute_errors.png
--- a/paper/img/sl_compas__all.png
+++ b/paper/img/sl_compas__all.png
--- a/paper/img/sl_compas_results.png
+++ b/paper/img/sl_compas_results.png
--- a/paper/img/sl_errors.png
+++ b/paper/img/sl_errors.png