Note file added

3a3fe26b · Riku-Laine · 93562ca6 · 3a3fe26b · 3a3fe26b · 3a3fe26b
Commit 3a3fe26b authored 5 years ago by Riku-Laine
--- a/analysis_and_scripts/notes.tex
+++ b/analysis_and_scripts/notes.tex
+\documentclass[11pt]{amsart}
+\usepackage{geometry}                % See geometry.pdf to learn the layout options. There are lots.
+\geometry{a4paper}                   % ... or letterpaper or a5paper or ... 
+%\geometry{landscape}                % Activate for for rotated page geometry 
+\usepackage[parfill]{parskip}    % Activate to begin paragraphs with an empty line rather than an indent
+\usepackage{graphicx}
+\usepackage{amssymb}
+\usepackage{epstopdf}
+\usepackage{hyperref}
+\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
+
+\usepackage{algorithm}% http://ctan.org/pkg/algorithms
+\usepackage{algorithmic}% http://ctan.org/pkg/algorithms
+\renewcommand{\algorithmicrequire}{\textbf{Input:}}
+\renewcommand{\algorithmicensure}{\textbf{Procedure:}}
+
+\title{Notes}
+\author{RL, 10 June 2019}
+%\date{}                                           % Activate to display a given date or no date
+\begin{document}
+
+\maketitle
+
+
+\begin{abstract}
+This document presents the implementations of RL in pseudocode level. 
+\end{abstract}
+
+\section*{Terms and abbreviations}
+
+\begin{description}
+\item[R] acceptance, leniency of decision maker, $r \in [0, 1]$
+\item[X] personal features, observable to a predictive model
+\item[Z] some features of a subject, unobservable to a predictive model, latent variable
+\item[W] noise added to result variable Y
+\item[T] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0
+\item[Y] result variable, no crime/positive result equal to 1, crime/negative result equal to 0
+\item[SL] Selective labels
+\item[Labeled data]  data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA.
+\item[Unobservables]   unmeasured confounders, latent variables, Z
+\end{description}
+
+Rule of thumb for the binary coding: zero bad (crime or jail), one good!
+
+\section{RL's notes about the selective labels paper (optional reading)}
+
+\emph{This chapter is to present my comments and insight regarding the topic.}
+
+The motivating idea behind the SL paper of Lakkaraju et al. \cite{lakkaraju17} is to evaluate if machines could improve on human performance. In general case, comparing the performance of human and machine evaluations is simple. In the domains addressed by Lakkaraju et al. simple comparisons would be unethical and therefore algorithms are required. (Some other data augmentation algorithms have been proposed by De-Arteaga \cite{dearteaga18}.)
+
+The general idea of the SL paper is to train some predictive model with selectively labeled data. The question is then "how would this predictive model perform if it was to make independent bail-or-jail decisions?" That quantity can not be calculated from real-life data sets due to the ethical reasons. We can however use more selectively labeled data to estimate it's performance. But, because the data is biased, the performance estimates are too good or "overly optimistic" if they are calculated in the conventional way ("labeled outcomes only"). This is why they are proposing the contraction algorithm.
+
+One of the concepts to denote when reading the Lakkaraju paper is the difference between the global goal of prediction and the goal in this specific setting. The global goal is to have a low failure rate with high acceptance rate, but at the moment we are not interested in it. The goal in this setting is to estimate the true failure rate of the model with unseen biased data. That is, given selectively labeled data and an arbitrary black-box model $\mathcal{B}$ we are interested in estimating the model's performance in the whole data set with all ground truth labels.
+
+\section{Data generation}
+
+In this chapter I explain both of the data generating algorithms. 
+
+\subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})}
+
+In the setting without unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).$$ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Create data without unobservables} 		% give the algorithm a caption
+\label{alg:data_without_Z} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Number of judges $M=100$ and number of subjects distributed to each of them $N=500$ s.t. $N_{total} = N \cdot M$
+\ENSURE
+\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
+\STATE Sample features X for each $N_{total}$ observations from standard Gaussian.
+\STATE Calculate $P(Y=0|X=x)=\sigma(x)$ for each observation
+\STATE Sample Y from Bernoulli distribution with parameter $1-\sigma(x)$.
+\STATE Sort the data by (1) the judges and (2) by probabilities $P(Y=0|X=x)$ in descending order. 
+\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
+\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, set $T=0$ else set $T=1$.
+\STATE Halve the data to training and test sets at random.
+\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
+\end{algorithmic}
+\end{algorithm}
+
+\subsection{With unobservables (see also algorithm \ref{alg:data_with_Z})}
+
+In the setting with unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects (50000 in total) for each of the judges randomly and simulate their features X, Z and W as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x, Z=z, W=w)=\sigma(\beta_Xx+\beta_Zz+\beta_Ww)$$ where $\beta_X=\beta_Z =1$ and $\beta_W=0.2$. Next, value for result Y is set to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and 1 otherwise. The conditional probability for the negative decision is defined as $$P(T=0|X=x, Z=z)=\sigma(\beta_Xx+\beta_Zz)+\epsilon$$ where $\epsilon \sim N(0, 0.1)$. Next, the data is sorted for each judge by the probabilities $P(T=0|X, Z)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one.
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Create data with unobservables} 		% give the algorithm a caption
+\label{alg:data_with_Z} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Number of judges $M=100$, number of subjects distributed to each of them $N=500$ s.t $N_{total} = N \cdot M$, $\beta_X=1, \beta_Z=1$ and $\beta_W=0.2$.
+\ENSURE
+\STATE Sample acceptance rates for each M judges from $U(0.1; 0.9)$ and round to tenth decimal place.
+\STATE Sample features X, Z and W for each $N_{total}$ observations from standard Gaussian independently.
+\STATE Calculate $P(Y=0|X, Z, W)$ for each observation
+\STATE Set Y to 0 if $P(Y = 0| X, Z, W) \geq 0.5$ and to 1 otherwise.
+\STATE Sort the data by (1) the judges' and (2) by probabilities $P(T=0|X, Z)$ in descending order. 
+\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects for each of the judges are at the top.
+\STATE If subject belongs to the top $(1-r) \cdot 100 \%$ of observations assigned to that judge, set $T=0$ else set $T=1$.
+\STATE Halve the data to training and test sets at random.
+\STATE For both halves, set $Y=$ NA if decision is negative ($T=0$).
+\end{algorithmic}
+\end{algorithm}
+
+\section{Plotting / "Performance comparison"}
+
+\subsection{Model fitting}
+
+The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and predicting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details.
+
+NB: These models can not be fitted if the data includes missing values. Therefore listwise deletion is done in cases of missing data (whole record is discarded). 
+
+\subsection{Curves}
+
+The following quantities are estimated from the data:
+
+\begin{itemize}
+\item True evaluation: The true failure rate of the model. Can only be calculated for synthetic data sets. See algorithm \ref{alg:true_eval}.
+\item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}.
+\item Human evaluation: The failure rate of human decision-makers who have acces to the latent variable Z. Decision-makers with similar values of leniency are binned and trated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}.
+\item Contraction: See algorithm 1 of \cite{lakkaraju17}
+\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a predictive model trained on the labeled data and $$ F(x_0) = \int P(x)\delta(f(x) > f(x_0)) ~ dx.$$ $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html})
+\end{itemize}
+
+The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}.
+
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Performance comparison} 		% give the algorithm a caption
+\label{alg:perf_comp} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Number of iterations $N_{iter}$
+\ENSURE
+\FORALL{$r$ in $0.1, 0.2, ..., 0.9$}
+	\FOR{i = 1 \TO $N_{iter}$}
+		\STATE Create data using either Algorithm \ref{alg:data_without_Z} or \ref{alg:data_with_Z}.
+		\STATE Train a logistic regression model using observations in the training set with available outcome labels.
+        		\STATE Estimate failure rate of true evaluation with leniency $r$ using algorithm \ref{alg:true_eval}.
+        		\STATE Estimate failure rate of labeled outcomes approach with leniency $r$ using algorithm \ref{alg:labeled_outcomes}.
+        		\STATE Estimate failure rate of human judges with leniency $r$ using algorithm \ref{alg:human_eval}.
+        		\STATE Estimate failure rate of contraction algorithm with leniency $r$.
+        		\STATE Estimate the empirical performance of the causal model with leniency $r$ using algorithm \ref{alg:causal_model}.
+	\ENDFOR
+	\STATE Calculate mean of the failure rate over the iterations for each algorithm separately.
+	\STATE Calculate standard error of the mean over the iterations for each algorithm separately.
+\ENDFOR 
+\STATE Plot the failure rates with given levels of leniency $r$.
+\STATE Calculate absolute mean errors of each algorithm compared to the true evaluation.
+\end{algorithmic}
+\end{algorithm}
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{True evaluation} 		% give the algorithm a caption
+\label{alg:true_eval} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{all outcome labels}, acceptance rate r
+\ENSURE
+\STATE Sort the data by the probabilities $\mathcal{S}$ to ascending order.
+\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
+\STATE Calculate the number to release $N_{free} = |\mathcal{D}| \cdot r$.
+\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
+\end{algorithmic}
+\end{algorithm}
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Labeled outcomes} 		% give the algorithm a caption
+\label{alg:labeled_outcomes} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with ($T=0$), acceptance rate r
+\ENSURE
+\STATE Assign observations with observed outcomes to $\mathcal{D}_{observed}$.
+\STATE Sort $\mathcal{D}_{observed}$ by the probabilities $\mathcal{S}$ to ascending order.
+\STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last.
+\STATE Calculate the number to release $N_{free} = |\mathcal{D}_{observed}| \cdot r$.
+\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{N_{free}}\delta\{y_i=0\}$
+\end{algorithmic}
+\end{algorithm}
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Human evaluation} 		% give the algorithm a caption
+\label{alg:human_eval} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with ($T=0$), acceptance rate r
+\ENSURE
+\STATE Assign judges with leniency in $[r-0.05, r+0.05]$ to $\mathcal{J}$
+\STATE $\mathcal{D}_{released} = \{(x, j, t, y) \in \mathcal{D}|t=1 \wedge j \in  \mathcal{J}\}$
+\STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency.
+\RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\mathcal{D}_{released}}\delta\{y_i=0\}$
+\end{algorithmic}
+\end{algorithm}
+
+\begin{algorithm}[] 			% enter the algorithm environment
+\caption{Causal model, empirical performance} 		% give the algorithm a caption
+\label{alg:causal_model} 			% and a label for \ref{} commands later in the document
+\begin{algorithmic}[1] 		% enter the algorithmic environment
+\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with ($T=0$), predictive model f, acceptance rate r
+\ENSURE
+\STATE $T_{causal} = cdf(\mathcal{D}, f) < r$
+\RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{\mathcal{D}} \mathcal{S} \cdot T_{causal} = \frac{1}{|\mathcal{D}|}\sum_x f(x)\delta(F(x) < r)$
+\end{algorithmic}
+\end{algorithm}
+
+\begin{thebibliography}{9}
+
+\bibitem{dearteaga18}
+   De-Arteaga, Maria. Learning Under Selective Labels in the Presence of Expert Consistency. 2018. 
+\bibitem{lakkaraju17} 
+   Lakkaraju, Himabindu. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 2017. 
+
+\end{thebibliography}
+
+
+\end{document}  
\ No newline at end of file
--- a/figures/sl_with_Z_4iter_beta0.png
+++ b/figures/sl_with_Z_4iter_beta0.png
--- a/figures/sl_with_Z_4iter_betaZ_1,5.png
+++ b/figures/sl_with_Z_4iter_betaZ_1,5.png