diff --git a/analysis_and_scripts/notes.tex b/analysis_and_scripts/notes.tex index d9a1449b3d66214f6986cc374fcbbe5072c88c8a..a5129846339450bccf9ba1849933c1dc9c6265d4 100644 --- a/analysis_and_scripts/notes.tex +++ b/analysis_and_scripts/notes.tex @@ -6,7 +6,7 @@ \usepackage{graphicx} \usepackage{amssymb} \usepackage{epstopdf} -\usepackage{hyperref} +\usepackage[hidelinks, colorlinks=true]{hyperref} %\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png} \usepackage{algorithm}% http://ctan.org/pkg/algorithms @@ -16,6 +16,45 @@ \renewcommand{\descriptionlabel}[1]{\hspace{\labelsep}\textnormal{#1}} +\makeatletter +%Table of Contents +\setcounter{tocdepth}{3} + +% Add bold to \section titles in ToC and remove . after numbers +\renewcommand{\tocsection}[3]{% + \indentlabel{\@ifnotempty{#2}{\bfseries\ignorespaces#1 #2\quad}}\bfseries#3} +% Remove . after numbers in \subsection +\renewcommand{\tocsubsection}[3]{% + \indentlabel{\@ifnotempty{#2}{\ignorespaces#1 #2\quad}}#3} +%\let\tocsubsubsection\tocsubsection% Update for \subsubsection +%... + +\newcommand\@dotsep{4.5} +\def\@tocline#1#2#3#4#5#6#7{\relax + \ifnum #1>\c@tocdepth % then omit + \else + \par \addpenalty\@secpenalty\addvspace{#2}% + \begingroup \hyphenpenalty\@M + \@ifempty{#4}{% + \@tempdima\csname r@tocindent\number#1\endcsname\relax + }{% + \@tempdima#4\relax + }% + \parindent\z@ \leftskip#3\relax \advance\leftskip\@tempdima\relax + \rightskip\@pnumwidth plus1em \parfillskip-\@pnumwidth + #5\leavevmode\hskip-\@tempdima{#6}\nobreak + \leaders\hbox{$\m@th\mkern \@dotsep mu\hbox{.}\mkern \@dotsep mu$}\hfill + \nobreak + \hbox to\@pnumwidth{\@tocpagenum{\ifnum#1=1\bfseries\fi#7}}\par% <-- \bfseries for \section page + \nobreak + \endgroup + \fi} +\AtBeginDocument{% +\expandafter\renewcommand\csname r@tocindent0\endcsname{0pt} +} +\def\l@subsection{\@tocline{2}{0pt}{2.5pc}{5pc}{}} +\makeatother + \usepackage{subcaption} \graphicspath{ {../figures/} } @@ -26,6 +65,8 @@ \maketitle +\tableofcontents + \begin{abstract} This document presents the implementations of RL in pseudocode level. First, I present the nomenclature used in these notes. Then I proceed to give my personal views and comments on the motivation behind Selective labels paper. In the following sections, I present the data generating algorithms and algorithms for obtaining failure rates using different methods. In the end I present some some results that I was asked to present in the meeting Friday $7^{th}$. \end{abstract} @@ -33,16 +74,17 @@ This document presents the implementations of RL in pseudocode level. First, I p \section*{Terms and abbreviations} \begin{description} -\item[R:] acceptance rate, leniency of decision maker, $r \in [0, 1]$ -\item[X:] personal features, observable to a predictive model -\item[Z:] some features of a subject, unobservable to a predictive model, latent variable -\item[W:] noise added to result variable Y -\item[T:] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0 -\item[Y:] result variable, no crime/positive result equal to 1, crime/negative result equal to 0 -\item[SL:] Selective labels, see \cite{lakkaraju17} -\item[Labeled data:] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA. -\item[Full data:] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available. -\item[Unobservables:] unmeasured confounders, latent variables, Z +\item[R :] acceptance rate, leniency of decision maker, $r \in [0, 1]$ +\item[X :] personal features, observable to a predictive model +\item[Z :] some features of a subject, unobservable to a predictive model, latent variable +\item[W :] noise added to result variable Y +\item[T :] decision variable, bail/positive decision equal to 1, jail/negative decision equal to 0 +\item[Y :] result variable, no crime/positive result equal to 1, crime/negative result equal to 0 +\item[MAE :] Mean absolute error +\item[SL :] Selective labels, see \cite{lakkaraju17} +\item[Labeled data :] data that has been censored, i.e. if negative decision is given (T=0), then Y is set to NA. +\item[Full data :] data that has all labels available, i.e. \emph{even if} negative decision is given (T=0), Y will still be available. +\item[Unobservables :] unmeasured confounders, latent variables, Z \end{description} Mnemonic rule for the binary coding: zero bad (crime or jail), one good! @@ -59,11 +101,11 @@ One of the concepts to denote when reading the Lakkaraju paper is the difference \section{Data generation} -In this chapter I explain both of the data generating algorithms. +Both of the data generating algorithms are presented in this chapter. \subsection{Without unobservables (see also algorithm \ref{alg:data_without_Z})} -In the setting without unobservables Z, we first sample an acceptance rate r for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).$$ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one. +In the setting without unobservables Z, we first sample an acceptance rate $r$ for all $M=100$ judges uniformly from a half-open interval $[0.1; 0.9)$. Then we assign 500 unique subjects for each of the judges randomly (50000 in total) and simulate their features X as i.i.d standard Gaussian random variables with zero mean and unit (1) variance. Then, probability for negative outcome is calculated as $$P(Y=0|X=x) = \dfrac{1}{1+\exp(-x)}=\sigma(x).$$ Because $P(Y=1|X=x) = 1-P(Y=0|X=x) = 1-\sigma(x)$ the outcome variable Y can be sampled from Bernoulli distribution with parameter $1-\sigma(x)$. The data is then sorted for each judge by the probabilities $P(Y=0|X=x)$ in descending order. If the subject is in the top $(1-r) \cdot 100 \%$ of observations assigned to a judge, the decision variable T is set to zero and otherwise to one. \begin{algorithm}[] % enter the algorithm environment \caption{Create data without unobservables} % give the algorithm a caption @@ -112,7 +154,7 @@ In the setting with unobservables Z, we first sample an acceptance rate r for al The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. -All of the algorithms 4--7 and the contraction algorithm are model agnostic. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour." +All of the algorithms 4--7 and the contraction algorithm are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour." NB: The sklearn's regression model can not be fitted if the data includes missing values. Therefore list-wise deletion is done in cases of missing data (whole record is discarded). @@ -125,7 +167,7 @@ The following quantities are estimated from the data: \item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}. \item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}. \item Contraction: See algorithm 1 of \cite{lakkaraju17} -\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) trained on the labeled data and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf} +\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) predicing Y from X trained on the labeled data and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf} \end{itemize} The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}. @@ -210,11 +252,46 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \section{Results} -\subsection{If $\beta_Z=0$ when data is generated with unobservables.} +Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. + +\begin{table}[H] +\caption{Mean absolute error (MAE) w.r.t true evaluation} +\begin{center} +\begin{tabular}{l | c c} +Method & MAE without Z & MAE with Z \\ \hline +Labeled outcomes & 0.107563333 & 0.0817483\\ +Human evaluation & 0.004403964 & 0.0042597\\ +Contraction & 0.011049707 & 0.0054146\\ +Causal model, ep & 0.001074039 & 0.0414928\\ +\end{tabular} +\end{center} +\label{tab:results} +\end{table}% + + +\begin{figure}[H] + \centering + \begin{subfigure}[b]{0.5\textwidth} + \includegraphics[width=\textwidth]{sl_without_Z_3iter} + \caption{Results without unobservables} + \label{fig:results_without_Z} + \end{subfigure} + ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. + %(or a blank line to force the subfigure onto a new line) + \begin{subfigure}[b]{0.5\textwidth} + \includegraphics[width=\textwidth]{sl_with_Z_3iter_betaZ_1_0} + \caption{Results with unobservables, $\beta_Z=1$.} + \label{fig:results_with_Z} + \end{subfigure} + \caption{Failure rate vs. acceptance rate with varying levels of leniency. Logistic regression was trained on labeled training data. $N_{iter}$ was set to 3.}\label{fig:results} +\end{figure} + + +\subsection{$\beta_Z=0$ and data generated with unobservables.} If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval 0.1, ..., 0.3 but the human evaluation failure rate. Figures are drawn in Figures \ref{fig:betaZ_1_5} and \ref{fig:betaZ_0}. -\begin{figure} +\begin{figure}[H] \centering \begin{subfigure}[b]{0.5\textwidth} \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5} @@ -231,14 +308,14 @@ If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval \caption{Failure rate vs. acceptance rate with unobservables in the data. Logistic regression was trained on labeled training data. Results from algorithm \ref{alg:perf_comp} with $N_{iter}=4$. Data was generated with algorithm \ref{alg:data_with_Z}.}\label{fig:betaZ_comp} \end{figure} -\subsection{If noise is added to the decision made when data is generated without unobservables} +\subsection{Noise added to the decision and data generated without unobservables} -Results are presented in Figure \ref{fig:sigma_figure}. +In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}. \begin{figure}[H] \centering \includegraphics[width=0.5\textwidth]{sl_without_Z_3iter_sigma_sqrt_01} - \caption{Failure rate with varying levels of leniency without unobservables. } + \caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data and $N_{iter}$ was set to 3.} \label{fig:sigma_figure} \end{figure} diff --git a/figures/sl_with_Z_3iter_betaZ_1_0.png b/figures/sl_with_Z_3iter_betaZ_1_0.png new file mode 100644 index 0000000000000000000000000000000000000000..980ecb1fee3dc36490544720afb800cdaf19e0ad Binary files /dev/null and b/figures/sl_with_Z_3iter_betaZ_1_0.png differ diff --git a/figures/sl_without_Z_3iter.png b/figures/sl_without_Z_3iter.png new file mode 100644 index 0000000000000000000000000000000000000000..26b3c25cb94884389428a117b149d285d5019a64 Binary files /dev/null and b/figures/sl_without_Z_3iter.png differ