diff --git a/analysis_and_scripts/notes.tex b/analysis_and_scripts/notes.tex index f86991205de6fb489ab17523b200c7f36d088828..ef4cd69fcb84cda3f6aeea9655438285deec6ee5 100644 --- a/analysis_and_scripts/notes.tex +++ b/analysis_and_scripts/notes.tex @@ -67,7 +67,7 @@ \maketitle -\section*{Contents} +%\section*{Contents} \tableofcontents @@ -160,7 +160,7 @@ In the setting with unobservables Z, we first sample an acceptance rate r for al The models that are being fitted are logistic regression models from scikit-learn package. The solver is set to lbfgs (as there is no closed-form solution) and intercept is estimated by default. The resulting LogisticRegression model object provides convenient functions for fitting the model and getting probabilities for class labels. Please see the documentation at \url{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html} or ask me (RL) for more details. Similar analyses were conducted using random forest classifier, but the results (see section \ref{sec:random_forest}) were practically identical. -All of the algorithms 4--7 and the contraction algorithm are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour." +All of the algorithms 4--8 are model agnostic, i.e. do not depend on a specific predictive model. The model has to give probabilities for given output with some determined input. Lakkaraju says in their paper "We train logistic regression on this training set. We also experimented with other predictive models and observed similar behaviour." NB: The sklearn's regression model can not be fitted if the data includes missing values. Therefore list-wise deletion is done in cases of missing data (whole record is discarded). @@ -173,7 +173,7 @@ The following quantities are computed from the data: \item Labeled outcomes: The "traditional"/vanilla estimate of model performance. See algorithm \ref{alg:labeled_outcomes}. \item Human evaluation: The failure rate of human decision-makers who have access to the latent variable Z. Decision-makers with similar values of leniency are binned and treated as one hypothetical decision-maker. See algorithm \ref{alg:human_eval}. \item Contraction: See algorithm \ref{alg:contraction} from \cite{lakkaraju17}. -\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}) trained on the labeled data predicting Y from X and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) \label{causal_cdf} +\item Causal model: In essence, the empirical performance is calculated over the test set as $$\dfrac{1}{n}\sum_{(x, y)\in D}f(x)\delta(F(x) < r)$$ where $$f(x) = P(Y=0|T=1, X=x)$$ is a logistic regression model (see \ref{sec:model_fitting}, random forest used in section \ref{sec:random_forest}) trained on the labeled data predicting Y from X and $$ F(x_0) = \int_{x\in\mathcal{X}} P(x)\delta(f(x) < f(x_0)) ~ dx.$$ All observations, even ones with missing outcome labels, can be used since empirical performance doesn't depend on them. $P(x)$ is Gaussian pdf from scipy.stats package and it is integrated over interval [-15, 15] with 40000 steps using si.simps function from scipy.integrate which uses Simpson's rule in estimating the value of the integral. (docs: \url{https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html}) See algorithm \ref{alg:causal_model}. \label{causal_cdf} \end{itemize} The plotted curves are constructed using pseudo code presented in algorithm \ref{alg:perf_comp}. @@ -196,8 +196,8 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \STATE Compute failure rate of contraction algorithm with leniency $r$ and labeled test data. \STATE Compute the empirical performance of the causal model with leniency $r$, predictive model $f$ and labeled test data using algorithm \ref{alg:causal_model}. \ENDFOR - \STATE Calculate means of the failure rates for each value of leniency and for each algorithm separately. - \STATE Calculate standard error of the mean for each value of leniency and for each algorithm separately. + \STATE Calculate means of the failure rates for each level of leniency and for each algorithm separately. + \STATE Calculate standard error of the mean for each level of leniency and for each algorithm separately. \ENDFOR \STATE Plot the failure rates with given levels of leniency $r$. \STATE Calculate absolute mean errors of each algorithm compared to true evaluation. @@ -208,7 +208,7 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \caption{True evaluation} % give the algorithm a caption \label{alg:true_eval} % and a label for \ref{} commands later in the document \begin{algorithmic}[1] % enter the algorithmic environment -\REQUIRE Test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{all outcome labels}, acceptance rate r +\REQUIRE Full test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{all outcome labels}, acceptance rate r \ENSURE \STATE Sort the data by the probabilities $\mathcal{S}$ to ascending order. \STATE \hskip3.0em $\rhd$ Now the most dangerous subjects are last. @@ -238,7 +238,7 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, acceptance rate r \ENSURE \STATE Assign judges with leniency in $[r-0.05, r+0.05]$ to $\mathcal{J}$ -\STATE $\mathcal{D}_{released} = \{(x, j, t, y) \in \mathcal{D}|t=1 \wedge j \in \mathcal{J}\}$ +\STATE $\mathcal{D}_{released} = \{(x, j, t, y) \in \mathcal{D}~|~t=1 \wedge j \in \mathcal{J}\}$ \STATE \hskip3.0em $\rhd$ Subjects judged \emph{and} released by judges with correct leniency. \RETURN $\frac{1}{|\mathcal{J}|}\sum_{i=1}^{\mathcal{D}_{released}}\delta\{y_i=0\}$ \end{algorithmic} @@ -269,12 +269,13 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \end{algorithm} \begin{algorithm}[] % enter the algorithm environment -\caption{Causal model, empirical performance (ep)} % give the algorithm a caption +\caption{Causal model, empirical performance (ep, see also section \ref{causal_cdf})} % give the algorithm a caption \label{alg:causal_model} % and a label for \ref{} commands later in the document \begin{algorithmic}[1] % enter the algorithmic environment -\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, predictive model f, acceptance rate r +\REQUIRE Labeled test data $\mathcal{D}$ with probabilities $\mathcal{S}$ and \emph{missing outcome labels} for observations with $T=0$, predictive model $f$, pdf $P_X(x)$ for features $x$, acceptance rate r \ENSURE -\STATE Create boolean array $T_{causal} = cdf(\mathcal{D}, f) < r$. See "Causal model" in \ref{causal_cdf}. +\STATE For all $x_0 \in \mathcal{D}$ evaluate $F(x_0) = \int_{x\in\mathcal{X}} P_X(x)\delta(f(x)<f(x_0)) ~dx$ and assign to $\mathcal{F}_{predictions}$ +\STATE Create boolean array $T_{causal} = \mathcal{F}_{predictions} < r$. \RETURN $\frac{1}{|\mathcal{D}|}\sum_{i=1}^{\mathcal{D}} \mathcal{S} \cdot T_{causal}$ which is equal to $\frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}} f(x)\delta(F(x) < r)$ \end{algorithmic} \end{algorithm} @@ -282,9 +283,9 @@ The plotted curves are constructed using pseudo code presented in algorithm \ref \section{Results} -Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. +Results obtained from running algorithm \ref{alg:perf_comp} with $N_{iter}$ set to 3 are presented in table \ref{tab:results} and figure \ref{fig:results}. All parameters are in their default values and a logistic regression model is trained. -\begin{table}[H] +\begin{table}[] \caption{Mean absolute error (MAE) w.r.t true evaluation} \begin{center} \begin{tabular}{l | c c} @@ -299,7 +300,7 @@ Causal model, ep & 0.001074039 & 0.0414928\\ \end{table}% -\begin{figure}[H] +\begin{figure}[] \centering \begin{subfigure}[b]{0.5\textwidth} \includegraphics[width=\textwidth]{sl_without_Z_3iter} @@ -323,7 +324,7 @@ If we assign $\beta_Z=0$, almost all failure rates drop to zero in the interval The differences between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0} could be explained in the slight difference in the data generating process, namely the effect of $W$ or $\epsilon$. The effect of adding $\epsilon$ (noise to the decisions) is further explored in section \ref{sec:epsilon}. -\begin{figure}[H] +\begin{figure}[] \centering \begin{subfigure}[b]{0.5\textwidth} \includegraphics[width=\textwidth]{sl_with_Z_4iter_betaZ_1_5} @@ -345,7 +346,7 @@ The differences between figures \ref{fig:results_without_Z} and \ref{fig:betaZ_0 In this part, Gaussian noise with zero mean and 0.1 variance was added to the probabilities $P(Y=0|X=x)$ after sampling Y but before ordering the observations in line 5 of algorithm \ref{alg:data_without_Z}. Results are presented in Figure \ref{fig:sigma_figure}. -\begin{figure}[H] +\begin{figure}[] \centering \includegraphics[width=0.75\textwidth]{sl_without_Z_3iter_sigma_sqrt_01} \caption{Failure rate with varying levels of leniency without unobservables. Logistic regression was trained on labeled training data with $N_{iter}$ set to 3.} @@ -354,21 +355,21 @@ In this part, Gaussian noise with zero mean and 0.1 variance was added to the pr \subsection{Predictions with random forest classifier} \label{sec:random_forest} -In this section the predictive model was switched to random forest classifier to examine the effect of changing the model. Results are practically identical to then ones presented in figure \ref{fig:results} previously. The resulting outcome is presented in \ref{fig:random_forest}. +In this section the predictive model was switched to random forest classifier to examine the effect of changing the predictive model. Results are practically identical to those presented in figure \ref{fig:results} previously and are presented in figure \ref{fig:random_forest}. -\begin{figure}[H] +\begin{figure}[] \centering \begin{subfigure}[b]{0.5\textwidth} \includegraphics[width=\textwidth]{sl_withoutZ_4iter_randomforest} \caption{Results without unobservables, \\$N_{iter}=4$.} - \label{fig:results_without_Z} + \label{fig:results_without_Z_rf} \end{subfigure} ~ %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. %(or a blank line to force the subfigure onto a new line) \begin{subfigure}[b]{0.5\textwidth} \includegraphics[width=\textwidth]{sl_withZ_6iter_betaZ_1_0_randomforest} \caption{Results with unobservables, $\beta_Z=1$ and \\$N_{iter}=6$.} - \label{fig:results_with_Z} + \label{fig:results_with_Z_rf} \end{subfigure} \caption{Failure rate vs. acceptance rate with varying levels of leniency. Random forest classifier was trained on labeled training data} \label{fig:random_forest} @@ -378,7 +379,7 @@ In this section the predictive model was switched to random forest classifier to Predictions were checked by drawing a graph of predicted Y versus X, results are presented in figure \ref{fig:sanity_check}. The figure indicates that the predicted class labels and the probabilities for them are consistent with the ground truth. -\begin{figure}[H] +\begin{figure}[] \centering \includegraphics[width=0.75\textwidth]{sanity_check} \caption{Predicted class label and probability of $Y=1$ versus X. Prediction was done with a logistic regression model. Colors of the points denote ground truth (yellow = 1, purple = 0). Data set was created with the unobservables.}