diff --git a/analysis_and_scripts/notes.tex b/analysis_and_scripts/notes.tex index 8b943d63e80dc00cb6db1cbf8822d68a859eb6dc..243c6aef766ed538881b3a27735fa063a34bc4ac 100644 --- a/analysis_and_scripts/notes.tex +++ b/analysis_and_scripts/notes.tex @@ -171,20 +171,22 @@ Complex computational algorithms have already been deployed in numerous fields s \section{Framework} -$$DG+DEC = DATA \rightarrow MODEL \rightarrow EVALUATION$$ +$$FEATURES + DECISION = DATA \rightarrow MODEL \rightarrow EVALUATION$$ -In our framework, the flow of information is divided into three steps: data generation, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described. +In our framework, the flow of information is divided into three steps: data, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described. -The data generation step generates all the data, including the features and the selectively labeled outcomes of the subjects and the features of the deciders. In our setting we discuss mainly five different variables: subjects have feature X for observed features, Z for unobserved features (or \emph{unobservables} in short), and Y for the outcome. The deciders (judges, doctors) assigning the labels are defined by their leniency R and the decisions given to subjects T. The central variants for feature and decision generation are presented in section \ref{sec:modules}. (we have separated the decision process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}. +The data step generates all the data. The data includes the features and the outcomes of the subjects, the features of the deciders and their decisions concerning the subjects. The features of the subjects include observable and unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R. The decisions given will be denoted with T and the resulting outcomes with Y, 0 for negative and 1 for positive. -After the feature generation, the data is moved to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model we assign predictions for negative outcomes for the observations in the other part of the data. It is this model the performance of which we are interested in. +%In our setting we discuss mainly five different variables: subjects have feature X for observed features, Z for unobserved features (or \emph{unobservables} in short), and Y for the outcome. The deciders (judges, doctors) assigning the labels are defined by their leniency R and the decisions given to subjects T. The central variants for feature and decision generation are presented in section \ref{sec:modules}. (we have separated the decision process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}. -Finally, an evaluation algorithm tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency $r$. As explained the setting is characterized by the \emph{selctive labeling} of the data. +Once the data is available, it is given to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model and the observed covariates we assign predictions for negative outcomes for the observations in the other part of the data. It is this model the performance of which we are interested in. + +Finally, the data with the predictions is given to an evaluation algorithm which tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency $r$. % As explained the setting is characterized by the \emph{selctive labeling} of the data. \section{Potential outcomes in model evaluation} -Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of Y for the missing outcomes. Full hierarchical model is presented in eq. \ref{eq:po} below. Expectations for the missing outcomes Y are obtained as \sout{point estimates from the maximum of the joint posterior} means of the posterior predictive distribution. (Joint posterior appeared to be bimodal or skewed and therefore the mean gave better estimates.) +Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of the potential outcome Y(1) for the missing outcomes. Full hierarchical model is presented in eq.\ref{eq:po} below. Expectations for the missing outcomes Y are obtained as \sout{point estimates from the maximum of the joint posterior} means of the posterior predictive distribution. (Joint posterior appeared to be bimodal or skewed and therefore the mean gave better estimates.) Priors for the $\beta$ coefficients were chosen to be sufficiently non-informative without restricting the density estimation. Coefficients for the latent Z were restricted to be positive for posterior estimation. The resulting distribution is the half-normal distribution ($X\sim N(0, 1) \Rightarrow Y=|X| \sim$ Half-Normal). The $\alpha$ intercepts are only for the decision variable to emulate the differences in leniency for each $M$ judge. Then subjects with equal x and z will have a different probability for bail given the judges leniency. @@ -197,11 +199,11 @@ Priors for the $\beta$ coefficients were chosen to be sufficiently non-informati p(\alpha_j) & \propto 1 \hskip1.0em \text{for } j \in \{1, 2, \ldots, M\} \end{align} -Model is fitted on the test set and the expectations of potential outcomes $Y_{T=1}$ are used in place of missing outcomes. +Model is fitted on the test set and the expectations of potential outcomes $Y(1)$ are used in place of missing outcomes. \section{Results} -Here we present the results obtained from four core settings and four settings created for robustness check. Our core settings include situations where data is generated with the unobservables Z. Settings are characterized by the outcome creating mechanism and decision assignment. +Here we present the results obtained from four core settings \sout{and four settings created for robustness check}. Our core settings include situations where data is generated with the unobservables Z. Settings are characterized by the outcome creating mechanism and decision assignment. The four main result figures are presented in figure set \ref{fig:results_bayes} with captions indicating the outcome and decision mechanisms. Figures showing the variance of the failure rate estimates are in appendix section \ref{sec:diagnostic}. @@ -235,7 +237,7 @@ The four main result figures are presented in figure set \ref{fig:results_bayes} \label{fig:results_bayes} \end{figure} -The robustness checks are presented in figure set \ref{fig:results_robustness} depicting deciders who +The robustness checks (TBA) are presented in figure set \ref{fig:results_robustness} depicting deciders who \begin{enumerate}[label=\Alph*)] \item assign bail randomly with probability $r$, \item who favour and dislike some defendants with certain values of X @@ -243,43 +245,43 @@ The robustness checks are presented in figure set \ref{fig:results_robustness} d \end{enumerate} Last figure (D) is to illustrate the situation where the data generation mechanism is exactly the same as in model specified in eq. \ref{eq:po} with $\beta$ coefficients equal to 1 and $\alpha$ coefficients equal to 0. -The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015. - -Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.) - -% The figures show that the potential outcomes approach estimates the true failure rate with a slightly better performance than the contraction algorithm (MAEs 0.00X...0.00Y compared to 0.00T...0.00U respectively). The results are in for only some of the core situations discussed, but will be provided for the Thursday meeting. The diagnostic figures in section \ref{sec:diagnostic} also show that the predictions given by the potential outcomes approach have a lower variance. - -Imposing constraints to model in equation \ref{eq:po} did not yield significantly better results. The model was constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. Other caveats of the current approach is its scalability to large data sets. - -\begin{figure}[H] - \centering - \begin{subfigure}[b]{0.475\textwidth} - \includegraphics[width=\textwidth]{sl_result_random} - \caption{Random decisions.} - %\label{fig:} - \end{subfigure} - \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. - %(or a blank line to force the subfigure onto a new line) - \begin{subfigure}[b]{0.475\textwidth} - \includegraphics[width=\textwidth]{sl_result_biased} - \caption{Biased decisions.} - \label{fig:bias} - \end{subfigure} - \begin{subfigure}[b]{0.475\textwidth} - \includegraphics[width=\textwidth]{sl_result_bad} - \caption{Bad judge with $\beta_x=0.2$.} - %\label{fig:} - \end{subfigure} - \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. - %(or a blank line to force the subfigure onto a new line) - \begin{subfigure}[b]{0.475\textwidth} - \includegraphics[width=\textwidth]{sl_result_bernoulli_bernoulli} - \caption{Data generation as corresponding to model.} - %\label{fig:} - \end{subfigure} - \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms. Only one data set used, which affects the performance of contraction.} - \label{fig:results_robustness} -\end{figure} +%The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015. + +%Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.) + +The figures show that the contraction method is fairly robust to changes in data and decision generation although its estimates have some variance. The figures also show that the potential outcomes approach estimates the true failure rate with a slightly better performance than the contraction algorithm (MAEs 0.00164 and 0.00165 compared to approximately 0.003...0.005 respectively). The results are in for only three of the core situations discussed figs (2a-c), but will be provided for the Thursday meeting. The diagnostic figures in section \ref{sec:diagnostic} also show that the predictions given by the potential outcomes approach have a lower variance. + +Imposing constraints to model in equation \ref{eq:po} might yield better results. The model could be constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. A caveat of the current approach is its scalability to large data sets (multi-dimensionality). + +%\begin{figure}[H] +% \centering +% \begin{subfigure}[b]{0.475\textwidth} +% \includegraphics[width=\textwidth]{sl_result_random} +% \caption{Random decisions.} +% %\label{fig:} +% \end{subfigure} +% \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. +% %(or a blank line to force the subfigure onto a new line) +% \begin{subfigure}[b]{0.475\textwidth} +% \includegraphics[width=\textwidth]{sl_result_biased} +% \caption{Biased decisions.} +% \label{fig:bias} +% \end{subfigure} +% \begin{subfigure}[b]{0.475\textwidth} +% \includegraphics[width=\textwidth]{sl_result_bad} +% \caption{Bad judge with $\beta_x=0.2$.} +% %\label{fig:} +% \end{subfigure} +% \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc. +% %(or a blank line to force the subfigure onto a new line) +% \begin{subfigure}[b]{0.475\textwidth} +% \includegraphics[width=\textwidth]{sl_result_bernoulli_bernoulli} +% \caption{Data generation as corresponding to model.} +% %\label{fig:} +% \end{subfigure} +% \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms. Only one data set used, which affects the performance of contraction.} +% \label{fig:results_robustness} +%\end{figure} \begin{thebibliography}{9} % Might have been apa diff --git a/figures/sl_diagnostic_bernoulli_batch_with_Z.png b/figures/sl_diagnostic_bernoulli_batch_with_Z.png index 278cc6bf34102302e4f42f3f75a794469eafb660..dd4f6a1d7379581b4765c6fc269ca79ea3bd842a 100644 Binary files a/figures/sl_diagnostic_bernoulli_batch_with_Z.png and b/figures/sl_diagnostic_bernoulli_batch_with_Z.png differ diff --git a/figures/sl_diagnostic_threshold_independent_with_Z.png b/figures/sl_diagnostic_threshold_independent_with_Z.png index 00eef169d356b0c7321fac8dd96d39490f6ee405..ca1e8348af24a9a23b9f681eb824cade23908578 100644 Binary files a/figures/sl_diagnostic_threshold_independent_with_Z.png and b/figures/sl_diagnostic_threshold_independent_with_Z.png differ diff --git a/figures/sl_result_bernoulli_batch.png b/figures/sl_result_bernoulli_batch.png index 0f87e356250b267c6b46769e6c29744e8f073f2e..09358849764ff25ad51a27c8e0a397842b2096d1 100644 Binary files a/figures/sl_result_bernoulli_batch.png and b/figures/sl_result_bernoulli_batch.png differ diff --git a/figures/sl_result_threshold_independent.png b/figures/sl_result_threshold_independent.png index f862b236fd1fa18df5e666631d0f79ba2fa98a1f..b3a16f7cb3fceaba10c4e5b0dd4d66f072f29a9b 100644 Binary files a/figures/sl_result_threshold_independent.png and b/figures/sl_result_threshold_independent.png differ